Digital Maktaba project: Toward a metadata-driven, LLM-assisted framework for arabic digital libraries

The rapid digitization of cultural heritage collections, especially those featuring Arabic-script texts, presents distinct challenges related to access, discoverability, and long-term preservation. This paper proposes a metadata-driven framework for Arabic-script digital libraries, which is currently being explored through initial prototyping as part of the Digital Maktaba project. The framework leverages validated metadata from the Diamond catalogue and the La Pira Library’s extensive collection. To address the technical complexities of Arabic script, including calligraphy, diacritics, and ligatures, the project employs frontispiece images and the Kraken OCR engine within the eScriptorium platform to train high-accuracy recognition models. The cataloging workflow is structured around international standards such as Dublin Core and is informed by both topographic and thematic classification practices. As part of ongoing development, the project is evaluating the use of large language models (LLMs), including Arabic-specialized models, to assess their potential for extracting semantic metadata from digitized texts. This effort aims to enrich subject headings and improve classification depth, contributing to AI-assisted indexing and enhanced resource discovery. By combining human-validated metadata with machine learning pipelines, the Digital Maktaba project aims to provide a scalable, standards-aligned approach for building Arabic digital libraries, with broader applicability to other underrepresented language collections.

Digital Maktaba project: Toward a metadata-driven, LLM-assisted framework for arabic digital libraries / El Ganadi, Amina; Gagliardelli, Luca; Ruozzi, Federico. - In: INTERNATIONAL JOURNAL ON DIGITAL LIBRARIES. - ISSN 1432-5012. - 26:4(2025), pp. 1-16. [10.1007/s00799-025-00432-w]

Digital Maktaba project: Toward a metadata-driven, LLM-assisted framework for arabic digital libraries

El Ganadi, Amina^{Writing – Review & Editing};Gagliardelli, Luca^Methodology;Ruozzi, Federico^Supervision

2025

Abstract

The rapid digitization of cultural heritage collections, especially those featuring Arabic-script texts, presents distinct challenges related to access, discoverability, and long-term preservation. This paper proposes a metadata-driven framework for Arabic-script digital libraries, which is currently being explored through initial prototyping as part of the Digital Maktaba project. The framework leverages validated metadata from the Diamond catalogue and the La Pira Library’s extensive collection. To address the technical complexities of Arabic script, including calligraphy, diacritics, and ligatures, the project employs frontispiece images and the Kraken OCR engine within the eScriptorium platform to train high-accuracy recognition models. The cataloging workflow is structured around international standards such as Dublin Core and is informed by both topographic and thematic classification practices. As part of ongoing development, the project is evaluating the use of large language models (LLMs), including Arabic-specialized models, to assess their potential for extracting semantic metadata from digitized texts. This effort aims to enrich subject headings and improve classification depth, contributing to AI-assisted indexing and enhanced resource discovery. By combining human-validated metadata with machine learning pipelines, the Digital Maktaba project aims to provide a scalable, standards-aligned approach for building Arabic digital libraries, with broader applicability to other underrepresented language collections.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno di pubblicazione
	
				2025
			
	Data di prima pubblicazione
	
				15-ott-2025
			
	Rivista
	
				INTERNATIONAL JOURNAL ON DIGITAL LIBRARIES
			
	N° del Volume
	
				26
			
	Fascicolo
	
				4
			
	Pagina iniziale
	
				1
			
	Pagina finale
	
				16
			
	Codice DOI
	
				https://dx.doi.org/10.1007/s00799-025-00432-w
			
	Citazione
	
				Digital Maktaba project: Toward a metadata-driven, LLM-assisted framework for arabic digital libraries / El Ganadi, Amina; Gagliardelli, Luca; Ruozzi, Federico. - In: INTERNATIONAL JOURNAL ON DIGITAL LIBRARIES. - ISSN 1432-5012. - 26:4(2025), pp. 1-16. [10.1007/s00799-025-00432-w]
			
	Tutti gli autori
	
						El Ganadi, Amina; Gagliardelli, Luca; Ruozzi, Federico
					
	Tipologia
	
				Articolo su rivista

File in questo prodotto:

Non ci sono file associati a questo prodotto.

Pubblicazioni consigliate

I metadati presenti in IRIS UNIMORE sono rilasciati con licenza Creative Commons CC0 1.0 Universal, mentre i file delle pubblicazioni sono rilasciati con licenza Attribuzione 4.0 Internazionale (CC BY 4.0), salvo diversa indicazione.
In caso di violazione di copyright, contattare Supporto Iris

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11380/1388270

Citazioni

ND

ND

ND

social impact