The rapid digitization of cultural heritage collections, especially those featuring Arabic-script texts, presents distinct challenges related to access, discoverability, and long-term preservation. This paper proposes a metadata-driven framework for Arabic-script digital libraries, which is currently being explored through initial prototyping as part of the Digital Maktaba project. The framework leverages validated metadata from the Diamond catalogue and the La Pira Library’s extensive collection. To address the technical complexities of Arabic script, including calligraphy, diacritics, and ligatures, the project employs frontispiece images and the Kraken OCR engine within the eScriptorium platform to train high-accuracy recognition models. The cataloging workflow is structured around international standards such as Dublin Core and is informed by both topographic and thematic classification practices. As part of ongoing development, the project is evaluating the use of large language models (LLMs), including Arabic-specialized models, to assess their potential for extracting semantic metadata from digitized texts. This effort aims to enrich subject headings and improve classification depth, contributing to AI-assisted indexing and enhanced resource discovery. By combining human-validated metadata with machine learning pipelines, the Digital Maktaba project aims to provide a scalable, standards-aligned approach for building Arabic digital libraries, with broader applicability to other underrepresented language collections.
Digital Maktaba project: Toward a metadata-driven, LLM-assisted framework for arabic digital libraries / El Ganadi, Amina; Gagliardelli, Luca; Ruozzi, Federico. - In: INTERNATIONAL JOURNAL ON DIGITAL LIBRARIES. - ISSN 1432-5012. - 26:4(2025), pp. 1-16. [10.1007/s00799-025-00432-w]
Digital Maktaba project: Toward a metadata-driven, LLM-assisted framework for arabic digital libraries
El Ganadi, Amina
Writing – Review & Editing
;Gagliardelli, LucaMethodology
;Ruozzi, FedericoSupervision
2025
Abstract
The rapid digitization of cultural heritage collections, especially those featuring Arabic-script texts, presents distinct challenges related to access, discoverability, and long-term preservation. This paper proposes a metadata-driven framework for Arabic-script digital libraries, which is currently being explored through initial prototyping as part of the Digital Maktaba project. The framework leverages validated metadata from the Diamond catalogue and the La Pira Library’s extensive collection. To address the technical complexities of Arabic script, including calligraphy, diacritics, and ligatures, the project employs frontispiece images and the Kraken OCR engine within the eScriptorium platform to train high-accuracy recognition models. The cataloging workflow is structured around international standards such as Dublin Core and is informed by both topographic and thematic classification practices. As part of ongoing development, the project is evaluating the use of large language models (LLMs), including Arabic-specialized models, to assess their potential for extracting semantic metadata from digitized texts. This effort aims to enrich subject headings and improve classification depth, contributing to AI-assisted indexing and enhanced resource discovery. By combining human-validated metadata with machine learning pipelines, the Digital Maktaba project aims to provide a scalable, standards-aligned approach for building Arabic digital libraries, with broader applicability to other underrepresented language collections.Pubblicazioni consigliate

I metadati presenti in IRIS UNIMORE sono rilasciati con licenza Creative Commons CC0 1.0 Universal, mentre i file delle pubblicazioni sono rilasciati con licenza Attribuzione 4.0 Internazionale (CC BY 4.0), salvo diversa indicazione.
In caso di violazione di copyright, contattare Supporto Iris




