The rapid digitization of cultural heritage collections, especially those featuring Arabic-script texts, presents distinct challenges related to access, discoverability, and long-term preservation. This paper proposes a metadata-driven framework for Arabic-script digital libraries, which is currently being explored through initial prototyping as part of the Digital Maktaba project. The framework leverages validated metadata from the Diamond catalogue and the La Pira Library’s extensive collection. To address the technical complexities of Arabic script, including calligraphy, diacritics, and ligatures, the project employs frontispiece images and the Kraken OCR engine within the eScriptorium platform to train high-accuracy recognition models. The cataloging workflow is structured around international standards such as Dublin Core and is informed by both topographic and thematic classification practices. As part of ongoing development, the project is evaluating the use of large language models (LLMs), including Arabic-specialized models, to assess their potential for extracting semantic metadata from digitized texts. This effort aims to enrich subject headings and improve classification depth, contributing to AI-assisted indexing and enhanced resource discovery. By combining human-validated metadata with machine learning pipelines, the Digital Maktaba project aims to provide a scalable, standards-aligned approach for building Arabic digital libraries, with broader applicability to other underrepresented language collections.

Digital Maktaba project: Toward a metadata-driven, LLM-assisted framework for arabic digital libraries / El Ganadi, Amina; Gagliardelli, Luca; Ruozzi, Federico. - In: INTERNATIONAL JOURNAL ON DIGITAL LIBRARIES. - ISSN 1432-5012. - 26:4(2025), pp. 1-16. [10.1007/s00799-025-00432-w]

Digital Maktaba project: Toward a metadata-driven, LLM-assisted framework for arabic digital libraries

El Ganadi, Amina
Writing – Review & Editing
;
Gagliardelli, Luca
Methodology
;
Ruozzi, Federico
Supervision
2025

Abstract

The rapid digitization of cultural heritage collections, especially those featuring Arabic-script texts, presents distinct challenges related to access, discoverability, and long-term preservation. This paper proposes a metadata-driven framework for Arabic-script digital libraries, which is currently being explored through initial prototyping as part of the Digital Maktaba project. The framework leverages validated metadata from the Diamond catalogue and the La Pira Library’s extensive collection. To address the technical complexities of Arabic script, including calligraphy, diacritics, and ligatures, the project employs frontispiece images and the Kraken OCR engine within the eScriptorium platform to train high-accuracy recognition models. The cataloging workflow is structured around international standards such as Dublin Core and is informed by both topographic and thematic classification practices. As part of ongoing development, the project is evaluating the use of large language models (LLMs), including Arabic-specialized models, to assess their potential for extracting semantic metadata from digitized texts. This effort aims to enrich subject headings and improve classification depth, contributing to AI-assisted indexing and enhanced resource discovery. By combining human-validated metadata with machine learning pipelines, the Digital Maktaba project aims to provide a scalable, standards-aligned approach for building Arabic digital libraries, with broader applicability to other underrepresented language collections.
2025
15-ott-2025
26
4
1
16
Digital Maktaba project: Toward a metadata-driven, LLM-assisted framework for arabic digital libraries / El Ganadi, Amina; Gagliardelli, Luca; Ruozzi, Federico. - In: INTERNATIONAL JOURNAL ON DIGITAL LIBRARIES. - ISSN 1432-5012. - 26:4(2025), pp. 1-16. [10.1007/s00799-025-00432-w]
El Ganadi, Amina; Gagliardelli, Luca; Ruozzi, Federico
File in questo prodotto:
Non ci sono file associati a questo prodotto.
Pubblicazioni consigliate

Licenza Creative Commons
I metadati presenti in IRIS UNIMORE sono rilasciati con licenza Creative Commons CC0 1.0 Universal, mentre i file delle pubblicazioni sono rilasciati con licenza Attribuzione 4.0 Internazionale (CC BY 4.0), salvo diversa indicazione.
In caso di violazione di copyright, contattare Supporto Iris

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11380/1388270
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
social impact