Model Card Metadata Collection from Hugging Face to Foster

Multidisciplinary AI Research: A Dataset

Muhammad Asif Suryani

, Saurav Karmakar

, Brigitte Mathiak

and Philipp Mayr

Knowledge Technologies for the Social Sciences

GESIS – Leibniz-Institut f

ur Sozialwissenschaften, K

oln, Germany

Keywords:

Hugging Face, Metadata Exploration, Metadata Collection, Large Language Models, Research Data

Management, Multidisciplinary Research, Dataset.

Abstract:

Metadata features generally exhibit valuable meta information which may facilitate researchers in their tasks.

Several studies incorporated scholarly metadata by highlighting its usefulness in certain granularity to assist

numerous research tasks. The emergence of Large Language Models (LLMs) has brought an exciting change

in the ﬁeld of Artiﬁcial Intelligence (AI) and Machine Learning (ML), which is equally supported by Open

Science initiative and FAIR principles. One of the prominent platforms, which ensures the availability of

these models to research communities is the Hugging Face. It provides democratized access to models while

experiencing rapid growth as a repository. As of March 2025, Hugging Face hosts more than 1.4 million

models, which were 0.5 million approximately in February 2024. In this dataset paper, we provide information

on a large fraction of Hugging Face model cards. Our dataset comprises of a wide range of metadata features

which showcase the meta information about each model card. In this work, we aim to provide democratized

access to a collection of diverse metadata features from Hugging Face model cards and present an insightful

overview of these cards by leveraging the metadata to support the research communities by facilitating model

adoption.

1 INTRODUCTION

Open Science is one of the driving factors, which

has transformed the scientiﬁc disciplines by support-

ing the accessibility to research artifacts i.e. publi-

cations and machine learning models. This initiative

is equally supported by the research communities by

sharing research artifacts at various platforms to facil-

itate the information dissemination. In this ecosystem

both researchers and repositories are of equal impor-

tance as one is going to share the resources and later

aims to democratize the access to these resources un-

der certain guidelines to ensure technology is for all

(Warzel et al., 2020; Dang et al., 2023).

Researchers across the domains are continuously

developing and tailoring intelligent solutions leverag-

ing large language models (LLMs), which has trans-

formed the scientiﬁc disciplines recently. These mod-

els are actively adopted by the research communi-

https://orcid.org/0000-0003-1669-5524

https://orcid.org/0009-0007-0124-5316

https://orcid.org/0000-0003-1793-9615

https://orcid.org/0000-0002-6656-1658

ties. These models are available at various reposito-

ries such as Hugging Face

and GitHub

. However,

considering LLMs Hugging Face is the one of the

prominent entity which facilitates the users by hosting

state-of-the-art models and provide access to them by

its user friendly interface and features. Hugging Face

is widely adopted by the research communities due

to availability and accessibility of diversiﬁed models

(Yang et al., 2023; Face, 2025).

These models are becoming essential part of re-

search and development in every domain by their

coverages to diverse applications. As Hugging Face

asks information about these models during the cura-

tion process to collect certain metadata features about

these models to enhance the user experience similar

to scholarly metadata features. These metadata fea-

tures are helpful for the users to search for the ap-

propriate model conveniently. These metadata fea-

tures provide an abstract overview about these models

which could facilitate the users in the model adop-

tion process. However, this task is quite challeng-

https://huggingface.co/

https://github.com/

Suryani, M. A., Karmakar, S., Mathiak, B., Mayr and P.

Model Card Metadata Collection from Hugging Face to Foster Multidisciplinary AI Research: A Dataset.

DOI: 10.5220/0013571800003967

In Proceedings of the 14th International Conference on Data Science, Technology and Applications (DATA 2025), pages 583-590

ISBN: 978-989-758-758-0; ISSN: 2184-285X

583

ing and crucial because accessing these models are

generally expensive with respect to time. So these

metadata features could play an essential role in tack-

ling the rapid increase in the number of models and

facilitate the model discovery process concerning di-

versiﬁed research tasks (McMillan-Major et al., 2021;

Face, 2025; Yang et al., 2024).

This paper focuses on Hugging Face model cards

by presenting coverage of associated metadata fea-

tures by targeting two timelines between February

2024 to March 2025. Besides, this study also aims

to cover insights where such collection of metadata

features could be useful and research communities

may beneﬁt from it. In addition, it also highlights the

dynamics of metadata features to leverage the latest

trends prevailing over the Hugging Face including:

• Information on number of model cards over two

stages and by which factor model cards are in-

creasing.

• How many models provide complete metadata

features?

• How does the adaptability of model cards can be

indicated by number of downloads and likes?

• Distribution of model cards by downloads over

two stages.

• What are the most inﬂuential models in terms of

downloads, top k models by month.

• How many models provide information about the

licensing?

• How many models provide multilingual support?

• How many models link to external scholarly

repositories such as arXiv?

This study is organized as later section presents

metadata feature coverage by indicating the set of fea-

tures followed by related work. The fourth section

provides metadata feature collection and by present-

ing initial explorations. The paper concludes by pro-

viding community impact of this study followed by

discussions and conclusion.

2 METADATA COVERAGE

Metadata Coverage brieﬂy enlist the heterogeneous

metadata features associated with the Hugging Face

model cards. These features leverage wide range of

technical and generic information and it is important

to mention that these features are set by Hugging Face

as default but it is not necessary that each model card

must carry all of these features. For uniformity and

simplicity all the standardized features with default

naming conventions are described as available in the

data collection (Face, 2025; Suryani et al., 2025):

• The huggingface link directs to the landing page

of model card.

• The modelId/id indicates the name of model and

hierarchy which could lead to key references.

• The pipeline tag presents the primary task associ-

ated with each model card.

• The downloads feature is the key aspects which

represents the number of times a model has been

downloaded, in earlier settings it indicated the cu-

mulative downloads, but currently it reﬂects only

last month’s downloads.

• The likes indicates number of likes a model has

received over its life time depicted cumulatively.

• The library name speciﬁes the library or frame-

work associated with each model card.

• The private feature indicates about the access sta-

tus of the repositories, which is generally set to

“False” by default.

• The createdAt: feature shows the timeline but nor-

mally it has been observed that it inclined towards

date of access.

• The tags features is the key among metadata fea-

tures as it provides further useful information as-

sociated with the model card such as license,

dataset, languages, parameters and arXiv ids. But

in case no information is provided by default, it

carries“region:us” or “region:eu”.

• The trendingScore indicates trends regarding

model cards introduced recently and its minimum

value is 0 and maximum is 999 as observed. The

mechanism to calculate it yet not clear.

The information exhibited by these set of features

has tendency to provide deeper insights about these

models and access to metadata features of model

cards seem interesting and may help the research

communities to strive for their relevant models. Fur-

thermore, these metadata features cumulatively por-

tray meta information that is essential for supporting

model adoption in multidisciplinary research.

3 RELATED WORK

In this problem setting, it is equally essential to high-

light the studies where data sources from scholarly

publications and model cards features are targeted to

facilitate the user experience alongside research data

management practices. These metadata features in

DATA 2025 - 14th International Conference on Data Science, Technology and Applications

584

general tends to provide an abstract overview and has

become an integral part of research ecosystem.

Scholarly metadata is available at numerous

repositories which covers certain granularities. Dig-

ital Bibliography and Library Project (DBLP)

per-

forms metadata acquisition from scholarly resources

which covers wide range of features and it is currently

one of the largest repositories considering informat-

ics and interdisciplinary research. OpenAlex

is also

one of the biggest repository which aim to collect and

curate scholarly metadata by focusing numerous sci-

entiﬁc disciplines. Research Organization Registry

(ROR)

streamlines metadata features targeting re-

search organization across the globe. These reposito-

ries aim to cover distinct set of metadata features and

ultimately contribute towards better community expe-

rience, discoverability, accessibility and adoption of

technological advancement across research domains.

Moreover, targeting dataset Zenodo

and Kaggle

are

the prominent repositories which host heterogeneous

data sources to facilitate the research communities un-

der FAIR principles thus promoting open science. It

is also important to mention that ROR releases its data

dump every month (Registry, 2025).

Due to large increase in the number of publi-

cations, there has been numerous studies targeting

scholarly metadata to assist the researchers by net-

work exploration. The study utilizes Heterogeneous

Information Networks (HIN) which has ability to in-

dicate potential relationships such as citation links,

author collaborations, and research areas. The popu-

lated network is being explored by random walk strat-

egy which simulates the relevance between papers

(Du et al., 2020). In addition, another study incor-

porates a large number of entities such as authors, pa-

pers, citations, etc. Such network exploration models

diverse data and has the potential to uncover patterns

across scholarly resources (Liu et al., 2018).

Recently, an exploratory study targeting Hugging

Face was conducted, which considers heterogeneous

metadata features of model cards. The study presents

an insightful overview of model cards, using down-

loads and likes as the primary features for segrega-

tion. It also presents an overview of linking scholarly

repositories and modeling the information into a Het-

erogeneous Information Network (HIN). As such in-

formative representations highlight the importance of

linking scholarly repositories at metadata level while

emphasizing on the need of scholarly repository har-

https://dblp.org/

https://openalex.org/

https://ror.org/

https://zenodo.org/

https://kaggle.com/

monization (Suryani et al., 2024).

Moreover, a study targeting Hugging Face

presents how its ecosystem struggles in providing a

categorization for the models targeting software de-

velopment. It discusses an approach which may ﬁll

this gap by automating the classiﬁcation of models for

software development tasks, initially extracting rele-

vant information including documentation and rele-

vant tags for these model cards (Di Sipio et al., 2024).

However, another study highlights that lack in

transparency across these models can confront issues

related to bias, fairness and potential legal risks. It

also provides insights against these models by ex-

amining model descriptions, datasets for pre-training,

training biases and licenses. The results depicts that

there has been limited transparency regarding training

datasets, biases, and licenses, with several licensing

violations and insists on the need for improved trans-

parency (Pepe et al., 2024; Casta

no et al., 2023).

In a technical report on 4chan which gather obser-

vations regarding various sensitive topics. This report

provide the approach being adopted to collect the data

from its various boards and brieﬂy discuss how such

dataset can drive the social studies revolving around

digital behavior (Culbert, 2023).

4 MODEL CARD DATASET

COLLECTION AND

EXPLORATION

Dataset are one the key elements in driving the re-

search speciﬁcally in this era of Large Language

Models (LLMs). Dataset are generally made available

across the open repositories. However, it is equally

important to make these datasets discoverable and

accessible, in addition to providing useful meta in-

formation, such as a READMEs. The recent study

on data searching indicates that it is still difﬁcult to

search for appropriate dataset (Hulsebos et al., 2024).

4.1 Metadata Collection and Timeline

Over the course of this study, we aim to collect meta-

data features of the Hugging Face model cards avail-

able at Hugging Face via Hugging Face API. The

brief overview is presented in Figure 1 which com-

mence with the metadata retrieval and extracts un-

structured metadata features from the Hugging Face

API which need to be further processed to convert into

structured format such as CSVs. It ensures that all the

features remain intact with their respective descrip-

tions as presented in section 2.

Model Card Metadata Collection from Hugging Face to Foster Multidisciplinary AI Research: A Dataset

585

Information parser handles all the inconsistencies

by following pre-processing steps which covers han-

dling encoding issues, missing values, parsing nested

information and remove redundancies to make sure

that data is consistent. Moreover, these processes are

responsible of providing metadata features in a struc-

tured format because metadata modeling is the key

process which thoroughly study all the feature set and

tends to grasp useful information from metadata fea-

tures individually and collectively by forming infor-

mative links among features. For example, downloads

and likes values depict model adaptability. Dataset

exploration module provides the interesting insights

covering distribution of models by features, top model

cards by timeline and pipeline/library tags which has

ability to drive the interest of research communities

by facilitating AI adoption across the domains.

Figure 1: Model Card Metadata Collection Overview.

The metadata acquisition process commenced in

February 2024 by collecting the metadata features and

later it has been divided into two stages by timeline

i.e. February 2024 to April 2024 and September 2024

to March 2025. The selection of these timelines is

based on the availability and interpretability of the

model cards metadata features. For example, Hug-

ging Face currently provides month-wise downloads

for model cards, whereas previously it was providing

cumulative downloads. Over these timelines we aim

to present a consolidated insights to research commu-

nities that how metadata features could be helpful in

the model discovery process. Moreover, to support

the study a couple of dataset instances are available at

Zenodo (Suryani et al., 2025).

4.2 Dataset Exploration

Data Exploration section presents the comprehensive

insights about the Hugging Face model cards by ex-

ploiting the metadata features over two timestamps.

Figure 2: Model Card Overview by Downloads Stage 1.

Figure 3: Model Card Overview by Downloads Stage 2.

The Figure 2 and Figure 3 present the distribution of

model cards, in which model cards are divided into

two groups such as models with zero download and

models with downloads greater than zero.

It is evident that almost half of the model cards are

having no downloads which is an important aspect for

the research communities as well as for the repository.

Moreover, Figure 4 and Figure 5 demonstrate the

distribution of model cards considering likes over the

two stages respectively. Similar to downloads the

model cards are segregated into two sets i.e. mod-

els with likes and models with no like respectively.

These insights showcase the adoption of this feature

by the communities and hence cannot be used as a

primary parameter to measure the impact of model

cards. However from ﬁgures one can realize a sub-

stantial increase in total number of models over time.

Hence, it is important to mention that a considerable

amount of models having zero downloads which may

indicate their adaptability across research communi-

ties. So the focus will be on the model cards having

DATA 2025 - 14th International Conference on Data Science, Technology and Applications

586

Figure 4: Model Card Overview by Likes Stage 1.

Figure 5: Model Card Overview by Likes Stage 2.

downloads greater than zero.

Besides, downloads and likes, tags is also an

important feature which comprises of nested infor-

mation presenting useful overview about the model

cards. For example arXiv ids, dataset mentions, li-

cense information, language support and many more.

But it is also essential to mention such information is

not uniform across the model cards and depends upon

the curation process.

Hence, Figure 6 and 7 effectively demonstrate the

total number of model cards over these timestamps

and also presents the available auxiliary information

across these models.

These ﬁgures highlight how a relatively small

number of model cards are providing useful infor-

mation, it can be seen that almost one third or even

lower number of model cards are providing informa-

tion about the licenses which is a crucial indicator in

the adaptability process. Moreover, it is also impor-

tant to highlight that a very low number of models are

providing information about the relevant datasets and

Figure 6: Model Card Overview by Tags Features Stage 1.

Figure 7: Model Card Overview by Tags Features Stage 2.

that trends remains persistent when it comes to mod-

els mentioning publications. So, availability of such

valuable meta information will deﬁantly enhance the

overall reproducibility and user experience.

Furthermore, it is also important to target features

covering technological aspects concerning model

cards, as access to such information holds key im-

portance in the adoption process. The Figure 8 and

Figure 9 present interesting insights targeting library

and pipeline features.

The aim of this representation is to indicate the

trends of library and pipelines which are being ex-

ploited by the research communities.

Moreover, for simplicity we present the top 10

most frequent items. The trends across both times-

tamps remain similar, with “transformers” and “text-

generation” being the most widely adopted under li-

brary and pipeline features respectively. But in sec-

ond stage there is ﬂuctuation is observed considering

image based pipelines. Hence, from these exploration

it is evident that these metadata features are capable

Model Card Metadata Collection from Hugging Face to Foster Multidisciplinary AI Research: A Dataset

587

Figure 8: Pipeline and Library Overview Stage 1.

Figure 9: Pipeline and Library Overview Stage 2.

of providing useful insights towards research commu-

nities which could pave the way for the indulgence of

these model across multidisciplinary research while

assisting in the model discovery process.

5 COMMUNITY IMPACT

It is essential to discuss how model cards metadata

exploration can provide valuable insights by targeting

pipelines, libraries, language coverage, and datasets.

These trends may reﬂect model popularity, preferred

licensing, and dataset diversity, enabling researchers

to align their work with the evolving needs of the

community. Furthermore, tracking model cards with

arXiv ids can help in identifying models with papers

and foster research collaborations by enhancing re-

producibility. These insights also support data driven

decision making and facilitate future research and

open science initiatives within the research commu-

nities.

Figure 10: Top 5 Model Cards in Stage 1.

To support these arguments, Figure 10 presents

the top ﬁve model cards from stage 1 which illustrates

a shift in technology as well as the adoption of various

model cards across research communities. Moreover,

it also highlights the potential contributors driving ad-

vancements in research and technology.

Similarly, Figure 11 also illustrates the top mod-

els by each month across stage 2 by revealing notable

informative trends in model adoption, pipeline pref-

erences, and dataset utilization. In stage 1, a strong

presence of transformers based models has been ob-

served, particularly targeting text-classiﬁcation and

text-generation tasks. The presence of models with

paper in stage 1, although not dominant, suggests

that some models originate from published research.

However, the limited diversity in datasets and licenses

can be crucial for the community driven innovation.

However, stage 2 showcases a broader shift in

model diversity and complexity. Notably, speech

recognition and multi-modal models emerge along-

side traditional text-based models, signifying grow-

ing interest in speech and vision applications. The

adoption of models utilizing diffusers for generative

tasks is also notable, indicating a rise in image gen-

eration and diffusion-based techniques. Furthermore,

DATA 2025 - 14th International Conference on Data Science, Technology and Applications

588

Figure 11: Top 5 Model Cards in Stage 2.

the higher number of models with associated publica-

tions in this stage suggests that research driven contri-

butions are becoming more prevalent and may foster

stronger linkages between academia and industry.

Moreover, this exploration highlights the impor-

tance of metadata features within model cards and

emphasizes the need to collect such metadata as a

dataset to support the research community in the

model adoption process. These insights hold signif-

icant implications for both the research communities

and the Hugging Face ecosystem. By analyzing these

trends, this exploration may serve as a valuable ref-

erence point for researchers across various domains,

offering guidance in shaping future directions and

addressing the needs of model development aligned

with open science initiatives and research data man-

agement.

6 DISCUSSION AND

CONCLUSION

In this era, data driven applications are getting indis-

pensable across research communities, elevating the

need for studies that focus not only on data acquisi-

tion but also on necessary exploration. Such efforts

not only provide insightful overview but also help in

managing rapidly evolving technological dynamics.

These data sources can play a crucial role in driving

and reshaping research practices.

This study aims to democratize access to the meta-

data features of Hugging Face model cards, offering

as a valuable data source in various meaningful con-

ﬁgurations. Because such exploration alongside Hug-

ging Face existing searching mechanism will be ideal

for researchers to search for the appropriate models

conveniently. However, it is also important for Hug-

ging Face to come up with some deprecation policy to

further optimize the user experience related to model

cards. In future more exciting dataset instances will

Model Card Metadata Collection from Hugging Face to Foster Multidisciplinary AI Research: A Dataset

589

be shared on Zenodo to facilitate the adoption of AI

across research communities.

This work have exciting future prospects such as

Model Recommender System, Hugging Face Model

card Leader-boards which provide users valuable rec-

ommendations based on metadata features of model

cards alongside technological description available on

the landing pages of each model cards. Moreover, in-

clusion of current state of development of model cards

in the recommendations will be interesting and indi-

cate the current state of development of the models.

Furthermore, an exciting direction is the harmo-

nization of research artifacts across scholarly repos-

itories, which would enrich the research ecosystem

with more linked information for researchers such as

“Authors with Models” and “Organization with Mod-

els’. In addition, exploiting model card provenance

will also be interesting and may yield valuable in-

sights.

ACKNOWLEDGEMENTS

This work has been partially funded by the Deutsche

Forschungsgemeinschaft (DFG, German Research

Foundation), NFDI4DS (Grant number 460234259).

Authors also acknowledge the Hugging Face as data

sources and also thanks the individuals involved in

this research.

REFERENCES

Casta

no, J., Mart

ınez-Fern

andez, S., Franch, X., and

Bogner, J. (2023). Exploring the carbon footprint

of hugging face’s ml models: A repository mining

study. In 2023 ACM/IEEE International Symposium

on Empirical Software Engineering and Measurement

(ESEM), pages 1–12. IEEE.

Culbert, J. H. (2023). 4tct, a 4chan text collection tool.

arXiv preprint arXiv:2307.03556.

Dang, V.-N., Aussenac-Gilles, N., Megdiche, I., and Ravat,

F. (2023). Interoperability of open science metadata:

What about the reality? In International Conference

on Research Challenges in Information Science, pages

467–482. Springer.

Di Sipio, C., Rubei, R., Di Rocco, J., Di Ruscio, D., and

Nguyen, P. T. (2024). Automated categorization of

pre-trained models for software engineering: A case

study with a hugging face dataset. arXiv preprint

arXiv:2405.13185.

Du, N., Guo, J., Wu, C. Q., Hou, A., Zhao, Z., and

Gan, D. (2020). Recommendation of academic pa-

pers based on heterogeneous information networks.

In 2020 IEEE/ACS 17th International Conference on

Computer Systems and Applications (AICCSA), pages

1–6. IEEE.

Face, H. (2025). Hugging face apis. https://huggingface.co.

Accessed: 2025-02-20.

Hulsebos, M., Lin, W., Shankar, S., and Parameswaran, A.

(2024). It took longer than i was expecting: Why is

dataset search still so hard? In Proceedings of the

2024 Workshop on Human-In-the-Loop Data Analyt-

ics, pages 1–4.

Liu, J., Tang, T., Wang, W., Xu, B., Kong, X., and Xia, F.

(2018). A survey of scholarly data visualization. Ieee

Access, 6:19205–19221.

McMillan-Major, A., Osei, S., Rodriguez, J. D., Am-

manamanchi, P. S., Gehrmann, S., and Jernite, Y.

(2021). Reusable templates and guides for document-

ing datasets and models for natural language process-

ing and generation: A case study of the hugging-

face and gem data and model cards. arXiv preprint

arXiv:2108.07374.

Pepe, F., Nardone, V., Mastropaolo, A., Bavota, G., Can-

fora, G., and Di Penta, M. (2024). How do hug-

ging face models document datasets, bias, and li-

censes? an empirical study. In Proceedings of the

32nd IEEE/ACM International Conference on Pro-

gram Comprehension, pages 370–381.

Registry, R. O. (2025). Ror data.

Suryani, M. A., Karmakar, S., and Mathiak, B. (2024). Ex-

ploration of hugging face models by heterogeneous in-

formation network and linking across scholarly repos-

itories. In International Conference on Advances in

Social Networks Analysis and Mining, pages 371–386.

Springer.

Suryani, M. A., Karmakar, S., Mathiak, B., Mutschke, P.,

and Mayr, P. (2025). Hugging face model cards meta-

data dataset.

Warzel, D., Fitzmartin, R., Zhou, F., et al. (2020). Fair

data sharing: the roles of common data elements and

harmonization. Journal of biomedical informatics,

107:103421.

Yang, Z., Shi, J., Devanbu, P., and Lo, D. (2024). Ecosys-

tem of large language models for code. arXiv preprint

arXiv:2405.16746.

Yang, Z., Wang, C., Shi, J., Hoang, T., Kochhar, P., Lu,

Q., Xing, Z., and Lo, D. (2023). What do users ask

in open-source ai repositories? an empirical study of

github issues. In 2023 IEEE/ACM 20th International

Conference on Mining Software Repositories (MSR),

pages 79–91. IEEE.

DATA 2025 - 14th International Conference on Data Science, Technology and Applications

590