loading
Papers Papers/2022 Papers Papers/2022

Research.Publish.Connect.

Paper

Paper Unlock

Authors: Andreas Waldis ; Luca Mazzola and Michael Kaufmann

Affiliation: Lucerne University of Applied Sciences, School of Information Technology, 6343 - Rotkreuz and Switzerland

Keyword(s): Natural Language Processing, Concept Extraction, Convolutional Neural Network.

Related Ontology Subjects/Areas/Topics: Artificial Intelligence ; Biomedical Engineering ; Business Analytics ; Data Engineering ; Data Management and Quality ; Data Mining ; Databases and Information Systems Integration ; Datamining ; Enterprise Information Systems ; Health Information Systems ; Semi-Structured and Unstructured Data ; Sensor Networks ; Signal Processing ; Soft Computing ; Text Analytics

Abstract: For knowledge management purposes, it would be interesting to classify and tag documents automatically based on their content. Concept extraction is one way of achieving this automatically by using statistical or semantic methods. Whereas index-based keyphrase extraction can extract relevant concepts for documents, the inverse document index grows exponentially with the number of words that candidate concpets can have. To adress this issue, the present work trains convolutional neural networks (CNNs) containing vertical and horizontal filters to learn how to decide whether an N-gram (i.e, a consecutive sequence of N characters or words) is a concept or not, from a training set with labeled examples. The classification training signal is derived from the Wikipedia corpus, knowing that an N-gram certainly represents a concept if a corresponding Wikipedia page title exists. The CNN input feature is the vector representation of each word, derived from a word embedding model; the output i s the probability of an N-gram to represent a concept. Multiple configurations for vertical and horizontal filters were analyzed and configured through a hyper-parameterization process. The results demonstrated precision of between 60 and 80 percent on average. This precision decreased drastically as N increased. However, combined with a TF-IDF based relevance ranking, the top five N-gram concepts calculated for Wikipedia articles showed a high precision of 94%, similar to part-of-speech (POS) tagging for concept recognition combined with TF-IDF, but with a much better recall for higher N. CNN seems to prefer longer sequences of N-grams as identified concepts, and can also correctly identify sequences of words normally ignored by other methods. Furthermore, in contrast to POS filtering, the CNN method does not rely on predefined rules, and could thus provide language-independent concept extraction. (More)

CC BY-NC-ND 4.0

Sign In Guest: Register as new SciTePress user now for free.

Sign In SciTePress user: please login.

PDF ImageMy Papers

You are not signed in, therefore limits apply to your IP address 216.73.216.61

In the current month:
Recent papers: 100 available of 100 total
2+ years older papers: 200 available of 200 total

Paper citation in several formats:
Waldis, A., Mazzola, L. and Kaufmann, M. (2018). Concept Extraction with Convolutional Neural Networks. In Proceedings of the 7th International Conference on Data Science, Technology and Applications - DATA; ISBN 978-989-758-318-6; ISSN 2184-285X, SciTePress, pages 118-129. DOI: 10.5220/0006901201180129

@conference{data18,
author={Andreas Waldis and Luca Mazzola and Michael Kaufmann},
title={Concept Extraction with Convolutional Neural Networks},
booktitle={Proceedings of the 7th International Conference on Data Science, Technology and Applications - DATA},
year={2018},
pages={118-129},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0006901201180129},
isbn={978-989-758-318-6},
issn={2184-285X},
}

TY - CONF

JO - Proceedings of the 7th International Conference on Data Science, Technology and Applications - DATA
TI - Concept Extraction with Convolutional Neural Networks
SN - 978-989-758-318-6
IS - 2184-285X
AU - Waldis, A.
AU - Mazzola, L.
AU - Kaufmann, M.
PY - 2018
SP - 118
EP - 129
DO - 10.5220/0006901201180129
PB - SciTePress