Data-driven Relation Discovery from Unstructured Texts

Marilena Ditta, Fabrizio Milazzo, Valentina Ravì, Giovanni Pilato, Agnese Augello

2015

Abstract

This work proposes a data driven methodology for the extraction of subject-verb-object triplets from a text corpus. Previous works on the field solved the problem by means of complex learning algorithms requiring hand-crafted examples; our proposal completely avoids learning triplets from a dataset and is built on top of a well-known baseline algorithm designed by Delia Rusu et al.. The baseline algorithm uses only syntactic information for generating triplets and is characterized by a very low precision i.e., very few triplets are meaningful. Our idea is to integrate the semantics of the words with the aim of filtering out the wrong triplets, thus increasing the overall precision of the system. The algorithm has been tested over the Reuters Corpus and has it as shown good performance with respect to the baseline algorithm for triplet extraction.

References

  1. Atapattu Mudiyanselage, T., Falkner, K., and Falkner, N. (2014). Acquisition of triples of knowledge from lecture notes: a natural language processing approach. In 7th International Conference on Educational Data Mining (04 Jul 2014-07 Jul 2014: London, United Kingdom).
  2. Bach, N. and Badaskar, S. (2007). A Survey on Relation Extraction.
  3. Berry, M., Do, T., O'Brien, G., Krishna, V., and Varadhan, S. (2015). Svdlibc. http://tedlab.mit.edu/~dr/ svdlibc/.
  4. Callan, J. (2009). The clueweb09 dataset. http:// www.lemurproject.org/clueweb09.php/.
  5. Ceran, B., Karad, R., Mandvekar, A., Corman, S. R., and Davulcu, H. (2012). A semantic triplet based story classifier. In Advances in Social Networks Analysis and Mining (ASONAM), 2012 IEEE/ACM International Conference on, pages 573-580. IEEE.
  6. Deerwester, S., Dumais, S., Furnas, G., Landauer, T., and Harshman, R. (1990). Indexing by latent semantic analysis. Journal of the American Society for Information Science 41, pages 391-407.
  7. Etzioni, O., Banko, M., Soderland, S., and Weld, D. S. (2008). Open information extraction from the web. Commun. ACM, 51(12):68-74.
  8. Fader, A., Soderland, S., and Etzioni, O. (2011). Identifying relations for open information extraction. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, EMNLP 7811, pages 1535- 1545, Stroudsburg, PA, USA. Association for Computational Linguistics.
  9. Kubler, S., McDonald, R., Nivre, J., and Hirst, G. (2009). Dependency Parsing. Morgan and Claypool Publishers.
  10. Landauer, T. K. and Dutnais, S. T. (1997). A solution to platos problem: The latent semantic analysis theory of acquisition, induction, and representation of knowledge. Psychological Review, 104(2):211-240.
  11. Manning, C. D., Surdeanu, M., Bauer, J., Finkel, J., Bethard, S. J., and McClosky, D. (2014). The Stanford CoreNLP natural language processing toolkit. In Proceedings of 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pages 55-60.
  12. Nadeau, D. and Sekine, S. (2007). A survey of named entity recognition and classification. Journal of Linguisticae Investigationes, 30(1):1-20.
  13. Pilato, G. and Vassallo, G. (2015). Tsvd as a statistical estimator in the latent semantic analysis paradigm. Emerging Topics in Computing, IEEE Transactions on, 3(2):185-192.
  14. Richard Socher, John Bauer, C. D. M. and Ng., A. Y. (2015a). Stanford ner. http://nlp.stanford.edu/ software/CRF-NER.shtml.
  15. Richard Socher, John Bauer, C. D. M. and Ng., A. Y. (2015b). Stanford parser. http://nlp.stanford.edu/ software/lex-parser.shtml.
  16. Rose, T., Stevenson, M., and Whitehead, M. (2002). The reuters corpus volume 1 - from yesterdays news to tomorrows language resources. In In Proceedings of the Third International Conference on Language Resources and Evaluation, pages 29-31.
  17. Rusu, D., Dali, L., Fortuna, B., Grobelnik, M., and Mladenic, D. (2007). Triplet Extraction From Sentences. Proceedings of the 10th International Multiconference ”Information Society - IS 2007 , A:218- 222.
  18. Schmid, H. (1995). Treetagger a language independent partof-speech tagger. Institut für Maschinelle Sprachverarbeitung, Universität Stuttgart.
  19. Shinyama, Y. and Sekine, S. (2006). Preemptive information extraction using unrestricted relation discovery. pages 304-311.
  20. Soderland, S., Roof, B., Qin, B., Xu, S., Etzioni, O., et al. (2010). Adapting open information extraction to domain-specific relations. AI magazine, 31(3):93- 102.
  21. Surdeanu, M. (2015). Penn treebank project. http:// www.surdeanu.info/mihai/teaching/ista555-spring15/ readings/PennTreebankConstituents.html.
Download


Paper Citation


in Harvard Style

Ditta M., Milazzo F., Ravì V., Pilato G. and Augello A. (2015). Data-driven Relation Discovery from Unstructured Texts . In Proceedings of the 7th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management - Volume 1: DART, (IC3K 2015) ISBN 978-989-758-158-8, pages 597-602. DOI: 10.5220/0005614205970602


in Bibtex Style

@conference{dart15,
author={Marilena Ditta and Fabrizio Milazzo and Valentina Ravì and Giovanni Pilato and Agnese Augello},
title={Data-driven Relation Discovery from Unstructured Texts},
booktitle={Proceedings of the 7th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management - Volume 1: DART, (IC3K 2015)},
year={2015},
pages={597-602},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0005614205970602},
isbn={978-989-758-158-8},
}


in EndNote Style

TY - CONF
JO - Proceedings of the 7th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management - Volume 1: DART, (IC3K 2015)
TI - Data-driven Relation Discovery from Unstructured Texts
SN - 978-989-758-158-8
AU - Ditta M.
AU - Milazzo F.
AU - Ravì V.
AU - Pilato G.
AU - Augello A.
PY - 2015
SP - 597
EP - 602
DO - 10.5220/0005614205970602