Rule Management for Information Extraction from Title Pages of Academic Papers

Atsuhiro Takasu, Manabu Ohta

2014

Abstract

This paper discusses the problem of managing rules for page layout analysis and information extraction. We have been developing a system to extract information from academic papers that exploits both page layout and textual information. For this purpose, a conditional random field (CRF) analyzer is designed according to the layout of the object pages. Because various layouts are used in academic papers, we must prepare a set of rules for each type of layout to achieve high extraction accuracy. As the number of papers in a system grows, rule management becomes a big problem. For example, when should we make a new set of rules, and how can we acquire them efficiently while receiving new articles? This paper examines two scores to measure the fitness of rules and the applicability of rules learned for another type of layout. We evaluate the scores for bibliographic information extraction from title pages of academic papers and show that they are effective for measuring the fitness. We also examine the sampling of training data when learning a new set of rules.

References

  1. Antonacopoulos, A., Bridson, D., Papadopoulos, C., and Pletschacher, S. (2009). A realistic dataset for performance evaluation of document layout analysis. In ICDAR2009, pages 296 - 300.
  2. Councill, I. G., Giles, C. L., and Kan, M.-Y. (2008). Parscit: An open-source crf reference string parsing package. In LREC, page 8.
  3. Krishnamoorthy, M., Nagy, G., and Seth, S. (1992). Syntactic segmentation and labeling of digitized pages from technical journals. IEEE Computer, 25(7):10-22.
  4. Kudo, T., Yamamoto, K., and Matsumoto, Y. (2004). Applying conditional random fields to Japanese morphological analysis. In EMNLP 2004.
  5. Lafferty, J., McCallum, A., and Pereira, F. (2001). Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In 18th ICML, pages 282-289.
  6. Nicolas, S., Dardenne, J., Paquet, T., and Heutte, L. (2007). Document image segmentation using a 2d conditional random field model. In ICDAR 2007, pages 407 - 411.
  7. Ohta, M., Inoue, R., and Takasu, A. (2010). Empirical evaluation of active sampling for crf-based analysis of pages. In IEEE IRI 2010, pages 13-18.
  8. Ohta, M. and Takasu, A. (2008). CRF-based authors' name tagging for scanned documents. In JCDL'08, pages 272-275.
  9. Pan, S. J. and Yang, Q. (2010). A survey on transfer learning. IEEE Transactions on Knowledge and Data Engineering, 20(10):1345 - 1359.
  10. Peng, F. and McCallum, A. (2004). Accurate Information Extraction from Research Papers using Conditional Random Fields. In HLT-NAACL, pages 329-336.
  11. Saar-Tsechansky, M. and Provost, F. (2004a). Active sampling for class probability estimation and ranking. Machine Learning, 54(2):153-178.
  12. Saar-Tsechansky, M. and Provost, F. (2004b). Active sampling for class probability estimation and ranking. Machine Learning, 54(2):153-178.
  13. Takasu, A. (2003). Bibliographic attribute extraction from erroneous references based on a statistical model. In JCDL 7803, pages 49-60.
  14. Wang, Y., Phillips, I. T., R.M.Robert, and Haralick, M. (2004). Table structure understanding and its performance evaluation. Pattern Recognition, 37(7):1479- 1497.
  15. Zhu, J., Nie, Z., Wen, J.-R., Zhang, B., and Ma, W.-Y. (2005). 2D conditional random fields for web information extraction. In ICML 2005.
Download


Paper Citation


in Harvard Style

Takasu A. and Ohta M. (2014). Rule Management for Information Extraction from Title Pages of Academic Papers . In Proceedings of the 3rd International Conference on Pattern Recognition Applications and Methods - Volume 1: ICPRAM, ISBN 978-989-758-018-5, pages 438-444. DOI: 10.5220/0004827204380444


in Bibtex Style

@conference{icpram14,
author={Atsuhiro Takasu and Manabu Ohta},
title={Rule Management for Information Extraction from Title Pages of Academic Papers},
booktitle={Proceedings of the 3rd International Conference on Pattern Recognition Applications and Methods - Volume 1: ICPRAM,},
year={2014},
pages={438-444},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0004827204380444},
isbn={978-989-758-018-5},
}


in EndNote Style

TY - CONF
JO - Proceedings of the 3rd International Conference on Pattern Recognition Applications and Methods - Volume 1: ICPRAM,
TI - Rule Management for Information Extraction from Title Pages of Academic Papers
SN - 978-989-758-018-5
AU - Takasu A.
AU - Ohta M.
PY - 2014
SP - 438
EP - 444
DO - 10.5220/0004827204380444