Teaching Machines to Recognize Idiomatic Expressions - A Comparative Analysis of Compound Word Recognition Results between Human and Machine Annotation

Totok Suhardijanto, Zahroh Nuriah, Setiawati Darmojuwono

2017

Abstract

This paper presents our research progress in building an automatic recognition system for compound words in Bahasa Indonesia. Our goal is to develop a system that is able to distinguish significant multiword expressions and other insignificant groups of words. For instance, rumah tangga ‘household’ should be considered as a significant cluster of words rather than rumah kayu ‘wooden house’. It is not easy to differentiate a compound word and an ordinary phrase in Bahasa Indonesia because there are no specific phonological markers like accent in German or Dutch. The orthographical markers are not always present, rumah tangga is written with a space while kacamata ‘glasses’ not. In this paper, we compare and analyze the results of machine and human annotation. The automatic annotation system is built with a statistical machine learning algorithm called conditional random field. Data for annotation task is collected from newspaper and magazine articles. In this analysis, the mixed method was applied to reveal the differences between human and machine annotation. The result showed that the machine still performed 69% of accuracy and had several error patterns in compound word recognition tasks. Human annotation is trivial due to personal annotator backgrounds.

Download


Paper Citation


in Harvard Style

Suhardijanto T., Nuriah Z. and Darmojuwono S. (2017). Teaching Machines to Recognize Idiomatic Expressions - A Comparative Analysis of Compound Word Recognition Results between Human and Machine Annotation.In The Tenth Conference on Applied Linguistics and The Second English Language Teaching and Technology Conference in collaboration with The First International Conference on Language, Literature, Culture, and Education - Volume 1: CONAPLIN and ICOLLITE, ISBN 978-989-758-332-2, pages 376-380. DOI: 10.5220/0007167603760380


in Bibtex Style

@conference{conaplin and icollite17,
author={Totok Suhardijanto and Zahroh Nuriah and Setiawati Darmojuwono},
title={Teaching Machines to Recognize Idiomatic Expressions - A Comparative Analysis of Compound Word Recognition Results between Human and Machine Annotation},
booktitle={The Tenth Conference on Applied Linguistics and The Second English Language Teaching and Technology Conference in collaboration with The First International Conference on Language, Literature, Culture, and Education - Volume 1: CONAPLIN and ICOLLITE,},
year={2017},
pages={376-380},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0007167603760380},
isbn={978-989-758-332-2},
}


in EndNote Style

TY - CONF

JO - The Tenth Conference on Applied Linguistics and The Second English Language Teaching and Technology Conference in collaboration with The First International Conference on Language, Literature, Culture, and Education - Volume 1: CONAPLIN and ICOLLITE,
TI - Teaching Machines to Recognize Idiomatic Expressions - A Comparative Analysis of Compound Word Recognition Results between Human and Machine Annotation
SN - 978-989-758-332-2
AU - Suhardijanto T.
AU - Nuriah Z.
AU - Darmojuwono S.
PY - 2017
SP - 376
EP - 380
DO - 10.5220/0007167603760380