ALL ABOUT MICROTEXT

A Working Definition and a Survey of Current Microtext Research within

Artificial Intelligence and Natural Language Processing

Jeffrey Ellen

SPAWAR Systems Center Pacific, 53560 Hull St, San Diego, CA, U.S.A.

Keywords: Microtext, Natural language processing, Text classification, Semi-structured data, Information extraction,

Sentiment analysis, Topic summarization.

Abstract: This paper defines a new term, ‘Microtext’, and takes a survey of the most recent and promising research

that falls under this new definition. Microtext has three distinct attributes that differentiate it from the

traditional free-text or unstructured text considered within the AI and NLP communities. Microtext is text

that is generally very short in length, semi-structured, and characterized by amorphous or informal grammar

and language. Examples of microtext include chatrooms (such as IM, XMPP, and IRC), SMS, voice

transcriptions, and micro-blogging such as Twitter(tm). This paper expands on this definition, and provides

some characterizations of typical microtext data. Microtext is becoming more prevalent. It is the thesis of

this paper that the three distinct attributes of microtext yield different results and require different

techniques than traditional AI and NLP techniques on long-form free text. By creating a working definition

for microtext, providing a survey of the current state of research in the area, it is the goal of this paper to

create an understanding of microtext within the AI and NLP communities.

1 INTRODUCTION

Information retrieval and extraction on free text (e.g.

long form prose, newswire releases, emails, etc) is a

relatively vibrant and burgeoning research area

within the AI and NLP communities, but by

comparison there is a lack of studies and

experiments on shorter texts, especially where

grammar is less formal and abbreviations are more

common. One of the difficulties in organizing or

tracking this type of research is there is not a

common term differentiating these shorter, less

formal texts. This paper suggests ‘Microtext’ as

being an appropriate term for this type of text. As

electronic communications become more prevalent,

we expect Microtext sources to become more

common, and more important in day-to-day

operations within every industry.

Microtext sources include point to point instant

messaging via any protocol (such as XMPP), Multi-

User Chatrooms or MUCs (such as IRC), SMS

(Short Message Service) common on mobile phones,

transcriptions of voice conversations, and micro-

blogging which has been popularized by Twitter and

similar services.

In section 2 this paper will introduce a working

definition for microtext, and characterize some

common microtext examples. Section 3 will survey

some NLP and AI papers that work on microtext

sources, with varying degrees of acknowledgement

or adjustment to the problem domain. This includes

examples focused on classification, clustering,

information extraction, sentiment analysis, etc.

Section 4 briefly illustrates some counterexamples,

and section 5 concludes the survey.

2 MICROTEXT DEFINITION

A definition of microtext is required for future

research efforts. The definition is not strict, in the

sense that it will not be defining an API or a

protocol, but a solid definition will certainly help

provide a ‘stake in the ground’ for future discussion

of work. A definition will serve two purposed

primarily. First, if adopted and utilized as a term or

keyword, it will greatly aid scientists and engineers

in locating similar research. Second, the definition

will help assist future researchers by serve as the

delineation between microtext and long form text.

329

Ellen J..

ALL ABOUT MICROTEXT - A Working Deﬁnition and a Survey of Current Microtext Research within Artiﬁcial Intelligence and Natural Language

Processing .

DOI: 10.5220/0003179903290336

In Proceedings of the 3rd International Conference on Agents and Artiﬁcial Intelligence (ICAART-2011), pages 329-336

ISBN: 978-989-8425-40-9

 2011 SCITEPRESS (Science and Technology Publications, Lda.)

Dalli, Xia, and Wilks (2004) presented a

summary of the “unique characteristics of email”

which consisted of essentially:

 Short messages between 2-800 words.

 Unconventional grammar & style (frequently).

 A cross between informal and traditional.

 Threading characteristics

This type of definition is what is necessary for

microtext, however, 800 words is too long for many

of the microtext sources. For a point of reference,

there are approximately 700 words on this page.

Additionally, since microtext is encountered in

multiple media, not all of which include threading,

this definition cannot be used as is.

2.1 Working Definition

Microtext is considered to have three main

characteristics that separate it from the traditional

documents used in text categorization:

 Individual author contributions are very brief,

consisting of as little as a single word, and almost

always less than a paragraph. Frequently the

contribution is a single sentence or less.

 The grammar used by the authors is generally

informal and unstructured, relative to the pertinent

domain. The tone is conversational, and frequently

unedited therefore errors and abbreviations are more

common.

 The text is ‘semi-structured’ by traditional NLP

definitions since it contains some meta-data in

proportion to some free-text. At a minimum, all

microtext has a minute-level timestamp and a source

attribution (author).

This definition is subject to change with respect to

precision. Through experimental validation, these

definitions can be made more concrete.

In regards to the length, ‘very brief’ is not

specific. It is suggested that future studies could help

specifically quantify length either explicitly through

experimentation, or implicitly through deriving

where documents consisting of thousands of

characters have different attributes from documents

consisting of dozens of characters. Similarly, it is

difficult to exactly ascertain a quantifiable metric for

grammar. The two most similar widely known

metrics, Flesch Reading Ease is based on the total

number of words and syllables, which is muddled

with abbreviations and acronyms. Flesch–Kincaid

Grade Level is based on average sentence length and

average syllables per word, which is also affected by

acronyms and subject to extreme variety and outliers

when considering 1 (or less) sentence documents

(Flesch, 1948).

These three metrics were selected specifically

because of their importance to the existing NLP

algorithms. Brevity affects the performance of many

NLP measures such as Term Frequency (TF). It is

certainly the most unique aspect of microtext, and is

reflected in the selection of the term itself. The

informal language creates the most difficulty for

NLP. The semi-structured nature of microtext is a

definite advantage to be leveraged in processing, and

is fairly unique. Generally, longer texts such as

websites, newswire articles, etc are not specifically

attributable to a single author or a single time.

Microtext guarantees both. Even if an article has a

single author or a timestamp, that generally covers

hundreds or thousands of words, so the granularity

or pedigree of individual thoughts or statements is

not nearly as fine-grained or accurate as that of

microtext.

Finally, the specific selection of the term

‘Microtext’ seems appropriate. The text is not only

short, but often abbreviated. Most importantly, use

seems to be clear. Other than a euphemism for very

small physical printed text, the only other academic

use of the term was decades ago (Bullen, 1972) for

describing a finite state machine. The way seems

clear for microtext to become adopted without

conflict.

2.2 Microtext Characterization

Encoding thoughts into an electronic format

continues to get easier. At first, capture and

encoding was reserved for higher priority items,

such as books, contracts, etc. As the internet

expanded in parallel with computers becoming more

prevalent and less expensive, the barrier was

lowered to include essays, newswire articles, etc.

The barrier continues to be lowered, in at least three

dimensions: cost required to encode, accessibility to

encoded work, and knowledge required to operate

encoding technology. So text representations of

thought and speech are becoming more prevalent

daily, and as the cost goes down, so does the return

on investment, and the messages and thoughts

encoded tend to become more brief and less formal.

The analog equivalent of ‘micro-text’ has always

been a part of our modern, communal society, in the

form of conversations, journal entries, etc. It’s just

that when these expressions occurred in spoken

dialog, telephone conversations, and paper

notebooks, they are not able to be as easily captured,

archived, sorted, or discussed. There are many

ICAART 2011 - 3rd International Conference on Agents and Artificial Intelligence

330

examples of this digitization in the hands of the

general public, including wikis, micro-blogs, SMS

messages, and even voicemail transcriptions

(through ad-supported free consumer technologies

such as Google Voice or Jott).

Although they are both encoded as characters,

micro-text varies in structure and content from long

form text, as discussed in the introduction. It is not

necessarily the case that micro-text is ‘noisier’ than

regular text. From a semantic perspective, the

‘signal’ in microtext is very strong, the difficulty

comes from lack of context, not too much ‘noise’.

For example, O’Connor, et al (2010) collected a

corpus of Twitter messages in 2009, and found that

the average message length was eleven words, and

that words rarely occur more than once in a

message. Therefore, some of the standard NLP

metrics such as Term Frequency (TF) and Document

Frequency(DF) will need to be reworked.

Although mentioned a few different places

throughout the document, it is useful to consider a

list of available public and commercial technologies,

services, and standards that would be considered

microtext.

 SMS (aka Text Messages)

 Instant Messaging (point to point messages such

as XMPP/Google Talk/Jabber, OSCAR/AIM/ICQ,

Microsoft Messenger)

 Multi-User Chatrooms (aka MUCs, including

IRC chatrooms, and communication within

MMORPG and other online communities such as

Second Life or World of Warcraft)

 Voicemail Transcriptions (Enterprise or

government level, as well as consumer level

technologies such as Google Voice or Jott)

 Microblogs (Twitter, Google Buzz, Identi.ca,

FriendFeed, and other closed sources such as in-

house or enterprise level microblogs such as the

United States Department of Defense’s ‘Chirp’

service, or private services such as Facebook )

There are some other sources which may

potentially fit the definition of microtext, but may

not. Generally this is because of the length of the

author’s individual statements. This includes email,

wikis, ‘regular’ weblogs, website ‘forums’, UseNet,

and RSS feeds.

For illustration, here is a non comprehensive list

of some sample types of meta data (and specific

values of those types in parenthesis):

 Source Attribution (Author, Screen Name,

Originating Phone Number or Email Address)

 Timestamp (Almost always with minute-level

accuracy)

 Audience (Public, Room or Chat channel for IRC

or MUCs, one or more specific recipients of the

Source Attribution type)

 URL References (both as a reply/threading

mechanism and as a pointer to a longer reference)

 Geo-location information (Either specifically

GPS coordinates, or through location tags)

 Other meta-data tags (Self selected topic tags i.e.

#hashtags, Author’s Mood, weather, etc. These

include both author created and automatically

generated)

Note that each of these types can be satisfied with its

own rules including ‘zero or more’. For example, in

a Twitter reply (characterized by starting with

‘@username’), the Audience is both public and a

specific recipient.

3 SURVEY OF CURRENT

MICROTEXT RESEARCH

By far the primary difficult in conducting a survey

of recent research in microtext is that since there is

no common vocabulary or terminology, locating all

of the research is non-trivial.

One metric is the references to the word

‘Twitter’ in peer reviewed publications. In the last

three full years, (since 2007), the number of

Association for Computing Machinery (ACM)

journal articles with the word ‘Twitter’ are 11, 84,

and 263. Through the first half of 2010, 284 have

been published, so it would be reasonable to expect

at least 568 articles to be published in 2010. In the

same time period, the number of Association of

Computational Lingustics (ACL) articles are 0, 0, 4,

and 34. (68 projected). The number of INSTICC

papers mentioning Twitter available in the

SciTePress digital portal is 2, both in 2010. Since

Twitter is one specific commercial product,

extrapolation can be dangerous, but it is interesting

to note the rapid and dramatic uptake within the

academic community. Interest is obviously high, and

growing fast.

Although ‘Twitter’ is mentioned in many papers,

the intent of the research is widely varied. Obviously

the hundreds of papers already published are beyond

cataloguing in this survey paper, however, a

sampling is presented in the following paragraphs.

Note that in almost every paper, it is a single

application that is being considered, rather than

ALL ABOUT MICROTEXT - A Working Definition and a Survey of Current Microtext Research within Artificial

Intelligence and Natural Language Processing

331

attempting to define or derive a higher purpose or

truth of the medium. This is a very industry or

engineering centric approach, rather than a scientific

approach.

Figure 1: Twitter’s growth in the academic community.

3.1 Topic Identification/Individual

Summarization

There is some preliminary work (Adams, 2008) in

topic detection within chat. Specifically looking at

IRC chatrooms, these researchers illustrate some of

the types of techniques that can be uniquely

leveraged by microtext research, such as augmenting

a typical TF-IDF based approach with temporal

information.

Ranganath, Jurafsky, and McFarland (2009)

were able to achieve 71.5% accuracy on a system

designed to detect a speaker’s intent to flirt using a

spoken corpus of speed-dates. They also considered

audible (prosodic) features, and their corpus was

transcribed by humans and heavily annotated with

extra information such as number of laughs, number

of filled pauses, etc. The applicable part of this

research where microtext is concerned is that their

transcription/representation was very accurate as to

the speech that actually occurred, including

interruptions, pauses, laughter, backchannel

utterances. (Examples include ‘Uh-huh, Yeah, Wow,

Excuse Me, Um, Uh). These types of attributes are

not part of the formal written grammar that more

traditional NLP approaches consider. Given that the

results of this system were more accurate than

human annotators, it is very notable and exciting to

think that the informal grammar characteristic of

microtext may be leveraged as an advantage over

traditional free text.

Ritter, Cherry, and Dolan (2010) focus on

modelling conversations using an unsupervised

learning algorithm. In their collection of 1.3 million

tweets, they note that Twitter postings tend to be

“highly ungrammatical, and filled with spelling

errors”. They also note that 69% of the

conversations in their data had a length of two. They

find that a modification of the Latent Dirichlet

Allocation overcomes the noisiness and brevity of

their tweets that causes difficulty for named entity

recognizers and noun-phrase chunkers. Although it

was not the focus of their paper, it is precisely these

types of discoveries that need to be re-used by the

community.

3.2 Clustering/Mass Summarization

Examining the larger zeitgeist of the microtext

sources, by considering them in aggregate, is an area

with many commercial entities are pursuing, which

are then commercial, proprietary, and closed, so

there is no insight into their methods. One of the

more academic approaches is TweetMotif

(O’Connor, 2010). TweetMotif’s website provides

an elegant one sentence summary of the algorithm

that “takes any word or phrase, finds tweets where

people are talking about it, then groups them by

statistically unlikely phrases that co-occur”. This is a

relatively standard NLP approach, and it would be

interesting to compare results on Microtext vice

longer text. Also, TweetMotif takes the important

step of recognizing that many tweets are exact

duplicates, ore essentially the same, and specifically

“groups messages whose sets of trigrams have a

pairwise Jaccard similarity exceeding 65%.”

One drawback to clustering approaches in

general for application in more serious matters, the

current implementations do not seem comprehensive

(nor do they claim to be). So their application would

seem to be limited to more ephemeral uses, rather

than rigorous or exhaustive. However, they do serve

a useful purpose.

Another approach is taken by TWinner (Abrol,

2010) to attempt to cluster tweets by physical

location, and then utilize this information to

“improve the quality of web search and predicting

whether the user is looking for news or not.” Twitter

is proposing an automated GPS tagging capability,

and other microblogging services such as Google

Buzz already support automatic or user specified

location information, which will only improve the

accuracy of algorithms such as TWinner. The

TWinner paper also defines a ‘Frequency-Population

ratio), which is a ratio of the number of tweets per

geographic location, normalizing with respect to

population density.

ICAART 2011 - 3rd International Conference on Agents and Artificial Intelligence

332

The ‘Phrase Reinforcement Algorithm’,

developed by Sharifi, et Al (2010), utilizes a

different strategy. It provides a machine-generated

summarization of ‘trending topics’ by examining a

quantity of very similar updates, and normalizing

them to produce a best possible summarization. This

does not leverage the microtext aspect, but is

leveraging the massive amount of human thought

put into generating the content.

3.3 Classification

Phan (2008) proposes a “general framework for

building classifiers that deal with short and sparse

text & Web segments by making the most of hidden

topics”. The approach leverages a ‘universal dataset’

to augment the short and sparse text collected. This

is a promising approach, and could be extended

easily to include ontologies or language concepts

and representation in the ‘universal dataset.’ So the

bottleneck of this approach is essentially the same as

the rest of the Natural Language community, the

ability for the machine to understand human

generated text.

Dela Rosa and Ellen (2009) have completed a

series of experiments on classification of military

chat posts. A number of different machine learning

algorithms were evaluated, including SVMs, k-

Nearest Neighbour, Rocchio, and Naive Bayes.

Various feature selection methodologies were also

considered, and Mutual Information (MI) and

Information Gain (IG) were found to perform

relatively poorly. K-NN and SVM were found to be

the most suitable in a binary and four-way

classification task.

3.4 Sentiment Analysis

Go and Bhayani (2010) perform sentiment analysis

of Twitter messages. They are able to leverage

emoticons as noisy labels, a technique first presented

by Read (2005). Difficulties with less formal

grammar constructs are also encountered. They

attempt to perform clustering to assist with the

analysis, and found that it unexpectedly hurt results.

Wilson, Wiebe, and Hoffmann (2005) examine

contextual polarity (aka semantic orientation) of

phrases in great detail. Their work attempts to deal

with the paradox that in the English language,

“Positive words are used in phrases expressing

negative sentiments, or vice versa.” One focus of the

research is on feature selection, such as word

features (e.g. what part of speech), the presence of

nearby modifiers or negators, and other proximity

features (e.g. whether the word is preceded by an

adjective). The stated goal of this work is to provide

insight into phrase-level sentiment analysis. Some

microtext is not much more than a phrase in length,

so this type of research is definitely applicable. A

small question to be answered, however, is whether

or not the informal grammar would interfere with

the feature selection methods exploited in their

work.

3.5 Question/Answer

Cong, et. Al (2008) attempt to leverage existing

knowledge bases of questions and answers (i.e.

website forums) to provide answers for new

questions. While this is not specifically microtext

related, it is interesting because of the implications.

Social Search is a concept being explored by various

companies and pundits (Google, Laporte, 2009); the

idea is to focus search results to consider more

highly authors that the searcher has a personal

relationship with, under the guise that those

recommendations or answers would be more

appropriate, or authoritative. The majority of those

‘social search’ sources would be considered

microtext, and therefore microtext extraction is

crucial to these technologies succeeding.

3.6 Information Extraction

Marom and Zukerman (2009) study a corpus of

paired question & response help desk emails with

the intention of automating the process. The bulk of

this research is focused on NLP tasks that are not

applicable to microtext, such as meta-learning and

semantic overlap. However, the study does

investigate sentence level granularity for the

purposes of generating hybrid or better tailored

answers through combination. One thing specifically

investigated is sentence cluster cohesion, a measure

of the similarity of sentences to each other. This

metric would be useful in microtext analysis because

some microtext sources have an arbitrary character

limit which forces the author to rapidly cycle

between topics. Classifying the entire microtext

‘document’ will vary greatly depending on whether

or not the individual sentences are cohesive.

Gruhl, et. al (2009) explore “statistical NLP

techniques to improve named entity annotation in

challenging Informal English domains”. They

achieve notably better results through application of

SVMs. This paper illustrates the types of insight that

can be gained through specific focus on microtext

characteristics first and experimental validation

ALL ABOUT MICROTEXT - A Working Definition and a Survey of Current Microtext Research within Artificial

Intelligence and Natural Language Processing

333

second. The majority of the research (both

referenced in this survey, and otherwise reviewed

and not referenced) centers on an experiment.

3.7 Semi-structured Data Exploitation

One of the most underutilized aspects of Microtext

research is ignoring the semi-structured nature of the

data. Kinsella, Passant, and Breslin (2010) examine

the occurrences of hyperlinks in online message

boards. They observe that not only is the use of

hyperlinks increasing, but the hyperlinks themselves

often reference “resources with associated structured

data”, and they discuss “the potential for using this

data for enhanced analysis of online conversation”.

Wang (2010) provides another example of

utilizing the structure of the data in his research into

identifying spammers on Twitter. He utilizes some

of the relationship information available from twitter

accounts to construct graphs and examine some

typical directed graph features. Also, Wang makes

the interesting choice of ignoring the NLP aspect of

the tweets completely, and instead treating authors’

contributions as strings of symbols, and compares

them using Levenshtein distance, ignoring grammar

and semantic content completely.

4 SPURIOUS MICROTEXT

RESEARCH RESULTS

Not all papers that reference microtext sources are

applicable to microtext characterization. For

example, there are many instances where the

microtext is utilized for some other purpose, such as

using SMS to interface with other systems like

FAQs (Kothari, 2009) or yellow pages (Kopparapu,

2007). Similarly, not all papers mentioning a

microtext source are concerned with analyzing the

content in any fashion. Mowbray (2010) publishes a

paper on identifying spam in twitter, similar to the

aforementioned paper by Wang, but unlike Wang

focuses on automated use and abuse of the Twitter

API and functionality, and other non-NLP, non-AI

techniques.

There are also many interesting sociological

applications and research to be performed on this

type of data (which as stated earlier, used to be

private, non-digital, or more expensive). There are

dozens of papers on how to leverage these new

sources of digital information, such as the influence

of Twitter (Cha, 2010) (Lee, 2010), using Twitter to

predict elections (Tumasjan, 2010), the stock

market, or movie results, or the flu (Ritterman,

2009). While interesting and valuable in their own

right, these papers do not provide insight into the

mechanics of microtext, or leverage the

characteristics that define microtext. While these

works have an NLP aspect, it is really the publicness

and ubiquity of the mechanisms that are being

exploited, not the microtext.

Another example of this type of clever

exploitation is Davidov, Tsur, and Rappoport (2010)

who leverage emoticons in conjunction with user

generated tags for sentiment analysis. Emoticons are

by no means required by or limited to microtext

sources, but they tend to appear more frequently.

They examine the phenomenon that sometimes the

user generated tags are overloaded and part of the

grammatical/semantic content, such as “I always

enjoy the #Olympics” and other times simply serve

as metadata, for example “I can’t believe the USA

just won the gold in hockey! #Olympics”. Twitter is

leveraged as a large repository of sentiment, and

“the obtained feature vectors are not heavily Twitter-

speciﬁc”. This is more of an exploration of the

English language and the tagging and emoticon

phenomenon than anything specifically about

microtext, although the emoticon/sentiment analysis

feature vectors could be leveraged as would any

other ontology.

5 CONCLUSIONS

There are a growing number of papers being

published on NLP and AI techniques as applied to

brief, poorly formatted, semi-structured text. As

presented in this survey, there are a number of

interesting papers being published in the area. Much

of the current work is more engineering than

scientific in focus; they seek to provide anecdotal or

experimental evidence about a single use case. So

while not using a common terminology, these papers

are providing the rough foundation for research on

microtext.

There is some past NLP work on sentence and

phrase level types of analysis that is partially

relevant. Although the brevity condition is met,

much of this work is relies on correct grammar and

sentence structure, and to a lesser extent on a larger

corpus. So, not all previous NLP work on concise

expressions will translate to microtext.

It is the thesis of this paper that some discussion

and meta-experimentation on the field itself would

lead to greater insights, with a higher level of reuse.

A first step in that direction is defining terminology,

ICAART 2011 - 3rd International Conference on Agents and Artificial Intelligence

334

‘Microtext’, so that researchers can have a common

ground for future discussion.

Some of the scattered research surveyed in this

paper has provided interesting insights as to the type

of conclusions and methodologies that would be

discovered and catalogued with a more focused

effort. Some of these include: Leveraging an outside

body of knowledge, leveraging non-traditional

language features such as laughs and “uh/ums”, and

treating individual results as less important and

focusing more on less granular trends. Overall, trend

analysis and identification has the most research,

and Information Extraction from microtext is

particularly lacking.

In two different papers, SVMs were a successful

strategy in dealing with informal grammars.

The next step is investigating and more

rigorously quantifying the three attributes in the

microtext definition. This would certainly provide

reusable insights and help catalogue best performing

techniques and unique quirks and advantages of

microtext processing versus text processing. The

goal of this paper is to create an understanding of

microtext within the AI and NLP communities.

ACKNOWLEDGEMENTS

Thanks to the Office of Naval Research and the

Space and Naval Warfare Systems Center Pacific for

their financial support, and Dr. LorRaine Duffy for

inspiration and motivation. This paper is the work of

U.S. Government employees performed in the

course of employment and no copyright subsists

therein.

REFERENCES

Abrol, S. and Khan, L. 2010. TWinner: understanding

news queries with geo-content using Twitter. In

Proceedings of the 6th Workshop on Geographic

information Retrieval (Zurich, Switzerland, February

18 - 19, 2010). GIR '10. ACM, New York, NY, 1-8

Adams, P., and Martell, C., 2008. Topic Detection and

Extraction in Chat. In International Conference on

Semantic Computing, IEEE.

Bullen, R.H. Jr., and Millen, J. K., 1972. Microtext: the

design of a microprogrammed finite state search

machine for full-text retrieval. In Proceedings of the

AFIPS Joint Computer Conferences. ACM.

Cha, M., Haddadi, H., Benevenuto, F., and Gummadi, K.

P. 2010. Measuring user influence in twitter: the

million follower fallacy. In Proceedings of the 4

International Conference on Weblogs and Social

Media, AAAI, Washington, D.C., 2010.

Chi, E. 2009 "Information Seeking Can Be Social,"

Computer, pp. 42-46, March, 2009. IEEE

Cong, G., et al. (2008). Finding question-answer pairs

from online forums. In SIGIR '08: Proceedings of the

31st annual international ACM SIGIR conference on

Research and development in information retrieval,

pp. 467-474, New York, NY, USA. ACM.

Dalli, A., Xia, Y., and Wilks, Y., 2004. FASIL email

summarisation system. InProceedings of the 20th

international conference on Computational Linguistics

(COLING '04). ACL, Morristown, NJ, USA, Article

994.

Davidov, D., Tsur, O., Rappoport, A. 2010. Enhanced

Sentiment Learning Using Twitter Hashtags and

Smileys, In Proceedings of the 23rd international

conference on Computational Linguistics (COLING),

2010.

Flesch, R. (1948); A new readability yardstick, Journal of

Applied Psychology, Vol. 32, pp. 221–233.

Go, A., Bhayani, R., and Huang, L. 2010. Exploiting the

Unique Characteristics of Tweets for Sentiment

Analysis. CS224N Project Report, Stanford.

Gruhl, D., Nagarajan, M., Pieper, J., Robson, C., and

Sheth, A. 2009. Context and Domain Knowledge

Enhanced Entity Spotting in Informal Text.

In Proceedings of the 8th international Semantic Web

Conference. 260-276.

Kinsella, S., Passant, A., Breslin, J. 2010. Ten Years of

Hyperlinks in Online Conversations. In Proceedings of

the Web Science Conference 2010. WWW2010.

Lee, C., Kwak, H., Park, H., Moon, S., 2010. Finding

influentials based on the temporal order of information

adoption in twitter. In Proceedings of the 19th

international conference on World wide web (WWW

'10). ACM, New York, NY, USA, 1137-1138.

Laporte, Leo. 2009. [Internet Radio Broadcast] This Week

in Google 13. October 24, 2009.

Kopparapu, S. K., Srivastava, A., and Pande, A. 2007.

SMS based natural language interface to yellow pages

directory. In Proceedings of the 4th international

Conference on Mobile Technology, Applications, and

Systems and the 1st international Symposium on

Computer H uman interaction in Mobile Technology

ACM. Mobility '07. ACM, New York, NY, 558-563.

Kothari, G., Negi, S., Faruquie, T. A., Chakaravarthy, V.

T., and Subramaniam, L. V. 2009. SMS based

interface for FAQ retrieval. In Proceedings of the Joint

Conference of the 47th Annual Meeting of the ACL

and the 4th international Joint Conference on Natural

Language Processing of the Afnlp. Association for

Computational Linguistics. Morristown, NJ, 852-860.

Marom, Y. and Zukerman, I. 2009. An empirical study of

corpus-based response automation methods for an e-

mail-based help-desk domain. Computational

Linguist. 35, 4 (Dec. 2009), 597-635

Mowbray, M. 2010. The Twittering Machine. In

Proceedings of the 6th International Conference on

ALL ABOUT MICROTEXT - A Working Definition and a Survey of Current Microtext Research within Artificial

Intelligence and Natural Language Processing

335

Web Information Systems and Technologies (WEBIST

2010). INSTICC. 299-304.

O’Connor, B., Krieger, M., and Ahn, D. 2010.

TweetMotif: Exploratory Search and Topic

Summarization for Twitter. In Proceedings of the

International AAAI Conference on Weblogs and Social

Media. Washington, DC, May 2010

Phan, X.-H., et al. (2008). Learning to classify short and

sparse text & web with hidden topics from large-scale

data collections. In WWW '08: Proceeding of the 17th

international conference on World Wide Web, pp. 91-

100, New York, NY, USA. ACM.

Ranganath, R., Jurafsky, D., and McFarland, D. 2009. It's

not you, it's me: detecting flirting and its

misperception in speed-dates. In Proceedings of the

2009 Conference on Empirical Methods in Natural

Language Processing: Volume 1. ACL.

Read, J. 2005. Using emoticons to reduce dependency in

machine learning techniques for sentiment

classification. In Proceedings of the ACL Student

Research Workshop ACL.

Ritter, A., Cherry, C. And Dolan, B. 2010 Unsupervised

Modeling of Twitter Conversations. In Human

Language Technologies: The 2010 Annual Conference

of the North American Chapter of the Association for

Computational Linguistics. ACL, Los Angeles, CA,

172-180.

Ritterman, J., Osborne, M., and Klein, E. 2009. Using

prediction markets and Twitter to predict a swine flu

pandemic. In 1st International Workshop on Mining

Social Media - 13th Conference of the Spanish

Association for Artificial Intelligence, 2009. AEPIA

(Asociación Española de Inteligencia Artificial).

Rosa, K. D. and Ellen, J. 2009. Text Classification

Methodologies Applied to Micro-Text in Military

Chat. In Proceedings of the 2009 international

Conference on Machine Learning and Applications

(December 13 - 15, 2009). ICMLA. IEEE Computer

Society, Washington, DC, 710-71.

Sharifi, B., et al. (2010). Summarizing Microblogs

Automatically. In Human Language Technologies:

The 2010 Annual Conference of the North American

Chapter of the Association for Computational

Linguistics, pp. 685-688, Los Angeles, CA. ACL.

Tumasjan, A., et al. 2010. Predicting Elections with

Twitter: Predicting elections with Twitter: What 140

characters reveal about political sentiment. In

International AAAI Conference on Weblogs and Social

Media, AAAI, Washington, D.C., 2010.

Wang, A. H. 2010. Don’t follow me - Spam Detection in

Twitter. In Proceedings of the International

Conference on Security and Cryptography (SECRYPT

2010). INSTICC. 142-151.

Wilson, T., Wiebe, J., and Hoffmann, P. 2005.

Recognizing contextual polarity in phrase-level

sentiment analysis. In Proceedings of the Conference

on Human Language Technology and Empirical

Methods in Natural Language Processing. ACL,

Morristown, NJ, 347-354.

ICAART 2011 - 3rd International Conference on Agents and Artificial Intelligence

336