
Step 2: In step 2, the processed description was 
subjected to part of speech tagging and 
lemmatization. The Stanford part-of-speech tagger is 
used (Toutanova et al. 2003) to attach a part-of-
speech tag to each token (i.e. word) in the app 
description. More precisely, the app description is 
parsed into sentences, which are then processed by 
the part-of-speech tagger. When supplied with a 
sentence, the tagger can produce an ordered list of 
part-of-speeches as the output for each word in the 
sentence (such as noun, verb, adjective, etc). For 
example, the app called “Beer Calculator” had the 
sentence like the following in its description: “By 
now we all know that alcohol is bad for you, yet 
most of will still go out to have a beer”. When we 
subjected this sentence to part-of-speech-tagger the 
word ‘By’ was tagged as a preposition, ‘now’ as 
adverb, ‘we’ as personal pronoun and ‘all’ as a 
determiner, and so on. Thus the overall tagging 
results would be By/IN now/RB we/PRP all/DT 
know/VBP that/IN alcohol/NN is/VBZ bad/JJ for/IN 
you/PRP ,/, yet/RB most/JJS of/IN will/MD still/RB 
go/VB out/RP to/TO have/VB a/DT beer/NN, where 
IN,RB,PRP,DT, VBP, NN,VBZ, JJ,MD stands for 
preposition, adverb, personal pronoun, determiner, 
Verb, Noun, adverb, Verb, adjective and model 
respectively. Once the descriptions were tagged, 
only the verb, adverbs and nouns were extracted as 
the initial features. Then extracted features were 
subjected to lemmatization in order to get the root 
word (e.g. “running” would be lemmatized as “run”) 
form a particular extracted token.  
Step 3: Once initial set of features were extracted 
based on the above mentioned procedure, in step 3, 
it was subjected to master feature set check. Master 
feature set is a bag of words that contain words 
related to app domain. Initial master feature set was 
created by lexicographers based on the bag of words 
(i.e. dictionary) related to app domain.  
To build the master feature list, a corpus for 
each category has been created by taking a sample of 
100 apps per category and then came up with the 
high frequency and higher idf (i.e. rare words) 
tokens for each category (top 100 tokens). Then it 
was added the tokens into the initial master feature 
list. If the extracted top word appears in the master 
feature set then it will be considered as one of the 
feature for a given app. Thus for each app selected 
for the training, features were extracted and kept in a 
file in the following format.  
“<feature1> <feature2><feature 3>………………………… <feature_n>”. 
Now that the features have been extracted, next it 
proceeds with building the classification model. 
b. Building classification model 
Multinomial Naïve Bayes, TF-IDF and Support 
vector machines are used as the initial classification 
approaches in classifying the apps into the possible 
IAB Tier-2 categories. Brief introduction about these 
methodologies are detailed below. 
Naïve Bayes:  
Since the training input is pre-processed app 
description, token-based naive Bayes classifier is 
used to compute the joint token count in app 
description and category probabilities by factoring 
the joint into the marginal probability of a category 
times the conditional probability of the tokens given 
the category defined as follows.  
,|∗
Conditional probabilities of a category given tokens 
are derived by applying Bayes's rule to invert the 
probability calculation: 
|,/
|∗/
Since Naïve Bayes assumes that tokens are 
independent of each other (this is the "naive" step): 
 |0|∗...∗
.1|
.
|
Then, using the marginalization the marginal 
distribution of tokens has been computed as follows: 
′,′
′|′∗′ 
In addition, maximum a posterior (MAP) estimate of 
the multinomial distributions also calculated 
for  over the set of categories, and for each 
category, the multinomial 
distribution 
|
over the set of tokens. 
Further, it has been employed the Dirichlet 
conjugate prior for multinomials, which is 
straightforward to compute by adding a fixed "prior 
count" to each count in the training data. This lends 
the traditional name "additive smoothing". After 
building the Naïve Bayes classifier, extracted 
features with the respective categories are passed as 
the input to build the classification model. 
TF-IDF: 
This classifier is based on the relevance feedback 
algorithm originally proposed by Rocchio(Rocchio 
1971) for the vector space retrieval model (Salton & 
McGill 1986). In TF-IDF we considered the app 
ICSOFT2014-DoctoralConsortium
8