Automated Analysis of Job Requirements for Computer Scientists in

Online Job Advertisements

Joscha Gr

uger

1,2 a

and Georg J. Schneider

1 b

Computer Science Department, Trier University of Applied Sciences, Main Campus, Trier, Germany

University of Trier, Department of Business Information Systems II, 54286 Trier, Germany

Keywords:

Data Analysis, Web Mining, Natural Language Processing, Information Retrieval, Machine Learning, Job Ads,

Skills.

Abstract:

The paper presents a concept and a system for the automatic identiﬁcation of skills in German-language job

advertisements. The identiﬁcation process is divided into Data Acquisition, Language Detection, Section

Classiﬁcation and Skill Recognition. Online job exchanges served as the data source. For identiﬁcation of the

part of a job advertisement containing the requirements, different machine-learning approaches were compared.

Skills were extracted based on a POS-template. For classiﬁcation of the found skills into predeﬁned skill classes,

different similarity measures were compared. The identiﬁcation of the part of a job advertisement containing

the requirements works with the pre-trained LinearSVC model for 100% of the tested job advertisements.

Extracting skills is difﬁcult because skills can be written in different ways in the German language – especially

since the language allows ad-hoc creation of compound. For extraction of skills, POS templates were used.

This approach worked for 87.33% of the skills. The combination of a fasttext model and Levenshtein distance

achieved a correct assignment of skills to skill classes for 75.33% of the recognized skills. The results show

that extracting required skills from German-language job ads is complex.

1 INTRODUCTION

The labor market for IT specialists is very complicated

due to the always evolving and changing IT environ-

ment. New programming languages, frameworks and

development concepts appear on a yearly basis. Some

of these technologies and concepts are very trendy

only for a certain period, some manifest and play an

important role for a long time. For a company it is

vital to ﬁnd IT specialists that are competent with the

strategic technologies used in their enterprise. In a

study by Bitkom in the year 2015, seven out of ten ICT

companies (70 percent) state that there is currently a

shortage of IT specialists. In 2014, only 60 percent of

those surveyed said that there was a shortage (Bitkom,

2016). 38.7 percent of the IT vacancies forecasted for

the top 1000 companies in Germany in 2015 are difﬁ-

cult to ﬁll due to a lack of a ﬁtting qualiﬁcation, and

6.1 percent of the vacancies will remain vacant due

to a lack of suitable candidates (Weitzel et al., 2015).

Hence it is vital for a national economy to educate

students that have not only an excellent qualiﬁcation

https://orcid.org/0000-0001-7538-1248

https://orcid.org/0000-0001-7194-2394

in the theoretical foundations of Computer Science but

also up-to-date skills in current technologies for this

urgent need. Likewise, being competent with technolo-

gies in demand, it helps the alumni to easily ﬁnd a job.

Universities of Applied Sciences in Germany usu-

ally target praxis-oriented topics and they have tradi-

tionally a close link to the professional ﬁeld. There-

fore lab courses, student projects or applied research

projects usually reﬂect the current trends in the IT

to complete the theoretical foundations with practi-

cal experiences using modern IT systems. Hence the

students can easily ﬁnd a job with the competencies

acquired during their education. Universities try to

solve the problem of staying up-to-date with their ed-

ucation by regularly reviewing their curricula. They

observe developments in the industry to offer either

specialized courses that complement the sound theo-

retical foundation of their study program or to change

programming languages, frameworks or systems for

the lab courses or projects. Students usually embrace

these changes and also gladly choose new course of-

ferings to get the best possible chances on the labor

market. For both students and universities, the ques-

tion is, what should be taught on the practical side and

226

Grüger, J. and Schneider, G.

Automated Analysis of Job Requirements for Computer Scientists in Online Job Advertisements.

DOI: 10.5220/0008068202260233

In Proceedings of the 15th International Conference on Web Information Systems and Technologies (WEBIST 2019), pages 226-233

ISBN: 978-989-758-386-5

what should students learn to best meet the require-

ments of the industry? One possibility would be to

manually review the job advertisements in different

newspapers and online platforms. However, this is a

tedious and time-consuming task. Our research hy-

pothesis is that this task can be automated in a way

that the ﬁndings are at least nearly as good as a human.

German, however, is a challenge compared to English,

where similar approaches already exist, as German

has a rich morphology, umlauts, four cases and much

fewer corpora for training.

This paper describes an approach for an automated

survey of current job requirements for computer sci-

entists on the German labor market. Obviously, this

survey has to be carried through regularly to identify

trends and manifesting technologies.

The suggested process to realize the automated

survey is based on technologies of Natural Language

Processing and Machine Learning. The results of the

analysis can be used, for example, to check the content

of university curricula or to identify new technological

trends at an early stage. In addition, the procedure can

be used to implement a skill-based job search. The

next section discusses the current state of research and

various papers regarding the extraction of skills from

job ads. Section 3 introduces our approach to job ad

analysis. Both the processing steps and the evaluation

of each step are described in detail. Finally, Section 4

concludes the results and discusses future work.

2 FOUNDATIONS AND RELATED

WORK

Some studies conclude that education does not comply

with the requirements of employers in the IT sector.

Kwon Lee and Han (2008) for example concluded

for the US labor market that most universities attach

great importance to hardware and operating systems,

although the employers surveyed are rarely interested

in these skills. They see for example deﬁcits in the

teaching of skills in the economic and social cate-

gory. Yongbeom et al. (2006) also speak of a skill

gap between employers’ requirements and universi-

ties’ IT curricula. Among other things, the lack of

project management, Enterprise-Resource-Planning

(ERP) and information security modules in the cur-

ricula is criticized. Scott et al. (2002) criticize poor

database knowledge and the lack of skills in CASE/-

modeling tools and Business Process Reengineering

(BPR) techniques among graduates. Students also

lacked skills in XML and iterative development.

There are a number of reasons for the skill gap:

one is the rapid technological change, another is the

discordance between the content of the curricula of

the universities and the required competencies of the

industry (Scott et al., 2002; Milton, 2000). In addition,

too long revision cycles of curricula relative to the

speed of technology change and a lack of knowledge

at universities about new and upcoming technology

are cited as reasons for the gaps (Lee et al., 2002).

In order to keep curricula up-to-date, universities

need to know which competencies are currently and

in the long term required by the industry. For iden-

tifying the skills required various approaches exist.

Prabhakar et al. (2005) researched online job adver-

tisements for computer scientists in the US in 2005

with regard to the changing demand for skills over the

year. For this purpose, they examined the job advertise-

ment to see whether it contained one of 59 keywords

or not. The approach for identifying requested skills

in the IT sector of Gallagher et al. (2010) is interview-

based. His team interviewed 104 senior IT managers.

The questions were very general, e.g. it was asked

whether programming skills were required, and not for

concrete programming languages like Java. Sibarani

et al. (2017) developed an ontology-guided job market

demand analysis process. Their method is based on

the self-deﬁned SARO ontology and a deﬁned set of

skills. Using the predeﬁned skills and ontology, they

perform a named-entity tagging. The identiﬁed skills

are linked by a co-word analysis. Litecky et al. (2010)

used web and text mining techniques for retrieving

and analysing a data set of 244.460 job ads. They

scraped the data from online job exchanges to extract

titles and requirements based on predeﬁned keywords.

Wowczko (2015) took a different approach. They anal-

ysed descriptions of vacancies and reduced the words

used in the descriptions until only signiﬁcant words

remained. Custom word lists, stemming, removing

stopwords, removing numbers, stripping whitespaces,

etc. were used to clean up the data.

The problem with all these approaches except

Wowczko (2015) and Gallagher et al. (2010) is that

they are based on ﬁxed keyword lists. Thus, only abil-

ities contained in the lists are recognized. New tech-

nologies or skills described in any other way cannot be

recognized. Abbreviations like Active Directory and

AD are assigned to different classes or remain unrecog-

nized. Additionally, some processes were performed

manually and are hence rather time-consuming and are

only carried out periodically. Wowczko (2015) also

ﬁnds false positives like strong, excellent and can. Con-

sequently, an automated procedure that monitors job

advertisements permanently would simplify this proce-

dure enormously. Thus the approaches also search in

areas of the job advertisement where no requirements

are described (e.g. in the company description).

Automated Analysis of Job Requirements for Computer Scientists in Online Job Advertisements

227

This paper describes a concept and a resulting auto-

mated procedure to search for competencies in all

job advertisements. The approach also recognizes

unknown competencies and maps them to the same

class if they are semantically similar or equal. For

the extraction of the skills, only the area in which the

requirements for the applicant are formulated is used.

3 CONCEPT

The focus of the approach is on job advertisements

in German. The target group of the analyzed job ad-

vertisements are computer scientists. Online job ex-

changes such as monster.de and stepstone.de were used

as data sources. The process for identifying skills is

divided into four steps (see ﬁgure 1): Data Acquisition,

Language Detection, Section Classiﬁcation and Skill

Recognition.

3.1 Data Acquisition

A web crawler based on the Scrapy framework was

developed for data retrieval. The web crawler searches

German online job exchanges for jobs in the IT sector.

The online job exchanges monster.de, stepstone.de and

stellenanzeigen.de were used as data sources. The

crawler extracts the HTML job ads found, as well as

metadata such as company name, job title, and work

location. The deﬁned process works on the HTML

ﬁles of the job advertisements. The metadata could be

used for extensions such as geographic analysis or job

title analysis.

3.2 Language Detection

The result of the data acquisition are all job ads, di-

rected at computer scientists of the online job ex-

changes mentioned. Seven percent of the extracted

job advertisements are written in English, 93% of the

advertisements are written in German. As described,

the approach is aimed at ads in German, the English-

language job advertisements must be ﬁltered out.

In order to be able to ﬁlter out non-German adver-

tisements a language recognition according to Shuyo

was implemented (Shuyo, 2010). The basic assump-

tion of the approach is that certain n-grams occur

in different frequency in different languages (see ta-

ble 1). Based on a quantitative corpus analysis using

Wikipedia pages, all mono-, bi- and trigrams of the

languages to be detected were counted and the proba-

bility with which each n-gram occurs in the respective

language was calculated. Shuyo provides the results

of the quantitative analysis in so-called proﬁle ﬁles.

Table 1: Comparison of the probability with which the n-

grams occur in the languages German and English (based on

Shuyos proﬁle ﬁles.

n-gram German English

D 0.00648 0.00253

ie 0.01158 0.00259

hum 0.00013 0.00029

To recognize the language, the text to be classi-

ﬁed is fragmented into mono-, bi- and trigrams. The

algorithm randomly selects individual n-grams and

calculates the probability with which these occur in

the languages tested (Shuyo, 2010).

For evaluation, 741 job advertisements were classi-

ﬁed by language. 8.77% of the job ads tested were in

English, 91.23% in German. The algorithm achieved

an accuracy of 100% and examined a maximum of 15

n-grams per text.

3.3 Section Classiﬁcation

After all English-language job adverts have been ﬁl-

tered out, the position of the requirements within the

job advertisement must be identiﬁed. Position means

in which HTML list element the requirements are

listed. For this purpose classiﬁcation algorithms of

machine learning were trained and compared.

For training 150 job advertisements were seg-

mented. To this end, all HTML list elements with

more than 20 characters contained in the job adver-

tisements were assigned to a category. The categories

are Requirements, Offer, Tasks and uncategorized. Re-

quirements describes the employer’s requirements for

the applicant. Elements classiﬁed as tasks contain the

tasks to be performed on the position. Offer-sections

contain information about the offers and beneﬁts of

the company. The models were trained based on the

segmented data of 50 job advertisements. The test

corpus includes 100 segmented job advertisements.

The scikit-learn

implementations of Random-

ForestClassiﬁer, LinearSVC (Linear Support Vector

Classiﬁcation), MultinomialNB (Naive Bayes classi-

ﬁer for multinomial models), LogisticRegression, De-

cisionTreeClassiﬁer and BernouliNB (Naive Bayes

classiﬁer for multivariate Bernoulli models) were

trained and compared in the scikit-learn standard con-

ﬁguration. For tokenization, a word tokenizer was

used. Features were represented by a count matrix.

The result of the count matrix was transformed into

normalized tf (term frequency) representation, contain-

ing 3349 features. To compare the algorithms, each

algorithm was trained 5 times on 50 segmented job ads

Version: 0.20.3 (scikit-learn.org)

WEBIST 2019 - 15th International Conference on Web Information Systems and Technologies

228

• Spider

• Scheduler

• Downloader

• Item Pipelines

Data Aquisition

• ﬁlter out non German ads

• using n-Grams

Language Detection

• Support Vector Maschine

• using nlkt

• Categories: Tasks,

Requirements, Offer and

uncategorized

Section Classiﬁcation

• POS-Tagging

• Res. detached compounds

• Chunking

• Skill Cleaning

• Semantic Matching

Skill Recognition

Figure 1: System architecture.

Figure 2: Box plot of 5 tests per algorithm. LinearSVC

reached an accuracy of 0.950291.

Table 2: Comparing accuracy, precision and recall. The fea-

tures were tokenized by words and the frequency measured

by TF.

algorithm accuracy precision recall

RandomForestCl. 0.7496 0.7215 0.5933

LinearSVC 0.9502 0.9416 0.9434

MultinomialNB 0.8902 0.8804 0.8103

LogisticRegr. 0.8874 0.8754 0.8043

DecisionTreeCl. 0.8100 0.8115 0.8249

BernoulliNB 0.8690 0.8185 0.7664

and tested with 100 job ads. Rated by accuracy Lin-

earSVC and MultinomialNB were the best algorithms.

LinearSVC reached an average accuracy of 0.950291

and MultinomialNB 0.890297 (see ﬁgure 2 and table

2).

The results are achieved because skill/requirements

sections contain words like

Kenntnisse

(eng. knowl-

edge),

idealerweise

(eng. ideally) and

gute

(eng.

good) in high frequency and almost exclusively (ﬁg-

ure 3). For optimization, the classiﬁcation algorithms

were also tested with stemmed features, on a feature

set with and without stop words, using n-grams, us-

ing TF-IDF (inverse document frequency) and with

a pruned feature set. Table 3 shows that LinearSVC

using TF-IDF, without further optimization, gives the

best results in terms of skill section recognition.

The confusion matrix in table 4 shows that all re-

quirement sections were correctly recognized. In ad-

dition to requirement sections, other areas were also

classiﬁed as requirement sections. This leads to a pre-

idealerweise

Kenntnisse

oder

gute

Ausbildung

Erfahrungen

Requirements

Others

Figure 3: Frequency with which the features occur in require-

ment sections and in other areas (Tasks, Offer, uncategorized,

based on 50 job ads).

Table 3: Comparing algorithms by accuracy (acc), precision

(pre) and recall (rec) for different settings. For stemming

NLTK (Natural Language Toolkit) German Snowball is used.

The stop word list contains 621 German stop words. Pruning

means removing tokens that are included in most job ads or

in almost no job ads (< 0.5% or > 99.5%).

LinSVC Mult.NB LogReg

TF-IDF

acc

0.9502 0.8902 0.8874

pre

0.9416 0.8804 0.8754

rec

0.9434 0.8103 0.8043

TF-IDF, prune

< 0.5 % &

> 99.5 %

acc

0.9368 0.8899 0.9421

pre

0.9321 0.9543 0.9247

rec

0.9308 0.9274 0.8163

TF-IDF,

stop-words

acc

0.9372 0.9032 0.8821

pre

0.9471 0.8875 0.8799

rec

0.8960 0.8249 0.7963

TF-IDF,

stop-words,

stem

acc

0.9449 0.9137 0.9006

pre

0.9512 0.9426 0.8895

rec

0.9116 0.8424 0.8193

TF-IDF,

stop-words,

stem, prune

acc

0.9393 0.9000 0.9052

pre

0.9245 0.8724 0.9248

rec

0.9283 0.8205 0.8312

TF-IDF,

stop-words,

stem,

n-gram(1,2)

acc

0.9163 0.8902 0.8240

pre

0.9438 0.8820 0.8439

rec

0.8436 0.8103 0.7227

cision of 0.826 for the Requirements class. For the

further recognition of the skills, this would mean that

skills would also be searched in e.g. offer sections.

Recall and precision of the requirements class also

Automated Analysis of Job Requirements for Computer Scientists in Online Job Advertisements

229

Table 4: Confusion matrix of prediction with LinearSVC and

TF-IDF. The results show that texts of the classes Unclassi-

ﬁed, Task and Offer were also classiﬁed as Requirements.

Predicted

n =

266

Unc. Offer Req.

Task recall

Actual

Unc.

14 2 1 0 0.823

Offer

2 57 16 11 0.662

Req.

0 0

100

0 1

Task

1 11 4 47 0.746

prec. 0.823 0.814

0.826 0.81

Table 5: Confusion matrix of prediction with LinearSVC and

TF-IDF under the premise that there is only one section con-

taining requirements per job ad. The Requirement Section

is the area for which the highest probability of containing

requirements has been calculated.

Predicted

n =

266

Unc. Offer Req.

Task recall

Actual

Unc.

14 3 0 0 0.823

Offer

2 76 0 8 0.883

Req.

0 0

100

0 1

Task

1 9 0 53 0.841

prec. 0.823 0.864

1 0.869

show that in some job advertisements several areas

were classiﬁed as requirement sections. Therefore the

premise was deﬁned that in every online job advertise-

ment exactly one requirement section exists. Based on

the trained model,

P(x)

calculates the probability with

which text

is a requirement section. A requirement

section

of a job advertisement

consisting of a set

of section texts S

is deﬁned as:

= max({P(s)|s ∈ S

}) (1)

Table 5 shows the result including the premise. No

false positives are assigned to the requirements class

and the recognition of the other classes also improves.

3.4 Skill Detection

After the requirement sections have been identiﬁed,

the skills they contain must be extracted. A POS

tagging was performed for this purpose. For the

tagging the nltk ClassiﬁerBasedTagger was trained

with the Tiger Corpus of the Institute for Machine

Language Processing of the University of Stuttgart.

The Tiger Corpus consists of 90,000 tagged tokens

and 50,000 sentences (IMS, 2018). The analy-

sis of POS tagged skills shows that they can be

included in requirement descriptions in a very hetero-

geneous form. Skills can include nouns, foreign words,

compound words, detached compounds, combi-

nations with cardinals, etc. Examples for skills

in job ads are

Hochschulstudium

ISO 27001

gute Deutsch- und Englischkenntnisse

and

Erfahrung in der Programmierung mit Java.

A multi-stage process was deﬁned to identify the

skills. First, detached compounds like German and

English skills are resolved. In the next step, the skills

get chunked by POS templates. False positives are

then ﬁltered out via a blacklist. In order to be able to

count semantically similar skills in a class, they are

assigned to a class via a semantic matching.

3.4.1 Resolve Detached Compounds

In German words can be concatenated to mean the

same as the sum of two words (e.g. eng: admin-

istrator password – de:

Administratorpasswort

By a conjunction, several compositions with the

same end or beginning can be shortened without

losing their meaning (e.g.

Clientadministrator

und Serveradministrator

can be written as

Client- und Serveradministrator

). In order to

understand the connections between the conjunctions,

they must be resolved.

For resolving detached compounds like shown in

the listing a compound resolver module was developed.

The compoundResolver-module searches for tokens

that are classiﬁed as TRUNC (Truncate) by the POS

tagger and end with a hyphen punctuation mark. If

the following token is classiﬁed as KON (Conjunction)

and the following as NN (regular nominal) than the reg-

ular nominal is split into syllables using the pyphen

library. Examples of resolving detached compounds

and the POS-tagging are:

Deutsch- und Englischkenntnisse

-> Deutschkenntnisse und Englischkenntnisse

Hochschul- oder Universit

atsabschluss

-> Hochschulabschluss oder Universit

atsabschluss

By merging the syllables, all possible combinations

were generated while keeping the order of the syl-

lables. Based on a dictionary of German nominals,

each syllable is tested. The shortest combination con-

tained in the Dictionary is identiﬁed as the ﬁrst part

of the compound. The rest of the word is merged

with the truncated part of the compound. The new

String replaces the truncated token. An example of the

generated syllable combinations and the result after

resolving is:

[’Eng’, ’Englisch’, ’Englischkennt’,

’Englischkenntnis’, ’Englischkenntnisse’]

pyphen.org

WEBIST 2019 - 15th International Conference on Web Information Systems and Technologies

230

[(’Deutschkenntnisse’, ’NN’), (’und’, ’KON’),

(’Englischkenntnisse’, ’NN’)]

For regular nouns that are not compounds (like

Wirtschafts- (Informatik)

) the algorithm has

been extended. By checking if just the whole word

matches with an entry of the dictionary the token can

be joined with the truncated token. An example of

resolving Wirtschafts- (Informatik) is:

[(’Wirtschafts-’, ’TRUNC’), (’)’, ’$(’),

(’Informatik’, ’NN’)]

[(’Wirtschaftsinformatik’, ’NN’), (’)’, ’$(’),

(’Informatik’, ’NN’)]

3.4.2 Skill Chunking

Job requirements are speciﬁed in job advertisements

in various forms. To extract the requirements, the text

is chunked using POS templates and regular expres-

sions. Four structures were identiﬁed for matching

skills. Each structure is shown with examples and the

appropriate expression for the chunking with nltk Reg-

exParser.

Sequence of nouns, cardinal numbers, proper

names or/and foreign words like:

Deutsch

SAP R3

HTML Kenntnisse

Windows 10

. The code shown

indicates that the search is for single or consecutive

occurrences of foreign words, normal nouns, proper

names, or cardinal numbers.

<NN|NE|CARD|FM>+>

These sequences can be separated

by hyphens, such as

FH-Studium

ERP-Systeme

Microsoft-Zertifizierung

IT-System-Kaufmann, SAP IS-A.

<NN|NE|CARD|FM|TRUNC>+

Within a skill description, articles, preposi-

tions, or prepositions with articles can be included:

Kenntnisse [der] Informatik, Erfahrung [

im] Projektmanagement, Erfahrung [in der]

Arbeit [mit] Eclipse.

<NN|NE|CARD|FM|TRUNC>+<ART|APPR|APPRART>+

<NN|NE|CARD|FM|TRUNC>+

In addition, the description of a skill can con-

tain attributive adjectives, adverbial or pred-

icative adjectives:

Kenntnisse [in der]

technischen Informatik, Erfahrung [in

der] objektorientierten Programmierung.

<NN|NE|CARD|FM|TRUNC>+<ART|APPR|APPRART>+

These regular expressions were testet on a corpus

of 60 job ads. 87.33% of the skills mentioned in the

job ads could be recognized correctly. 12.67% of the

skills could not be extracted. In addition, 21.57% of

the extracted tokens are false positives.

3.4.3 Skill Cleaning

After the Skill Chunking the result contains 21.57%

false positives like

Wort

Fach

. These are not job

requirements and it is necessary to remove them from

the results. For this purpose, the results are cleaned via

a ﬁlter. The ﬁlter uses a blacklist of 117 tokens that are

classiﬁed as non-requirements. The list contains to-

kens that are often part of the requirement description

but are not requirements in themselves. These include

words like

Wissen

(eng. knowledge) and

Erfahrung

(eng. experience). The tokens were collected based on

50 job ads and on the token frequency.

The ﬁlter compares the stemmed blacklist tokens

with the stemmed extracted skills. The stemming is

used to ﬁlter out ﬂexions of the blacklist words, too.

If these match, the extracted token is discarded. For

stemming the German Snowball Stemmer is used.

3.4.4 Semantic Matching

After the skill cleaning the set of extracted tokens

contains only skills. The extracted skills can be syntac-

tically different, but semantically similar or identical.

For example:

IT - Si che rh eit ← [IT - S ecur i ty ,

Si c h er h e i t s as p e kt e der IT ]

GWT ← [ Ke n n t n iss e in GWT , G o o g le

Web To o l k i t ]

In order to get an overview of the most frequently re-

quired skills, exact matching is not relevant. Therefore,

semantically similar requirements must be assigned

to the same class. To this end, different text distance

algorithms were compared. The evaluation of the algo-

rithms is based on a test data set that contains 200 skills

and relation to the correct target class. Two fast text

models (Bojanowski et al., 2017; Grave et al., 2018)

one word2vec model

and four syntactic text distance

algorithms were evaluated.

Table 6 shows the result of the comparison. With

an accuracy of 44.00% the fasttext Model of Grave

et al. (2018) achieves the best results, followed by the

Levenshtein distance with 39.33%. The fasttext model

recognizes the semantic relation between token like

Diplom

and

Master

. But especially abbreviations and

inﬂections are mostly not recognized. Levenshtein pro-

vides good results when large character strings match.

If only small parts of the string match, the Levenshtein

Bojanowski et al. (2017): fasttext.cc/docs/en/pretrained-

vectors

Grave et al. (2018): fasttext.cc/docs/en/crawl-vectors

devmount.github.io/GermanWordEmbeddings

Automated Analysis of Job Requirements for Computer Scientists in Online Job Advertisements

231

Table 6: Comparing accuracy for text distance algorithms.

algorithm accuracy

Fasttext (Grave et al. (2018)) 0.4400

Levenshtein 0.3933

editex 0.3800

Hamming 0.2866

Fasttext (Bojanowski et al. (2017)) 0.2133

Burrows–Wheeler transform Run-

length encoding (BWT RLE)

0.1400

word2vec 0.1333

distance is also large. This leads to a wrong match-

ing of extracted tokens, which contain a very short

keyword like C++ and other meaningless words like

knowledge.

In order to take advantage of both, the fasttext

model and the Levenshtein distance, both distances

were combined and supplemented by preprocessing

and stemming (see Algorithm 1). In preprocessing,

the set of 511 skills

is extended by the respective

stemmed skills. If the token

is contained in the set

it is assumed that this is the best possible result. If not,

a set

W = {(w

, p

), ..., (w

, p

)}

of the 500 seman-

tically most similar words

and the corresponding

similarity

is calculated by the model

. It is as-

sumed that stemmed(

) also has the similarity

(

). From the set of all

∈ W

the intersection

with

is formed (

∩

). The similarity of each element

of the intersection is weighted with the normalized

Levenshtein distance and if

is a substring of

is weighted with

wc = 1.3

The algorithm returns the word

with the greatest

similarity p

. This results in an accuracy of 75.33%.

3.5 Required Competencies

The algorithm was tested on 491 job ads. Figure

4 shows that professional experience, programming

experience and a university degree are the most de-

manded competencies. The most important soft skills

are the ability to work in a team, communication skills

and a sense of responsibility. The most demanded pro-

gramming languages are Java, C and Python. Linux,

SAP and VMware lead the list with experience in han-

dling concrete products.

4 CONCLUSIONS

Knowing the demands on employees is important

for universities and students. Due to the lack of

instruments to identify requirements for employees on

the German labor market, a procedure was developed

to automatically extract them from job advertisements.

The procedure is subdivided into Data Acquisition,

Language Detection, Section Classiﬁcation and a

multi-stage Skill Detection procedure. For data acqui-

sition the online job portals monster.de, stepstone.de

and stellenanzeigen.de were used as data sources. The

language detection method according to Shuyo, based

on n-grams, was used for speech recognition and

reached an accuracy of 100% for the tested 300 job ads.

Algorithm 1: Semantic Matching.

Input: token to classify t, list of skills S, model m with m(x, n) returning

sequence of n tuples of the most similar words to x and the similarity.

Output: semantically most similar word w

to given token t

wc ← 1.3

S ← S ∪ {stemmed(s)|s ∈ S}

if t ∈ S then

return t

else

W ← m(t, 500) // Result: {(w

, p

), ...(w

, p

)}

← W ∪ {(stem(w

), pi)|(w

, p

) ∈ W }

∩

← {(w

, p

)|(w

, p

) ∈ W

∧ w

∈ S}

result

, result

= t, 0

for all (w,p) ∈ W

∩

if w substring of t or t substring of w then

p ← p ∗ wc

end if

p ← p ∗ (1 + norm(levens(w, t)))

if p > result

then

result

, result

= w, p

end if

end for

return result

end if

To classify the sections, different methods of ma-

chine learning were compared. The best results with

an accuracy of 0.9502 were achieved by the linear im-

plementation of a Support Vector Machine using the

one-vs-rest approach for multi-class problems. Taking

into account the premise that per job advertisement

only one HTML list element deﬁnes the employee re-

quirements, an accuracy of 100% could be achieved.

The extraction of skills is based on natural lan-

guage processing techniques. Using part of speech

templates, tokens are extracted that correspond to the

pattern of skills. False positives are ﬁltered out via a

blacklist. A synonym dictionary was created to cor-

rectly assign semantically identical or similar skills to

a common class. This dictionary contains descriptions

for skills in relation to semantically similar skills. Us-

ing a text distance algorithm based on fasttext word

embedding and the Levenshtein distance the tokens

are assigned to the skill classes. With this method,

75.33% of the known skills can be assigned correctly.

In the future, the semantic similarity recognition

process could be improved. The inclusion of exter-

nal data sources and training of word embeddings on

WEBIST 2019 - 15th International Conference on Web Information Systems and Technologies

232

Python

VMware

SAP

data bases

Linux

self-reliance

IT security

reliability

process thinking

Java

sense of responsibility

IT specialists

Education and training

project management

English

German

computer science degree

communication skills

team ability

university degree

programming

professional experience

105

109

110

117

129

239

251

277

Figure 4: Most wanted competences based on 491 job ads (March 28, 2019).

job ads could improve the assignment. The procedure

could also be extended to other occupational groups

and, for example, provide up-to-date statistics online

in real-time. In combination with other data in the

job ads, maps showing the required skills in certain

regions could be generated. In addition, the evaluation

of the transferability of the approach to the English

language and a comparison of the results would be

interesting.

REFERENCES

Bitkom (2016). 51.000 offene stellen

ur it-spezialisten. Retrieved from

bitkom.org/Presse/Presseinformation/51000-offene-

Stellen-fuer-IT-Spezialisten.html.

Bojanowski, P., Grave, E., Joulin, A., and Mikolov, T. (2017).

Enriching word vectors with subword information.

Transactions of the Association for Computational Lin-

guistics, 5:135–146.

Gallagher, K. P., Kaiser, K. M., Simon, J. C., Beath, C. M.,

and Goles, T. (2010). The requisite variety of skills for

it professionals. Commun. ACM, 53(6):144–148.

Grave, E., Bojanowski, P., Gupta, P., Joulin, A., and Mikolov,

T. (2018). Learning word vectors for 157 languages. In

Proceedings of the International Conference on Lan-

guage Resources and Evaluation (LREC 2018).

IMS (2018). Tiger corpus. Retrieved from ims.uni-

stuttgart.de/forschung/ressourcen/korpora/tiger.html.

Kwon Lee, C. and Han, H. (2008). Analysis of skills require-

ment for entry-level programmer/analysts in fortune

500 corporations. Journal of Information Systems Edu-

cation, 19.

Lee, S., Koh, S., Yen, D., and Tang, H.-L. (2002). Percep-

tion gaps between is academics and is practitioners:

an exploratory study. Information & Management,

40(1):51–61.

Litecky, C., Aken, A., Ahmad, A., and Nelson, H. J. (2010).

Mining for computing jobs. IEEE Software, 27(1):78–

85.

Milton, T. (2000). Cross training the answer to e-commerce

staff shortages. Computer Weekly.

Prabhakar, B., Litecky, C. R., and Arnett, K. (2005). It skills

in a tough job market. Communications of the ACM,

48(10):91–94.

Scott, E., Alger, R., Pequeno, S., and Sessions, N. (2002).

The skills gap as observed between is graduates and

the systems development industry–a south african ex-

perience. Informing Science.

Shuyo, N. (2010). Language detection library for java. Re-

trieved from code.google.com/p/language-detection.

Sibarani, E. M., Scerri, S., Morales, C., Auer, S., and Col-

larana, D. (2017). Ontology-guided job market demand

analysis. In Hoekstra, R., Faron-Zucker, C., Pellegrini,

T., and de Boer, V., editors, Proceedings of the 13th In-

ternational Conference on Semantic Systems - Seman-

tics2017, pages 25–32, New York, New York, USA.

ACM Press.

Weitzel, T., Eckhardt, A., Laumer, S., Maier, C., and Stet-

ten, A. v. (2015). Recruiting trends 2015: Eine em-

pirische untersuchung mit den top-1.000-unternehmen

aus deutschland, sowie den top-300-unternehmen aus

den branchen ﬁnanzdienstleistung, health care und it.

Retrieved from nbn-resolving.de/urn:nbn:de:bvb:473-

opus4-262833.

Wowczko, I. (2015). Skills and vacancy analysis with data

mining techniques. Informatics, 2(4):31–49.

Yongbeom, K., Jeffrey, H., and Mel, S. (2006). An update

on the is/it skills gap. Journal of Information Systems

Education, 17(4):395–402.

Automated Analysis of Job Requirements for Computer Scientists in Online Job Advertisements

233