
 
Figure 1: System overview. – implies future components. 
chapters. Then, these chapters, which are in PDF 
format, have been converted to texts. Finally, from 
these texts the metadata have been generated. 
2.1.1  Partition of Books into Chapters 
The documents we have consist of course books in 
PDF format. Each of the course books has different 
number of chapters. To make metadata extraction 
available, we have divided the course books into 
smaller PDFs according to their chapters.  The 
number of course books published in PDF format 
was about 230. Some of the course books were 
scanned through the hardcopy of the originals, so we 
had to eliminate them. Among the course books we 
have, 205 of them were partitioned into smaller 
PDFs. At last, we obtained 2654 PDFs, which is the 
number of the chapters of the 205 course books. 
2.1.2  PDF to Text Conversion 
In document processing, we have converted the PDF 
documents to texts to make them more suitable to 
allow processing. At this step, PDFBox 
(http://www.pdfbox.org), which is an open source 
Java PDF library to work with PDF documents, has 
been used. 
While converting PDFs’ into texts, we have 
faced some problems. Most of the PDF documents 
were legacy. Also, since they are written in Turkish, 
when we convert them to text, Turkish language 
specific characters like ‘ı’, ‘İ’, ‘ğ’, ‘Ğ’, ‘ş’, and ‘Ş’ 
were corrupted. Except ‘ş’ and ‘Ş’, the other 
corrupted Turkish characters have been corrected by 
replacement as each of them was referred by a non-
Turkish single character. However we could not 
corrected ‘ş’ and ‘Ş’ characters by replacement. 
When these two characters are converted to text, ‘ş’ 
becomes ‘fl’ and ‘Ş’ becomes ‘fi’, and both ‘fl’ and 
‘fi’ can take place in meaningful Turkish words. To 
overcome this issue, we thought that we need a spell 
checker, which will check if a word is correctly 
spelled or not. If there is a wrong spelled word, it 
will be changed by the correct one. As a spell 
checker, we used Zemberek, which is an open 
source Turkish NLP library 
(http://code.google.com/p/zemberek). Zemberek 
provides basic NLP operations such as spell 
checking, morphological parsing, stemming, word 
construction, word suggestion, converting words 
written only using ASCII characters (so called 
'deasciifier') and extracting syllables.  
Although Zemberek has overcomed many 
problems and been useful, in some cases it logically 
failed to do the right correction, because of the 
proper names and missspelled words. We created a 
correction map file, which contains a list of correct 
spellings of proper names and common words, to do 
the correction.  
2.1.3  Metadata Extraction and Discourse 
Analysis 
At this stage full-text of the chapters, which were 
obtained in the PDF to text conversion step, are 
converted into XML representations.  
Instead of writing the full-text under a tag, we 
first extract the metadata such as author, summary, 
keywords and learning objectives so that we could 
display this information to the user in the result set. 
We followed similar research that has been done for  
(Yilmazel, Finneran, & Liddy, 2004) as it takes great 
amount of time and effort to create metadata of 
digital contents manually. 
After the discourse analysis of the chapters, we 
had found that chapters were organized as chapter 
number, chapter title, introduction, text body, 
abstract, evaluation tests, and references. However, 
some of the chapters don’t contain evaluation tests 
and references. Also, the introduction parts of the 
chapters show differences. Some of them may 
contain information like chapter author, aim, 
keywords and suggestions. We implemented a rule 
based extraction system to extract metadata of the 
chapter texts automatically.  
We observed that our document collection could 
be separated into six categories according to the 
differences of the chapter full-texts. So, we designed 
a chapter parser which determines the category of 
the full-text. When a document is sent to this parser, 
it decides the category and extracts the metadata of 
the document.   
Finally, we obtained the following metadata 
elements: Course No, Book Name, Book Author, 
Book ISBN, Chapter No, Chapter Title, Chapter 
Author, Chapter Begin, Foreword, Learning 
Objective, Keywords, Content, Suggestions, 
TURKISH QUESTION ANSWERING - Question Answering for Distance Education Students
321