Authors:
            
                    Pedro Curto
                    
                        
                                1
                            
                    
                    ; 
                
                    Nuno Mamede
                    
                        
                                1
                            
                    
                     and
                
                    Jorge Baptista
                    
                        
                                2
                            
                    
                    
                
        
        
            Affiliations:
            
                    
                        
                                1
                            
                    
                    Universidade de Lisboa and INESC-ID Lisboa/L2F – Spoken Language Lab, Portugal
                
                    ; 
                
                    
                        
                                2
                            
                    
                    Universidade de Lisboa and Universidade do Algarve, Portugal
                
        
        
        
        
        
             Keyword(s):
            Readability, Readability Assessment Metrics, Automatic Readability Classifier, Linguistic Features Extraction, Portuguese.
        
        
            
                Related
                    Ontology
                    Subjects/Areas/Topics:
                
                        Computer-Supported Education
                    ; 
                        Information Technologies Supporting Learning
                    ; 
                        Learning/Teaching Methodologies and Assessment
                    ; 
                        Metrics and Performance Measurement
                    
            
        
        
            
                Abstract: 
                This paper describes a system to assist the selection of adequate reading materials to support European Portuguese teaching, especially as second language, while highlighting the key challenges on the selection of
linguistic features for text difficulty (readability) classification. The system uses existing Natural Language
Processing (NLP) tools to extract linguistic features from texts, which are then used by an automatic readability classifier. Currently, 52 features are extracted: parts-of-speech (POS), syllables, words, chunks and
phrases, averages and frequencies, and some extra features. A classifier was created using these features and
a corpus, previously annotated by readability level, using a five-levels language classification official standard
for Portuguese as Second Language. In a five-levels (from A1 to C1) scenario, the best-performing learning
algorithm (LogitBoost) achieved an accuracy of 75.11% with a root mean square error (RMSE) of 0.269. In
a three-level
                s (A, B and C) scenario, the best-performing learning algorithm (C4.5 grafted) achieved 81.44%
accuracy with a RMSE of 0.346.
                (More)