Authors:
            
                    Hadi Mohammadzadeh
                    
                        
                                1
                            
                    
                    ; 
                
                    Thomas Gottron
                    
                        
                                2
                            
                    
                    ; 
                
                    Franz Schweiggert
                    
                        
                                1
                            
                    
                     and
                
                    Gholamreza Nakhaeizadeh
                    
                        
                                3
                            
                    
                    
                
        
        
            Affiliations:
            
                    
                        
                                1
                            
                    
                    University of Ulm, Germany
                
                    ; 
                
                    
                        
                                2
                            
                    
                    Universität Koblenz-Landau, Germany
                
                    ; 
                
                    
                        
                                3
                            
                    
                    University of Karlsruhe, Germany
                
        
        
        
        
        
             Keyword(s):
            Main content extraction, Information extraction, Web mining, HTML web pages.
        
        
            
                Related
                    Ontology
                    Subjects/Areas/Topics:
                
                        Artificial Intelligence
                    ; 
                        Information Extraction
                    ; 
                        Knowledge Discovery and Information Retrieval
                    ; 
                        Knowledge-Based Systems
                    ; 
                        Soft Computing
                    ; 
                        Symbolic Systems
                    ; 
                        Web Mining
                    
            
        
        
            
                Abstract: 
                Extracting the main content of web documents, with high accuracy, is an important challenge for researchers working on the web. In this paper, we present a novel language-independent method for extracting the main content of web pages. Our method, called DANAg, in comparison with other main content extraction approaches has high performance in terms of effectiveness and efficiency. The extraction process of data DANAg is divided into four phases. In the first phase, we calculate the length of content and code of fixed segments in an HTML file. The second phase applies a naive smoothing method to highlight the segments forming the main content. After that, we use a simple algorithm to recognize the boundary of the main content in an HTML file. Finally, we feed the selected main content area to our parser in order to extract the main content of the targeted web page.