Supervised Machine Learning and Feature Selection for a Document

Analysis Application

James Pope

1 a

, Daniel Powers

, J. A. (Jim) Connell

, Milad Jasemi

, David Taylor

and Xenofon Fafoutis

Stephens College of Business, University of Montevallo, U.S.A.

University of Memphis, U.S.A.

DTU Compute, Technical University of Denmark, Denmark

Keywords:

Document Analysis, Supervised Machine Learning, Feature Selection, Optical Character Recognition.

Abstract:

Over the past three decades large amounts of information have been converted to image formats from paper

documents. Though in digital form, extracting the information, usually textual, from these documents requires

complex image processing and optical character recognition techniques. The processing pipeline from the

image to information typically includes an orientation correction task, document identiﬁcation task, and text

analysis task. When there are many document variants the tasks become difﬁcult requiring complex sub-

analysis for each variant and quickly exceeds human capability. In this work, we demonstrate a document

analysis application with the orientation correction and document identiﬁcation task carried out by supervised

machine learning techniques for a large, international airline. The documents have been amassed over forty

years with numerous variants and are mostly black and white, typically consist of text and lines, and some

have extensive noise. Low level symbols are extracted from the raw images and separated into partitions. The

partitions are used to generate statistical features which are then used to train the classiﬁers. We compare the

classiﬁers for each task (e.g. decision tree, support vector machine, and random forest) to choose the most

appropriate. We also perform feature selection to reduce the complexity of the document type classiﬁers.

These parsimonious models result in comparable accuracy with 80% or fewer features.

1 INTRODUCTION

Automatically identifying text in images has enabled

the extraction of knowledge from vast amounts of ex-

isting data. Optical Character Recognition (OCR) en-

gines, trained for different languages, routinely recog-

nise text in provided image ﬁles with remarkable ac-

curacy (Ye and Doermann, 2015). However, these

OCR engines typically assume the images to have

low noise, correctly oriented, and minimal skew. Re-

search continues to ensure the highest accuracy in ex-

tracting text from images, speciﬁcally in image pre-

processing techniques.

In this paper we examine a speciﬁc text min-

ing application for a large, international airline com-

pany that has amassed hundreds of thousands of

maintenance work order documents since the 1980’s.

Internally the airline currently maintains electronic

records, however, many maintenance tasks are car-

https://orcid.org/0000-0003-2656-363X

ried out by various suppliers and original component

manufacturers. There are efforts to standardise elec-

tronic exchange between airlines and suppliers, how-

ever, many still use physical documents to inform the

airline of the maintenance performed on the compo-

nent. These physical documents have been converted

to electronic images using less than optimal scanners.

Even as suppliers upgrade to electronic exchange, a

vast amount of the maintenance history of the compo-

nents in operational planes remains electronically in-

accessible requiring tedious manual labour to acquire.

The airline’s goal is to automatically extract as

much information as possible from the work orders,

where each work order consists of numerous docu-

ments. Correctly orienting and identifying the docu-

ment’s type are critical for achieving this goal. Figure

1 depicts the pipeline highlighting the document anal-

ysis process. Unfortunately there are numerous prob-

lems. Each scanned work order contains anywhere

from 3-25 images. The images have noise, signif-

icant scan lines, and are arbitrarily oriented. Once

Pope, J., Powers, D., Connell, J., Jasemi, M., Taylor, D. and Fafoutis, X.

Supervised Machine Learning and Feature Selection for a Document Analysis Application.

DOI: 10.5220/0008925104150424

In Proceedings of the 9th International Conference on Pattern Recognition Applications and Methods (ICPRAM 2020), pages 415-424

ISBN: 978-989-758-397-1; ISSN: 2184-4313

415

Figure 1: Document Analysis Process.

(a) north (b) east

Figure 2: Example Form Variants.

oriented and identiﬁed correctly, each document will

require further processing including text area identi-

ﬁcation, optical character recognition, and text anal-

ysis. Considering the decades covered, there are nu-

merous variants of each document. For example, we

have identiﬁed at least 10 variants of one document of

interest, the FAA Form 8130-3. Figure 2 shows four

of these variants and their orientation. Note that (a)

and (b) are wider than tall. Variants (c) and (d) also

show signiﬁcant scan line noise. These variants illus-

trate the difﬁcultly in assuming any particular layout

for automated processing. The problem is to automat-

ically extract information from the work orders given

these difﬁculties.

This paper examines our approach for obtaining

information from these work orders using supervised

machine learning techniques. We speciﬁcally focus

on determining the status of the FAA Form, though

the approach has been extended to other documents

and information of interest. There are at least thirty

identiﬁed document types. We use a sample of the

work orders to train the classiﬁers by partitioning the

images and calculating statistical measures for each

partition from low-level symbols. Depending on the

partition, there can be a large number of features gen-

erated. To reduce the complexity of the classiﬁers, we

consider a shrinkage feature selection technique that

considerably reduces the number of features while

maintaining comparable accuracy. The contributions

of this paper are enumerated as follows:

• Practical approach to orientation and document

classiﬁcation using only symbol features without

resorting to an OCR engine.

• Comparison of machine learning algorithms for

these symbol features.

• Application of machine learning and feature se-

lection for document analysis.

We further stipulate that our approach makes no

assumptions regarding the font type and font size.

The rest of the paper is organised as follows. The

related work is discussed followed by the feature ex-

traction process. The orientation and document clas-

siﬁers are presented along with their results and anal-

ysis. Next the text analysis and end-to-end results are

presented followed by the conclusion.

2 RELATED WORK

Lu and Tan (Lu and Tan, 2006) propose converting the

documents into vectors using language information,

density, and vertical components. Training vectors

are then compared against for orientation detection

and document classiﬁcation. Rangoni, et al. (Rangoni

et al., 2009), propose a method that does not depend

on the speciﬁc text symbols used to infer the orienta-

tion. Instead it uses the OCR engine and selected text

ICPRAM 2020 - 9th International Conference on Pattern Recognition Applications and Methods

416

Figure 3: Feature Extraction Task.

to determine the orientation. Yang, et al. (Yang et al.,

2017), focus on text recognition from biomedical lit-

erature. They propose a system that performs multi-

orientation text detection, cropped word and end-to-

end recognition. Their approach does not explicitly

model the orientation of the document.

O’Gorman (O’Gorman, 1993) proposed a bottom-

up, nearest neighbour clustering approach for page

layout analysis. This has been a popular technique for

page layout analysis with many variants and heuris-

tics added. Ye, et al. (Ye and Doermann, 2015), sur-

vey existing techniques and problems for text recog-

nition in images highlighting the multiple orientation

issue.

Research in model selection continues and can

roughly be categorised as subset selection, shrinkage

methods, and dimension reduction techniques (James

et al., 2014). The Forward Stepwise subset selection

starts with no features in the model and adds them one

at a time. It can be evaluated using estimates of the

test error (e.g. AIC, BIC, adjusted R

) or directly us-

ing a validation set or cross-validation. LASSO (Tib-

shirani, 1994) and Elastic Net (Zou and Hastie, 2005)

are shrinkage techniques. LASSO adds an L1 norm

penalty of the coefﬁcient vector to the residual sum

of squares. This norm is multiplied by the weighting

parameter λ. This can also be evaluated using cross-

validation.

3 FEATURE EXTRACTION

This section details how the image is processed from a

raw pixel representation to an annotated instance with

several hundred features. We use the notation N to

represent the number of training instances and D to

represent the number of features (i.e. dimensions).

Each work order typically consists of between four to

eight documents stored as images in a PDF. For this

Figure 4: Original Image.

work, we assume that the images have been extracted

and make the following assumptions:

• Images are scanned from form style documents

(also referred to as tabular documents) primarily

consisting of text and lines.

• Images consist mostly of black and white pixels or

can easily be converted (e.g. grey scale images).

• Most of the text symbols in the image are separa-

ble - i.e. non-cursive.

• Most of the text symbols are similarly aligned; we

do not assume mixed aligned text.

The following section details how pixels are con-

verted into symbols. The subsequent sections detail

how the symbol attributes are used to generate the fea-

tures for each image.

3.1 Symbol Extraction

The FAA Form 8130-3 contains black text, lines, and

ﬁgures with a white pixel background. Figure 4 shows

an example of the form. Most of the pixels are ei-

ther black (0,0,0) or white (255,255,255). Given the

form image, the ﬁle is read in and inverted to a pure

black background with pure white text, lines, and ﬁg-

ures. A connected components algorithm is then run

on the image to group all pixels that are connected

into a symbol. We found that using a weighted union-

ﬁnd with path compression (Sedgewick and Wayne,

2011) performs well for determining the connected

components and is scalable with a time complexity

near linear in the number of pixels.

Supervised Machine Learning and Feature Selection for a Document Analysis Application

417

Table 1: Performance Metrics per Document Image.

Metric Value

Mean Time to Partition 0.8107 second

Mean Number of Pixels 8498281

Mean Number of Total Symbols 13226

Mean Number of Symbols Filtered 10986

To eliminate noise, we ﬁlter the symbols by num-

ber of points. Symbols with less than ∼ 20 points are

believed to be noise or uninteresting symbols (another

common approach would be to use a Gaussian ﬁlter /

threshold).

The remaining symbols are then augmented with

information about their neighbours. The K nearest

neighbours are determined using the distance between

the centroids of the symbols. The approach is per-

formed using a kD-tree to help avoid brute-force cal-

culation against all neighbours.

To demonstrate the efﬁciency of the approach, we

ran the process on 452 images (from 113 original doc-

uments rotated in the four orientations) and collected

metrics. The experiments were conducted ten times

on a standard MacBook Pro laptop (2.6 GHz Intel

Core i7). Table 1 shows the results. It takes less than

one second to produce the features for an image. It is

interesting to note the number of ﬁltered noise sym-

bols (10986). This is due to the shaded regions on

many of the images (see Figure 2 (b) and (d)). Each

symbol has the following attributes.

• Number of black pixels

• Number of white pixels

• Density, white pixels / black pixels

• Bounding box pixel width and height

• Aspect ratio (width / height)

• Neighbour vector (distance and angle between

centroids)

3.2 Symbol Feature Generation

After each image is processed from raw pixels into

a set of symbols, the symbols are partitioned based

on the their centroid. The partition bounds are rect-

angular based on the number of columns and rows

speciﬁed. For example, if 4 rows and 4 columns are

speciﬁed for an image that is 2000 x 3000 pixels, the

partitions are 500 x 750 pixels. For this work we as-

sume the number of rows and columns to be the same

which we denote as P. Thus the number of partitions

is P

. Note that some symbol area may extend into

adjacent partitions - the symbol is placed into the par-

tition that contains its centroid point. Figure 5 shows

Figure 5: Partition P = 4.

the result of the partition of the form shown in Figure

The number of points, neighbour angles, and den-

sity for the symbols are heavily skewed. For this rea-

son we choose to use the median as a central tendency

measure for each partition. For each partition the fol-

lowing features are recorded.

• Number of symbols in the partition

• Median neighbour angles

• Median aspect ratio

• Median symbol density

• Partition Density

As P increases there will be more and more par-

titions that have no symbols. For this, we record 0

for the number of symbols and 0 for the partition den-

sity and leave the aspect ratio, neighbour angle, and

symbol density as missing values.

3.3 Aspect Ratio

The aspect ratio is deﬁned as the width / height for

each symbol. For each partition, we take the median

aspect ratio. For these particular documents we ﬁnd

that a ratio below approximately 0.95 is indicative of

north or south orientation and above 0.95 is indicative

of east or west orientation.

ICPRAM 2020 - 9th International Conference on Pattern Recognition Applications and Methods

418

3.4 Symbol Neighbour Angle

Calculation

The centroid of each symbol is denoted as vector u

and a neighbour’s centroid is denoted as vector v.

Note that we ﬁrst have to translate v to u by subtract-

ing to produce a new vector a = v − u. We normalise

a and compute the dot product with a unit vector b

pointing east (1,0). Because both vectors are unit, we

can simply determine the cos(θ) = a·b, where θ is the

angle between the vectors. For the rest of the paper

we consider this cos(θ) synonymous with neighbour

angle.

We do not need the actual angle in degrees and

simply use cos(θ) as the feature. We take the abso-

lute value to get values between [0.0, 1.0], where val-

ues near 0.0 indicate the neighbour is to the west or

east from the symbol and values near 1.0 indicate the

neighbour is to the north or south.

3.5 Number of Neighbours K

The number of neighbours for the expected neigh-

bour angle was also considered and is used in other

problem domains. O’Gorman suggests using K = 5

for extracting text blocks (O’Gorman, 1993). Figure

6 shows the result for K = 1. Each symbol is sur-

rounded by a green bounding box and red lines are

drawn to each neighbour. We empirically found that

it was sufﬁcient to only consider K = 1 for orientation

detection (results omitted for brevity).

Figure 6: K = 1 Nearest Neighbours.

3.6 Number of Symbols

The number of symbols is also available for each par-

tition. Interestingly, Figures 11 and 10 show that the

number of symbols are critical for determining the

document’s type.

3.7 Document Features

In addition to the features for each partition, we cal-

culate the following features for each document.

• Document width

• Document height

• Document number of symbols

• Document median density

• Aspect ratio (width / height)

• Document median neighbour angle

• Orientation

• Document order in the PDF

Figure 7 shows the segment density versus neigh-

bour angle. Similar to the aspect ratio, this feature is

very good at separating north and south from east and

west orientations. Figure 8 shows it being used as the

root, though a better split value would have been 0.5

rather than the selected 0.119145.

It should be noted that there will be some corre-

lation between these document features and partition

features (e.g. the number of document symbols will

be related to the number of symbols in the partitions).

The total number of features D for each document is

given by the following formula.

D #features/document = (P

× 5) + 8 (1)

For example, a P = 4 partition results in D = 88

features for each image. We repeat this for each im-

age and save as a comma-separated values (CSV) ﬁle.

This ﬁle is then manually annotated to include the cor-

rect orientation and document type. We designate the

orientation as one of {east, north, west, south} (0, 90,

180, 270 degrees respectively). There are approxi-

mately thirty different document types. In this paper

we simplify and only consider the two classes {faa,

other}. We next analyse the orientation and document

type classiﬁers built using these features. We use de-

cision trees to assist in the analysis because they are

relatively easy to interpret.

4 ORIENTATION

CLASSIFICATION

We evaluate how well the features can be used to in-

fer the orientation and the effect that P has on the var-

ious classiﬁers. A failed orientation detection very

likely results in a failed optical character recognition,

thus we consider false positives and false negatives

Supervised Machine Learning and Feature Selection for a Document Analysis Application

419

Table 2: Classiﬁer Settings.

Classiﬁer Settings

Naive Bayes NA

Decision Tree -C 0.25 -M 2

Support Vector Machine -C 1.0 -K PolyKernel

Random forest -P 100

Logistic Regression -R 1e10

−8

-M -1

to be equal in cost. Considering this and since each

class is balanced we choose to use the accuracy mea-

sure to compare the various experiments. We use the

Weka data mining software (Hall et al., 2009) (version

3.9.3, 2019) to analyse the data. For the remainder of

this paper we use the classiﬁers listed in Table 2 using

the typical settings for this library. In several ﬁgures

we use the notation c0r4 segmentDensity to indicate

a speciﬁc feature for a partition (e.g. segment density

for partition column 0 and row 4). If the partition is

omitted it should be assumed to be a document fea-

ture. A brief analysis of the features with respect to

orientation is ﬁrst presented followed by a compar-

ison of the accuracy of various classiﬁers using the

generated features.

4.1 Orientation Classiﬁer Analysis

We consider the partition P = 5 with 133 features. Ex-

amining the scatter plots of the features against each

other it was clear that the document median angle

and certain partition segment densities provide use-

ful information regarding orientation. Figure 7 shows

a plot of the median angle (x-axis) versus partition

c4r4’s segment density. For illustration some jitter has

been added to distinguish the points.

The ﬁgure’s results are somewhat expected. For

English text the nearest symbol would be to the left or

right when oriented north or south. Conversely when

oriented east or west the nearest neighbour would be

above or below the symbol. Interestingly, the ﬁgure

also shows that the selected partition’s segment den-

sity can be used to distinguish between the north and

south orientation.

Figure 8 shows the corresponding decision tree.

We restricted the size of the tree for illustration pur-

poses (required leaf nodes to have a minimum of 100

instances).

The decision tree classiﬁer takes advantage of the

document median angles and splits at 0.83 (though

0.5 would appear to sufﬁce as well). As Figure

7 shows most north-south orientation instances are

nearer to 1.0 and east-west are nearer to 0.0. The

right half of the tree then uses the partition’s segment

density to differentiate north from south. The classi-

ﬁer uses another partition segment density to differ-

Segment Density (c4r4)

Document Median Angle

Figure 7: Orientation Scatter Plot P = 5. Legend: north

south east west.

Figure 8: Orientation Decision Tree P = 5.

entiate east from west. The document median angle

and speciﬁc partition segment densities are important

for orientation classiﬁcation. We also found that the

document median aspect was also able to distinguish

north-south from east-west similar to the document

median angle. This is also reasonable as we would

expect English text height to be more than the width

on the average.

Table 3 shows the confusion matrix for the deci-

sion tree classiﬁer (with the node size restriction re-

moved) trained using 10 fold cross validation. We

would expect there to be errors between east and west

and north and south. We would not expect a north,

south orientation to be confused as east,west and visa

versa. The results conﬁrm this as most errors occur

between north vs south and east vs west.

Table 3: Decision Tree Confusion Matrix P = 5.

Accuracy=0.972 Predicted

east north south west

Actual

east 563 0 1 16

north 1 565 14 0

south 1 16 563 0

west 15 0 1 564

ICPRAM 2020 - 9th International Conference on Pattern Recognition Applications and Methods

420

2 4

8 10 12 14

0.96

0.98

Accuracy

Decision Tree

Random Forest

Support Vector Machine

Figure 9: Orientation Accuracy 10-Fold Cross Validation.

4.2 Orientation Classiﬁer Accuracy

We use the annotated ﬁles for each value of P from 2

to 13 with N = 2320 instances to train several orienta-

tion classiﬁers. Due to limited training instances, we

train each classiﬁer on the ﬁle using 10 fold cross val-

idation. We run this 30 times for each classiﬁer and

take the average of the accuracy to produce Figure 9

with 99% conﬁdence intervals. The results are shown

9. When P = 2 most classiﬁers perform poorly. We

believe this is because too much aggregation has oc-

curred and obscured differentiating information. As P

increases, the reason we believe the decision tree per-

forms poorly is because this creates more and more

irrelevant and correlated features. This has a negative

effect on some classiﬁers including decision trees and

naive Bayes (speciﬁcally correlated features) whereas

other classiﬁers, such as support vector machines, are

less impacted by these issues (Tan et al., 2019).

The ensemble random forest classiﬁer appears to

do quite well for all values of P including P = 2. The

random forest and support vector machine do well be-

tween 3 =< P <= 6 achieving near 98% accuracy.

The random forest’s accuracy across all values of P

make it a strong candidate.

5 DOCUMENT CLASSIFICATION

Similar to the orientation classiﬁer, we evaluate how

well the features can be used to infer the document

type and the effect that P has on the various classiﬁers.

We assume the orientation has been determined and

Document Number of Symbols

Order

Figure 10: Document Type Scatter Plot P = 5. Legend:

{other, faa}.

Figure 11: Document Type Decision Tree P = 5.

therefore only consider each image in its north orien-

tation with N = 580 annotated instances. We ﬁrst con-

sider a brief analysis of the features and then consider

a comparison of the accuracy of various classiﬁers.

5.1 Document Type Classiﬁer Analysis

Similar to the orientation classiﬁer, we examine the

partition P = 5 with 133 features. Figure 10 shows

a plot of the document’s number of symbols (x-axis)

versus the document’s order (with some jitter added).

A relatively simple selection statement would do well

considering documents with more than 1800 symbols

and an order with 2 or 3. Figure 11 shows the corre-

sponding decision tree.

Interestingly, as Figure 11 shows, the decision tree

classiﬁer primarily relies on the number of symbols

features. Other partition sizes do use the order. How-

ever, we also found the document median density fea-

ture to be just as effective as the order. We determined

this by removing the order feature and were able to

obtain similar accuracy results. Note that we did not

restrict the size of the tree. These results provide evi-

dence that many feature are either redundant (i.e. cor-

related with other features) or superﬂuous for docu-

ment type classiﬁcation. Section 7 will provide more

evidence for this assertion. Table 4 shows the con-

fusion matrix for the decision tree. To remain consis-

tent, the classiﬁer was trained using 10 fold cross vali-

Supervised Machine Learning and Feature Selection for a Document Analysis Application

421

2 4

8 10 12 14

0.97

0.98

0.99

Accuracy

Decision Tree

Random Forest

Support Vector Machine

Figure 12: Document Accuracy 10-Fold Cross Validation.

dation. We do note that for highly imbalanced classes

the number of folds may need to be adjusted (Raeder

et al., 2012). There are roughly the same number of

false positives as false negatives.

Table 4: Decision Tree Confusion Matrix P = 5.

Accuracy=0.9776 Predicted

other faa

Actual

other 457 7

faa 6 110

5.2 Document Type Classiﬁer Accuracy

We again choose to use the accuracy measure to com-

pare the various experiments and use similar settings

shown in Table 2. We train each document type clas-

siﬁer using 10 fold cross validation and run this 30

times for each classiﬁer and compute 99% conﬁdence

intervals.

Figure 12 shows the results of the experiments.

For P = 2 the decision tree and random forest per-

form notably better than the support vector machine.

From P = 2 to P = 3 the situation reverses and the de-

cision tree performs progressively worse, as was the

case with the orientation classiﬁer. After P = 3 the

random forest and support vector machine produce

similar results around 99% accuracy.

6 OCR AND TEXT ANALYSIS

Once images have been oriented and the document

identiﬁed, selected images are segmented and sent to

the optical character recognition (OCR) engine. After

a qualitative comparison of several tools, it was de-

cided to use Tesseract (Kay, 2007) as the OCR engine

in the pipeline. The north oriented FAA form image

is segmented to form a set of symbols. The symbols

are combined to form a suspected line of text (termed

document streams (O’Gorman, 1993)). Neighbouring

symbols are combined if within some angle to the left

and right and within some distance, both empirically

determined. Once the document streams have been

OCR’ed the produced text stream, (x, y) coordinates,

and bounding box are written to a Javascript Object

Notation (JSON) ﬁle.

The ﬁnal text analysis step in Figure 1 performs

the following relatively simple procedure. The text

streams are ﬁltered to only examine those in the area

of interest (in this case, the upper right quadrant).

Then each text stream is checked against a small set

of predetermined words. The Levenstein distance is

taken and the shortest distance is taken to the intended

word. Though simple this ﬁnal step can also handle

false positive document classiﬁcations (images that

are not FAA forms but were identiﬁed as such and

OCR’ed). Of course, more extensive text analysis

can be performed and reinforce document classiﬁ-

cation (e.g. using a bag of words approach). The

text analysis classiﬁes as one of the following status’

{repaired,tested,inspected,overhauled,other}.

7 FEATURE SELECTION

We used the shrinkage method LASSO to select a

subset of the features for the document type classiﬁer.

We compare the accuracy of the LASSO classiﬁers

to the full model classiﬁers. We do not necessarily

expect the LASSO classiﬁers to have a higher accu-

racy but would prefer that they are not signiﬁcantly

lower. First the training set features for each parti-

tion size are standardised ( ¯x = 0.0, σ = 1.0). Miss-

ing values are then set to the mean of 0.0. We then

used the glmnet (Friedman et al., 2010) R package

for the analysis and ﬁgure generation. To evaluate

the models we selected cross-validation. With x rep-

resenting all the standardised features and y the re-

sponse variable type, we used the following com-

mand cv.glmnet(x,y, family = ”binomial”).

Figure 13 shows the log of λ versus the deviance

for a 10-fold cross validation to determine a suitable

value for λ for the P = 5 partition. The deviance is an

analogous measure to residual sum of squares but for

more general models and is negative two times the

maximised log-likelihood (lower is a better) (James

et al., 2014). For each value of λ the upper and lower

ICPRAM 2020 - 9th International Conference on Pattern Recognition Applications and Methods

422

Figure 13: Binomial Deviance λ.

standard deviation is also shown. The ﬁrst dashed ver-

tical line represents the λ

∗

that resulted in the lowest

deviance. Because we prefer a model with fewer fea-

tures we use the common heuristic of selecting the

matching λ to the right of λ

∗

’s upper standard devia-

tion. This is within one standard deviation of λ

∗

and

has fewer features (from roughly 27 to 25). This is

shown as the second dashed vertical line. Overall, this

went from 133 features (per Equation 1) to just 25, a

reduction of

80%.

This process is repeated for each partition size

from P = 2 to P = 14 where a subset of features are se-

lected from the total set of features. Figure 14 shows

the total number of features versus the number of fea-

tures chosen by LASSO. The number of features for

LASSO rises slightly from P = 2 to P = 5. However,

after approximately P = 5 the number of features se-

lected by LASSO appears to stop increasing (the val-

ues never exceed 34 features). At P = 4, the number

of features for the LASSO model is 15 versus 88 for

the full model, a ratio of ∼17%. At P = 14, the num-

ber of features for the LASSO model is 28 versus 988

for the full model. The ratio between the number of

features for the LASSO model and the full model is

never more than 80% when P = 5. We believe this

is expected. At some point increasing the number of

features would not help the model as the partitions

become smaller and smaller and no signiﬁcant aggre-

gation is obtained.

The document number of symbols and document

order from Section 3.7 appear in all LASSO models

except for one indicating their importance. The FAA

document typically appears second or third in the or-

der suggesting some ordering procedure in the scan-

ning process. The document median density also ap-

pears in more than half the models. The median doc-

ument density is likely correlated with the document

number of symbols. A drawback of LASSO is that it

arbitrarily selects one of a set of correlated features.

For the same models shown in Figure 14, using the

support vector machine classiﬁer, we compute their

8 10 12 14

500

1,000

# Features

Full Model

Lasso Model

Figure 14: Features: Full Model vs LASSO Model.

8 10 12 14

0.97

0.98

0.99

Accuracy

Full Model

Lasso Model

Figure 15: Document Accuracy Full vs LASSO Model.

accuracy results 30 times and plot them in Figure 15

with 99% conﬁdence intervals. The accuracy results

of the LASSO model are comparable to the full model

or indistinguishable.

8 STATUS CLASSIFICATION

RESULTS

To demonstrate the overall efﬁcacy of our approach,

we consider the classiﬁcation of the FAA status as

shown in Figure 1 for each work order. A cor-

rect classiﬁcation requires the orientation, document

type, document stream identiﬁcation, optical charac-

ter recognition, and text analysis to also be successful.

Supervised Machine Learning and Feature Selection for a Document Analysis Application

423

To be clear, this is not an ideal validation. Though we

have been using cross validation approaches through-

out, a signiﬁcant test set would be more convincing

(future work is to obtain this).

The pipeline was run on the 116 work orders with

1106 images where 117 are FAA documents (one

work order had a duplicate FAA form). All 117

FAA documents were correctly oriented. The doc-

ument type classiﬁer predicted 121 documents to be

FAA where four were false positives (non-FAA im-

ages classiﬁed as FAA). The text classiﬁer predicted

three of the four as other and one as tested (i.e. it mit-

igated three of four false positives). The text classiﬁer

correctly predicted the status of 116 FAA documents

with one incorrectly predicted as tested versus actual

repaired.

9 CONCLUSIONS

In this paper we proposed and demonstrated an ap-

proach for document analysis using a combination

of supervised machine learning models for orienta-

tion classiﬁcation and document classiﬁcation. The

form style documents were ﬁrst partitioned to pro-

duce symbols from which features were generated.

The features were then used to train machine learn-

ing algorithms. When the image is oriented and the

document identiﬁed, document streams are sent to an

OCR engine to produce a text ﬁle from which a simple

match is made to determine the desired form’s status.

We then employed a feature selection approach for the

document type classiﬁer to produce a parsimonious

model and showed that it was as accurate as the full

model. Finally, the end-to-end results were presented

to demonstrate the effectiveness of our approach.

ACKNOWLEDGEMENTS

This work was supported in part by the University

of Montevallo Contract #19-0501-001. The authors

greatly appreciate the support of the airline company

employees involved in the project. Without their ef-

forts this research could not have been conducted.

REFERENCES

Friedman, J., Hastie, T., and Tibshirani, R. (2010). Regular-

ization paths for generalized linear models via coordi-

nate descent. Journal of Statistical Software, 33(1):1–

22.

Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann,

P., and Witten, I. H. (2009). The WEKA Data Min-

ing Software: An Update. SIGKDD Explor. Newsl.,

11(1):10–18.

James, G., Witten, D., Hastie, T., and Tibshirani, R. (2014).

An Introduction to Statistical Learning: With Appli-

cations in R. Springer Publishing Company, Incorpo-

rated.

Kay, A. (2007). Tesseract: An open-source optical character

recognition engine. Linux J., 2007(159):2–.

Lu, S. and Tan, C. L. (2006). Automatic document orien-

tation detection and categorization through document

vectorization. In Proceedings of the 14th ACM Inter-

national Conference on Multimedia, MM ’06, pages

113–116, New York, NY, USA. ACM.

O’Gorman, L. (1993). The document spectrum for page lay-

out analysis. IEEE Transactions on Pattern Analysis

and Machine Intelligence, 15(11):1162–1173.

Raeder, T., Forman, G., and Chawla, N. V. (2012). Learn-

ing from Imbalanced Data: Evaluation Matters, pages

315–331. Springer Berlin Heidelberg, Berlin, Heidel-

berg.

Rangoni, Y., Shafait, F., Van Beusekom, J., and Breuel,

T. M. (2009). Recognition driven page orientation

detection. In Proceedings of the 16th IEEE Inter-

national Conference on Image Processing, ICIP’09,

pages 1969–1972, Piscataway, NJ, USA. IEEE Press.

Sedgewick, R. and Wayne, K. (2011). Algorithms, 4th Edi-

tion. Addison-Wesley.

Tan, P., Steinbach, M., Karpatne, A., and Kumar, V. (2019).

Introduction to Data Mining, Second Edition. Pear-

son.

Tibshirani, R. (1994). Regression shrinkage and selection

via the lasso. JOURNAL OF THE ROYAL STATISTI-

CAL SOCIETY, SERIES B, 58:267–288.

Yang, C., Yin, X., Yu, H., Karatzas, D., and Cao, Y. (2017).

ICDAR2017 robust reading challenge on text extrac-

tion from biomedical literature ﬁgures (DeTEXT). In

2017 14th IAPR International Conference on Docu-

ment Analysis and Recognition (ICDAR), volume 01,

pages 1444–1447.

Ye, Q. and Doermann, D. (2015). Text detection and

recognition in imagery: A survey. IEEE Transac-

tions on Pattern Analysis and Machine Intelligence,

37(7):1480–1500.

Zou, H. and Hastie, T. (2005). Regularization and variable

selection via the elastic net. Journal of the Royal Sta-

tistical Society, Series B, 67:301–320.

ICPRAM 2020 - 9th International Conference on Pattern Recognition Applications and Methods

424