Page Boundary Extraction of Bound Historical Herbaria

Krishna Kumar Thirukokaranam Chandrasekar

and Steven Verstockt

IDLab, IMEC-Ghent University, Ghent, Belgium

Keywords:

Page Boundary Detection, Hinge Detection, Border Noise, Digitization, Historical Image Processing,

Herbaria Books.

Abstract:

When digitizing bound historical collections such as herbaria it is important to extract the main page region

so that it could be used for automated processing. The thickness of the herbaria books also gives rise to

deformations during imaging which reduces the efﬁciency of automatic detection tasks. In this work we

address these problems by proposing an automatic page detection algorithm that estimates all the boundaries

of the page and performs morphological corrections in order to reduce deformations. The algorithm extracts

features from Hue, Saturation and Value transformations of an RGB image to detect the main page polygon.

The algorithm was evaluated on multiple textual and herbaria type historical collections and obtains over

94% mean intersection over union on all these datasets. Additionally, the algorithm was also subjected to an

ablation test to demonstrate the importance of morphological corrections.

1 INTRODUCTION

Since the early 1990s, libraries and museums have

conducted multiple digitization initiatives with cul-

tural heritage documents and scientiﬁc resources on

regular basis to ensure restoration and lasting preser-

vation of historical collections. This is to protect

them from further degradation caused by repetitive

handling. Exponential growth in high quality image

capturing devices induced by the enormous amount

of rich historical collections (that are yet to be uncov-

ered) has further led to a raising interest in histori-

cal document image analysis in recent times. Indeed,

an important need has also emerged to develop auto-

mated tools to process and enrich these collections to

facilitate better access to the preserved archives.

In addition to textual documents and records such

as books, student registers or death records that are

normally digitized on a large scale, there are various

types of bound historical herbaria that preserve the

rich horticulture of a region, that should also be dig-

itized. In these herbaria, plants are collected, dried

and stored in often difﬁcult circumstances so that it

could be used as a reference material decades later. It

is important to preserve these records with technical

proﬁciency but at the same time make it available to

readers easily. Though modern scanners offers solu-

https://orcid.org/0000-0003-4385-915X

https://orcid.org/0000-0003-1094-2184

tions for preserving these information, scanned mate-

rials are not always correctly oriented along the co-

ordinate axes. In certain instances, the visual infor-

mation is corrupted by external border noise. Some-

times it also introduces signiﬁcant variations in the

page by inducing noise such as unimportant objects,

slightly rotated, imaged under different viewing an-

gles and perspective or blended with a scene back-

ground. Such noises interrupt the document analy-

sis algorithms, that in turn affect the efﬁciency of the

overall scanning procedure. While removing back-

ground is feasible using simple segmentation tech-

niques, other types of border noises are more chal-

lenging.

The digitization of books and bound herbaria also

suffer from deformation and warping due to the thick-

ness of the collections. This poses a huge problem for

automated detection tasks. For textual books, such

warping effect decreases the text recognition process

to a great extent. A similar problem persists for

herbaria, wherein warping and deformation affects

the original shape and texture of the leaf specimen.

This directly affects historical plant phenotypeing and

experiments involving learning the evolution of shape

of leaves for which these digitized herbaria could be

greatly useful. Therefore it makes it necessary to ac-

curately detect the boundaries of the page, including

the edge that is usually shared by the neighbouring

page. This center portion of the page that is usually

caused due to binding is called hinge in book anatomy

476

Chandrasekar, K. and Verstockt, S.

Page Boundary Extraction of Bound Historical Herbaria.

DOI: 10.5220/0009154104760483

In Proceedings of the 12th International Conference on Agents and Artiﬁcial Intelligence (ICAART 2020) - Volume 1, pages 476-483

ISBN: 978-989-758-395-7; ISSN: 2184-433X

Figure 1: A sample scan image of a historical herbaria dur-

ing the digitization process.

based literature. The following terminology would be

used in the rest of the paper to refer that edge. The

detection of hinge gets increasingly tougher for hand

binded books since the edge would not be straight any

more.

A number of approaches have been developed to

remove border noise (e.g., [(Bukhari et al., 2012),

(Chakraborty and Blumenstein, 2016b), (Fan et al.,

2002), (Shafait and Breuel, 2010a), (Shafait and

Breuel, 2010b)]). However, as noted in (Chakraborty

and Blumenstein, 2016a), most of the prior method-

ologies make ﬁxed assumptions that holds good only

for textual pages. Some of these assumptions are con-

sistent text size, absolute location of border noise,

straight text lines, and distances between page text

and border. These assumption don’t hold good for

herbaria type images (e.g. Figure 1) where such a

consistent pattern is not followed. Therefore it is

necessary to have an algorithm that makes use of

only page and book based features such as colour,

intensity, illumination and texture to detect the page

boundaries.

In order to address the above concerns we pro-

pose a multi step page detection algorithm that de-

tects variations in brightness and the distribution of

the colour in an image to estimate the page bound-

ary. The algorithm is based on the HSV colour model

since it provides additional intensity and colour depth

features that can be utilized to better localize the page

boundaries. A novel hinge detection algorithm is also

proposed that can be used to speciﬁcally localize and

extract the center portion(hinges) of the book.

The remaining paper is organized as follows. Sec-

tion 2 explains the algorithms in detail and reasons

the approaches with examples. The data sets used are

elaborated in Section 3. Section 4 discusses the ini-

tial results, while Section 5 concludes the paper by

proposing possible future work directions.

2 METHODOLOGY

Page detection is considered as the process of ﬁnd-

ing pixels and regions in an image that constitutes a

page. Within the domain of historical digitization,

page detection is predominantly applied for pre pro-

cessing of documents before hand written text detec-

tion and recognition tasks, line and character detec-

tion and segmentation of historical pictures.

As shown in Figure 2, the proposed pipeline for

page detection begins with the pre processing of im-

ages. The images are rotated and aligned such that

the longest edge is maintained as its height. The core

methodology of the proposed algorithm can be sub

divided into two main steps namely book extraction

and hinge region detection. The book extraction step

ﬁlters background noise and extracts the main book

region while the hinge region detection step detects

the hinges and extracts the main page region. Finally,

morphological transformations are performed on the

extracted page in order to reduce deformations.

Since our algorithm is directed primarily towards

historical documents, we would be dealing with

books and herbaria from the 19th and the early 20th

century. Due to the factor of aging, a large major-

ity of these books possess a distinctive texture and

colour which could be used for detection. Addition-

ally, it is also necessary to represent the image using

a colour space (Sanchez-Cuevas et al., 2013; Albiol

et al., 2001) in which some of the colour channels

are invariant or at least insensitive to lighting changes,

such as the H and S channel in HSV (James, 2013).

In HSV colour space, the Hue (H) and Saturation (S)

values represent the colour information of the image.

The Value (V) on the other hand is the measure of

light intensity and denotes the extent of the colour’s

brightness of the image. A range of Hue, Saturation

and Value values of HSV can be used as important

features for page boundary detection as further ex-

plained below.

Page Boundary Extraction of Bound Historical Herbaria

477

Figure 2: The proposed page detection pipeline.

2.1 Book Extraction

The Saturation and Value components of HSV colour

space have been adopted as the major features for

background ﬁltering and initial boundary prediction.

Figure 3 provides a visual comparison of the original

image with their Saturation and Value transformed

images. The transformed images can directly be used

for ﬁltering and segmentation of the page to remove

external and surrounding background noises.

Figure 3: The ﬁrst image from the left is the original image

and the following two images are its corresponding Satura-

tion and Value transformed images. It can be seen that the

hinge is more prominently perceivable from the saturation

image (middle image) which we would use to localize the

hinge edge automatically.

Based on the average histogram of the Hue values for

a random selection of 100 historic images from 6 dif-

ferent collections, it was found that more than 85% of

the pixel’s expected dominant colours lie between 5

◦

and 65

◦

of the Hue colour wheel. This seems to be

logically acceptable since the dominant colours be-

tween the speciﬁed range are red and yellow and the

majority of historic images has a high probability of

having a slight yellow or brown tint due to the ag-

ing factor of the document. The average histogram of

Values shows a much more wider distribution. Yet, on

further introspection of the average histograms, it has

been witnessed that all the historic books taken into

consideration, falls between the range of 30 - 93%.

Based on the observed values for Hue and Value, book

mask B

mask

can be estimated for the image based on

Equation 1.

mask











◦

< H < 65

◦

30% < V < 93%

0, otherwise











(1)

An example of the ﬁltered mask obtained using HSV

ﬁltering is shown in Figure 4. Based on calculating

the largest contour from the mask, the background

and border noise can be eliminated and the book can

be extracted by calculating the convex hull points of

the contour (Goodrich et al., 2009).

ARTIDIGH 2020 - Special Session on Artiﬁcial Intelligence and Digital Heritage: Challenges and Opportunities

478

Figure 4: An example of book extraction with the corre-

sponding segmentation mask generated based on HSV ﬁl-

tering.

2.2 Hinge Region Detection

The HSV based ﬁltering and segmentation results in

the elimination of background noise, yet it is not suf-

ﬁcient enough to predict the hinge of the page. This is

because the hinge is usually shared by the neighbour-

ing page and therefore just colour based techniques

are inadequate.

Based on human perception, even though the

fourth boundary of a page in a book is shared by the

neighbouring pages, it is still possible to locate this

boundary because the area around the boundary tends

to be darker. Logically this is because, the amount of

light that could reach the hinge is lower due to bind-

ing. The Saturation value (S) signiﬁes the amount of

white light that needs to be mixed with the Hue and

therefore could be used as a feature to detect hinges.

In the center image of Figure 3 it is seen that the re-

gion around the hinge is brighter than the other parts

of the page. Therefore, the Saturation values could be

used for detecting the hinge.

Figure 5: The plot shows the mean, mode and the mode

count values for the columns of the image enclosed within

the red bounding box. The orange dashed box shows the

region where the actual boundary lies.

The plot in Figure 5 depicts the correlation between

the mean, mode and modal count values for a selected

region around the boundary of the image. Since our

focus here is predicting a longitudinal boundary, the

mean and modal values are calculated for every col-

umn of the image Saturation values. There is an evi-

dent Gaussian behaviour for the mean and modal val-

ues around the boundary region, which can be further

justiﬁed by the plot in Figure 6.

Figure 6: The plot shows the mean value of the columns for

the image in Figure 4. The orange strips indicate the region

of the actual boundary.

The above mentioned phenomenon is made use of to

localize and detect the hinges with near pixel level

accuracy. To begin with, the peaks and troughs are

estimated using a function that can ﬁnd all local max-

ima by simple comparison of neighbouring values

(Virtanen et al., 2019). The scipy implementation

f ind peaks has been utilized to detect the peaks and

troughs. The parameter prominence had a greater ef-

fect on detecting the peaks and in our experiments

prominence=10 was chosen as it performed the best

for majority of images. Finally hinge regions are esti-

mated based on the following conditions:

• Based on the peaks and trough values of the mean,

the peaks with the steepest slopes are selected.

• For the selected slopes, the mode and mode count

values are cross veriﬁed. Based on the plot in Fig-

ure 5 it can be inferred that the modal count and

mean values are inversely correlated. Thus only

those peaks that satisfy this condition are selected.

• Since the book is initially aligned in such a way

that the hinge is always to the right side of the im-

age, peaks present in the ﬁrst half of the image are

eliminated. It is also possible to automatically se-

lect the peaks based on the location of the contour

but since the image is rotated and aligned during

the pre processing, we follow this methodology.

• From the remaining list of peaks, the peak with

the highest prominence is chosen. The hinge re-

gion is derived as the region between the two local

minima before and after the peak as indicated by

Page Boundary Extraction of Bound Historical Herbaria

479

the orange strip in Figure 6.

Note: Since the dimensions of the page do not change

over a book, it is also possible to obtain them before

the start of the digitization process. In case if the di-

mensions of the page are already known, the peaks

can be selected based on the width/height ratio of the

book and detect the page in the image.

2.3 Page Detection

Based on the obtained book mask and hinge region,

the overall page is detected as follows:

• For the selected hinge region, the Saturation val-

ues (S) are applied a threshold and transformed

into a mask based on the following criteria:

Let h

be the selected hinge region. Then

mean

= mean(S[h

])

max

= max(S[h

])

mask











0, hs

mean

< hs

max

1, otherwise











(2)

• The calculated h

mask

is combined with the original

book mask as follows:

mask

] = h

mask

• The page region P is estimated by applying hull

points to the largest contour (Goodrich et al.,

2009). The largest contour is calculated as fol-

lows.

P = max

area

mask

]

The overall algorithm for page detection is elaborated

in Algorithm 1.

2.4 Morphological Correction

The estimated page region is expected to be a rect-

angle or square for a normal book based on its di-

mensions. Yet, the obtained page suffers a mismatch

due to a number of external factors. In order to com-

pensate for this mismatch, the detected page has to

be interpolated and transformed such that their shapes

match. The morphological correction is performed as

follows:

• For the estimated page region P, enclosing bound-

ing box is calculated. The enclosing bound-

ing box would be the reference bounding box

R. In case if the book dimensions are known,

the dimensions are used to estimate the reference

bounding box. The morphological correction is

performed if the area of P is less than the area of

• In order to learn the shape representations, i equiv

distant points are chosen along every side s of P

and reference bounding box R. The value is de-

pendent on the actual size and resolution of the

image. In our algorithm, 12 points were chosen

on each side. 12 was chosen since for i =12, the

selected points were neither too close nor too far

away resulting in better interpolation results.

• P is interpolated between P

and R

and the re-

sulting co-efﬁcients are re mapped on to R

. In

our algorithm, cubic interpolation was used since

it performed better than linear interpolation.

The implementation of morphological correction is

demonstrated in Figure 9.

Algorithm 1: Page Detection.

Require: An image of a book page

Convert RGB to HSV

Generate book masks using HSV ﬁltering.

if book is not portrait then

Rotate the mask by 90

◦

and straighten the image

else

Straighten the image

end if

f ull page ← rectangle enclosing the largest contour

Mean ← mean of full page[axis=0]

Mode,Count ← Mode of full page[axis=0]

Peaks,troughs ← Find peaks and troughs for mean

and count

Detect Hinge region

if Hinge region then

Threshold the region

Attach the region to the Book mask

end if

Extract Page P based on the largest contour of the

mask

Perform morphological correction

3 DATASET

The main datasets that were used for our experiments

were herbaria books of three different and prominent

Belgian botanist from the late 18th and early 19th cen-

tury, Charles Van Hoorebeke, Aim

e Mac Leod and

Julius Mac Leod . There are 78 books of Charles

Van Hoorebeke with 20-40 single sided specimen per

ARTIDIGH 2020 - Special Session on Artiﬁcial Intelligence and Digital Heritage: Challenges and Opportunities

480

Table 1: Evaluation of the proposed algorithm on multiple datasets.

Dataset Precision Recall IoU

Mac Leod 0.963 0.951 0.966

Charles Van Hoorebeke 0.935 0.948 0.94

cBAD 0.937 0.973 0.942

(a) (b) (c) (d) (e)

(f) (g) (h) (i) (j)

Figure 7: Sample results of the page detection algorithm. (a) is a page from Charles Van Hoorebeke, (b) and (d) are from Mac

leod herbaria. The rest are from cBAD baseline - dataset with different complexity obtained from 7 different archives.

book. All these books possess a similar structure (e.g.

Figure 7(a)). The books of Julius and Aime Mac Leod

are a bit more complex with approximately 200 pages

each and with multiple specimens per page (e.g. Fig-

ure 7(b), 7(d)). The page detection algorithms were

initially developed and parameters were chosen based

on randomly selected pages from these books. The

ﬁnal algorithm was evaluated on the rest of the pages.

In order to evaluate the generalization of the al-

gorithm, the same algorithm was also evaluated on

the ICDAR 2019 Competition Baseline Detection

(cBAD) dataset (Markus et al., 2019). The dataset

consist of documents with varying levels of layout

complexity extracted from 7 archives (bottom row in

Figure 7). There were documents from two tracks and

we used a random subset of documents from both of

the tracks.

3.1 Evaluation

Precision, recall and IoU (Intersection over Union)

metrics are normally used to evaluate the efﬁciency

of the boundary detection algorithms. These metrics

should be used in order to compare performance of

the algorithm with similar models. As proposed in

(Tensmeyer et al., 2017) we also evaluate the page de-

tection algorithm using manually labelled polygons.

Normally a page edge could be a few pixels thick

and it is important to evaluate how much of the edge

actually falls within this region. A pixel level compar-

ison is therefore performed for the detected boundary

to evaluate the closeness of the estimation with the

Page Boundary Extraction of Bound Historical Herbaria

481

(a) (b)

Figure 8: Comparison of text detection and recognition results using Google Cloud Vision API. (a) depicts the recognition

results for original non-corrected page while (b) depicts the results for a morphologically corrected page. It can be seen that

the detection and grouping of text (light blue bounding box), number recognition (the right page number is 146) and text

recognition (la melampirum, jasminum grandiﬂorum, veronice seurellera) results are much improved in (b).

ground truth. For this type of evaluation, the page

boundary regions were manually annotated for 50 im-

ages of variable complexity as shown in Figure 7 bot-

tom row. Finally, the test examples were also manu-

ally evaluated using two human participants.

4 RESULTS AND DISCUSSION

Table 1 shows the precision, recall and IoU scores for

the different datasets. It could be seen that the overall

IoU score is minimum 94% over all the datasets com-

bined. Even though the following results were com-

parable with the models of (Tensmeyer et al., 2017),

as shown in Figure 7, the page boundary mask could

be used for further processing of the page.

Since a precise shape of the page boundary poly-

gon could be obtained, this feature can be used for

reducing the deformation of the page. Figure 9 show-

cases one such use case for ﬂattening of the page to re-

duce deformation using simple morphological trans-

formation. For this task, the numpy implementation

griddata was used to learn the current representation

of the page and was transformed into the original rep-

resentation (before deformation)using the opencv ge-

ometric image transformation function remap. Nor-

mally, text lines would be detected to reduce defor-

mation. But, for herbaria type collections, where text

patterns would be limited, it would be hard to use the

text line features. In those scenarios, the page bound-

aries can be used as features to reduce deformation.

(a) (b)

Figure 9: An example of morphological correction for ﬂat-

tening the page in order to reduce deformation. (a) is the

original image with chosen hull points for ﬂattening and (b)

is the subsequent ﬂattened image.

4.1 Ablation Study

An ablation study was performed to evaluate the im-

portance of morphological correction. To do so, the

page samples before and after morphological trans-

formations of the Mac Load test data set were cho-

sen and text detection / recognition was applied. The

Google Cloud Vision API was used to obtain text de-

tection and recognition. The results of text detection

and recognition were quantitatively observed. The

ARTIDIGH 2020 - Special Session on Artiﬁcial Intelligence and Digital Heritage: Challenges and Opportunities

482

number of right predictions for both text detection and

recognition tasks were manually veriﬁed and counted

for both the samples of the image. Since the im-

age quality between the two images is similar and the

same text recognition model setting is used, it makes

the obtained results comparable. It was observed that

the morphological correction improved both the lo-

calization and prediction of hand written text by 25%

on average for each image. It was also noticed that

page numbers, headings and text that were close to

the boundaries had much better results than before.

An example of the result is shown in Figure 8.

5 CONCLUSIONS

A novel page detection algorithm has been presented

which eliminates border noise by segmenting the

main page region from the rest of the image. The

importance of using HSV colour model for histori-

cal document processing was elaborated. With less

assumptions, it was showed that the page detection

could also work for complex page structures. It

was also demonstrated that the detected page poly-

gon could be used as a feature for reducing deforma-

tion. Finally, the page with reduced deformations was

proved to perform better in automatic text detection

tasks.

ACKNOWLEDGEMENTS

The research activities described in this paper were

funded by The Department of Culture, Youth &

Media, Flanders (Belgium) for the Flore de Gand

project.

REFERENCES

Albiol, A., Torres, L., and Delp, E. J. (2001). Optimum

color spaces for skin detection. Proceedings 2001

International Conference on Image Processing (Cat.

No.01CH37205), 1:122–124 vol.1.

Bukhari, S. S., Shafait, F., and Breuel, T. M. (2012). Bor-

der noise removal of camera-captured document im-

ages using page frame detection. In Iwamura, M. and

Shafait, F., editors, Camera-Based Document Analysis

and Recognition, pages 126–137, Berlin, Heidelberg.

Springer Berlin Heidelberg.

Chakraborty, A. and Blumenstein, M. (2016a). Marginal

noise reduction in historical handwritten documents –

a survey. 2016 12th IAPR Workshop on Document

Analysis Systems (DAS), pages 323–328.

Chakraborty, A. and Blumenstein, M. (2016b). Preserving

text content from historical handwritten documents.

2016 12th IAPR Workshop on Document Analysis Sys-

tems (DAS), pages 329–334.

Fan, K.-C., Wang, Y.-K., and Lay, T.-R. (2002). Marginal

noise removal of document images. Pattern Recogni-

tion, 35(11):2593 – 2611.

Goodrich, B., Albrecht, D., and Tischer, P. (2009). Algo-

rithms for the computation of reduced convex hulls. In

Nicholson, A. and Li, X., editors, AI 2009: Advances

in Artiﬁcial Intelligence, pages 230–239, Berlin, Hei-

delberg. Springer Berlin Heidelberg.

James, S. P. (2013). Face image retrieval with hsv color

space using clustering techniques.

Markus, D., Florian, K., and Basilis, G. (2019). ICDAR

2019 Competition on Baseline Detection (cBAD).

Sanchez-Cuevas, M. C., Aguilar-Ponce, R. M., and

Tecpanecatl-Xihuitl, J. L. (2013). A comparison of

color models for color face segmentation. Procedia

Technology, 7:134 – 141. 3rd Iberoamerican Confer-

ence on Electronics Engineering and Computer Sci-

ence, CIIECC 2013.

Shafait, F. and Breuel, T. (2010a). A simple and effective

approach for border noise removal from document im-

ages. pages 1 – 5.

Shafait, F. and Breuel, T. (2010b). A simple and effective

approach for border noise removal from document im-

ages. pages 1 – 5.

Tensmeyer, C., Davis, B., Wigington, C., Lee, I., and Bar-

rett, B. (2017). Pagenet: Page boundary extraction in

historical handwritten documents. pages 59–64.

Virtanen, P., Gommers, R., Oliphant, T. E., Haberland, M.,

Reddy, T., Cournapeau, D., Burovski, E., Peterson, P.,

Weckesser, W., Bright, J., van der Walt, S. J., Brett,

M., Wilson, J., Jarrod Millman, N., Nelson, A. R. J.,

Jones, E., Kern, R., Larson, E., Carey, C., Polat,

I.,

Feng, Y., Moore, E. W., Vand erPlas, J., Laxalde,

D., Perktold, J., Cimrman, I., Quintero, E. A., Harris,

C. R., Archibald, A. M., Ribeiro, A. H., Pedregosa, F.,

van Mulbregt, P., and Contributors, S. . . (2019). SciPy

1.0–Fundamental Algorithms for Scientiﬁc Comput-

ing in Python. arXiv e-prints, page arXiv:1907.10121.

Page Boundary Extraction of Bound Historical Herbaria

483