A Case Study of Genealogical Networks from Network Science

Perspective

Imre Varga

Department of IT Systems and Networks, University of Debrecen, 26 Kassai str., Debrecen, Hungary

Keywords:

Genealogy, Networks Analysis, Pedigree Collapse, Social Network, Ancestral Network.

Abstract:

In this paper, the analysis of a genealogical network is presented. The source database was constructed from

the records of birth, marriage and death registers of a medium-sized Hungarian town covering some centuries.

This genealogical network contains ca. 100.000 individuals. The topological features of this acyclic directed

graph were analyzed by computer software in order to draw conclusions about the community. The results

illustrate how network science can help the social sciences. A new measure is also deﬁned to quantify the

degree of pedigree collapse of a person having a partially known ancestor graph. The network was analyzed

from the point of view of this ancestor-loss coefﬁcient.

1 INTRODUCTION

Social networks, where different interactions of indi-

viduals are described by graphs, are broadly studied

in the last decades. Different aspects were in the fo-

cus of scientiﬁc analysis for instance citation or co-

authorship of scientists (Radicchi et al., 2012; New-

man, 2004), sexual interactions (McDonald and Piz-

zari, 2017), membership of terrorist groups (Fellman,

2008), etc. Not just the real world social networks,

but also online social networks are also investigated

(Kumar et al., 2010; Howard, 2008). Besides their

topological structure (Barab

asi, 2016), their dynami-

cal properties and roles in spreading processes (New-

man, 2010; Kocsis and Varga, 2014) are also exam-

ined empirically and theoretically.

Genealogy is an ancillary historical discipline. It

means the study of family origins and history, and

the tracing of their lineages. The word ”genealogy”

comes from two Greek words (”family” and ”sci-

ence”), thus is derived ”to trace ancestry”, the science

of studying family history. Genealogists use histor-

ical records, genetic analysis, and other sources to

get information about a family and to demonstrate

ancestry and pedigrees of individuals. There are tri-

als of computer-aided document processing, but this

task is very complicated even for artiﬁcial intelligence

(Malmi et al., 2017; Gellatly, 2015). In the broad

sense, genealogy traces the descendants and the an-

https://orcid.org/0000-0003-3921-2521

cestors of one person. Genealogy research is per-

formed for historical, scholarly, or forensic purposes

as well. The results of such research are often pre-

sented in pedigree charts (BCG, 2019).

Family trees or ancestry charts are usually main-

tained as a binary tree data structure containing the

ancestors of a person. In a simple assumption,

everyone has 2 parents, 4 grandparents, 8 great-

grandparents, 16 great-great-grandparents, and so on.

Thus the number of ancestors in a given generation

can be expressed by the powers of two. For example

in the 30th generation theoretically, there are more

the one billion people, which can be more than the

total population of the Earth at that time. This con-

ﬂict can be resolved by the fact that not all ancestors

are unique. In genealogy, this phenomenon is called

pedigree collapse (Wikipedia, 2020). It describes the

situation caused by the reproduction between two in-

dividuals who share an ancestor. It is very rare in the

short-term oral history of a family, but it is unavoid-

able in huge pedigree charts covering centuries. Due

to pedigree collapse genealogists have to use graphs

instead of tree data structures. It is quite frequent in

royal families. A good example of pedigree collapse

can be found in the ancestors of Charles II, the last

Habsburg King of Spain. There were three uncle-

niece marriages and three ﬁrst cousins marriages be-

sides other unions of his immediate ancestry. Be-

tween Habsburg Charles II and his ancestor Philip I

of Castile, there are 14 different lineage relationships.

When not just a family, but a community is in the

Varga, I.

A Case Study of Genealogical Networks from Network Science Perspective.

DOI: 10.5220/0011723800003485

In Proceedings of the 8th International Conference on Complexity, Future Information Systems and Risk (COMPLEXIS 2023), pages 47-52

ISBN: 978-989-758-644-6; ISSN: 2184-5034

 2023 by SCITEPRESS – Science and Technology Publications, Lda. Under CC license (CC BY-NC-ND 4.0)

focus of genealogical research besides the size of the

data source its structure also changes (Rannala, 1997;

Kingman, 1982). Marriages and childbirths connect

families (Koylu et al., 2021). Ancestry charts of a mi-

nor community cannot be represented by a forest of

family trees, it is a general directed acyclic graph. In

a small settlement especially in bygone years, the so-

ciety was more closed than nowadays, thus families

are densely interconnected. People who live in small

communities often choose wives/husbands from the

same community (villages, ethnic or religious minori-

ties).

The ”loss of lineage” (also called implex) can be

characterized by a genealogical coefﬁcient of a given

genealogical tree, deﬁned as the difference between

the number of theoretical ancestors of a person and

the number of his/her real ones in a given generation.

For example, procreation between ﬁrst cousins means

25% loss in the generation of great-grandparents of

the offspring. This measure is not so useful in case

of marriages between different generations or when

some ancestors are unknown (Pattison, 2001; Patti-

son, 2007).

In genetic genealogy, DNA analysis can be used

to show out pedigree collapse (Tetushkin, 2011;

Vince Buffalo, 2016). Generally, children inherit 50%

of their DNA each from their parents, 25% from their

4 grandparents, and so on. Nevertheless, the ac-

tual amount of DNA inherited is random, the average

amount of DNA inherited from an individual ancestor

is halved going back to each generation level. Due

to random inheritance, DNA analysis is an effective

way of ﬁnding shared ancestors only within few gen-

erations. Nevertheless, this kind of research is quite

expensive and involved.

Our goal is to build a directed network of people

based on only registry records (without genetic test re-

sults) and then determine different metrics of the net-

works (Newman, 2010; Barab

asi, 2016), such as in-

degree and out-degree distribution, average clustering

coefﬁcient, size of the giant component, average path

length, etc. In this system, they have the social mean-

ing as well. The characterization of pedigree collapse

also requires network analysis. While the dataset is

not complete a novel quantity is deﬁned to illustrate

the scale of pedigree collapse.

2 METHOD OF

INVESTIGATIONS

Our research is based on a public dataset created by a

Hungarian genealogist (Szepesi, 2020). He processed

the available (civil and parish) birth, marriage and

death registers of a town (Hajd

osz

orm

eny, Hun-

gary) and other historical documents (census, burial

records, etc) of the archives. The database contains

different data ﬁelds appeared in the registry records:

an ID, the name, the date of birth, marriages and

death of the given person, names (and IDs) of his/her

parents and name (and ID) of his/her spouse(s), etc.

More than 100.000 individuals appear in the dataset

mainly (but not exclusively) from the last three cen-

turies.

Of course, the dateset is not complete due to the

nature of the problem and the accuracy of the sources.

Each person has two parents, but the source is re-

stricted in time and space. Too old ancestors and too

young descendants are unknown and migration is also

not followed. In the 18th century, just the fathers were

represented in registers.

The IDs of people and his/her parents were ex-

tracted form a dataset having Personal Ancestral File

format and used to build up the genealogical chart

i.e. an acyclic directed graph of depersonalized IDs.

(Those few people who do not belong to any other

individuals are eliminated.) A special graph analyzer

program (Bord

an, 2019) and a web-application (Sz

ell

et al., 2020) were applied to analyze the topology of

this special social network. However, only one com-

munity was investigated in this case study we believe

that the results and conclusions may be general.

2.1 New Characterisation of Pedigree

Collapse

As it was highlighted in Section 1, the pedigree col-

lapse cannot be properly characterised by the simple

loss of ancestors in a given generation. That is why

we propose a new quantity to measure the degree of

the pedigree collapse. First a kind of auxiliary mea-

sure α

is assigned to the given person i and to his/her

known ancestors according to a recursive deﬁnition.

The α

= 1 for the given person, so where j = i. For

ancestors the value of α

is given by the following

form

∑

k=1

, (1)

where k runs over all the N children of person j who

are ancestors of person i. An example is shown in

Figure 1. It was motivated by the inheritance. How-

ever it is a random process, approximately half of

the genome comes from the father and the other half

comes from the mother.

In order to deﬁne the new ancestor-loss coefﬁcient

of a person i (denoted by λ

) the summation of aux-

iliary measure is needed according to the following

restrictions:

COMPLEXIS 2023 - 8th International Conference on Complexity, Future Information Systems and Risk

Figure 1: An example of auxiliary measures of known an-

cestors of the bottom person. The sum of values in gray

nodes and the half values in gray-white (bicolour) nodes

provides the ancestor-loss coefﬁcient λ = 0.1875. It in-

dicates incest, an uncle-niece marriage in the grandparent

generation. (Colors represents the number of known ances-

tors).

• If ancestor j has not got known parents, his/her

full auxiliary measure α

is taken into account in

the summation.

• If ancestor j has only one known parent, then the

half of his/her α

is added.

• If both parents of ancestor j are known, then

his/her α

value is disregarded.

Thus if the ancestor j of person i has P

unknown

parents (P

∈

{

0,1,2

}

) then the λ

ancestor-loss coef-

ﬁcient of a person i is deﬁned as

= 1 −

∑

, (2)

where j runs over all ancestors of person i. Since

genome can origin from the starting points of lin-

eages, that is why just the red and bicolor nodes of

Fig. 1 are considered.

The λ can be interpreted as an extension of the

common implex. In the case of simple situations (e.g.

reproduction between ﬁrst cousins, where all great-

grandparents are known) λ = A

, where A

is the

real identical ancestors in the ith ancestor generation.

The deﬁnition of this quantity assumes that there is no

pedigree collapse in the branches of unknown ances-

tors, thus λ coefﬁcient determines just a maximum for

the given person, in the case of more explored fam-

ily history it can decrease. According to the deﬁni-

tion 0 < λ < 1. The λ = 0 indicates pure bloodline

in the investigated genealogical network. The λ > 0

implies the rate of incest (pedigree collapse). For in-

stance, in the well-known case of Habsburg Charles

II the λ = 0.830295.

3 RESULTS

It was found that our genealogical network contains

100.273 nodes (people) and 156.062 directed links

(parent-child relations). If it would be a set of in-

dependent families we should see a forest of many

trees of approximately the same size. Instead we

found only 3840 independent clusters of the network,

where most of them are relatively small, but there is a

dominant cluster. This giant component contains the

85.3% of nodes (and 91.7% of links). In the remain-

ing 14.7% of the system, most of clusters contain not

more than 4 nodes (S <= 4). The cluster size dis-

tribution is presented in Figure 2 excluding the giant

component. As one can see it can be well ﬁtted by a

power-law form.

Figure 2: The cluster size distribution of the investigated ge-

nealogical network without the giant component. The solid

gray line illustrates a power-law function with an exponent

−2.5. The dotted line refers to the average cluster size ex-

cluding the giant component.

In the case of small clusters, the available source

documents probably did not contain enough infor-

mation to identify individuals behind the registry

records, so relationship was not found to other people.

In the remaining part of the paper, the investigation is

restricted only to the giant component, so the inter-

connected pedigree of 85536 people is in the focus of

the study.

Since it is a directed graph the in-degree and out-

degree distribution can be important. In the case of

A Case Study of Genealogical Networks from Network Science Perspective

genealogical networks, the in-degree k

means the

number of known parents of a person. In this sam-

ple, 74.8% of the population has 2 known parents and

6.0% has only one parent. (Old registers contain just

the name of the father and in case of a bastard child

just the name of mother is documented.) Nodes with

= 0 refer to individuals where the parents are not

known most likely due to the missing documentation.

The out-degree k

out

of a node denotes the number

of known children of a person. (It must be mentioned

that the real number of children can be greater than the

number of known ones.) The out-degree distribution

is presented in Figure 3. Results show that an average

parent has 3 children, but someone has much more

(max(k

out

) = 20). Almost half of nodes have not got

outgoing edges (k

out

= 0). This can be explained by

several things. On one hand, young adults can migrate

mainly due to their marriages. On the other hand, the

child mortality was high in former times. Last, but not

least the ﬁrst childbirth was later than the last public

documents.

Figure 3: B) The out-degree distribution P(k

out

) of the sam-

ple network. The solid gray line illustrates an exponential

form on a semi-log scale.

There are two special subsets of the population.

One of them includes people where the in-degree is

= 0. Since the age of the source documents is lim-

ited, probably they are the oldest people in the sam-

ple, they are the forefathers. The other group covers

the k

out

= 0 subset of nodes. They are either in the

youngest generations or they are the end of lineages.

From nodes of the former group to nodes of the lat-

ter one we can ﬁnd multiple paths (along lineage). In

order to characterize the network, these paths were

discovered. These are the longest paths in that sense

that there is no more known ancestor of the oldest per-

son along the path and no more known descendant of

the youngest person along the path. The length distri-

bution of these maximal paths is shown in Figure 4,

while their average length is 6.46 generations.

Figure 4: The number of maximal paths containing the

given number of generations. As one can see the most

paths cover 6-7 generations. Some of the lineages are much

longer. These belong to nobleman families, because only

they have so old documents (public registration started only

in the 18th century).

The distribution of our ancestor-loss coefﬁcient λ

is shown in Figure 5. One can see that the majority

of the people can be characterized by λ = 0.0, thus in

case of them, it is not possible to ﬁgure out pedigree

collapse based on the available registry records. It is

consistent with the average ancestor-loss coefﬁcient

λ = 0.002986 ± 0.021524.

Figure 5: Distribution of the ancestor-loss coefﬁcient λ.

Pure bloodlines are very frequent, but there are some people

with really low λ as well.

The most interesting ﬁnding of the research is the

relatively large number of individuals affected by the

pedigree collapse. In the studied population, 3943

COMPLEXIS 2023 - 8th International Conference on Complexity, Future Information Systems and Risk

people have an ancestor-loss coefﬁcient greater than

0.0, thus they are available from another node of the

graph along at least two distinct paths. It is the 3.93%

of the investigated population, which is quite high if

we consider that the average known lineages are only

6 or 7 generations long. At least 7.59% of ancestors

of these people are lost, thus only 92.41% of the an-

cestors are unique in the last few generations. The

lowest found λ was 0.5 in the case of 7 people (in 3

families). Some of them were children of a full sib-

lings’ marriage. If two brothers get married to two sis-

ters and their children get also married (to each other)

then the grandchild is also has λ = 0.5 coefﬁcient. A

real genealogical chart is illustrated in Figure 6 as an

example of a signiﬁcant pedigree collapse.

Figure 6: An example of the pedigree of a person with se-

rious incest (pedigree collapse). The person represented by

the bottom node (ID: 15486) has λ = 0.475261. However,

his parents are not full siblings his ancestor-loss coefﬁcient

is close to 0.5.

In network science, the average clustering coefﬁ-

cient of the network is an important topological mea-

sure. The clustering coefﬁcient gives the probability

triangles, since it determines how often two neighbors

of a node are also connected. In genealogical net-

works, this kind of ”friend of my friend is my friend”

situation refers to incest. In the investigated network,

the average clustering coefﬁcient hCi = 0.0. It can be

interpreted as total absence of procreation in father-

daughter or son-mother relationships.

4 CONCLUSIONS

In this work, a case study is presented in order to

demonstrate how graph theory and computer science

can be used in genealogical studies. A large dataset

was created from records of birth, marriage and death

(civil and parish) registers of a town. It contains

the known parents of inhabitants covering a few cen-

turies, thus a very complex genealogical network of

individuals serves as the object of analysis includ-

ing several generations of many families. Naturally,

it is not complete and full in the given time period

because some missing or inaccurate records do not

enable the full exploration of kinship. Nevertheless,

the dataset is enough to discover huge interconnected

pedigree charts. The graph analysis of them can pro-

vide interesting information for social sciences about

the population (e.g. child number distribution).

We created an acyclic directed unweighted graph

and then we investigated its features. The system is

dominated by a giant component, other clusters are

negligible. The out-degree distribution of nodes re-

ﬂects the number of children in families. It can be

roughly ﬁtted by an exponential distribution. Due to

the discovery of directed paths, we found that most

lineages include 5 − 7 generations. We introduced a

new quantity to measure the degree of pedigree col-

lapse in a not complete ancestor chart. The distribu-

tion of this ancestor-loss coefﬁcient shows the preva-

lence of incest within the given community even if the

dataset covers just a bit more generations than the oral

history of an average family. These results cannot be

obtained without the tools of network science.

ACKNOWLEDGEMENTS

The author would like to express his sincere grati-

tude to Imre Szepesi for his valuable registry research

and the creation of the genealogical database used

(Szepesi, 2020). The author expresses great appre-

ciation to Imre Bord

an for technical assistance.

REFERENCES

Barab

asi, A.-L. (2016). Network Science. Cambridge Uni-

versity Press.

BCG (2019). Genealogy Standards. Turner Publishing

Company.

A Case Study of Genealogical Networks from Network Science Perspective

Bord

an, I. (2019). Geneal

ogiai h

ozatok sz

ıt

epes el-

emz

ese. Master’s thesis, University of Debrecen, Fac-

ulty of Informatics.

Fellman, P. V. (2008). The complexity of terrorist networks.

In Proc. of 12th International Conference Information

Visualisation, pages 338–340. IEEE.

Gellatly, C. (2015). Population Reconstruction, chapter Re-

constructing Historical Populations from Genealogi-

cal Data Files, pages 111–128. Springer.

Howard, B. (2008). Analyzing online social networks.

Communications of the ACM, 51(14-16):11.

Kingman, J. F. C. (1982). On the genealogy of large popu-

lations. Journal of Applied Probability, 19:27–43.

Kocsis, G. and Varga, I. (2014). Investigation of spreading

phenomena on social networks. Infocommunications

Journal, 6(3):45–51.

Koylu, C., Guo, D., Huang, Y., Kasakoff, A., and Grieve,

J. (2021). Connecting family trees to construct a

population-scale and longitudinal geo-social network

for the u.s. International Journal of Geographical In-

formation Science, 35(12):2380–2423.

Kumar, R., Novak, J., and Tomkins, A. (2010). Link Min-

ing: Models, Algorithms, and Applications, chapter

Structure and Evolution of Online Social Networks,

pages 337–357. Springer.

Malmi, E., Rasa, M., and Gionis, A. (2017). Ancestryai:

A tool for exploring computationally inferred family

trees. In Proceedings of the 26th International Con-

ference on World Wide Web Companion, pages 257–

261.

McDonald, G. C. and Pizzari, T. (2017). Structure of sexual

networks determines the operation of sexual selection.

PNAS, 115(1):E53–E61.

Newman, M. E. J. (2004). Coauthorship networks and pat-

terns of scientiﬁc collaboration. PNAS, 101(1):5200–

5205.

Newman, M. E. J. (2010). Networks: An Introduction. Ox-

ford University Press.

Pattison, J. E. (2001). New method of estimating inbreeding

in large semi-isolated populations with application to

historic britain. Homo, 52(2):117–134.

Pattison, J. E. (2007). Estimating inbreeding in large, semi-

isolated populations: effects of varying generation

lengths and of migration. American Journal of Hu-

man Biology, 19(4):495–510.

Radicchi, F., Fortunato, S., and Vespignani, A. (2012).

Models of Science Dynamics, chapter Citation Net-

works, pages 233–257. Springer.

Rannala, B. (1997). Gene genealogy in a population of vari-

able size. Heredity, 78:417–423.

Szepesi, I. (2020). Hajd

osz

orm

enyi csal

aderd

https://gw.geneanet.org/szepesi.

ell, M. C., Becsei, M., and Kocsis, G. (2020). Introduc-

tion to dina: An extendable web-application for di-

rected network analysis. In Proceedings of the 5th In-

ternational Conference on Complexity, Future Infor-

mation Systems and Risk, pages 129–135. SciTePress.

Tetushkin, E. (2011). Genetic aspects of genealogy.

Genetika, 47(11):1451.

Vince Buffalo, Stephen M. Mount, G. C. (2016). A ge-

nealogical look at shared ancestry on the x chromo-

some. Genetics, 204(1):57–75.

Wikipedia (2020). Pedigree collapse.

https://en.wikipedia.org/wiki/Pedigree collapse.

COMPLEXIS 2023 - 8th International Conference on Complexity, Future Information Systems and Risk