Cold Start of Enterprise Knowledge Graph Construction

Rong Duan and Kangxing Hu

Huawei Technology, Shenzhen, China

Keywords:

Enterprise Knowledge Graphs, Cold Start, Link Prediction.

Abstract:

Enterprise Knowledge Graphs(EKG) is a powerful tool for Enterprise Knowledge Management(EKM). Most

EKG construction suffers cold start problem. In reality, EKG construction is an interactive process,in which

domain experts provide a small seed graph, and data driven methods are applied to expand the graph. This

paper proposes a framework to solve EKG cold start problem by integrating graph form expert knowledge

with non-graph form corpus. The proposed framework employs expert knowledge to guide unsupervised

learning, and crosses check the quality of expert knowledge simultaneously. A coarser cluster level and ﬁner

entity level vectorization is proposed to predict the link between graph nodes and cluster words. And also,

a combined strategy is adopted to measure the importance of the predicted link, and provide to the expert

to evaluate incrementally. The proposed framework solves the ”labor intensive” EKG cold start construction

problem and utilizes expert knowledge efﬁciently. Simulation is generated to illustrate the properties of deﬁned

measurements, and real-world application is discussed to show the challenges in practices.

1 INTRODUCTION

Basically, there are two types of KG construc-

tion techniques: data-driven or manpower driven.

Most open-domain KGs are constructed through data-

driven method. It either has a large knowledge base

at the start point, or has a large corpus that can extract

knowledge from. For example, Yago roots from the

large knowledge base Wikipedia, WordNet and Geon-

ame(Suchanek et al., 2007). NELL (Carlson et al.,

2010) extracts knowledge from hundreds of millions

of web pages. For speciﬁc domain, the KG construc-

tion heavily depends on domain experts. Especially

in solving cold start problem. Expert driven method

is efﬁcient in a relative narrow area or a ﬁeld with rela-

tive complete ontology. Successful applications such

as Amazon product graph(Dong, 2018), medical re-

lated knowledge graphs(Ramaswami, 2007)(Ferrucci

and Brown, 2011).

Enterprise Knowledge Graph(EKG), as a tool

for Enterprise Knowledge Management(EKM), is far

more complicated than speciﬁc domain applications.

The comparison of open domain, speciﬁc domain and

EKG has discussed in the tutorial(Duan and Xiao,

2019). Usually EKG covers different knowledge do-

mains, and the ontology design is time consuming

and labor intensive, which is the bottle neck of EKM.

Even though many industries have collected more and

more data with digital transformation, the size is still

not comparable with open domain.

A big challenge that modern EKM faces is how to

utilize the data and extract knowledge from it. Data

driven Knowledge Graphs have been receiving more

and more attention in enterprise. There are research

in studying the EKG framework(Galkin et al., 2017),

components and construction steps (Pan et al., 2017),

but the cold start problem of knowledge acquisition

hasn’t been addressed in details yet. This is a prob-

lematic issue that holds up the step for enterprise to

adopt the knowledge graph approach in EKM.

In practice, EKG starts with a small expert con-

structed seed graph, and a moderate size corpus can

be used to expand the graph. The problem can be

decomposed into four subproblems as described fol-

lowing:

1. Validate the correctness of expert designed seed

KG.

2. Link the words in cluster with nodes in graph.

3. Integrate expert knowledge with data.

4. Design the predicted link evaluation criteria.

This paper proposes a framework to answer above

questions. Our main contributions are three points.

First, dispersion and reachability are deﬁned to mea-

sure the relation between graph structure and cluster

Duan, R. and Hu, K.

Cold Start of Enterprise Knowledge Graph Construction.

DOI: 10.5220/0010055301530160

In Proceedings of the 12th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management (IC3K 2020) - Volume 3: KMIS, pages 153-160

ISBN: 978-989-758-474-9

153

results. This is the ﬁrst to use graph-form and non-

graph-form data to learn from each other, which pro-

vides a method to integrate expert’s prior knowledge

with unsupervised learning. Second, the entities are

vectorized by the cluster results, and the link predic-

tion between the graph nodes and words in cluster is

decided by the similarities of the coarser cluster level,

and the ﬁner entity level. Third, an important-based

incremental method is proposed to evaluate the pre-

dicted links.

The rest of the paper is organized as follows: Sec-

tion 2 introduces the related work and problem state-

ment. Section 3 provides deﬁnitions and methodol-

ogy. Section 4 discusses the issues in real-application.

Section 5 concludes the paper and gives the future di-

rection.

2 RELATED WORK

Graph-based methods have been used widely in

classiﬁcation and clustering problems. If the

data is graph-form, graph properties will apply di-

rectly(Rogers and Hahn, 2010)(Shervashidze et al.,

2009)(Geng et al., 2012)(Fortunato, 2009). If the

data is non-graph-form, a graph will be constructed

ﬁrst, and then apply graph properties(Wang et al.,

2005)(Wu et al., 2018)(Zhang and Lee, 2007)(Zhu

et al., 2003).

This is different from our problem setting. We

have two data sets, one is in graph-form that con-

structed by experts, and the other is non-graph-form

that extracted from corpus.

Another related work is link prediction, especially

for cold start link prediction. Traditional link pre-

diction methods base on graph structure or labeled

data similarity (L

u and Zhou, 2011)(Mart

ınez et al.,

2017)(Taskar et al., 2004)(Menon and Elkan, 2011).

Our problem setting is to predict the link between the

nodes in graph and words in clusters. The entities

are not in the graph yet, the graph structure based

link predict doesn’t ﬁt. And also, attributes similar-

ity based or labeled data based methods don’t ﬁt our

short of labels and lack of attributes corpus.

Many steps are involved in constructing KG from

scratch. Here we only focus on the stage where ex-

perts have manually constructed an initial KG,G

}, and are willing to expand the G

with the

corpus. V

represents node and E

represents edge. V

only has name as its attributes, and E

indicates the ex-

isting of relation. The corpus has been segmented to

words or phrase(Shang et al., 2017)(Liu et al., 2015),

noted as W. The nodes in V

, but not in W are noted

as V

, The words in W , but not in V

are noted as

. W

are candidates of graph nodes. The objec-

tive is to expand the original G

with W

to form a

new G = {V,E}. In general, |V

|  |W | for the initial

states, where |V | is the number of V . Clustering tech-

nology is applied to W to form multiple clusters, rep-

resented as C

, where k ∈ [1,K], and |C

| represents

the number of words/phrases in the cluster C

3 DEFINITIONS AND

METHODOLOGY

3.1 Deﬁnitions

Given graph G = {V, E}, and cluster results C

, where

k ∈ [1, ··· , K], dispersion ζ and reachability ρ are de-

ﬁned to measure the relation between graph G and

cluster C

, Cluster-based node vector representation

is deﬁned to predict links.

3.1.1 Dispersion

Node dispersion ζ measures the neighbour nodes dis-

tribution in clusters. let’s denote Γ(V ) as node V ’s

1-degree ego-centric network, |Γ(V )| represents V ’s

degree. Dispersion of node V is deﬁned as :

= tanh(−

∑

Γ(V )∈c

log p

Γ(V )∈c

log(|Γ(V )| + 1)

) (1)

where p

Γ(V )∈c

|Γ(V )

|Γ(V )|

, and log(|Γ(V )| + 1) is

normalization factor to adjust ζ

’s concentration ten-

dency for lower degree nodes.

is tanh transferred normalized entropy. It mea-

sures V ’s neighbors distribution in clusters. The prop-

erties of ζ are:

1. ζ ∈ [0,1], due to tanh transformation.

2. ζ

= 0, when there is only one cluster K = 1.

3. ζ

→ 0, means node V’s direct neighbours is rel-

ative synchronize with cluster results.

4. ζ

is monotonic increase function with the num-

ber of clusters Γ(V ) spread into, when cluster

method has hierarchical structure.

The properties (1),(2),(3) are straight forward. And

the proof of property (4) can be changed to en-

tropy and conditional entropy relation problem, which

equivalent to the problem that information gain is

nonnegative, which can be proofed easily.

Dispersion ζ measures the synchronousness be-

tween graph structure and cluster results from graph

point of view. In general, if a node’s neighbor is

spread in different clusters, its dispersion is high. If

KMIS 2020 - 12th International Conference on Knowledge Management and Information Systems

154

the dispersion for most of the nodes in graph is high,

it means the cluster result is not in sync with graph

structure, the neighbor nodes in graph are spread in

different clusters, and the clusters should be com-

bined. The expected dispersion E(ζ) will decrease as

the cluster number decreases until it becomes 0 when

every cluster is combined into one cluster.

Figure 1: A graph with 5 nodes and 5 edges.

Let’s say we have a graph as shown in Figure

1, and cluster method separates the data into 3 clus-

ters: C

= {A, B,C}, C

= {E}, and C

= {D}. For

nodes A and C, both of their 1-degree ego-centric

network are {A,B,C}. All of these nodes are in

. Γ(A) = Γ(C) = {A,B,C}. According to Eq:1,

= ζ

= 0. For node B, its 1-degree ego-centric net-

work is Γ(B) = {A,B,C, E}, 3 of 4 nodes in C

, 1 of

4 nodes in C

, and |Γ(B)| = 3, then ζ

= 0.38. Same

for nodes D and E, their dispersion are ζ

= 0.78, and

= 0.76 respectively. Compared with these 5 nodes,

A and C’s ego-centric nodes are only in one clus-

ter, and their dispersion are 0. B and D’s ego-centric

nodes are in two clusters, but D only has 1 degree, and

B has 3 degree. Adjusted by degree, ζ

< ζ

. Now,

let’s see the situation when decrease the number of

clusters. Combining cluster 2 and 3, the dispersion of

all the nodes change to ζ

= ζ

= 0, ζ

= 0.38,

and ζ

= 0.52. All the nodes’ dispersion become 0, if

further combine all clusters into one.

ζ has the tendency to combine clusters. To avoid

this situation, cluster reachability ρ

and expected

cluster reachability ρ are introduced.

3.1.2 Reachability

Cluster Reachability. ρ

is deﬁned by average short-

est path of all pairs of nodes in the graph G that over-

lapped with the cluster k. The nodes that are in G,

but not in any cluster are considered when form the

shortest path.

∑

(SP(i, j))

|i, j|

i, j∈(C

(2)

where |i, j| is the number of node pairs. Cluster reach-

ability ρ

compares the cluster result with graph struc-

ture from cluster point of view. It measures the short-

est path in graph of each pair of words in cluster k.

The smaller the ρ

, the closer the words are in the

graph under this cluster method.

The nodes that in graph, but not in any cluster

are considered while calculate reachability, which to

avoid disconnected graphs situation.For example, in

Figure 1, if node E is not in W , and the cluster re-

sults are assumed as {{A,C}, {B,D}}. The reachabil-

ity of cluster {A,C} and {B,D} are 1 and 2 respec-

tively,where E is considered in calculating ρ

{B,D}

even though E is not in cluster result. And also, prac-

tically, these nodes are the semantic nodes created by

domain experts, which help in understanding the se-

mantic meaning of related nodes.

Expected cluster reachability ρ is deﬁned to com-

pare the cluster results with different number of clus-

ters.

Expected Reachability. ρ is the sum of weighted

cluster reachability ρ

for all clusters. The weight

is the proportion of number of words in each cluster

against number of all words.

∑

|W |

(3)

The value of ρ represents average steps of separa-

tion of the cluster words in graph. The smaller the ρ,

the closer the words are in the graph. Referring back

to Figure 1, if the clusters are {{A,B,C},{E},{D}},

the cluster reachability are {1,0,0}, and the weights

are {

} respectively. The ρ is 0.6. If the cluster

result is {{A,B,C},{E,D}}, ρ = 1. If all the nodes are

in one cluster, ρ = 1.7. The idea situation is that ρ is

close to 1, which means the words in each cluster are

in one degree ego-centric network in the graph.

Properties of ρ:

1. ρ ∈ (0,d(g)]. where d(g) is the diameter of graph

2. ρ = d(g),when there is only one cluster K=1.

3. ρ = 1, when each cluster is clique, and there is no

isolated single nodes.

4. ρ is monotonic decrease with the number of clus-

ter increase, when the cluster method has hierar-

chical structure.

Proof of property (4): Since the hierarchical clus-

tering result, we can assume the split cluster is s,

and c

= c

+ c

. To proof ρ is monotonic decrease

with the number of cluster, we only need to proof

≥ ρ

+ ρ

. The whole proof process is illus-

trated following.

Cold Start of Enterprise Knowledge Graph Construction

155

≥ ρ

+ ρ

|W |

∑

|i, j|

SP(i, j)

|i, j|

≥

|W |

∑

, j

SP(i

, j

)

, j

|W |

∑

, j

SP(i

, j

)

, j

∑

|i, j|

SP(i, j)

|i, j|

≥ |C

∑

, j

SP(i

, j

)

, j

+ |C

∑

, j

SP(i

, j

)

, j

∑

|i, j|

SP(i, j)

|i, j|

≥

∑

, j

SP(i

, j

)

, j

∑

, j

SP(i

, j

)

, j

∑

, j

SP(i

, j

)

, j

∑

, j

SP(i

, j

)

, j

≤

∑

|i, j|−1

SP(i, j)

|i, j| − 1

− 1

∑

|i, j|

SP(i, j)

|i, j|

≥

∑

|i, j|−1

SP(i, j)

|i, j| − 1

− 1

∑

|i, j|

SP(i, j)

|i, j|

≥

∑

, j

SP(i

, j

)

, j

∑

, j

SP(i

, j

)

, j

As illustrated above, the cluster reachability ρ is

equal to 1 when cluster words are direct connect to

each other in the graph. ρ has the tendency to sep-

arate the data into too many clusters. Neither ρ nor

ζ can be used stand alone. For example, in the

graph as shown in Figure 2, if the cluster result is

{{A,B},{C,F},{E,D}}, the ρ is 1, and the ζ equals

to 0.46; If the cluster result is {{A,B,C,F},{E,D}},

the ρ still equals to 1, but the ζ becomes to 0. The

reachability of these two results are the same, but the

dispersion for the 2nd cluster result is smaller. The

second cluster result is better than the ﬁrst one, since

every words in cluster are directly linked to each other

in the graph.

Figure 2: A graph with a 4-nodes clique and 2-nodes clique.

Simulation data is generated to further illustrate

the properties of ζ and ρ. A random 1000 nodes

erdos-renyi graph is generated with edge probability

0.005. Edge betweeness is adopted as hierarchical

clustering method. And the number of cluster is set

from 1 to 100. The experiment repeats 50 times. The

changes of ζ and ρ according to the number of cluster

are shown in the Figure: 3. ζ is monotonic decrease

with the number of cluster increase, and ρ is vice-

verse.

3.1.3 Entity Vectorization

To predict the link between the nodes in graph with

the words in cluster, let’s deﬁne entity’s cluster-based

vector representation. Let K represents the number of

clusters, and ν

represents the vector of word W .

(a) (b)

Figure 3: (a) Dispersion monotonic increase with the num-

ber of cluster;(b) Reachability monotonic decrease with the

number of cluster;Solid black line is mean and dashed red

line is variance.

|ν

| is the length of vector, and |ν

| = K, ν

is the

value of kth bit of W .



1,whileW ∈ C

0,whileW /∈ C

With this vector representation, the link prediction

algorithm is designed as two-steps approach:

1. Cluster-Subgraph Linkage: each cluster C

will

link to a node V

’s ego-centric network Γ(V

where i = argmax(JS(C

, Γ(V

))), JS is Jaccord

similarity.

2. Cluster Member - Subgraph Node Linkage: each

cluster member will link to a graph node where

argmax(cosine(ν

, ν

Γ(V

)

)). ν

is the vec-

tor representation of entity p in cluster C

, and

Γ(V

)

represents the vector of nodes V

that is

node V

’s direct neighbour, where V

is identiﬁed

in the ﬁrst step.

3.1.4 Link Validation

Link validation is another challenge problem in KG

expansion, especially during the cold start phase.

In EKM system, the initial graph G

is far from

complete, and the knowledge consumption base on

this graph is poor. The quality validation of the new

link is through expert.

To efﬁciently utilize expert resources under |V

| 

|, this paper proposes a score κ

to measure the

importance of each nodes. The score is the average

of degree centrality, betweeness centrality and close-

ness centrality. The nodes with high κ

score are

recommended to experts. The validated graph is send

back to replace seed graph. The words W

are added

incrementally.

3.2 Methodology

This section describes how to use previous deﬁned

measurements to expand G’ to G with W. The frame-

work is illustrated in Figure 4. Three key modules are

KMIS 2020 - 12th International Conference on Knowledge Management and Information Systems

156

shown in thicker solid rectangle boxes: Cluster Num-

ber Determination and Seed Graph Validation, Link

Prediction, and Incremental Validation. Other NLP

techniques, like Word Segmentation, Phrase Finding,

Named Entity Recognition,etc are important, but not

in the scope of this paper. We put all these steps in

Preprocess box. This framework only considers the

stages where the corpus are ready to do cluster, and

there exits a small seed graph.

3.2.1 Cluster Number Determination

Cluster Number Determination is the critical module

in this framework. Both Link Prediction and Incre-

mental Validation modules base on the result of this

one. As discussed in Section 2, we have an initial

from experts, and a cluster results from cluster-

ing method. The largest possible number of cluster is

|, where |N

| is the number of nodes in initial G

As illustrated in Section 3, ζ is monotonic decreases,

and ρ is monotonic increases. The cross point of ζ

and ρ is the best number of cluster. In reality, it is

not necessary to start from |N

|, just pick the num-

ber that doesn’t miss the cross point of ζ and ρ. The

advantage to use the balance of ζ and ρ to determine

the cluster number is that it integrates domain expert’s

knowledge with data. On one side, these two sources

compliment each other with missing information. At

the same time, it solves the lack of label, missing at-

tributes and not large enough corpus issues.

3.2.2 Extra Nodes in Graph

There is situation that not all nodes in G’ are in clus-

ters, which means the entity speciﬁed by the expert

might not be in corpus. For example, the semantic

level entity might not exist in corpus directly. These

nodes are represented as V

. To exploit this extra in-

formation, V

needs to be added to clusters. There are

two scenarios to add V

1. Scenario 1: Add V

to the cluster that V

is on the

shortest path when calculating ρ

2. Scenario 2: If V

hasn’t been added in Scenario

1, Add V

to the clusters that satisfy either of the

following two conditions:

(a) has the most V

’s common neighbours;

(b) the normalized common neighbor(Jaccard) is

larger than a threshold.(We set 50% here).

is critical node for the cluster in the ﬁrst scenario.

For the second scenario, it complements cluster in-

formation. The ﬁrst condition picks the cluster that

has the most common neighbor with V

, and the sec-

ond condition makes sure all those clusters that have

a large portion of V

’s neighbors are included. V

added to the clusters directly while calculating ρ for

scenario 1. V

is added after the cluster number

3.2.3 Seed Graph Validation

To evaluate the quality of seed graph, the node that is

not in any common clusters with its neighbours are se-

lected to re-evaluate by experts. Using Figure 1 as ex-

ample, if the cluster result is {{A,D}, {B,C}, {B,E},

{D,E}}, A

s neighbour Γ(A) is B and C. B and C are

in clusters {B,E}, {B,C}. A is not in any of these two

clusters. A’s relation with all its neighbours will be

re-examined by experts.

∀V

∈ G

;i f (|V

Γ(V

)

|) = 0,checkV

⇔ Γ(V

)

3.2.4 Link Prediction

After determining the number of clusters, the next

step is link prediction between graph nodes and clus-

ter members. It is straight forward if there is only one

entity from a cluster that exists in the graph. But in

most situation, there will be more entities in the graph

from the same cluster. Still using Figure 1 as ex-

ample. If the corpus includes {A,B,C,D,F,G,H,I,K},

and are grouped into four clusters: C

:{A,F,G}, C

{A,B,K,H}, C

: {C,D,H,I} and C

: {B,F}. The rela-

tion between cluster member and graph can be sum-

marized into 4 scenarios:

• Scenario 1: Only one cluster member exists in

the graph, and the other non-graph members only

associate with that speciﬁc cluster

• Scenario 2: Only one cluster member exists in the

graph, and the other non-graph members might

exist in different clusters

• Scenario 3: There are multiple cluster members

exist in the graph, and the other non-graph mem-

bers only associate with that speciﬁc cluster

• Scenario 4: There are multiple cluster members

exist in the graph, and the other non-graph mem-

bers might exist in different clusters

In the above example, G belongs to scenario 1, F be-

longs to scenario 2, K and I belong to scenario 3 and

H belongs to scenario 4.

To solve the link prediction problem to cover these

4 scenarios, we follow the method in Section 3.1.3,

which ﬁnds the coarser cluster level linkage ﬁrst, and

ﬁner individual entity level linkage second. The result

for the cluster level linkage is shown in Table:1

Node E is added to the cluster {C,D,H,I} ﬁrst

according to Section 3.2.2, where E is on the short-

est path between C and D. Table: 1 provides the

Cold Start of Enterprise Knowledge Graph Construction

157

Figure 4: Framework components: rectangle boxes are modules, and round boxes are results. Solid dots represent entities

from expert, and empty dots represent entities from Corpus. Thicker solid rectangle boxes are the key components. After

important predicted links are validated, the validated graph is send back to replace Seed Graph, and incrementally expand the

graph.

Table 1: Link Prediction: Cluster Alignment.

Cluster New ego-centric JS(G

)

Members Clusters Networks

AFG AFG ABC 1/5

ABKH ABKH ABC 2/5

ABKH ABKH ABCE 1/3

CDHI CDHIE ABC 1/7

CDHI CDHIE DE 2/5

CDHI CDHIE BED 1/3

BF BF ABCE 1/5

detail calculation. Both cluster {A, F,G} and clus-

ter {A,B,K,H} match with A ’s ego-centric network

{A,B,C}, cluster {C,D,H,I} matches with E’s ego-

centric network {B, D,E}, and cluster {B,F} matches

with D’s ego-centric network {D,E}. The sub-graph

and cluster matching results are shown in Figure: 5.

The solid nodes are the nodes in graph, and dashed

nodes are the candidate nodes that to be added to the

graph. The color indicates the coarser level cluster

alignment.

(a) (b)

Figure 5: Solid circles are notes in graph, and dotted cir-

cles are candidate entities. The same color represents the

same alignment. (a) Cluster {AFG} aligns with A’s ego

network {ABC}; (b) Cluster {ABKH} align with A’s ego

network ABC; (c) Cluster {CDHI} align with E’s ego net-

work {BDE};(d) Cluster {BF} align with D’s ego network

{DE}.

Table 2: Link Prediction: Entity Linkage.

A B C D E

1100 0101 0010 0010 0001

H-0110 0.5 0.5 0.71 0.71 0.71

K-0100 0.71 0.71 0

I-0010 0 1 1

F-1001 0.5 0.5 0 0

G-1000 0.71 0 0

After the cluster alignment, the second step of link

prediction is to link the individual entities. Since H

is in both clusters {A, B,K, H} and {C,D,H,I, E},

and these two clusters link to B’s and E’s ego-centric

network, {A,B,C,E} and {B, E,D} respectively. H

needs to compare the similarity with all the exist-

ing nodes A,B,C,D, and E. K only exists in cluster

{A,B,K,H}, and {A, B,K, H} has larger Jaccord In-

dex with A’s ego-centric network {A, B,C}, compar-

ing with B’s ego-centric network {A,B,C,E}. K only

needs to compute the cosine similarity with nodes

A,B,C. Follow the same logic, I needs to compare

with B,E and D, F needs to compare with A,B,C and

E. The computed cosine similarities are shown in Ta-

ble 2, and the graph after the link prediction is shown

in Figure:6

Figure 6: Dashed line are predicted link.

3.2.5 Incremental Validation

After adding W

to G

, the expert evaluation is needed

to conﬁrm the expansion. When |V

|  |W

|, the cri-

teria to recommend W

to be evaluated is important.

We adopt the incremental method to recommend to

KMIS 2020 - 12th International Conference on Knowledge Management and Information Systems

158

Table 3: Procedure to extend G’ with W

Procedure of expand G’ with W

Initial:

1: Input expert designed graph G

= {V

}

2: Initial number of cluster K = |V

Find the best cluster number K:

3: While K >= 2

4: Cluster W into C

, k ∈ [1,K] with hierarchical cluster method

5: for each node V

∈ G’

6: calculate dispersion ζ

7: for each topic C

, k ∈ [1, K]

8: calculate reachability ρ

9: if V

is on the shortest path while calculating ρ

10: add V

to C

11: if µ(ζ

) <= 0.2, and ρ ≈ 1, exit;

12: else K = K - skip, go back to Step 3

13: The best K is the cross point of ζ

and ρ

14: if V

is not in C

15: addV(V

, C)

Seed Graph Validation G

16: for each V

∈ V

17: if |V

Γ(V

)

| = 0

18: output V

and Γ(V

) for re-evaluation

Expand G’ with W

19: vectorlize V

and W base on cluster result from Step 16

20: for each V

∈ V

21: calculate J

between Γ(V

) and C

22: ﬁnd J

> 0.5 or argmax

k∈[1,K]

)

23: for each k in Step 19

24: calculate cosineV

, where V

∈ Γ(V

),W

∈ C

25: ﬁnd argmax

cosineV

,add link between W

and V

Evaluate W

26: for each W

added to graph

27: if form a cycle for relation forbidden cycle, output W

28: else calculate κ

29: Sort κ

from high to low for validating.

30: form new graph G after validation and send back to Step 1.

function addV(V

,C):Add V

to Clusters:

1: for each V

∈ V

2: calculate J

between Γ(V

) and C

3: ﬁnd J

> 0.5 or argmax

k∈[1,K]

)

4: Add V

to C

experts. First the degree centrality,closeness central-

ity and betweeness centrality are combined to rank the

candidate W

. Recommend W

to experts that have

its combined score larger than a threshold. After ex-

pert validation, the new graph is send back to replace

the seed graph. The whole process repeats till all W

have been added to G or all W

’s combined score is

smaller than threshold.

The summarized procedure is illustrated in Ta-

ble 3.

4 EXPERIMENTS

The proposed framework has been tested on an Enter-

prise Finance Knowledge Management System. The

corpus are collected from ofﬁcial documents and in-

ternal community discussion blogs. Rules, regula-

tions, policies, and manually collected Q&A regard-

ing accounting, tax, capital, etc are covered. The ob-

jective is to construct multiple intelligent bots that can

answer questions or serve as searching engine. The

detail is removed to reserve the company’s property.

The procedure will be discussed through the graph

properties without loss generality. The size of cor-

pus is few Gigabytes, and the initial expert designed

seed graph has few hundred nodes, and most in hi-

erarchical relation. The density of the seed graph is

0.031. We can use HAC to get the monotonic increase

and decrease ζ and ρ here, but the character simi-

larity based clustering method lacks of latent topic,

which doesn’t ﬁt well for our situation. We adopt

LDA model in practice. Even though LDA is not strict

hierarchical clustering method, the cluster result will

form a hierarchical-like result if the corpus has good

hierarchical structure. Even though ζ and ρ will not

have increase or decrease function monotonically, but

the increase and decrease trend is good enough. We

set the maximum topic number as the half of the seed

graph nodes, and gradually reduce the topic number

till to one. Dispersion ζ decrease slowly with the

number of topics reduce till a change point. After

the change point, ζ drops quickly, which means graph

structures are more concentrate in clusters. Reacha-

bility ρ has the trend to increase with the number of

topics decrease, but it is not monotonic. There are

situations that ρ

≤ ρ

k+1

. Globally, ρ increases with

the number of topics decrease. In practice, we use the

trend of ρ instead of the actual line to ﬁnd the cross

point with ζ. The changes of dispersion and reacha-

bility with the number of cluster for the ﬁrst round is

shown in Figure: 7.

After validation, the ﬁnal graph has around few

thousand entities, with density as 0.23. After the cold

start, the base graph serves different bots, and ex-

panded through the usage. The whole EKM system

has been running for one year, and have accumulated

close to million entities.

Figure 7: Dispersion and Reachability for real application.

There are some issues we would like to address

here in practice. First, In enterprise, the corpus have

low frequency and high precision, which is a chal-

lenge for frequency based data-driven method; Sec-

Cold Start of Enterprise Knowledge Graph Construction

159

ond, the construction of seed graph can adopt the En-

terprise Architecture, which provides a comprehen-

sive view of business; and the last, expert knowledge

needs crowd sourcing too, especially when the knowl-

edge are cross different domains. And the validation

of expert driven crowd sourcing is another challenge.

5 CONCLUSION

This paper proposes a framework to solve enterprise

knowledge graph cold start problem. The proposed

framework takes the advantage of graph-formed ex-

pert knowledge, and use it to guide the clustering

method, which integrates expert knowledge with data

driven clustering.

Coarser cluster level linkage, and ﬁner entity level

linkage are adopted base on entity vectorization. In-

cremental validation is used to gradually add the

words in graph. The framework is validated through

an enterprise ﬁnance knowledge graph to support rea-

sonable knowledge consumption. Search engines and

Q&A apps will be constructed base on this graph, and

the current available data-driven techniques will be

used to expand and ﬁne tune the graph.

REFERENCES

Carlson, A., Betteridge, J., Kisiel, B., Settles, B., Jr., E.

R. H., and Mitchell1, T. M. (2010). Toward an archi-

tecture for never-ending language learning.

Dong, X. L. (2018). Challenges and innovations in building

a product knowledge graph. In 24th ACM SIGKDD

International Conference on Knowledge Discovery

&Data Mining, pages 2869–2869.

Duan, R. and Xiao, Y. (2019). Enterprise knowledge graph

from speciﬁc business task to enterprise knowledge

management. In Proceedings of the 28th ACM Inter-

national Conference on Information and Knowledge

Management, CIKM ’19, page 2965–2966.

Ferrucci, D. and Brown, E. (2011). Adaptwatson: A

methodology for developing and adapting watson

technology. IBM, Armonk, NY, IBM Res. Rep.,

RC25244.

Fortunato, S. (2009). Community detection in graphs.

Physics Reports, 486(3):75–174.

Galkin, M., Auer, S., Vidal, M.-E., and Scerri, S. (2017).

Enterprise knowledge graphs: A semantic approach

for knowledge management in the next generation of

enterprise information systems. In International Con-

ference on Enterprise Information Systems (ICEIS),

volume 2, pages 88–98, Porto, Portugal. SciTePress.

Geng, L., Semerci, M., Yener, B., and Zaki, M. J. (2012).

Effective graph classiﬁcation based on topological and

label attributes. Statistical Analysis & Data Mining

the Asa Data Science Journal, 5(4):265–283.

Liu, J., Shang, J., Wang, C., Ren, X., and Han, J. (2015).

Mining quality phrases from massive text corpora.

In 2015 ACM SIGMOD International Conference on

Management of Data, pages 1729–1744. ACM.

u, L. and Zhou, T. (2011). Link prediction in complex

networks: A survey. Physica A: statistical mechanics

and its applications, 390(6):1150–1170.

Mart

ınez, V., Berzal, F., and Cubero, J.-C. (2017). A survey

of link prediction in complex networks. ACM Com-

puting Surveys (CSUR), 49(4):69.

Menon, A. K. and Elkan, C. (2011). Link prediction

via matrix factorization. In Joint european confer-

ence on machine learning and knowledge discovery

in databases, pages 437–452. Springer.

Pan, J. Z., Vetere, G., Gomez-Perez, J. M., and Wu,

H. (2017). Exploiting Linked Data and Knowledge

Graphs in Large Organisations. Springer Publishing

Company, Incorporated, 1st edition.

Ramaswami, P. (2007). A remedy for your health-related

questions: health info in the knowledge graph.

Rogers, D. and Hahn, M. (2010). Extended-connectivity

ﬁngerprints. Journal of Chemical Information & Mod-

eling, 50(5):742–754.

Shang, J., Liu, J., Meng, J., Xiang, R., Voss, C. R., and Han,

J. (2017). Automated phrase mining from massive text

corpora. IEEE Transactions on Knowledge & Data

Engineering, PP(99):1–1.

Shervashidze, N., Vishwanathan, S. V. N., Petri, T.,

Mehlhorn, K., and Borgwardt, K. M. (2009). Ef-

ﬁcient graphlet kernels for large graph comparison.

In 12th International Conference on Artiﬁcial Intel-

ligence and Statistics, pages 488–495.

Suchanek, F. M., Kasneci, G., and Weikum, G. (2007).

Yago: a core of semantic knowledge. pages 697–706.

Taskar, B., Wong, M.-F., Abbeel, P., and Koller, D. (2004).

Link prediction in relational data. In Advances in neu-

ral information processing systems, pages 659–666.

Wang, J., Zhang, Z., and Zha, H. (2005). Adaptive manifold

learning. In Advances in neural information process-

ing systems, pages 1473–1480.

Wu, X., Zhao, L., and Akoglu, L. (2018). A quest for struc-

ture: Jointly learning the graph structure and semi-

supervised classiﬁcation. In 27th ACM International

Conference on Information and Knowledge Manage-

ment, CIKM ’18, pages 87–96. ACM.

Zhang, X. and Lee, W. S. (2007). Hyperparameter learn-

ing for graph based semi-supervised learning algo-

rithms. In Advances in neural information processing

systems, pages 1585–1592.

Zhu, X., Ghahramani, Z., and Lafferty, J. D. (2003). Semi-

supervised learning using gaussian ﬁelds and har-

monic functions. In 20th International conference on

Machine learning (ICML-03), pages 912–919.

KMIS 2020 - 12th International Conference on Knowledge Management and Information Systems

160