GRANEF: Utilization of a Graph Database for Network Forensics

Milan Cermak

and Denisa Sramkova

Institute of Computer Science, Masaryk University, Brno, Czech Republic

Keywords:

Network Forensics, Graph Database, Dgraph, Zeek, Association-based Analysis.

Abstract:

Understanding the information in captured network trafﬁc, extracting the necessary data, and performing inci-

dent investigations are principal tasks of network forensics. The analysis of such data is typically performed by

tools allowing manual browsing, ﬁltering, and aggregation or tools based on statistical analyses and visualiza-

tions facilitating data comprehension. However, the human brain is used to perceiving the data in associations,

which these tools can provide only in a limited form. We introduce a GRANEF toolkit that demonstrates

a new approach to exploratory network data analysis based on associations stored in a graph database. In this

article, we describe data transformation principles, utilization of a scalable graph database, and data analysis

techniques. We then discuss and evaluate our proposed approach using a realistic dataset. Although we are at

the beginning of our research, the current results show the great potential of association-based analysis.

1 INTRODUCTION

Network forensics covers a variety of techniques used

for cyber-attack investigation, information gathering,

and legal evidence using identiﬁcation, capture, and

analysis of network trafﬁc (Khan et al., 2016). The

crucial part is the analysis of collected data (e.g.,

packet data or IP ﬂows) to ﬁlter and extract the re-

quired information and gain a situational overview.

Such analysis can be partly automated using anomaly

or intrusion detection tools (Fernandes et al., 2018).

However, these tools may not reveal details impor-

tant to evidence collection, and therefore manual ex-

ploratory network data analysis plays an important

role, as it allows analysts to verify detected anomalies,

examine contexts, or extract additional information.

One of the main challenges of the exploratory

analysis of network trafﬁc is the volume of data that

faces high computational demands. Besides, foren-

sic analysis requires that the analyst has access to

all the data, which limits the use of some automated

tools aggregating the data. Such analysis is typi-

cally based on two approaches: interactive raw data

analysis and statistical analysis. Tools such as Wire-

shark or Network Miner are commonly used in in-

teractive raw data analysis to ﬁlter, aggregate, and

extract meaningful information. Their disadvantage

is a limited visualization, amount of obtained infor-

https://orcid.org/0000-0002-0212-6593

https://orcid.org/0000-0002-3746-5114

mation, high demands on computing resources, and

limited automation of analysis queries. In the sta-

tistical approach, the signiﬁcant packet elements are

extracted from network trafﬁc and visualized in the

form of various statistics charts and overview visual-

izations. The main advantage of tools such as Arkime

or Elastic Stack is processing large amounts of net-

work data and providing an overview via interactive

visualizations. Nevertheless, because of the data ag-

gregation, the analyst has limited access to raw data.

Our research aims to combine the advantages of

both approaches and enable the analyst to investigate

the captured data using interactive visualization. To

achieve this goal, we introduce the GRANEF toolkit

focused on association-based network trafﬁc analysis.

This method is widely used to analyze real-world ob-

jects, social networks, or as part of criminal investiga-

tion (Atkin, 2011). It also reﬂects the way people nat-

urally think (Zhang et al., 2020). In contrast to current

methods focused only on hosts relations, we focus on

an exploratory analysis of all signiﬁcant attributes of

collected network trafﬁc data, including connection

properties and application data. The toolkit is based

on graph database Dgraph (Dgraph Labs, Inc., 2021)

capable of storing and analyzing a large volume of

logs provided by the Zeek (The Zeek Project, 2020)

network security monitor. Unlike interactive raw data

analysis, our approach allows the analysts to browse,

ﬁlter, and aggregate all collected information and vi-

sualize the results in a relationship diagram providing

a broader context to analyzed data.

Cermak, M. and Sramkova, D.

GRANEF: Utilization of a Graph Database for Network Forensics.

DOI: 10.5220/0010581807850790

In Proceedings of the 18th International Conference on Security and Cryptography (SECRYPT 2021), pages 785-790

ISBN: 978-989-758-524-1

 2021 by SCITEPRESS – Science and Technology Publications, Lda. All rights reser ved

785

2 RELATED WORK

Commonly used techniques, analysis methods, and

research directions of network forensics are summa-

rized in the survey by Khan et al. (Khan et al., 2016).

In addition to a taxonomy proposal, they also sum-

marize the open challenges and discuss possible so-

lutions. A well-arranged insight into the area is also

provided by Ric Messier’s book (Messier, 2017) pre-

senting the whole process of network forensics to-

gether with commonly used tools. The main empha-

sis is on the practical use of these tools in a real-world

environment, allowing us to better understand the an-

alyst’s needs. Besides analysis approaches used in

the network forensics area, our research is also mo-

tivated by criminal investigation processes. To solve

the crime and maintain an overview of the whole case,

criminal investigators typically capture associations

between real-world objects and events through link

analysis (Atkin, 2011). Thanks to this approach, they

can maintain a good overview of the data while pre-

serving all the analysis details, which is also the goal

of network forensics.

The utilization of graph databases for network

trafﬁc analysis was introduced by Neise (Neise,

2016). He proposed to use Zeek for data extraction

and store the data in the Neo4j graph database (Neo4j,

2021). To capture the extracted information, Neise

proposed a simple data model, which we further de-

velop in our work. Besides, we propose to utilize

Dgraph to efﬁciently store and analyze large amounts

of data, which is difﬁcult to achieve in the Neo4j

database. The use of Neo4j is also proposed by

Diederichsen et al. (Diederichsen et al., 2019). They,

however, were focused only on the analysis of con-

nection, DNS, and HTTP logs. They designed a data

model that takes into account all attributes in the form

of associations. This approach generates many nodes

and edges, which places huge demands on storage

and computing capacity. Another example of a graph-

based network trafﬁc analysis is Sec2graph proposed

by Leichtnam et al. (Leichtnam et al., 2020). They

have further developed the approach of Neise and

proposed automatic detection of attacks and anoma-

lies. They did not store the data in a database for ex-

ploratory analysis but transformed them into associa-

tions, which they analyzed using machine learning.

3 TOOLKIT DESIGN

The central part of the GRANEF toolkit is graph

database Dgraph which enables scalable data storage,

and processing of large-size network trafﬁc captures.

The toolkit further consists of tools for data prepro-

cessing as well as their exploratory analysis. These

data processing and analysis tools are implemented

as standalone modules as Docker containers where

one module can implement more than one tool, or

one tool can be implemented by more than one mod-

ule, as shown in Figure 1. For example, the indexing

and graph database tools, both working directly with

a running instance of Dgraph, use the functionality of

one Data handling module. The Transformation mod-

ule is our custom solution, and the remaining modules

are based on the use of already existing tools.

extraction transformation

indexing graph database

analysis

Transformation moduleExtraction module

Data handling module

API module

Web module

PCAP

Figure 1: Data pipeline of the GRANEF toolkit.

Separation of data processing into standalone mod-

ules allows us to easily replace or update some mod-

ules without changing the remaining, as long as the

compatibility with subsequent modules is preserved.

Besides, this approach allows us to store intermediate

results and use them in other analysis tools or speed

up the data processing for a new analysis.

3.1 Data Extraction

Network trafﬁc captures are initially processed by

Zeek, which extracts information from packet headers

and application layers (e.g., from HTTP, DNS, TLS,

and SSH protocols) and produces them as log ﬁles.

By default, it aggregates packets to connections and

stores their characteristics. Individual records across

log ﬁles are linked through a unique connection iden-

tiﬁer that easily links extracted data as associations.

The advantage of Zeek is the variety of data process-

ing settings and especially the possibility of extend-

ing it with new extraction methods. This functionality

makes it possible to respond to various requirements

of network trafﬁc forensics and reﬂect new trends and

applications. One possible extension to the Extrac-

tion module would be to add the export of transferred

application data or ﬁles. Zeek manages to save cap-

tured ﬁles in a separate folder, whereas the reference

SECRYPT 2021 - 18th International Conference on Security and Cryptography

786

to these ﬁles is retained in the corresponding log. It

is also possible to extend packet analysis scripts and

extract additional information about the connection

not available in a default conﬁguration. The modu-

larity feature of the GRANEF toolkit plays an impor-

tant role in this case as it allows us to prepare sev-

eral containers with various conﬁgurations and data

processing extensions allowing us to reﬂect different

requirements to the current case of network forensics.

3.2 Data Transformation

The Transformation module takes log ﬁles produced

by Zeek, utilized in the previous module, and converts

them to the RDF triples format (W3C, 2014) accepted

by Dgraph. This conversion of log data is performed

by a custom script that processes selected log ﬁles

record by record. Since each log ﬁle has a prede-

ﬁned set of attributes, we can manually decide which

ones to transfer to the database and how to treat them.

This approach makes it very easy to incorporate any

changes in the design of the database schema or any

information obtained from external sources. Such in-

formation can be, for example, an attribute value that

indicates that the host with a given IP address has

some property that was discovered during forensics

analysis. This information can also be added later

through a unique external identiﬁer given to the node

at the stage of its deﬁnition.

The conversion is done according to a scheme

whose simpliﬁed form is shown in Figure 2. This

scheme is based on Neise (Neise, 2016) and Leicht-

nam et al. (Leichtnam et al., 2020), who represent in-

dividual logs as separate nodes and connect them with

deﬁned associations. The information contained in

log records is stored in the database as node attributes

allowing to perform ﬁltering or aggregation on them.

Compared to previously proposed schemas, we add

an additional edge communicated between individ-

ual hosts to facilitate the deﬁnition of queries focused

only on the connection’s existence and optimize the

query execution. Communicating hosts are extracted

from the connection log and represented as separate

nodes. We also simplify edge naming to be uniform

throughout all logs and make it easier to query the en-

tire schema. The resulting schema is designed to re-

ﬂect people’s common perception of how a computer

network works and simpliﬁes analysis as queries can

be formed at the highest level of abstraction.

Each node of the schema has an assigned type.

Host nodes represent a device on the network with

a given IP address. These nodes can be associ-

ated with Host-data nodes containing information ex-

tracted from application data related to the host. Ex-

originated

responded

<host-data>

Host

produced

Connection

<Host-data>

communicated

<host-data/uid> <Application>

Figure 2: Simpliﬁed database schema showing nodes and

their associations.

amples of such data are domain names extracted from

DNS, HTTP, or TLS trafﬁc. Further, it can refer to

transferred ﬁles, certiﬁcates, or user-agents. It is also

possible to associate external information relevant to

the host, such as details from reputation databases.

The Connection nodes contain information about the

network connection, such as its duration, the number

of bytes transferred, relevant ports, and used proto-

col. The Application nodes contain application data

extracted from the Connection and may be mutually

connected by an additional edge. Edge host-data/uid

is present to preserve what Application node created

the associated Host-data node. All edges are direc-

tional but allow reverse processing for querying from

an arbitrary node regardless of its type.

Thanks to the universal deﬁnition of the proposed

scheme, it is possible to transform other types of data

related to network trafﬁc analysis in a similar way

as using the Zeek. An example is IP ﬂows, which

may currently contain information about individual

connections and can be extended by information ex-

tracted from application data (Velan, 2018). Alterna-

tively, it is possible to transform system logs related

to network connections or collected from network de-

vices. These transformations can be represented as

separate modules of the toolkit to be easily intercon-

nected according to the network forensics case.

3.3 Data Handling

The core part of data handling is the Dgraph clus-

ter consisting of two types of computational nodes.

Dgraph Zero controls the cluster and serves as the

main component responsible for the orchestration of

the database and analysis. Data processing is per-

formed by Dgraph Alpha nodes containing indexed

data. At least one Zero and Alpha node are needed

to handle stored data. Additional details about the

database and data analysis abilities can be found in

its documentation (Dgraph Labs, Inc., 2021).

The Data handling module consists of indexing

and graph database components, working directly

with an instance of Dgraph. The indexing compo-

nent uploads and indexes RDF triples and stores them

in an internal database structure. The main part of

GRANEF: Utilization of a Graph Database for Network Forensics

787

the component is Dgraph Bulk Loader which oper-

ates on the MapReduce concept. It appropriately uti-

lizes available computational resources. In addition,

the component allows us to specify the number of Al-

pha nodes that will be utilized in the following graph

database component. Large volumes of data can thus

be distributed within the cluster while maintaining the

ability to perform fast analysis over stored data. Re-

sults of the indexing component are binary ﬁles stor-

ing both the data and indexes. The advantage of this

approach is a reduction of data processing time when

it is reloaded. Besides, it is possible to use the gen-

erated index within another instance of Dgraph de-

ployed on a more powerful computation node.

The graph database component takes care of man-

aging Dgraph nodes and their communication. Data

provided by the indexing component are loaded to Al-

pha nodes. The exposed Dgraph user interface al-

lows, among other things, to perform basic queries

over the data. However, it is not suitable for ex-

ploratory analysis as it has only a limited degree of

interaction. The analyst must also know the speciﬁcs

of the query language, which complicates the adapta-

tion of the proposed network forensics approach.

3.4 Data Analysis

Data stored in Dgraph are queried using Dgraph

Query Language (DQL) based on GraphQL. An ex-

ample of such a query is provided in Figure 3 con-

taining a selection of TCP connections and trans-

ferred ﬁles from a local network. A DQL query ﬁnds

nodes based on search criteria matching patterns in

the graph and returns a graph in JSON format (Dgraph

Labs, Inc., 2021). Queries are composed of nested

blocks; their evaluation starts by ﬁnding the initial set

of nodes speciﬁed in the query root, against which

the graph matching is applied. In addition to ﬁlter-

ing, DQL allows variables deﬁnition and data aggre-

gation. Thanks to the pre-deﬁned schema, results

are predictable. A disadvantage is that DQL is not

widespread yet, and the analyst must devote some

time to perform advanced queries. To overcome this

issue, we have created an additional analysis module

providing an abstract layer over DQL.

The GRANEF analysis tool consists of two mod-

ules: the Application interface (API) module and the

Web user interface module. This approach supports

greater versatility of the entire solution, as it is pos-

sible to connect other systems to the API without the

need to use a web user interface. The API implements

querying and processing of data stored in Dgraph,

while only ﬁlter properties or immersion rates are re-

quired as input. The provided API functions reﬂect

common tasks of exploratory analysis and are based

on both our experience and the steps typically per-

formed by analysts within our CSIRT team.

{getConn(func: allof(host.ip, cidr, "10.10.0.0/16")) {

name : host.ip

host.originated @filter(eq(connection.proto, "tcp")) {

expand(Connection)

connection.produced {

expand(_all_)

files.fuid { expand(File) }

}

~host.responded { responded_ip : host.ip }

}

}}

Figure 3: Selection of local network TCP connections and

transferred ﬁles using DQL.

The web user interface utilizes the API and repre-

sents its user-friendly extension that allows perform-

ing deﬁned queries and supports exploratory analy-

sis. The query results are displayed in an interactive

relationship visualization which uses a force-directed

graph layout and allows nodes aggregation to show

large relationship diagrams while preserving a simple

overview of the data. Based on our experience, this

layout seems to be the best comprehensible. However,

we plan to verify other variants in the future. An ex-

ample of such a visualization is shown in Figure 4,

containing one speciﬁc connection of response to the

query from Figure 3. This approach supports interac-

tivity as the analyst can select nodes or edges, see all

attributes, and perform another analytical query over

them while the result is added to the same visualiza-

tion or displayed in a new analysis tab. As part of the

exploratory analysis, it is possible to browse through

the associations between information extracted from

network trafﬁc and observe a context that would oth-

erwise remain hidden.

host.originated

~host.responded

connection.produced

http.resp_fuid

files.fuid

Host

File

Files

HTTP

Connection

Figure 4: Visualization of one connection between hosts.

4 DISCUSSION

To evaluate the toolkit capabilities, we use network

trafﬁc datasets containing realistic scenarios with

small-size captures and larger ones with size in the

SECRYPT 2021 - 18th International Conference on Security and Cryptography

788

order of gigabytes. Especially, analysis of large net-

work trafﬁc captures is a typical use-case of network

forensics, so we pay more attention to it. In this case,

however, the analyst expects that preprocessing of

such data puts considerable computational demands

increasing processing time. Therefore, greater em-

phasis is on the subsequent analysis, which must be

sufﬁciently interactive without delays.

4.1 Computational Requirements

To test data processing speed, we have prepared a vir-

tual machine with Debian OS, 4 VCPU, and 16 GB

RAM, which corresponds to today’s ordinary hard-

ware performance. The data processing speed of

a small capture ﬁle (Digital Corpora, 2020) with the

size of several megabytes was affected more by con-

tainer startup. Nevertheless, the processing took an

average of tens of seconds. To test the processing of

a larger network capture, we selected a capture from

the second day of the CyberCzech exercise (Tovar

nák

et al., 2020) which is approximately 6 GB in size

and contains 330,564 connections. The average pro-

cessing time for this ﬁle was approximately 7 min-

utes, with extraction taking approximately 120 sec-

onds, transformation 50 seconds, and indexing 250

seconds. The transformed dataset resulted in 718,475

nodes and 397,632 edges, with an index size of ap-

proximately 820 MB. Although this data processing

time is not critical for network forensics, it is possible

to achieve further improvements by parallelizing the

extraction using multiple Zeek runs or using a bigger

cluster for the data indexing task.

Once the data are indexed, analytical queries are

performed fast, whereas the results are typically re-

turned in one or two seconds. However, the main

challenge is to render the results in the form of rela-

tionship visualization. It is necessary to spread nodes

in a suitable layout to reasonably support the visual

analysis. Besides, a larger number of nodes place

great computational demands and causes the result-

ing graph to become less clear. For this reason, it is

necessary to allow the grouping of similar nodes so

that the overall visualization could offer a sufﬁcient

response. We perceive this visualization requirement

as a crucial factor of the toolkit, which we plan to fo-

cus on more in future work.

4.2 Exploratory Analysis

The main beneﬁt of graph-based network forensics

is the support of exploratory analysis. The general

queries that are part of API follow the analyst’s typi-

cal behavior. In the beginning, it is essential to restrict

the set of nodes we want to focus on. To do so, we

need to understand the nature of as many hosts and

connections as possible to distinguish unusual net-

work trafﬁc. Examples of some queries are "return

all connections and protocol types between two spe-

ciﬁc hosts" or "return number of all speciﬁed connec-

tions for hosts that fall within given CIDR range". We

have also taken advantage of DQL and deﬁned queries

utilizing aggregation functions, allowing us, for ex-

ample, to group all host connections according to the

number of transferred bytes.

The result of a query that focused on a subset of

outgoing TCP connections of one host can be seen in

Figure 5. An advantage of such visualization is that it

often allows the analyst to distinguish regular network

trafﬁc from suspicious just at ﬁrst glance based solely

on the resulting pattern. In the provided example, it

would be relevant to pay attention to the communi-

cation with the left node. In the subsequent analy-

sis step, the analyst can select nodes or a group of

nodes, further explore their associations, and go into

the graph’s depth and explore observed connections.

Figure 5: TCP connections in the National Gallery DC Sce-

nario dataset (Digital Corpora, 2020).

Besides the mentioned advantages, our experience

has also shown the challenges that need to be faced

with the proposed graph-based network forensics ap-

proach. Fast relationship visualization is crucial as

it directly affects the exploratory analysis. Another

challenge we have encountered is taking time percep-

tion into account. Associations of individual connec-

tions are created independently of the time context.

This approach allows the analyst to overview events

that have occurred over a longer time. On the other

hand, it is necessary to consider the continuity of indi-

GRANEF: Utilization of a Graph Database for Network Forensics

789

vidual network connections in certain cases. This can

be achieved through appropriate attribute ﬁltering, but

a challenge is how to make both of these methods ac-

cessible to the analyst. Another challenge associated

with graph analysis is the need for a mindset change

as analysts are used to other approaches. However,

our experience shows that they can naturally analyze

the data provided in this way after a while. This ob-

servation requires a more detailed veriﬁcation, which

we plan to perform in future work.

5 CONCLUSION

Graph-based network forensics is a new approach

to analyzing network trafﬁc data utilizing mod-

ern database technologies capable of storing large

amounts of information based on their associations.

It follows the typical way of human thinking and

perception of the characteristics of the surrounding

world. Its main advantage is the connection of ex-

ploratory analysis of network trafﬁc data with results

visualization allowing analysts to easily go through

the acquired knowledge and visually identify interest-

ing network trafﬁc. Our experience also shows that

this approach is not only the new method of data stor-

age and querying, but it is a shift of mindset that al-

lows us to perceive network data in a new way.

In this paper, we introduced the GRANEF toolkit

utilizing Dgraph database that stores transformed in-

formation from network trafﬁc captures extracted by

Zeek network security monitor. The stored data are

presented to the user via a web-based user interface

that provides an abstraction layer above the database

query language and allows the user to efﬁciently

query data, visualize results in the form of a relation-

ship diagram, and perform exploratory analysis.

Our aim of the provided toolkit description was

to introduce a new approach to network forensics

and incident investigation and describe this solution’s

speciﬁcs. As part of future work, we want to further

compare this approach with other typically used an-

alytical methods, both in terms of functionality and

analyst’s behavior. Furthermore, we plan to focus on

the deﬁnition of new methods for automatic analysis

of network trafﬁc based on the associations provided

by our proposed data model. We also see great po-

tential in connecting various data types and sources,

which could create a uniﬁed analytical environment

allowing us to analyze the data obtained from hosts

and network trafﬁc in one place. The ﬁrst evaluation

results of the proposed approach demonstrate its great

potential for network forensics and generally for ex-

ploratory analysis of network trafﬁc data.

ACKNOWLEDGEMENTS

This project has received funding from the European

Union’s Horizon 2020 research and innovation pro-

gramme under grant agreement No 833418.

REFERENCES

Atkin, H. (2011). Criminal Intelligence: Manual for Ana-

lysts. UNODC Criminal Intelligence Manual for Ana-

lysts. United Nations Ofﬁce on Drugs and Crime (UN-

ODC).

Dgraph Labs, Inc. (2021). Native GraphQL Database: The

Best Graph DB | Dgraph. https://dgraph.io/. Ac-

cessed: 2021-01-21.

Diederichsen, L., Choo, K.-K. R., and Le-Khac, N.-A.

(2019). A Graph Database-Based Approach to Ana-

lyze Network Log Files. In Network and System Secu-

rity, pages 53–73. Springer International Publishing.

Digital Corpora (2020). The 2012 National Gallery DC Sce-

nario. https://digitalcorpora.org/corpora/scenarios/

national-gallery-dc-2012-attack. Accessed: 2021-01-

21.

Fernandes, G., Rodrigues, J. J. P. C., Carvalho, L. F., Al-

Muhtadi, J. F., and Proença, M. L. (2018). A com-

prehensive survey on network anomaly detection.

Telecommunication Systems.

Khan, S., Gani, A., Wahab, A. W. A., Shiraz, M., and Ah-

mad, I. (2016). Network forensics: Review, taxon-

omy, and open challenges. Journal of Network and

Computer Applications, 66:214–235.

Leichtnam, L., Totel, E., Prigent, N., and Mé, L. (2020).

Sec2graph: Network Attack Detection Based on Nov-

elty Detection on Graph Structured Data. In Detection

of Intrusions and Malware, and Vulnerability Assess-

ment, pages 238–258. Springer International Publish-

ing.

Messier, R. (2017). Network Forensics. John Wiley & Sons,

Ltd.

Neise, P. (2016). Intrusion Detection Through Relationship

Analysis. Technical report, SANS Institute.

Neo4j (2021). Neo4j Graph Platform - The Leader in Graph

Databases. https://neo4j.com. Accessed: 2021-01-30.

The Zeek Project (2020). The Zeek Network Security Mon-

itor. https://zeek.org/. Accessed: 2021-01-21.

Tovar

nák, D., Špa

cek, S., and Vykopal, J. (2020). Trafﬁc

and log data captured during a cyber defense exercise.

Data in Brief, 31.

Velan, P. (2018). Application-Aware Flow Monitoring.

Doctoral theses, dissertations, Masaryk University,

Faculty of Informatics, Brno.

W3C (2014). RDF 1.1 N-Triples. https://www.w3.org/TR/

n-triples/. Accessed: 2021-01-21.

Zhang, H., Zeng, H., Priimagi, A., and Ikkala, O.

(2020). Viewpoint: Pavlovian Materials—Functional

Biomimetics Inspired by Classical Conditioning. Ad-

vanced Materials, 32(20).

SECRYPT 2021 - 18th International Conference on Security and Cryptography

790