Anomaly Detection in Communication Networks of Cyber-physical
Systems using Cross-over Data Compression
Hubert Schölnast
, Paul Tavolato
and Philipp Kreimel
Institute of IT Security Research, St. Pölten UAS, Matthias-Corvinus-Straße 15, St. Pölten, Austria
Limes Security GmbH, Hagenberg, Austria
{hubert.schoelnast, paul.tavolato},
Keywords: Anomaly Detection, Industrial Security, Substation Security, Cross-over Data Compression CDC.
Abstract: Anomaly detection in operational communication data of cyber-physical systems is an important part of any
monitoring activity in such systems. This paper suggests a new method of anomaly detection named cross-
over data compression (CDC). The method belongs to the group of information theoretic approaches and is
based on the notion of Kullback-Leibler Divergence. Data blocks are compressed by a Sequitur-like algorithm
and the resulting grammars describing the compression are applied cross-over to the all the other data blocks.
Divergences are calculated from the length of the different compressions and the mean values of these
divergences are used to classify the data in normal and anomalous. The paper describes the method in detail
and shows the results derived from a real-world example (communication data from a substation).
Systems where embedded computing devices sense,
monitor, and control physical processes through
networks, usually with feedback loops in which
physical processes affect computations and vice versa
(Lee, 2008) are called cyber-physical systems (CPS).
These systems are becoming literally ubiquitous, and
our society and economy depends in a high degree on
the precise and stable operation of these systems.
Therefore, it is of course necessary to build and
install such systems according to rules of safety and
security. Besides that – which is a big challenge of its
own – it is necessary to monitor the operational
system continuously. External influence from the
environment might seriously disturb the system’s
operation, which could lead to unwanted
malfunctioning and in case of safety-critical system
even have disastrous consequences. Such external
influences may have their causes in abnormal changes
in the environment, in malfunctions of the interfaces
between the environment and the CPS, or in attacks
against the system. Especially the threat of cyber-
attacks legitimately gets more and more attention as
their number is increasing rapidly. The majority of
applications makes use of IP-based technology and
standard computing devices, hence opening points of
exposure and increasing the attack surface of CPSs in
a way that cannot be neglected any more. Moreover,
the complexity of the systems is growing rapidly as
they become smarter, make use of advanced
technologies, and consist of a high number of devices.
So the protection of these systems is a challenging
One measure to meet these challenges are so-
called intrusion prevention and intrusion detection
systems. But, these defence mechanisms were
designed for common IT systems and often are not
applicable in smart CPS environments. Moreover,
there is no guarantee that any intrusion will be
detected in time. To ensure the protection of these
environments, a second line of defence is needed:
Certain security controls that monitor system
communication and operation in real-time, or at least
close-to-real-time are necessary. One possibility for
such defence systems is the implementation of an
anomaly detection system. Anomaly detection
systems consist of a formal model of normal system
behaviour and a monitoring component that compares
in real time the actual behaviour of the system with
the formal model. Too large deviations of the
system’s behaviour from the model forecasts are
identified as anomalies and will raise an alarm.
As of today, various formal models are used in
connection with anomaly detection: many of
statistical nature, such as outlier detection, cluster
analysis, or hidden Markov models; others are of
structural nature: neural networks, association rules
and syntactic pattern matching (Chandola et al.,
This paper suggests a new method for defining the
behaviour model of a CPS belonging to the family of
information theoretic models.
2.1 Anomaly Detection in General
For the field of anomaly detection there exists a still
valid survey article from 2009 by Chandola et al.
(Chandola et al., 2009) that gives a comprehensive
overview of methods and applications of anomaly
detection covering most of the field. With respect to
anomaly detection in (computer) networks there are
three more survey articles from recent years (Bhuyan
et al., 2013), (Ahmed et al., 2016), (Fernandes et al.,
2019) with thorough overviews.
The generally accepted definition of an anomaly
from (Chandola et al., 2009) reads as follows:
Anomalies are patterns in data that do not conform to
a well-defined notion of normal behaviour.
(Fernandes et al., 2019) gives a structured
overview of methods used for anomaly detection:
Statistical Methods
Clustering Methods
Finite State Machines
Classification Methods
Information Theory Models
2.2 Statistical Methods
Statistical methods are based on stochastic models
and assume that normal “events” are found in regions
predicted by the model with high probabilities, while
anomalies are located in regions with low probability.
The use of wavelet analysis for anomaly detection
is described by (Hamdi et al., 2007) in detail.
Principal Component Analysis (PCA) is a method
based on dimension reduction and was first
introduced by (Lakhina et al., 2004). Several
improvements of the approach have been suggested.
(Yeung et al., 2007) introduced covariance matrices
to filter for variables having a high discriminatory
2.3 Clustering Methods
The 𝑘-means algorithm (MacQueen, 1967) assumes
that there are 𝑘 given clusters to group the elements
and is mainly based on comparing distances. The
method 𝑘-Nearest Neighbour first published in 1967
(Cover et al., 1967) clusters training data in an 𝑛-
dimensional space and uses these clusters to assign
new instances to the best-fitting cluster.
(Agrawal et al., 1998) describes a method for
dimensional reduction and calls it Subspace
Clustering; the approach assumes that projecting data
into a space with fewer dimensions may facilitate
2.4 Finite State Machines
Finite state machines are an often-used mechanism to
detect anomalies. Normal data sequences are
modelled by regular expressions and the
corresponding finite state machine is used to verify
normal behaviour. Any data sequence not accepted by
the automaton is rejected as anomaly. This method
often comes along together with Markov chains
(Estevez-Tapiador et al., 2003). Intrusion detection
and prevention systems for conventional IT systems
often use this approach.
2.5 Classification Methods
Classification methods play an important role in
machine learning. The most important representatives
are Bayesian Networks (Jensen, 1997) and (Nielsen
et. al., 2007), Support Vector Machines (SVM)
(Schölkopf et al., 2001) and Neural Networks
(Haykin, 1994).
2.6 Information Theoretic Models
Shannon Entropy (Shannon, 1948) measures the
amount of uncertainty involved in the value of a
random variable.
Kullback-Leibler Divergence (Kullback et al.,
1951) is a measure for the difference between two
probability distributions. It can be used when
comparing two segments of data that represent the
behaviour of a system. The suggestion for a method
of anomaly detection described in this paper is
inspired by Kullback-Leibler Divergence. For further
details, see below.
Lee et al. (Lee et al., 2001) suggest a variety of
entropy-based methods to detect anomalies in data
used by intrusion detection systems. They analyse
Shannon Entropy, the entropy of dependent
probability distributions, the relative entropy of two
probability distributions, and information gain of the
attributes of a set of data. Bereziński (Bereziński et
al., 2015) and Martos (Martos et al., 2018) published
work using entropy-based methods, too.
2.7 Other Methods
Other methods used for anomaly detection are
derived from evolution theory (Kar, 2016), Artificial
Immune Systems (Castro et al., 2002), (Hooks et al.,
2018), Genetic Algorithms (Aslahi-Shahri et al.,
2016), (Hamamoto et al., 2018), Particle Swarm
Optimization (Bamakan et al., 2016), (Wahid et al.,
2019), Differential Evolution (Storn et al., 1997),
(Elsayed et al., 2015), and some hybrid approaches
combining two or more of the methods mentioned.
We define features relevant in describing the (normal)
operation of the CPS and an observation interval 𝑖.
We then collect the relevant data transmitted during
this interval in the network giving one block of data.
We do this with 𝑛 consecutive intervals. Each
interval yields a data block 𝑏
. Now we compress
each data block separately as described later on. Each
compression results in a compressed file 𝑐
and a
grammar (substitution table) 𝑔
describing the
compression; we assume 𝑔
is contained in 𝑐
. In the
next step we use all grammars 𝑔
instead of 𝑔
compress the block 𝑏
which results in the
compressed files 𝑐
. We do this for all combinations
of 𝑖 and 𝑗. Since 𝑔
is not optimal to compress 𝑏
, the
compressed file 𝑐
will be larger than 𝑐
. The
difference of the lengths of 𝑐
and 𝑐
is called the
divergence 𝑑
. It is a measure for the degree of
similarity of the block 𝑏
(which was compressed)
compared to the block 𝑏
(from which the grammar,
used in the compression algorithm, was extracted).
3.1 Data Acquisition and Features
We consider a communication network in a CPS. We
collect the data transmitted over the network splitting
it into segments. These segments or blocks will be the
units of analysis later on. There are two main features
we have to take into consideration:
a) Whether the data stream is encrypted or not.
b) Whether the protocol used in the network is
synchronous (like Modbus, HDLC and others) or
asynchronous (like Canbus or IEC 61850 or IEC
If the data stream is encrypted and we do not have
the possibility to access the decrypted information,
anomaly detection can be based on the available
metadata only, like packet frequency or roundtrip
The type of protocol (synchronous or
asynchronous) has influence on the collection
intervals of the data. If the protocol is synchronous,
we can construct data segments by simple time
slicing: In case of an asynchronous protocol, time
slicing is not an appropriate method as data packets
arrive in arbitrary intervals. In this case, we can
construct segments by counting the number of
packets of a certain important type: a segment is
defined as the interval necessary to transmit a
predefined number of packets.
3.2 Anomaly Detection
The method so far yields a sequence of features from
each block of data analysed. The question now is how
to compare these blocks to decide whether the data
describes normal or abnormal operation of the
We suggest a method that measures a special form of
“distance” between the blocks by looking at the
amount of redundancy contained in a block. The
method is inspired by the notion of Kullback-Leibler
Divergence (also known as “relative entropy”). This
is a non-negative real number that can be calculated
from two probability distributions 𝑃 and 𝑄, where 𝑄
in many cases is a predicted distribution (a
hypothesis) and 𝑃 is a measured distribution
(empirical data). The formula is:
The alphabet 𝑋 is the set of all characters 𝑥 that
may appear in both distributions. 𝑃
is the
probability that the character 𝑥 will appear at any
arbitrary position within the distribution 𝑃;
analogously for 𝑄
The formula above can be converted to:
The part in the first pair of big brackets is the cross
entropy of 𝑃 and 𝑄, the part in the second pair of
brackets is the well-known Shannon Entropy 𝛨 of 𝑃.
The Shannon Entropy is a property of a character-
source that emits characters from the alphabet 𝑋 with
probability distribution 𝑃. 𝛨 is the amount of
information per character emitted by this source.
Often 𝛨 is interpreted as a compression factor: A
string of 𝑛 characters, which is emitted from this
source can (so says theory), be compressed to a binary
string with a length of 𝑛∙𝛨 bits.
In accordance with this interpretation of Shannon
Entropy, Kullback-Leibler Divergence can be inter-
preted as the average number of bits wasted per
character when a string, emitted by a source having
probability distribution 𝑃 is not compressed by a
method optimized for its own distribution 𝑃 but using
a method that is optimized for the probability
distribution 𝑄.
The problem with the value calculated using the
formula shown above is that within probability
distributions the effective order of the characters is
irrelevant. Take these 3 strings as an example:
A: 00000000000000001111111111111111
B: 01010101010101010101010101010101
C: 01101001000110010001111101110100
They all consist of 16 zeros and 16 ones, so for all
3 examples we have 𝑃
and therefore
all three have the same Shannon Entropy 𝛨
=1. However, compressing those strings
optimally (make them as short as possible), one finds
that A and B will result in shorter compressed strings
than C.
If you calculate the Kullback-Leibler Divergence
for any pair of those strings, you will always get
𝐷=0, which is correct because all three probability
distributions are equal, so there is no difference.
However, if you find the optimal method to compress
string C and use this very method to compress string
A, the compression result is worse than if you had
taken the method optimal for A. Therefore, if you
really compress strings or files, you get different
results than what the common interpretation of
Kullback-Leibler Divergence suggests.
However, we assume that the effective number of
wasted bits you get when compressing a string with a
method optimized for another string might be a good
candidate to measure how different two strings are.
Therefore, we developed a method to do exactly what
corresponds to the common interpretation of
Kullback-Leibler Divergence.
Practically the method works as follows: Let the
data from each interval 𝑖 be a block 𝑏
and the
number of blocks be 𝑛. To each block belongs a
grammar (a set of replacement rules) which at the
beginning is empty. 𝑏
is compressed by a Sequitur-
like algorithm (Nevill-Manning et al., 1997) yielding
a compressed file 𝑐
and a grammar 𝑔
the compression. The compression method searches
for the most frequent bigram in 𝑏
(a bigram is a
group of two subsequent characters) and substitutes
each instance of this bigram with a single character,
which did not appear in any 𝑏
before. The new
replacement character and the two characters forming
the replaced bigram together build one replacement
rule that is added to the grammar. So, each rule in the
grammar consists of exactly 3 characters. The data
block wherein the bigrams have been replaced plus
the grammar together build a compressed version of
(i.e. 𝑏
). This procedure is repeated until no more
improvement in the compression ratio is achieved, i.e.
as long as 𝑏
becomes shorter from round to round.
The minimum-length version of 𝑏
(compressed data
plus grammar) is the compressed file 𝑐
and the
grammar 𝑔
is part of 𝑐
In the next step each data block 𝑏
is compressed
using all other grammars 𝑔
. This means: You don’t
search for the most frequent bigram, but take the
bigram that is contained in the grammar and replace
each instance of it with the character from the
grammar. You repeat this for each rule in the
grammar. Compressing 𝑛 data blocks with 𝑛
grammars gives 𝑛
compressed files 𝑐
Now we subtract the length of the optimal
compression 𝑐
from the length of 𝑐
where the
length 𝑙 is the number of characters. The difference is
the divergence 𝑑
∶= 𝑙𝑐
These lengths form an 𝑛×𝑛 divergence matrix ℒ.
Note: the elements in the main diagonal of the
matrix are always 0: 𝑑
=0 for all 𝑖. 𝑑
≥0 for all
𝑖≠𝑗, which means that ll other values are non-
negative integers (equal only when 𝑏
and 𝑏
are two
identic instances of the same string).
for most 𝑖,𝑗. The matrix is not
symmetric. By accident 𝑑
and 𝑑
can be equal, but
generally they are not.
A column of represents the length differences
of all blocks compressed with one specific grammar.
A row of ℒ represents the length differences of all
compressions of one specific block using different
Now we calculate for each column and each row
the average value of non-diagonal elements. For
comparison we can look either at the averages of
columns or at the averages of rows. Both should give
a reasonable measure of the information distance
between the blocks. The smaller the difference
between the averages of two rows 𝑖 and 𝑗, the smaller
is the information theoretic distance between the
blocks 𝑏
and 𝑏
Whether the column averages or the row averages
show a better correspondence seems to depend on the
specific application where the data comes from. This
topic still needs further investigations. For our
analysis we use both.
To the end of anomaly detection, we must
calculate the matrix from a number of data blocks
taken from a system showing normal behaviour. For
the row averages (and the column averages), we can
then calculate a mean value (which by definition is
equal for rows and columns) and the standard
deviation (which is not equal). These two values can
be used to detect a data block that shows a value
differing more than say 3 times the standard deviation
from the mean value derived from normal behaviour.
4.1 Topology of the Example CPS
To demonstrate the viability of the proposed anomaly
detection method we chose a component from the
distribution network for electrical energy: a
substation. Generally, substations transform electric
current changing its voltage. As part of the
distribution network for electrical energy, substations
are an eminent part of the critical infrastructure.
Figure 1: Configuration of a substation.
We used a testbed of a (small) substation. Figure
1 shows the main elements of the configuration of the
testbed resembling an automation network of a
typical substation. The RTUs (remote terminal units)
are connected to a switch. For the experiments, the
testbed we used was equipped with 4 RTUs. The
switch connects the protection zone to an engineering
zone in the substation and further on to the outside
In this testbed, we simulated the operation of a
substation used in a solar plant by a software
developed for testing purposes. The protocol used
was IEC 60870-5-104. This protocol is TCP/IP-based
and by definition does not provide up-to-date security
mechanisms. For example, the protocol transmits
messages in clear text without any form of
authentication. Therefore, such systems are very
susceptible for network-based attacks such as Man-
in-the-Middle and protocol-specific attacks.
4.2 Data Collection
We captured the network traffic from a mirror port of
the switch. The protocol IEC 60879-5-104 is an
asynchronous protocol and the number of messages
defined the collection interval. For each
measurement, we selected the following data:
rtt: round-trip-time of the packet
length: packet length
wsize: TCP window size
ioa89: information object at address 89 containing
the voltage of the input current from the solar
We collected 200 data blocks during normal
operation of the substation. The data was stored in
form of a csv-file for each block and the following
compression algorithm worked solely on the text
(characters) in the file (even for the numbers). Here
we use the label “valid” for these valid data.
Furthermore, we performed cyber-attacks against
the system and collected the data from the system
under attack. The following attacks have been
Man-in-the-Middle Filter Attack (labelled as
“filter”): overwrites the transmitted measurement
data with a constant value.
Man-in-the-Middle Increment Attack (label:
“incr”): changes the transmitted measurement
data by a small amount (+ 0,1-1,0). This can result
in unknown system states.
Man-in-the-Middle Drop Attack (label: “drop”):
packets containing a certain value are dropped.
The data collected during the attacks was the same as
during normal operation: rtt, length, wsize, ioa89. The
files are structurally identical with files gathered
during normal operation.
In both cases (data describing valid behaviour of
the system and data gathered from the system under
attack), we have discarded measurements about
packages that did not transmit a voltage, as they were
not significant for the behaviour of the system and
could be considered outliers.
4.3 Anomaly Detection
To describe the normal behaviour of the system we
started with 𝑛=200 data files collected during
normal operation of the system. Every file is a data
block 𝑏
and was compressed with a Sequitur-like
algorithm giving 200 compressed files 𝑐
and 200
grammars 𝑔
. In the next step each (original) file is
compressed by using all other grammars 𝑔
, yielding
40.000 compressions 𝑐
. From this, we calculate the
divergence matrix ℒ. From this matrix we calculate
the row and column averages. At last, we calculate
the mean value and standard deviation of these
averages. These values (one total average and two
standard deviations) describe the normal behaviour of
the system. To facilitate further comparisons, we
consolidated the two standard deviations into one
single number by taking the square root from the sum
of the squares of the two standard deviations.
Now we carry out the same procedure with the
data we collected from the system during each of the
attacks. This leads to a mean value and a consolidate
standard deviation for each attack allowing a
comparison of these values to check whether
anomalies can be detected with appropriate
an anomaly can be assigned to the right attack.
4.4 Example Calculation
The numbers in the following example are taken from
the data collected from the testbed (during normal
operation and under attack).
Figure 2: Excerpt of the complete divergence matrix.
There is the divergence matrix (showing a
small section of the large 271x271 matrix). The 6
rows and columns with valid data define the normal
behaviour of our system. The values in the row and
the column labelled “filter” come from an attack. We
want to find out, if those values are sufficiently
different from valid data. So considering valid data
only, we calculate the average of the non-diagonal
values for each row and each column. This leads to
the following table:
Figure 3: Averages for valid data.
Next we calculate the total average, which is
122.5, and the standard deviations which are 44.79 for
the column “avg row and 45.30 for the row “avg
col”. The consolidated standard deviation is 𝜎=
Then we calculate the average of the values in the
row “filter” and the column “filter” shown in the first
matrix. This average value is 871.2. From this value
we subtract the average value for the block of valid
values (122.5) and we get:
871.2 – 122.5 = 748.6
Dividing this number by 𝜎 gives:
We now know that the values from the file filter”
are 11.75 times 𝜎 away from the average of the valid
data. As this is much more than the usually used limit
in statistics of 3𝜎, we can conclude, that the data in
the row and column “filter” are anomalous.
4.5 Results
Calculations were carried out on the 200 files
collected during normal (valid) operation, and on 71
additional files collected from the system under
attack. There were three different attack classes (34
“filter”, 17 “drop” and 20 “incr”). Figure 4 shows a
colour-encoded picture of the complete 271x271
divergence matrix as. The 0’s in the diagonal are
shown as white pixels.
Bright green stands for low divergence (i.e. high
similarity), bright yellow and medium bright brown
depict medium values and dark violet stands for high
values of 𝑑
Figure 4: Colour-encoded 271x271 div-matrix.
We see from this picture, that the filter attacks are
all very homogenous (bright green square in the upper
left corner), but they differ strongly from the rest. The
highest divergence exists between the attack types
“filter” and “drop”. The divergence of “drop” and
“valid” can be separated by sight, but “incr” is hard
to distinguish from “valid”.
By visual inspection one can find four quadrants
(green and yellow) within the block of valid data. The
reason for this is that the data was collected from two
different RTUs (remote terminal units), and hence
contain slightly different voltages.
To distinguish between “incr” and “valid” using
cross-over data compression we have to take a closer
look at the numbers:
The method used above to determine whether the
data from a specific file corresponds to normal
operation (is valid) or not can also be applied to test
the membership of the data to the attack types “filter”,
“drop” or “incr” (or any other class).
To do so, we compare the differences between the
file of interest and the classes, measured in units of 𝜎.
We allocate each file to the class that produces the
lowest 𝜎- distance for this file.
Using cross-over data compression 268 files are
allocated to the correct class. Only three files
belonging to the attack class “incr” were
misrecognized. They were classified as “valid” by the
algorithm. However, attacks from the class “incr”
have proven to be hard to detect by other methods,
too. Some of these attacks changed the voltage value
only insignificantly, making the distinction from
valid data tricky.
Figure 5: Summary of assignments.
This paper presents a novel way of anomaly detection,
which we call Cross-over Data Compression (CDC).
The key characteristic of the method is the calculation
of differences between the lengths of compressed files
where different grammars were used for compression.
In an information theoretic sense one could summarize
the method by saying that we use the divergence of
redundancies between data blocks to define similarities
of the blocks. By observing a reasonable number of
data blocks collected during normal operation of the
system we can find a mean value and standard
deviation of the compressions be it in terms of the
data blocks compressed with the grammars of the all
the other blocks or be it in terms of the grammars
applied to all blocks. Which of these two versions
yields better results seems to depend on the application
where the original data was collected. In case of the
data from the substation network, it did not show big
differences. But this is not necessarily the case in any
situation, as we have observed. This question sure
needs further investigation.
The results of the experiments we conducted on
data from substations show that the method of cross-
over data compression is a suitable possibility for
anomaly detection in network data from industrial
communication networks.
The project is funded by the KIRAS program and the
Energy Research program of the Austrian Research
Promotion Agency (FFG).
Lee, Edward A. (2008) "Cyber physical systems: Design
challenges." Technical Report UCB/EECS-2008-8.
EECS Department, UC California, Berkeley.
CS-2008-8.html [online; Oct 24th 2019]
Chandola, V., Banerjee, A. and Kuma, V. (2009) "Anomaly
detection: A survey" ACM computing surveys (CSUR),
41(3), 15.
Bhuyan, M. H., Bhattacharyya, D. K., and Kalita, J. K. (2013)
"Network anomaly detection: methods, systems and
tools" Ieee communications surveys & tutorials, 16(1),
Ahmed, M., Mahmood, A. N., and Hu, J. (2016) "A survey
of network anomaly detection techniques" Journal of
Network and Computer Applications, 60, 19-31.
Fernandes, G., Rodrigues, J. J., Carvalho, L. F., Al-Muhtadi,
J. F., and Proença, M. L. (2019) "A comprehensive
survey on network anomaly detection"."
Telecommunication Systems, 70(3), 447-489.
Hamdi, M., and Boudriga, N. (2007) "Detecting Denial-of-
Service attacks using the wavelet transform" Computer
Communications, 30(16), 3203-3213.
Lakhina, A., Crovella, M., and Diot, C. (2004) "Diagnosing
network-wide traffic anomalies" ACM SIGCOMM
computer communication review (Vol. 34, No. 4, pp.
219-230). ACM.
Yeung, D. S., Jin, S., and Wang, X. (2007) "Covariance-
matrix modeling and detecting various flooding attacks"
IEEE Transactions on Systems, Man, and Cybernetics-
Part A: Systems and Humans, 37(2), 157-169.
MacQueen, J. (1967) "Some methods for classification and
analysis of multivariate observations" Proceedings of the
fifth Berkeley symposium on mathematical statistics and
probability (Vol. 1, No. 14, pp. 281-297).
Cover, T., and Hart, P. (1967) "Nearest neighbor pattern
classification" IEEE transactions on information theory,
13(1), 21-27.
Agrawal, R., Gehrke, J., Gunopulos, D., and Raghavan, P.
(1998) "Automatic subspace clustering of high
dimensional data for data mining applications" (Vol. 27,
No. 2, pp. 94-105). ACM.
Estevez-Tapiador, J. M., Garcia-Teodoro, P., and Diaz-
Verdejo, J. E. (2003) "Stochastic protocol modeling for
anomaly based network intrusion detection" First IEEE
International Workshop on Information Assurance,
2003. IWIAS 2003. Proceedings. (pp. 3-12). IEEE.
Jensen F. V. (1997) "An introduction to Bayesian networks"
Springer, ISBN 9780387915029
Nielsen T. D., Jensen F. V. (2007) "Bayesian Networks and
Decision Graphs" Springer, ISBN 9780387682815
Schölkopf B., and Smola A. J. (2001) "Learning with kernels:
support vector machines, regularization, optimization,
and beyond" MIT press, ISBN 9780262256933
Haykin S. (1994) "Neural networks: a comprehensive
foundation". Prentice Hall New York, ISBN
Shannon C. E. (1948) "A mathematical theory of
communication" The Bell System Technical Journal
27(3), 379-423
Kullback, S., and Leibler, R. A. (1951) "On information and
sufficiency" The annals of mathematical statistics, 22(1),
Lee, W., and Xiang, D. (2001) "Information-theoretic
measures for anomaly detection" Proceedings 2001 IEEE
Symposium on Security and Privacy. S&P 2001 (pp.
Bereziński, P., Jasiul, B., and Szpyrka, M. (2015) "An
entropy-based network anomaly detection method"
Entropy, 17(4), 2367-2408. [online;
Oct 24th 2019]
Martos, G., Hernández, N., Muñoz, A., and Moguerza, J.
(2018) "Entropy measures for stochastic processes with
applications in functional anomaly detection". Entropy,
20(1), 33.
[online; Oct 24th 2019]
Kar, A. K. (2016) "Bio inspired computing a review of
algorithms and scope of applications". Expert Systems
with Applications, 59, 20-32.
Castro, L. N., De Castro, L. N., and Timmis, J. (2002)
"Artificial immune systems: a new computational
intelligence approach" Springer.
Hooks, D., Yuan, X., Roy, K., Esterline, A., and Hernandez,
J. (2018) "Applying artificial immune system for
intrusion detection" 2018 IEEE Fourth International
Conference on Big Data Computing Service and
Applications (BigDataService) (287-292).
Aslahi-Shahri, B. M., Rahmani, R., Chizari, M., Maralani,
A., Eslami, M., Golkar, M. J., and Ebrahimi, A. (2016)
"A hybrid method consisting of GA and SVM for
intrusion detection system" Neural computing and
applications, 27(6), 1669-1676.
Hamamoto, A. H., Carvalho, L. F., Sampaio, L. D. H., Abrão,
T., and Proença Jr, M. L. (2018) "Network anomaly
detection system using genetic algorithm and fuzzy
logic" Expert Systems with Applications, 92, 390-402.
Bamakan, S. M. H., Wang, H., Yingjie, T., and Shi, Y. (2016)
"An effective intrusion detection framework based on
MCLP/SVM optimized by time-varying chaos particle
swarm optimization" Neurocomputing, 199, 90-102.
Wahid, A., and Rao, A. C. S. (2019) "A distance-based
outlier detection using particle swarm optimization
technique" Information and Communication Technology
for Competitive Strategies (pp. 633-643). Springer,
Storn, R., and Price, K. (1997) "Differential evolution a
simple and efficient heuristic for global optimization
over continuous spaces" Journal of global optimization,
11(4), 341-359.
Elsayed, S., Sarker, R. and Slay, J. (2015) "Evaluating the
performance of a differential evolution algorithm in
anomaly detection" 2015 IEEE Congress on
Evolutionary Computation (CEC) (pp. 2490-2497).
Nevill-Manning, C.G. and Witten, I.H. (1997) "Linear-Time,
Incremental Hierarchy Inference for Compression"
Proceedings DCC '97. Data Compression Conference.
pp. 3–11