using self-adjusting data structures, and (ii) the
heuristic according to which a container is burst is
based on how skew the accesses are in the elements
of the container, then at any time instance the
elements in the same container have similar access
characteristics under a dominant one and can be
represented by their common prefix, while the
topmost element in the container is the one with the
biggest share in the accesses.
In our algorithmic application and since we need
to capture frequent patterns of accesses and hence
represent strings where the frequency of access
should be taken into account, we have chosen to
implement the containers by employing splay trees.
3.1 Description of the Algorithm
In order to detect a “burst” of visits, we utilize a two
dimensional matrix B
n
m
, with n denoting the number
of Web pages or data units and with m representing
the number of timestamps that are maintained for
each data unit. We assign a number w
ϵ
[0, m-1] to
every Web page or data unit of the Website at
random. Thus, each Web page or data unit
corresponds to a row of the array B
n
m
. We define
that, for the data unit i, a “burst” has occurred if and
only if the difference between the timestamps stored
in positions (i, m-1) and (i, 0) is equal or smaller
than T, where T is the period of time in which the
visits should occur in order to suggest a “burst”.
We should mention that in the algorithm
described in (Antoniou et al., 2011) a splay tree was
used in order to store Web pages and a stack of
timestamps for each node of the splay tree. In our
present approach, we simplify the employed data
structures and we save time and space, since we
have replaced the splay tree with a simple two
dimensional array, while additionally by using the
two dimensional array we are no longer in need of
the stacks storing the timestamps of the various
users’ accesses.
In our algorithm, we use IP addresses
represented in dot decimal notation and we represent
every IP address as a string of bits. We store the
visiting IPs by employing the burst trie data
structure. In order to take into account the different
Internet Protocols we utilize two different burst tries,
depending on whether the visiting IP is defined as a
32-bit number (Internet Protocol Version 4 – IPv4)
or as a 128-bit number (Internet Protocol Version 6
– IPv6). An Internet Protocol Version 4 (IPv4)
address consists of 32 bits, which may be divided
into four octets. These four octets are written in
decimal numbers, ranging from 0 to 255, and are
concatenated as a character string with full stop
delimiters between each number. Similarly and as
far as IPv6 is concerned, the 128 bits of the address
are split in 16 octets and each two octets are
represented with a hexadecimal four digits number.
As already mentioned, we use splay trees in
order to implement the containers of the burst tries.
Using the splaying technique, the most popular
nodes of the tree, which represent parts of the
visiting IP addresses, are rearranged so that they are
located near the root of the container. Thus, the node
that is being splayed to the root of the container
could be quickly accessed in the future. For efficient
splaying, a record in a splay tree requires three
pointers, two pointing to its children and one to its
parent, thus they use the most space of any of the
container structures considered in (Sleator and
Tarajn, 1985). However, it is a natural choice in our
application since we want frequent IP accesses to be
stored near the root of the container, in order to
apply the burst heuristics that accompany the
specific structure.
The general principle for maintaining a burst trie
is to locate inefficient containers and burst them. In
particular three heuristics are proposed in (Heinz et
al., 2002); the ratio, the limit and the trend
heuristics. In the ratio heuristic, a container is burst
when the ratio of the number of accesses to the root
of the container with the total number of accesses is
less than a threshold, and simultaneously the number
of accesses to the container is large enough. In the
limit heuristic, the container is burst when the
number of the elements in the container exceeds a
threshold, and finally in the trend heuristic the
container is burst when its potential is exhausted;
during each access to the container the potential is
incremented by a fixed amount when the root of the
container is accessed, otherwise it is decremented by
another amount. In our application we have chosen
to follow a different set of heuristics since our main
aim is to guarantee that IPs with common prefixes
and similar distribution in their access characteristics
will be grouped together in the same container under
the same dominant IP; moreover the root of the
container will correspond to the dominant IP of the
group. Hence, storing IPs and employing these
heuristics should guarantee that IPs that depict
similar characteristics will be distributed in nearby
containers, and thus it will be easy for our
application to locate IPs that have the same prefixes
and similar access characteristics, in order to
efficiently locate the responsible sub network for a
specific “burst” of visits.
More analytically our algorithm is as follows:
DesigningaClickFraudDetectionAlgorithm-ExposingSuspectNetworks
95