
 
In summary, the implementation of Fingerprint 
Generator must have three properties, ability of 
identification, smaller size and fast process speed. 
The solution which we choose is xxHash (Collet, 
2016). xxHash is an extremely fast non-cryptographic 
hash algorithm, working at speeds close to RAM 
limits. It is widely used by many software like 
ArangoDB, LZ4, TeamViewer, etc. Moreover, it 
successfully completes the SMHasher (Appleby, 
2012) test suite which evaluates collision, dispersion 
and randomness qualities of hash functions.  
Although xxHash is powerful and successfully 
completes the SMHasher test suite, its 32-bit version 
still has collision. Here we provide a simple test to 
verify 32-bit xxHash collision rate with a real-world 
data. The data is from a GPS trajectory dataset (Yuan, 
2011) that contains one-week trajectories of 10357 
taxis. The sum of points in this dataset is about 15 
million and the total distance of the trajectories 
reaches 9 million kilometers. We use some data 
reprocessing to filter the raw data and gather them 
into a handled dataset. The file size of the handled 
dataset is about 410 MB. Figure 3 illustrates the 
repetition rates of the handled dataset with two hash 
functions SHA-1 and xxHash. 
We can see from Figure 3 that the repetition rate 
of the first row is undoubted by using 160-bit SHA-1 
function. We find that the repetition rate of xxHash32 
is higher than SHA-1 about 0.1% in field Hash Map 
which does not have any restriction. This 0.1% 
difference means that the 32-bit xxHash occurs 
collision in this simple test. In contrast, xxHash64 has 
the same repetition rate with SHA-1. The collision 
rate of xxHash64 is lower than xxHash32, but 
xxHash64 also has higher cost because its longer hash 
value size for our scheme. Even the xxHash32 has the 
risk of collision, we still prone for it. There are two 
reasons that mitigate the influence of collision. The 
first one is about its probability; hence, we consider 
that 0.1% deviation could not affect the result a lot. 
On the other hand, this error can be handled in 
computing phase by some operations. Another one is 
the implementation of hash map is LRU hash map, so 
the limitation not only prevents to occupy excessive 
memory but also reduces the occurrence of collision 
with an extra cost of having the repetition rate a little 
lower. Because after discarding the least recently 
used data blocks, the occurrences of collision have 
high possibility to eliminate. In summary, we said the 
defect of xxHash32 used in this scheme is ignorable. 
The memory size of LRU Cache Map is based on 
two factors, one is the size of hash value, and another 
one is its parameter. In Table 1, it shows that the 
standard hash map can store all fingerprints and data 
block, but it leads to out of memory. That is why we 
pick LRU hash map. The average size of records in 
the dataset is about 25 bytes. It shows xxHash32 has 
the smallest memory size for the LRU hash map. 
 
 
Figure 3: Repetition rate and LRU Cache Map analysis. 
Table 1: Memory size of each data structure.  
 
Hash 
Map 
LRU- 
10^3 
LRU- 
10^4 
LRU-
10^5 
LRU-
10^6 
SHA-1 OOM 50KB 500KB 5MB 50MB 
xxHash64 OOM 35KB 350KB 3.5MB 35MB 
xxHash32 OOM 30KB 300KB 3MB  30MB 
2.4  Data Chunk Preprocess 
In file synchronization systems, most of the time, the 
content difference between local node and remote 
node is slightly small. So, the methods of file 
synchronization are focus on how to find out the 
different parts between two files.  Note that the data 
generated by sensors in a time interval comes in 
record by record. For instance, consider the GPS 
dataset. The average size of the record in the GPS 
dataset is about 25 bytes. On the contrary, the 
parameter s in Rsync is at least 300 bytes, let alone 
the average block size in LBFS is 8KB. Therefore, a 
fine-grained chunking method is essential for our 
work.  
The data block in our scheme is like a record that 
sensor generates in a time interval. Spatial 
dependence leads to a neighbour cluster of sensors to 
detect similar values; time dependence leads to each 
record from the same sensor to measure smooth data. 
Therefore, we split raw data and obtain duplicated 
records as possible as it can be. 
In sensors network, a cluster head collects the real-
time data from many sensors. There is so much noise 
that causes low probability to distinguish the 
duplicated part. To identify the difference, we require 
LRU-
10^3
LRU-
10^4
LRU-
10^5
LRU-
10^6
Hash
Map
SHA-1
28,12% 28,57% 28,88% 28,88% 29,10%
xxHash64
28,32% 28,77% 28,88% 28,88% 29,10%
xxHash32
28,32% 28,78% 28,89% 28,89% 29,20%
27,40%
27,60%
27,80%
28,00%
28,20%
28,40%
28,60%
28,80%
29,00%
29,20%
29,40%
Repetition Rate
SHA-1 xxHash64 xxHash32
Fast Deduplication Data Transmission Scheme on a Big Data Real-Time Platform
159