EFFICIENT N-BYTE SLACK SPACE HASHING IN RETRIEVING
AND IDENTIFYING PARTIALLY RECOVERED DATA
Ireneusz Jozwiak
and Michal Kedziora
Institute of Informatics, Wroclaw Univesrity of Technology, ul. Wybrzeże Wyspianskiego 27, 50-370, Wroclaw, Poland
Keywords: Data retrieval, e-Discovery, Hash functions, Data recovery.
Abstract: This paper describes modification of slack space block hashing algorithm which improves performance in
the handling with process of identification of recovered data. In our research we relied on hash block
algorithm and present improvements which allow increase efficiency by analyzing time reduction. N-Byte
Slack Space Hashing is especially useful in data recovery process where due to a file system limitations, it is
possible to recover only fragments of data which was erased and partially overwritten. The Algorithm is
faster than block hashing and allows to identify partially erased files using modified hash sets.
1 INTRODUCTION
One of major tasks during data analysis process is to
search and identify specific files. This task is even
more complicated when applying data retrieving
techniques to applications as data recovery and
computer forensics. The simplest method is to
compare files by their names and extensions
(Kornblum, 2006). Obviously this method is highly
inefficient and can be easily compromised by
changing files names (Bunting, 2008). Most forensic
analysis software tools (Casey, 2004) can detect that
someone change file extension to hide evidence by
performing file signature analysis. In this process
file extension is compared with its header.
Performing name search and file signature analysis
is one of basic steps in computer forensic
investigations but there are not efficient way of
searching and identifying data. More advanced
method is to made search using one way
cryptographic hash functions and specially created
hash tables of known files (Henson, 2003). One of
the biggest advantages of comparing files by hash is
that it gives positive results even if file name was
changed because hash value is computed on a data
part of the file, and not on directory entry which
keeps the name and other metadata information
(White, 2005). Condition which has to be fulfilled to
properly use hash analysis is possessing whole data
of logical file to create its hash value (Stein, 2005).
Unfortunately the most interesting files are often
deleted both accidently and intentionally. For
example Internet activity history is usually deleted
by internet browser after fixed time period without
being noticed by the user. On the other hand we can
have situation where the user intentionally deletes
incriminating data. Hopefully it is not so easy to
completely delete data from the hard drive. In
simple, due to efficiency issues, when a file is
deleted from computer it is only simply removed
from a directory of files (Microsoft, 2004). It's
content still exist intact but an operation system
marks space as unallocated. Because the OS doesn't
immediately re-use unallocated space from deleted
files, a file can be recovered right after it has been
erased, and for a considerable time afterwards.
Chances of a whole logical file recovery decrease
with time, it is because sooner or later some or all of
that unallocated space will be re-used. Fortunately in
most cases not all unallocated space is overwritten
and we can recover part of previously deleted data
(Breeuwsma, 2007). This is case we will deal in this
work. We are not able to recover whole file so we
cannot compute its hash, and compare it with hash
table. Solution is to make several hashes from each
file to make comparison possible. Selection of the
proper algorithm we precede with analyzing slack
space forming and creating mathematical model of
data recovery process.
309
Jozwiak I. and Kedziora M..
EFFICIENT N-BYTE SLACK SPACE HASHING IN RETRIEVING AND IDENTIFYING PARTIALLY RECOVERED DATA.
DOI: 10.5220/0003605703090312
In Proceedings of the 6th International Conference on Software and Database Technologies (ICSOFT-2011), pages 309-312
ISBN: 978-989-8425-76-8
Copyright
c
2011 SCITEPRESS (Science and Technology Publications, Lda.)
2 THEORY
The file system is essential to store and organize
computer data. It can be described of as an index or
database containing the physical and logical location
of every piece of data on a hard drive. Most popular
nowadays are two file system types FAT and NTFS.
The basic concept of a FAT file system is that each
file and directory on a disk is allocated into a data
structure (Microsoft, 2000), which is called a
directory entry. It contains the file size, name,
starting address, and other metadata. The file and the
directory content is stored in data units called
clusters (standard is 8192 bytes for one cluster, but it
can be defined differently). If the file or directory
has allocated more than one cluster, the other
clusters are found by using a structure that is called
the FAT. Next popular file system is NTFS. NTFS
core is Master File Table (MFT). It holds the
information about every file and directory located on
the drive (Berghel, 2007). Important from our point
of view are data units allocation strategies, used to
place file content into disk. In the best case,
operation system should allocate consecutive data
units, but that is not always possible. In the course of
time files are put in and out from drive, allocating
large files in one piece may become a problem.
When a file does not have consecutive data units, it
is called fragmented An operation system can use
several different strategies for allocating data units.
2.1 File System Slack Space
Most popular file systems (FAT 32, NTFS) have
structure based on blocks (called also sectors) which
have standard data length 512 bytes and clusters
representing 8 sectors space. The smallest space
allocated for file is one cluster. This mean that even
though file size is 10 bytes there will be always 8192
bytes cluster reserved for it. We trace this process
according to figure 1 example.
Figure 1: Formation of the slack space.
Figure 1 shows scenario with 8192 Bytes X File
which occupies two whole clusters. The X file has
been deleted. During this process entries of the X
file are deleted (process can be slightly different
depending on the file system used).Process will not
affect any data stored in clusters, all data is still
available but operation system will mark the space
as unallocated and ready to reuse. The next phase of
this scenario is when two files Y1 and Y2 are
created. First has 2560 Bytes logical size, second
1536 Bytes. According to the file system allocation
strategy, each file will be placed in the beginning of
next available cluster. File Y1 will start in byte
offset 0 and end in 2560. Next cluster begins in
offset 4096, therefore the slack space will be created
between offset 2560 and 4096. Analogous situation
will be with Y2 which starts in offset 4096 to 5632.
Because cluster size is 4096 byte, slack space size of
2560 bytes is created. If X File where valuable file
from forensic view it would be good to identify it.
Slack space size can have different value, but in this
research we present that it will be enough to
properly identify previous file (Gladyshev, 2005).
2.2 Mathematical Model
We explain process of formation slack space with
formal mathematical model based on Gladyshev
(2005) work. The basic cluster model we describe
can store data objects of only three possible lengths:
LENGTH ={0,1,2} (1)
Where zero length means that the cluster is
unallocated (It is not equal statement that there is no
data stored on media. Either after formatting or
wiping cluster out it can be filled with random data
or byte patterns like "00". Actually we cannot prove
if its not a part of valid file or data e.g. cryptographic
container). Length 1 means that the cluster contains
only one object of the size of the unrelated data, and
the length 2 means that the cluster contains both
object of the data block "x", end "y". All other sizes
are disallowed in this model. This assumptions,
divides the cluster into two parts which are shown in
Figure 3; the left part that in the final state contains
the unrelated data "u" and "y", and the right part that
in the final state contains the piece of recovered data
"x".
LEFT_PART = {u,x1,y1}
RIGHT_PART = {u,y2,x2}
(2)
Advanced cluster model (ACM) we created is not
limited to one cluster of data, we take whole disk
space, where n is number of block series.
ICSOFT 2011 - 6th International Conference on Software and Data Technologies
310
 =
,
,

,
,
+
,
,

,
,
+,
(

)
,
(

)
,
()
,
()
+
,
,

,
,
(3)
As we see as a final state we will have n parts of our
evidence X file. In the best possibility, n can have
size of evidence logical file, this is case where file
was deleted but not overwritten. More often we will
get less than 512 Byte part of X file in slack space,
and other parts in an unallocated space. We take
assumption that border of left and right part of each
basic cell of the cluster model is the end of sector.
3 N-BYTE HASH
From a mathematical model we can see that standard
hashing algorithms will not work when dealing with
partially erased files. We cannot predict which part
of the file we will be able to recover, that is main
reason why reducing input data length to less than
length of file is necessary. There is algorithm
(Kornblum, 2006) which takes blocks as input
H(X
p(1-512)
), in most cases file system block has 512
Bytes but there are disadvantages of this
option(Menezes, 1996). Blocks can have other
values than 512 bytes depending on file system used,
it’s very hard to convert algorithm and correlated
hash tables to work with file systems with different
block bit length (Henson, 2003). The next
disadvantage is that we can miss a part of evidence
file in its last block because we cannot surely predict
ram slack data entry. The Ram slack we can explain
in mathematical model. in figure 1 Y file ends just in
the end of 5 block in a cluster. More likely it would
end in the middle of the block. In this case there will
be ram slack space created to the end of the block
(most of Operation Systems to deal with this
problem, makes a wiping till end of the block,
however in older MS systems it could be random
data from Random Access Memory, this is actually
why it’s called RAM slack). The third disadvantage
of using block input is performance. Taking 512
Bytes blocks force us to make hash for every byte on
hard disk. Hashing every byte on disk is essential
when we use hash function to preserve evidence, this
is one of the most important item in creating chain
of custody. However in our apply it is unnecessary
and not efficient. Solution performance is depending
on two main factors, the first is number of I/O
operations on hard drive. Hard disk read/write
operations, and interface for connecting drives is
still bottleneck in computers. The second
performance factor is computation time of hash
function. Cryptographic hash functions are designed
to be fast in both hardware and software
implementation, but it is obvious that they have
impact on computer performance. That is why we
focus on n<512 versions of block hashing. In
computer forensics there are widely used two
cryptographic hash function MD5 (Ronald Rivest,
Message-Digest algorithm 5) with a 128-bit hash
value and SHA-1 designed by the National Security
Agency (NSA) which creates a 160 bit message
output based on principles similar to those used in
MD5. In this research we will focus on Message
Digest algorithm (White, 2005).
The MD5 cryptographic function algorithm first
divides the data input into 512 bits blocks (Menezes,
1996). At the end of the last block 64 Bits are
inserted to record the length of the original input. If
input is smaller, bit value is filled with 0 to 448 bit
block. Padding function is performed in the
following way: a single "1" bit is appended to the
message, and then "0" bits are appended so that the
length in bits of the padded message becomes
congruent to 448, modulo 512.
This effects minimum input of N-Byte sector
hashing considered 448 bits (56 bytes) for one full
round of algorithm. And that is why we choose 56
bytes as input length in our algorithm. We
considered also 120 Bytes input (two full round of
md5). Standard full block input will have 512 Bytes
of data + 64bits of length record + 448 bit padding,
this results 576 Bytes MD5 input (9 full rounds).
Creating the hash tables compatible with
presented algorithm should take into consideration
that the same records can be ascribed to several
different files. And that there will be several hash
records to each file depending on its length. This
characteristic is described wider in research
implementation section.
4 PRACTICAL RESEARCH
IMPLEMENTATION
We have implemented function h(X
p(1...n)
) based on
Massage Digest 5 cryptographic hash function
algorithm. We have performed several tests using
the same software and hardware environment with n
equal 48, 112 and 512. Tests were carried out to
show efficiency of each method. Tests where
repeated 20 times to determine and reduce error rate.
EFFICIENT N-BYTE SLACK SPACE HASHING IN RETRIEVING AND IDENTIFYING PARTIALLY RECOVERED
DATA
311
Figure 2: N-byte hash function comparison.
Test results confirm that process time is strictly
correlated with input length of cryptographic hash
function. Difference between h{X
p(1...48)
} and
h{X
p(1...112)
} is minimal but clearly point on
efficiency growth by reducing Message-Digest
algorithm rounds amount. Comparing with
h{X
p(1...512)
} method we gain more than 8% time
reduction. Time reduction is especially useful
during analysis of TB hard drives which become
more and more popular.
When we use low inputs to decrease process
time, we have to be aware of potential threats. The
first is increase of positive negative hits. The hits are
correlated with high possibility of collision
occurrence we explained in theory section. On less
random data set we should deal with higher amount
of collisions because of regular shorter input values.
High probability of collisions occurring is a reason
we recommend using n equal 120 byte input data
length. With relatively low increase of process time
we get much more collision free function which will
reduce necessity of manual check.
5 CONCLUSIONS
We have proposed an algorithm which is based on
hash function which generates values based on N-
Bytes input from each data sector. N-Byte slack
space hashing can be used as a more efficient
replacement of block file hashing for identifying
partially recovered files during data retrieving
process. In laboratory tests we obtain 8% time
decrease using h{Xp(1...48)} or h{Xp(1...112)}
hashing instead of full block 512 Byte input
algorithm. In a real word environment it can
accelerate computer data analysis. The efficiency of
N-Byte slack space hashing results from
construction and implementation of md5
cryptographic hash function which is widely used in
the computer forensics. Further research will include
the research on SHA (Secure Hash Algorithm)
performance compared with use of md5 and
additional tests in real environment. Also the
research on identifying most occurring data inputs
will be processed to create predefined n-bit inputs. It
will be used to create tables similar to “rainbow
tables” to increase efficiency. It should be stated
simply that N-Byte Slack Space Hashing is not
replacement of traditional hash analysis in computer
forensics. Our solution is created to focus on a
narrow problem of identifying partially recovered
data. This is area in which standard hash analysis
does not work properly.
REFERENCES
Kornblum, J., 2006. Identifying almost identical files using
context triggered piecewise hashing, DFRWS Digital
Investigation, Elsevier, 91 –97.
Stein, B., 2005. Fuzzy-Fingerprints for Text-Based
Information Retrieval, Bauhaus University Weimar,
Germany, Journal of Universal Computer Science,
572-579, ISSN 0948-695.
Bunting, S., 2008. The Official EnCase Certified
Examiner Study Guide, Wiley Publishing, ISBN: 978-
0-470-18145-4.
Gladyshev P., 2005. Finite State Machine Analysis of a
Blackmail Investigation, International Journal of
Digital Evidence, 4(1).
White, D., 2005. NIST National Software Reference
Library. National Institute of Standards and
Technology.
Menezes, A., 1996. Handbook of Applied Cryptography,
CRC Press.
Henson, V., 2003. An Analysis of Compare-by-hash, Ninth
Workshop on Hot Topics in Operating Systems
HotOS-IX, Lihue, Hawaii, USA.
Microsoft, 2004. Description of the FAT32, ID310524,
Microsoft.
Breeuwsma, M., 2007. Forensic Data Recovery from
Flash Memory, Small Scale Digital Device Forensics
Journal, 1(1).
Berghel, H,. 2007. Hiding data, forensics, and anti-
forensics, Communications of the ACM.
Microsoft, 2000. FAT: General Overview of On-Disk
Format. FAT32 File System Specification, Version
1.03.
Casey, E., 2004. Tool Review—WinHex. Journal of Digital
Investigation, 1(2).
ICSOFT 2011 - 6th International Conference on Software and Data Technologies
312