Towards High Performance Big Data Processing
by Making Use of Non-volatile Memory
Shuichi Oikawa
University of Tsukuba, 1-1-1 Tennodai, Tsukuba, Ibaraki, Japan
Keywords:
Non-volatile Memory, Operating Systems, Storage, Big Data Processing.
Abstract:
Cloud computing environments for big data processing require high performance storage. There are emerg-
ing high performance memory storage technologies, such as next generation non-volatile (NV) memory and
battery backed NV-DIMM. While their performance is much higher than the current block storage devices,
such as SSDs and HDDs, they provides only limited capacity. Such limited capacity makes it difficult for
memory storage to be adapted as mass storage, and their uses in cloud computing environments have been
severely limited. This paper proposes a method that combines memory storage with block storage. It makes
use of memory storage as cache of block storage in order to remove the capacity limitation of memory storage.
The proposed method inherits the high performance of memory storage and also the large capacity of block
storage. Therefore, memory storage can be transparently used as a part of mass storage while its low over-
head access can accelerate storage performance. The proposed method was implemented as a device driver of
the Linux kernel. Its performance evaluation shows that it outperforms a bare SSD drive and achieves better
performance on the Hadoop and database environments.
1 INTRODUCTION
The importance of big data processing increases more
than ever before, and it is convincing that its impor-
tance will continue increasing in the future as well.
Cloud computing environments are currently the only
solution that can provide the scalability required by
big data processing since they can scale out their
storage capacity along with necessary computing re-
sources. There is no doubt that cloud computing en-
vironments for big data processing require high per-
formance storage; thus, SSDs were quickly adapted in
such environments, and they are sometimes combined
with HDDs to transparently enhance the performance
and capacity of storage.
Now, high performance memory storage technolo-
gies, such as next generation non-volatile (NV) mem-
ory and battery backed NV-DIMM, are emerging.
These new kinds of storage provide both high per-
formance and persistency, and they are byte address-
able. Since their byte addressability enables them to
be accessed as memory, we call them memory stor-
age. While they provide much higher performance
than the current block storage devices, such as SSDs
and HDDs, their capacities are limited. Such capac-
ity limitation makes it difficult for memory storage to
be adapted as mass storage, and their uses in cloud
computing environments have been severely limited.
This paper proposes a method that combines
memory storage with block storage. It makes use of
memory storage as cache of block storage in order
to remove the capacity limitation of memory storage.
Combining block storage with another faster block
storage, which is typically an SSD, for higher ac-
cess performance is a well known technique (Kgil and
Mudge, 2006; Koller et al., 2013; Saxena et al., 2012).
The technique utilizes faster block storage as cache
and stores frequently accessed data in it in order to im-
prove the average time to access data. Its open source
implementation is widely available (Facebook, 2014).
The existing technique employs a software layer that
combines two block storage devices. Since it is possi-
ble for memory storage to emulate block storage and
to use the software layer for combining block storage
devices, the emulation sacrifices its performance ad-
vantage for the compatibility with the block storage
interface.
The proposed method directly manages memory
storage in order to make use of its high performance
and byte addressability. The byte addressability of
memory storage enables its direct management with-
out a device driver; thus, the memory storage man-
529
Oikawa S..
Towards High Performance Big Data Processing by Making Use of Non-volatile Memory.
DOI: 10.5220/0005463605290534
In Proceedings of the 5th International Conference on Cloud Computing and Services Science (CLOSER-2015), pages 529-534
ISBN: 978-989-758-104-5
Copyright
c
2015 SCITEPRESS (Science and Technology Publications, Lda.)
agement can be integrated in a device driver that com-
bines memory storage with block storage without an
additional software layer as required by the existing
method. It can effectively utilize the high perfor-
mance of memory storage and also provides the large
capacity of block storage. Therefore, memory storage
can be transparently used as a part of mass storage
while its low overhead access can accelerate storage
performance.
The proposed method was implemented as a de-
vice driver of the Linux kernel, and its performance
evaluation was performed by measuring the file ac-
cess performance on the Hadoop distributed process-
ing environment and also a typical benchmark perfor-
mance on the MySQL database environment. Hadoop
and MySQL were employed for the measurements in
order to evaluate the effectiveness of the proposed
method in realistic environments. The measurements
were performed on a virtualized environment. The
evaluation results show that the proposed method con-
siderably outperforms a bare SSD drive and achieves
better performance on the Hadoop and database envi-
ronments.
The rest of this paper is organized as follows. Sec-
tion 2 describes the background of the work. Section
3 describes the detailed design and implementation of
the proposed method. Section 4 shows the result of
the experiments. Section 5 describes the related work.
Section 6 summarizes the paper.
2 BACKGROUND
This section describes the background of this work,
which includes the overview of the block device
driver layer of the operating system (OS) kernel and
the existing method to combine block storage devices.
2.1 Block Device Driver Layer
The current storage devices, such as SSDs and HDDs,
are block devices, and they are not byte addressable;
thus, CPUs cannot directly access the data on these
devices. A certain size of data, which is typically mul-
tiples of 512 byte, needs to be transferred between
memory and a block device for CPUs to access the
data on the device. Such a unit to transfer data is
called a block.
The OS kernel employs a file system to store data
in a block device. A file system is constructed on a
block device, and files are stored in it. In order to read
the data in a file, the data first needs to be read from
a block device to memory. If the data on memory
was modified, it is written back to a block device. A
processing*syscall
switching*
process
interrupt*handling*
&*switching*process
CPU
Block*Device
Issuing*
Command
No>fying*command*
comple>on*by*interrupt*
Proc*1
Proc*2
Proc*1
command*processing
Tproc2
Figure 1: The asynchronous access command processing
and process context switches.
memory region used to store the data of a block device
is called a page cache. Therefore, CPUs access a page
cache on behalf of a block device.
Since HDDs are orders of magnitude slower than
memory to access data on them, various techniques
were devised to amortize the slow access time. The
asynchronous access command processing is one of
commonly used techniques. Its basic idea is that a
CPU executes another process while a device pro-
cesses a command. Figure 1 depicts how it works.
Process 1 issues a system call to access data on a
block device. The kernel processes the system call
and issues an access command to the corresponding
device. The kernel then looks for the next process to
execute and perform context switching to Process 2.
Meanwhile, the device processes the command, and
sends an interrupt to notify its completion. The ker-
nel handles the interrupt, processes command com-
pletion, and performs context switching back to Pro-
cess 1. T
proc2
is a time left for Process 2 to run. Be-
cause HDDs are slow and thus their command pro-
cessing time is long, T
proc2
is long enough for Process
2 to proceed its execution.
The I/O request queueing mechanism that imple-
ments the asynchronous access command processing
has been a right choice for the block devices. It poses
high processing cost, but the cost pays off by cre-
ating additional processing times made available for
other processes. Such justification for the I/O re-
quest queueing mechanism and he asynchronous ac-
cess command processing is, however, no longer true
when storage becomes much faster.
2.2 Problems to Combine Memory
Storage with Block Storage
The existing method combines block storage with an-
other faster block storage, which is typically an SSD,
for higher access performance (Kgil and Mudge,
2006; Koller et al., 2013; Saxena et al., 2012). It
utilizes faster block storage as cache and stores fre-
quently accessed data in it in order to improve the
average time to access data. Its open source imple-
mentation is widely available (Facebook, 2014), and
the current Linux kernel includes several implemen-
tations, such as dm-cache and bcache.
CLOSER2015-5thInternationalConferenceonCloudComputingandServicesScience
530
The implementations of the existing method in
the Linux kernel employ the device mapper mecha-
nism to constitutes a single storage device. The de-
vice mapper is implemented as a software layer in the
kernel, and provides the mechanism to transfer ac-
cess requests for the constituted device to appropri-
ate underlying devices. The policy part defines how
it transfers requests. There can be multiple policy
implementations, and some of them combines block
storage with faster storage as cache. When an SSD is
used as cache storage by combining it with a HDD,
it is straightforward that the combined storage pro-
vides the block storage interface and is accessed asyn-
chronously since both of them are block storage. As
its extension, it is possible for memory storage to em-
ulate block storage and to have the device mapper to
combine block storage with memory storage.
The use of the device mapper requires memory
storage to emulate block storage interface since the
device mapper expects it as an interface. While such
emulation enables the use of the device mapper, it
causes significant software overhead. The device
mapper is basically a block device driver; thus, it re-
ceives access requests from the upper generic block
device driver framework. It then transfers the received
requests to another block storage device. The trans-
ferred requests are processed again by the generic
block device driver framework, and finally the tar-
get block storage device receives them (Ueda et al.,
2007). Therefore, processing in the generic block
device driver framework occurs multiple times, and
such processing causes a software overhead that can
be hidden in the long access latency of block storage
devices but becomes apparent for memory storage.
3 DESIGN AND
IMPLEMENTATION OF THE
PROPOSED METHOD
This section first describes the design of the proposed
method. It next describes the implementation in the
Linux kernel.
3.1 Design Overview
The most considerable advantage of memory storage
is its performance. In order to make use of it as much
as possible and not to sacrifice it, the software over-
head to access it must be minimum. The existing
method to combine block storage devices is, however,
inappropriate in this sense because of its inefficiency
that is inherent in its use of the block storage interface
Device&Driver&for&the&Proposed&Method
Block&Storage
Block&Storage&
Device&Driver
Memory&
Storage
(a)
(b)
(c)
Figure 2: The design overview of the proposed architecture.
and its asynchronous access. As described in Section
2.1, the overhead of the block storage interface and
its asynchronous access is significant, and it must be
avoided.
The proposed method keeps its access overhead to
memory storage minimum by making use of the direct
and synchronous access to memory storage. Memory
storage provides the memory interface, which means
that there is no need to use a device driver to ac-
cess it; thus, the device driver of the proposed method
can directly access memory storage. Such direct ac-
cess allows the least access overhead to memory stor-
age. Moreover, because memory storage allows syn-
chronous access, of which software overhead is much
less than asynchronous access, the proposed method
aggressively makes use of synchronous access.
Figure 2 depicts the design overview of the archi-
tecture of the proposed method. First, we consider
reading data. There are three access paths, which are
shown as (a), (b), and (c) in the figure. When data is
available on memory storage, which is shown as (a),
the device driver of the proposed method provides di-
rect and synchronous access to memory storage. Such
access enables the least overhead; thus it should be
utilized as much as possible. In order to make it possi-
ble, data needs to be read ahead from block storage to
memory storage, which is shown as (b). When read-
ing ahead is successful, data can be continuously read
from memory storage.
When data is not available on memory storage, a
straightforward way is to read in the requested data
from block storage to memory storage. This way,
however, unnecessarily pollutes memory storage be-
cause the data that was read in to memory storage be-
comes useless. Therefore, the proposed method reads
the data from block storage bypassing memory stor-
age, which is shown as (c).
Second, we consider writing data. There are also
three data paths, (a), (b), and (c), which are shown
in the figure. Unlike reading, there is no need to
read ahead into memory storage for writing since
TowardsHighPerformanceBigDataProcessingbyMakingUseofNon-volatileMemory
531
valid data is written onto memory storage. Therefore,
data can be written onto memory storage whenever
free spaces are available on memory storage, which
is shown as (a). The free spaces can contain valid
data for reading. Unavailable spaces of memory stor-
age are those where dirty data resides. The unavail-
able spaces that contain dirty data become free spaces
when the dirty data is written back to block storage,
which is shown as (b). When there is no free space
available, data can be written to block storage, which
is shown as (c). The path (c) is, however, considered
to be rarely used since writing to memory storage and
writing back to block storage can be processed in par-
allel.
The device driver of the proposed method man-
ages memory storage and also interacts with a block
storage device driver. A block storage device driver is
not a part of the driver of the proposed method. By
separating the management of memory storage and
block storage, there is no restriction of a choice of
block storage, and arbitrary block storage can be com-
bined with memory storage.
3.2 Implementation in the Linux Kernel
The Linux kernel provides the device mapper mech-
anism, which can be used to combine multiple block
storage devices. The existing method uses this mech-
anism as described in Section 2.2. The proposed
method, however, does not use the device mapper
mechanism in order to avoid the overhead of itself and
also the overhead incurred by having memory storage
emulate block storage.
The proposed method implements its own func-
tion that can provide the synchronous access interface
depending upon the location of requested data.
void memory_make_request(
struct request_queue *q,
struct bio *bio)
This interface is typically used by the device driver
of ramdisk, which provides synchronous access. The
proposed method makes use of this interface and pro-
vides synchronous access when data is available on
memory storage for reading or when a free space is
available on memory storage for writing. In this case,
memory storage is considered to be working just as
ramdisk. When data is unavailable on memory stor-
age for reading or when no free space is available
on memory storage for writing, however, the access
request is transferred to the device driver of block
storage. Then, the block storage device driver asyn-
chronously processes the request.
The device driver of the proposed method also im-
plements the functions for reading ahead and writing
back data between memory storage and block stor-
age. Because they need to be invoked in parallel with
reading and writing data from/to memory storage, the
dedicated kernel threads process them. They are in-
voked at appropriate timings in order to improve the
efficiency of the proposed method.
4 EXPERIMENT RESULTS
First, file I/O throughput was measured by using the
Hadoop TestDFSIO benchmark program to see per-
formance impact on big data processing. The mea-
sured costs are compared with a sole SSD drive. Sec-
ond, the performance of the database processing was
measured by the TPCC-MySQL benchmark program
to see performance impact on database processing.
4.1 Experiment Environment
Since there is no publicly available system that equips
memory storage, we used DRAM to emulate it. Since
MRAM, which is considered to be the best match
for the proposed method, performs comparably to
DRAM, the differences of results must be negligible.
All measurements described below were performed
on the Linux kernel 3.14.12 that includes the imple-
mentation of the proposed method. Execution times
were measured using the TSC (Time Stamp Counter)
register.
The system used for this experiment is a PC sys-
tem equipped with the Intel Core i7-4930K 3.4GHz
and 64GB of DRAM. The KVM virtualization soft-
ware of Linux is employed to construct experiment
environments that consist of virtual machines. Each
virtual machine is configured with two CPUs, the
main memory, and a dedicated block storage device.
The sizes of the main memory differ to match their
functionality, they are described below. The CFD
S6TNHG6Q 128GB SATA SSD is used for a dedi-
cated block storage device, and a whole SSD is as-
signed to a single virtual machine. When the pro-
posed method is used for an experiment, memory
storage consists of 1GB of memory.
4.2 Results of Hadoop TestDFSIO
This section shows the measurement results of the
Hadoop TestDFSIO benchmark program. For this ex-
periment, four virtual machines were configured to be
a Hadoop cluster. One virtual machine becomes the
master node, and the others are slave nodes. The main
memory size of the master node is 8GB, and that of
CLOSER2015-5thInternationalConferenceonCloudComputingandServicesScience
532
0""
50""
100""
150""
200""
250""
300""
350""
400""
450""
500""
100"" 200"" 300"" 400"" 500"" 600"" 700"" 800"" 900"" 1000""
Throughput"[MB/sec]!
Data"size"[MB]
Proposed"
Method"
SATA"SSD"
Figure 3: Comparison of read performance by Hadoop
TestDFSIO.
slave nodes is 1GB. Hadoop employs Hadoop Dis-
tributed File System (HDFS) for the file service of
its applications (Shvachko et al., 2010). The HDFS
servers consist of the name node and data nodes,
which are executed as user processes. The master
node runs the name node, which locates on which data
node requested data files reside upon access requests
from clients. The data nodes of slave nodes manage
data files.
We measured the file I/O throughput of read-
ing files of various data sizes by using the Hadoop
TestDFSIO benchmark program. Larger numbers are
better as I/O throughput. The size of each file created
for measurements was fixed to 100MB, and the num-
ber of files was changed from one to ten in order to
change the total data sizes from 100MB to 1GB. We
first executed the writing program of TestDFSIO to
create files for reading. After flushing the page cache
of the data nodes, we executed the reading program of
TestDFSIO, and measured the costs. HDFS provides
two methods for reading. One receives data from a
data node through remote procedure calls (RPCs), and
the other directly interacts with a local file system.
The latter one is called short circuit read (SCR). Both
methods were used for measurements. Figure 3 and
4 show the results without and with SCR enabled, re-
spectively.
The measurement results show a significant per-
formance advantage of the proposed method for the
Hadoop TestDFSIO. For reading from 100MB to 1GB
data sizes without SCR, it performs approximately
39.21% to 114.36% better than SSD. For reading with
SCR enabled, it performs approximately 135.32% to
624.10% better than SSD. On average, the proposed
method performs 78.45% better without SCR and
266.08% better with SCR than SSD.
A realistic evaluation with Hadoop shows that the
proposed method provides a significant boost with the
0""
100""
200""
300""
400""
500""
600""
700""
800""
900""
1,000""
100"" 200"" 300"" 400"" 500"" 600"" 700"" 800"" 900"" 1000""
Throughput"[MB/sec]!
Data"size"[MB]
Proposed"
Method"
SATA"SSD"
Figure 4: Comparison of read performance by Hadoop
TestDFSIO with SCR enabled.
0""
2,000""
4,000""
6,000""
8,000""
10,000""
12,000""
14,000""
16,000""
none" directsync" none" directsync"
TpmC"
fsync"flush"method""""""""""""""""O_DRECT"flush"method
Proposed"Method"
SATA"SSD"
Figure 5: Comparison of TPCC-MySQL performance.
file access throughput of Hadoop. Therefore, it is cer-
tain that a wide range of applications, which involves
a large amount of file access, can benefit from it.
4.3 Results of TPCC-MySQL
This section shows the measurement results of the
TPCC-MySQL benchmark program. For this experi-
ment, a single virtual machines with 8GB main mem-
ory was configured. The number of warehouses is 40,
which constitute approximately 4GB of a database.
The buffer pool size of the InnoDB storage engine
is 4GB. The measurements were performed with the
two flush methods of InnoDB and the two storage
cache modes of KVM; thus, there are the four combi-
nations of them. The Innodb flush methods used for
the measurement are fsync and O DIRECT, and the
KVM storage cache modes are none and directsync.
The none cache mode provides the write buffer while
the directsync cache mode does not. Figure 5 shows
the results.
The performance improvement enabled by the
TowardsHighPerformanceBigDataProcessingbyMakingUseofNon-volatileMemory
533
proposed method is significant. The proposed method
executes the benchmark from 2.39x to 4.60x faster
than SSD. The difference between the proposed
method and SSD is the largest when the combination
of the O DIRECT Innodb flush method and the KVM
directsync cache mode is used. Since this combina-
tion provides no buffering of data transfer in the OS
kernel and the KVM virtualization software, the cost
to write data in storage becomes the maximum among
the combinations used for the experiments. The other
combinations provide buffering somewhere in the OS
kernel and the KVM virtualization software; thus, the
differences are closer but still large, which are from
2.39x to 2.62x.
5 RELATED WORK
A technique to combine block storage with another
block storage for higher access performance existed
before SSDs become widely available and popular.
DCD (Hu and Yang, 1996) first stores data in cache
storage, so that it can make use of sequential access,
of which performance is typically much better than
random access, so that the write performance can be
improved. The emergence of SSDs stimulated the
research and development of various caching tech-
niques (Kgil and Mudge, 2006; Koller et al., 2013;
Saxena et al., 2012; Facebook, 2014) in order to
make use of their high performance. Because SSDs
are block storage, all of them combine block stor-
age with another block storage, and provide the block
storage interface. The proposed method is different
from them since it combines memory storage with
block storage. Because memory storage allows syn-
chronous access, the proposed method aggressively
makes use of it in order to reduce the access cost in
total.
The Linux kernel provides the device mapper as
the software layer to combine multiple storage de-
vices and to constitutes a single storage device. When
the device mapper is used to combine memory stor-
age with block storage, it requires memory storage
to emulate block storage since the device mapper can
interact only with the block storage interface. It also
causes significant software overhead since the access
requests can be processed by the generic block device
driver framework multiple times(Ueda et al., 2007).
The proposed method does not use the device mapper
mechanism in order to avoid such overheads, and im-
plements its own function that can provide the direct
and synchronous access interface to memory storage.
6 SUMMARY
Memory storage technologies are emerging, and they
should be effectively utilized in cloud computing en-
vironments in order accelerate storage performance
for big data processing. This paper proposed a
method that combines block storage with memory
storage and makes use of memory storage as cache of
block storage in order to remove such limitation. The
proposed method effectively utilizes the high perfor-
mance of memory storage and also provides the large
capacity of block storage. Therefore, memory storage
can be transparently used as a part of mass storage
while its low overhead access can accelerate storage
performance. The proposed method was implemented
as a device driver of the Linux kernel. Its performance
evaluation shows that it outperforms a bare SSD drive
and provides better performance on the Hadoop and
database environments.
REFERENCES
Facebook (2014). Flashcache. https://github.com/facebook/
flashcache.
Hu, Y. and Yang, Q. (1996). Dcd disk caching disk: A
new approach for boosting i/o performance. In Pro-
ceedings of the 23rd Annual International Symposium
on Computer Architecture, pages 169–178.
Kgil, T. and Mudge, T. (2006). Flashcache: A nand flash
memory file cache for low power web servers. In
Proceedings of the 2006 International Conference on
Compilers, Architecture and Synthesis for Embedded
Systems, CASES ’06, pages 103–112, New York, NY,
USA. ACM.
Koller, R., Marmol, L., Rangaswami, R., Sundararaman,
S., Talagala, N., and Zhao, M. (2013). Write poli-
cies for host-side flash caches. In Proceedings of the
11th USENIX Conference on File and Storage Tech-
nologies, FAST’13, pages 45–58, Berkeley, CA, USA.
USENIX Association.
Saxena, M., Swift, M. M., and Zhang, Y. (2012). Flashtier:
A lightweight, consistent and durable storage cache.
In Proceedings of the 7th ACM European Conference
on Computer Systems, EuroSys ’12, pages 267–280,
New York, NY, USA. ACM.
Shvachko, K., Kuang, H., Radia, S., and Chansler, R.
(2010). The hadoop distributed file system. In Pro-
ceedings of the 2010 IEEE 26th Symposium on Mass
Storage Systems and Technologies (MSST), MSST
’10, pages 1–10, Washington, DC, USA. IEEE Com-
puter Society.
Ueda, K., Nomura, J., and Christie, M. (2007). Request-
based device-mapper multipath and dynamic load bal-
ancing. In Proceedings of the Linux Symposium, vol-
ume 2, pages 235–243.
CLOSER2015-5thInternationalConferenceonCloudComputingandServicesScience
534