SAURIDA: CLOUD COMPUTING BASED
Data Mining System in Telecommunication Industry
Qing Ke, Bin Wu
School of Computer Science, Beijing University of Posts and Telecommunications, Beijing, China
Yuxiao Dong, Lei Qin
School of Computer Science, Beijing University of Posts and Telecommunications, Beijing, China
Keywords: Cloud Computing, Data Mining, Data Flow, Telecommunication.
Abstract: Telecommunication data mining has been often used as a background application to motivate many
technical problems in data mining research. However, traditional mining algorithms face new challenges
which are tremendous amount of data and high time and space complexity of algorithms. Recently, Map-
Reduce parallel computing model has been emerging. In this paper, we combine data mining with Map-
Reduce based cloud computing to meet the challenges and showcase our applied system named Saurida. As
a full functionality system, we provide data flow oriented preprocessing utilities which achieve almost linear
speedup and extensively support for user defined functions, and we also provide many data mining
algorithms. More importantly, we elaborate several application scenarios as real-word requirements of
telecom industry by employing a large volume of data obtained from telecom operator. And we validate our
system has a good scalability, effectiveness and efficiency.
1 INTRODUCTION
Telecommunication data analysis has stimulated
great interests in recent years. Typical application
scenarios are customer churn prediction and
customers’ relationship management.
However, these analysis methods face new
challenges. Firstly, the telecom industry generates
and stores a tremendous amount of data. Secondly,
many data mining algorithms have high time and
space complexity.
Traditional business solutions of data mining are
commercial database or data warehouse systems or
commercial data mining tools. However, these
systems or tools are low scalability and high cost. In
research areas, Wang et.al (Wang, 2009) developed
a working data mining system on real mobile
communication data, but the system mainly focused
on algorithms in research such as sequential patterns
mining and community detection.
Recently, the Map-Reduce (Dean, 2004)
computational model and its open-source
implementation, Hadoop, are widely applied both in
research and industry areas. The model mainly
focuses on share-nothing parallelism, and its storage
system focuses on scalability. These advantages are
very suitable for telecom data mining.
In this paper, we combine data mining with
Map-Reduce based cloud computing to meet the
challenges and introduce our applied system,
Saurida. The system is built on distributed cluster
infrastructure as hardware and Hadoop distributed
computing platform as
fundamental software. As a
full functionality system, we provide data
preprocessing utilities, data mining algorithms
.
More importantly, we elaborate several application
scenarios as real-word requirements of telecom
industry by employing a large volume of data
obtained from telecom operator, we validate our
system from the view of scalability, effectiveness
and efficiency. In summary, Saurida takes the
following challenges as its destination as well as the
contributions to this work:
Data flow oriented and almost linear
speedup of preprocessing.
Extensive support for user defined functions.
Nearly linear speedup of data mining
algorithm.
516
Ke Q., Wu B., Dong Y. and Qin L..
SAURIDA: CLOUD COMPUTING BASED - Data Mining System in Telecommunication Industry.
DOI: 10.5220/0003387905160519
In Proceedings of the 1st International Conference on Cloud Computing and Services Science (CLOSER-2011), pages 516-519
ISBN: 978-989-8425-52-2
Copyright
c
2011 SCITEPRESS (Science and Technology Publications, Lda.)
Figure 1: Saurida architecture.
Real-word application in telecom industry.
The rest of this paper is structured as follows.
Section 2 describes system architecture. Section 3
discusses implementation issues. In Section 4, we
present several applications. Finally, we draw the
conclusion and discuss future work.
2 SYSTEM ARCHITECTURE
The architecture of Saurida is depicted in Figure 1.
As can be seen, the system consists of three layers.
The functions and features of each layer are
described as follows:
Application layer implements business
applications of telecom industry, such as user
behavior analysis, customer churn prediction
and social community detection.
Data mining techniques layer completes
functions mainly including preprocessing, data
mining and results visualization, and a very
important component named Chain Engine
which is responsible for chaining the
preprocessing utilities together and submitting
to Hadoop, we will discuss it in detail in
Section 3.2.
Infrastructure layer consists of Hadoop
platform where every slave node runs an
Expression Evaluation Engine (EEE) instance
which is the essential component to
implement custom processing.
3 IMPLEMENTATION
We set up a cluster environment, composed of one
master node and 21 computing nodes (Intel Xeon
2.50GHz×4, 8GB RAM, 250GByte×4 SATA II
disk, Linux RH4 OS). The cluster is interconnected
through 1000Mbps Switch. And deployed Hadoop
platform version is 0.20.0.
3.1 Preprocessing Utilities
Our system provides many preprocessing utilities
which are mainly categorized into operations on
record and operations on attribute. All of them are
implemented through running Map-Reduce jobs.
We complete a performance benchmark which
processes a terabyte data by running some typical
utilities on 128 nodes. Figure 2 depicts the running
time. We can see that all of them complete duration
1,500 seconds except the Merge because it actually
process total 2 terabyte data and transferring
intermediate data from mapper to reducer also
consumes some time. Figure 3 shows derived
scalability on 32, 64 and 128 nodes. The experiment
results indicate that the parallel data preprocessing
has excellent scalability.
SAURIDA: CLOUD COMPUTING BASED - Data Mining System in Telecommunication Industry
517
Figure 2: Benchmark results of preprocessing MFUs.
Figure 3: Scalability of preprocessing Utilities.
3.2 Chain Engine
We use native APIs provided by Hadoop,
ChainMapper and ChainReducer to develop chain
engine which is working as following steps:
1. Chaining all the user selected utilities together;
2. Changing the logical data flow into Map-
Reduce jobs;
3. Submitting the jobs to Hadoop;
The immediate benefit of chained pattern is a
dramatic reduction in disk I/O because the output of
the first utility becomes the input of the second one,
and so on until the last one, all the intermediate
results do not need to flush to disk. So the execution
time of the data flow dramatic reduced.
3.3 EEE
To accommodate specialized data processing tasks,
Saurida has extensive support for User Defined
Functions (UDFs). Essentially all aspects of
preprocessing utilities in Saurida including Select,
Derive, Replace and so on can be customized
through the use of UDFs.
When the Map-Reduce job is executing, all the
UDFs are put into EEE instance. Every DataNode of
the Hadoop cluster is running an instance, we can
achieve and the engine will output the result of each
UDF. EEE uses traditional Reverse Polish Notation
algorithm to evaluate the expression.
4 APPLICATION SCENARIOS
4.1 Ad-hoc Query
In this Section, we describe a sample ad-hoc data
analysis tasks. The SQL is:
SQL: Select ID, fee_A, fee_B,
case when fee_A>100 then 1
When fee_A>200 then 2
else 0 as fee_A_interval
from fee_info where fee_A>50;
Figure 4 is shown with the number of nodes
increasing from 6, 9 to 17, the chain which process
12GB data has excellent scalability, that is to say,
the speedup ratio increases nearly linearly with the
number of nodes.
Figure 4: Running time and scalability of a sample ad-hoc
query.
4.2 PCA
PCA (Wold, 1987) transforms a number of possibly
correlated variables into a number of uncorrelated
variables called principal components, commonly by
an orthogonal transformation based on variance.
PCA is very useful both in research and industry
area. In many telecom data mining applications, the
training data set may be as many as hundreds of
feature items. However, many features are correlated,
and these relevant features can be removed.
We run the parallel implementation of PCA on a
real-world data set and on different number of nodes
CLOSER 2011 - International Conference on Cloud Computing and Services Science
518
to test performance and scalability. The data is 12
GB and contains 49 fields. We set the number of
principal component to 10. The results are shown in
Figure 5. And we can see that the PCA algorithm
achieves good scalability.
Figure 5: Running time and scalability of PCA.
4.3 NN
We run the parallel implementation of Feed-forward
back-propagation neutral network (BP) (
Williams,
1986) on a real-world data set and on different
number of nodes to test performance and scalability.
The data set is 14 GB, containing total 67 fields, we
choose 55 fields of them as training attributes and
the classification attribute contains 2 values. Figure
6 shows the results of running time on 6, 9 and 17
nodes and corresponding derived scalability. And
the parallel NN algorithm also achieves nearly linear
speedup.
Figure 6: Running time and scalability of NN.
5 CONCLUSIONS AND FUTURE
WORK
Motivated by recently increasing request for the
capability of large scale data computing in
telecommunications industry, in this paper, we
introduce our system, Saurida, and demonstrate the
system has advantages that open-source or
commercial data mining tools do not have. These
advantages include ability to process terabytes scale
of data, high performance, linear speedup, cost-
effective and custom processing. From the industrial
view, we describe several application scenarios over
large scale data as real-word requirements.
Nonetheless, Saurida is an experimental
framework at present. Further development and
improvement is needed at aspects such as
functionality, performance and reliability to meet
telecom industry requirements. Essentially, we hope
our system can serve as a practical data mining
system in telecom industry.
ACKNOWLEDGEMENTS
This work is supported by the National Science
Foundation of China (Grant No.60905025,
90924029, 61074128).
REFERENCES
Wold, S., Esbensen, K., Geladi, P., 1987. Principal
Component Analysis. Chemometrics and Intelligent
Laboratory Systems 2, pp. 37-52.
T. Wang, B. Yang, J. Gao, D. Yang, S. Tang, H. Wu, K.
Liu, and J. Pei, 2009. MobileMiner: A Real World
Case Study of Data Mining in Mobile Communication.
In SIGMOD'09, Proceedings of the 2009 ACM
SIGMOD International Conference on Management of
Data.
Dean, J., Ghemawat, S., 2004. MapReduce: Simplified
data processing on large clusters. In OSDI '04, Sixth
Symposium on Operating System Design and
Implementation.
Williams R. J., Rumelhart D. E., Hinton G. E., 1986.
Learning representation by back-propagating errors.
Nature, vol. 323, pp. 533–536.
SAURIDA: CLOUD COMPUTING BASED - Data Mining System in Telecommunication Industry
519