class of programs according to Internet rules. Python 
is the most commonly used development language 
for this technology. The principle of web crawler 
technology is realized by setting up new crawling 
rules and setting the URL of the portal. (Ji, 2017) 
Firstly, the developer selects a certain amount of 
seeds according to the requirements and saves the 
corresponding URLs. Then, a URL queue to be 
grabbed is set by the algorithm to save the selected 
URLs. After that, the program starts to download the 
contents corresponding to these URLs and grab the 
key information, and the processed URLs will be 
saved in the new grabbed queue. In the meantime, 
DNS resolution data and webpage download data 
generated by URL resolution will also be saved in the 
downloaded webpage database.   
2.2  Hadoop Processing Platform 
As a development and application ecosystem, Ha-
doop platform can support data-intensive applica-
tions, and its component team is growing with time. 
The most important components are distributed file 
system HDFS and parallel programming model 
MapReduce. The HDFS is responsible for the dis-
tributed storage of massive data, while mapreduce is 
to realize centralized parallel computing of distrib-
uted data, and the two complement each other. Ha-
doop ecosystem has many subprojects including 
Ambari, Hive, HBase, Zookeeper, Flume, Mahout, 
etc. besides Hadoop and mapreduce. With the coop-
eration of multiple components and clear division of 
labor, even inexperienced developers can use the 
advantages of clusters to deal with big data conven-
iently and quickly. (Li, 2017) 
2.3  J2EE Framework 
The J2EE is a simplified javaweb development plat-
form designed and developed by SUN Company, 
which can develop a series of application software 
platforms. In order to simplify the application soft-
ware development program of large enterprises, J2EE 
has specially developed a reusable component mod-
ule to improve the development efficiency. Besides, 
it has also built a structure that can automatically 
handle the level, thus reducing the skill requirements 
of developers in developing application software. 
(Ma, 2022) 
2.4  Development Environment 
In this paper, the author briefly introduces the related 
technologies of platform development and use. In the 
big data precision marketing system, Hadoop is used 
as a big data server cluster to process data and store it 
in MySQL database, and the corresponding applica-
tion platform is developed by using JavaWeb tech-
nology. 
According to the data volume and overall opera-
tion requirements of the system, this paper chooses to 
build a Hadoop3.3.1 cluster with three nodes. Then, 
the distributed collaboration system zookeeper-3.4.1, 
distributed file system HDFS 2.6.5, flume1.9.0, Hive 
0.13.1 and Hbase2.6.5 are installed and deployed in 
these three nodes synchronously, and the initial con-
struction of hadoop cluster is completed. The cluster 
will be developed under Linux system. This paper 
selects Centos6.5 Server release version of Linux 
operating system. The version of the web crawler 
framework Scrapy is 2.5, and Python3.8 is chosen as 
the development language. (Lin, 2016) 
In this system, the front-end development tool of 
JavaWeb application is boomstrap+jquery, and the 
development language is JavaScript+HTML+CSS. 
The back-end Java development tool is IDEA 
2021.1.3 (Ultimate Edition), the development envi-
ronment is JDK 1.8, and the J2EE framework of 
Tomcat+Spring MVC+Spring+MyBatis is is used in 
the implementation of this system. The development 
language is Java, and MySQL 8.0.28 is selected to 
help manage data. 
3  OVERALL DESIGN 
According to the needs of enterprises, hadoop-based 
Big Data Precision Marketing System establishes a 
top-down one-stop data collection, analysis, pro-
cessing and visualization system. The main func-
tions of data collection, data storage, data cleaning, 
data query and data analysis are supported by ha-
doop ecological cluster, and visualization is realized 
by javaweb technology. 
First of all, collect data from three sources. One 
is the collection of local enterprise server data by 
flume, two is the URL data collected from the prod-
uct details page by python web crawler technology, 
and the last is the access to Taobao, Weibo and other 
shared data through external JDBC interface. These 
data will be preliminarily cached in HDFS distrib-
uted storage. And the data of the crawler set is stored 
by redis. The data calculation module is implement-
ed by mapreduce, which analyzes the preliminary 
data and manages the crawler results, and uses data 
mining techniques such as association rule algorithm 
to achieve the portrait of consumers. After pro-
cessing, the data will be saved in HDFS and hive.