Authors:
Mikhail Hushchyn
1
;
Philippe Charpentier
2
and
Andrey Ustyuzhanin
3
Affiliations:
1
Yandex School of Data Analysis, Yandex Data Factory and Moscow Institute of Physics and Technology, Russian Federation
;
2
CERN, Switzerland
;
3
Yandex School of Data Analysis, Yandex Data Factory, Moscow Institute of Physics and Technology, National Research University Higher School of Economics (HSE) and NRC Kurchatov Institute, Russian Federation
Keyword(s):
Structured Data Analysis and Statistical Methods, Machine Learning, Information Extraction, Hybrid Data Storage Systems, Data Management, LHCb.
Related
Ontology
Subjects/Areas/Topics:
Artificial Intelligence
;
Computational Intelligence
;
Evolutionary Computing
;
Information Extraction
;
Knowledge Discovery and Information Retrieval
;
Knowledge-Based Systems
;
Machine Learning
;
Soft Computing
;
Structured Data Analysis and Statistical Methods
;
Symbolic Systems
Abstract:
This paper presents how machine learning algorithms and methods of statistics can be implemented to data
management in hybrid data storage systems. Basicly, two different storage types are used to store data in the
hybrid data storage systems. Keeping rarely used data on cheap and slow storages of type one and often used
data on fast and expensive storages of type two helps to achieve optimal performance/cost ratio for the system.
We use classification algorithms to estimate probability that the data will often used in future. Then, using the
risks analysis we define where the data should be stored. We show how to estimate optimal number of replicas
of the data using regression algorithms and Hidden Markov Model. Based on the probability, risks and the
optimal nuber of data replicas our system finds optimal data distribution in the hybrid data storage system. We
present the results of simulation of our method for LHCb hybrid data storage.