Authors:
Nir Regev
;
Asaf Shabtai
and
Lior Rokach
Affiliation:
Dept. of Software and Information Systems Engineering, Ben-Gurion University of the Negev, Beer Sheva, Israel
Keyword(s):
EDA (Exploratory Data Analysis), Neural Network, SQL, Supervised Learning.
Abstract:
In the current landscape of data analytics, data scientists predominantly utilize in-memory processing tools such as Python’s pandas or big data frameworks like Spark to conduct exploratory data analysis (EDA). These methods, while powerful, often entail substantial trade-offs, including significant consumption of time, memory, and storage, alongside elevated data scanning costs. Considering these limitations, we developed iDAT, a cost-effective interactive data exploration method. Our method uses a deep neural network (NN) to learn the relationship between queries and their results to provide a rapid inference layer for the prediction of query results. To validate the method, we let 20 data scientists run EDA (exploratory data analysis) queries using the system underlying this method. We show that it reduces the need to scan data during inference (query calculation). We evaluated this method using 12 datasets and compared it to the latest query approximation engines (VerdictDB, Blin
kDB) in terms of query latency, model weight, and accuracy. Our results indicate that the iDat predicted query results with a WMAPE (weighted mean absolute percentage error) ranging from approximately 1% to 4%, which, for most of our datasets, was better than the results of the compared benchmarks.
(More)