Shen, 2018). Similarly, Saadi et al. have integrated a
plethora of external factors (Saadi and Wang, 2017),
such as pricing and meteorological conditions, into
their predictive models, employing a diverse array of
algorithms including decision trees, ensemble
decision trees, and random forests. The foray of deep
learning into this realm has marked a significant leap
in forecasting capabilities due to their excellent
performance in many tasks (Lin, 2024; Wang, 2019).
Ke et al. have adeptly integrated deep learning with
machine learning techniques, employing Long Short-
Term Memory networks (LSTM) to encapsulate
temporal dependencies and Convolutional Neural
Networks (CNN) to model spatial correlations (Ke
and Zheng, 2017). Li has introduced Radial Basis
Function (RBF) neural network model, optimized via
quantum particle swarm optimization, which takes
into account an array of influencing factors such as
historical demand, traffic congestion indices, and
meteorological conditions (Li and Wen, 2018).
Despite these methodological advancements, the
ubiquitous challenge of reconciling supply and
demand in taxi services persists, adversely affecting
the operational efficiency of transportation systems
and the commuting experience of urban dwellers.
This paper aims to address this challenge by
conducting a comparative analysis of five distinct taxi
OD demand forecasting methodologies, utilizing data
analytics and machine learning algorithms, and
culminating in the identification of the LSTM model
as the most efficacious predictive instrument.
The rest of this paper is segmented into three
chapters, with the principal content of each chapter
delineated as follows: the method section details the
dataset and analytical methods; the results and
discussion section evaluate the performance of
Decision Tree (DT), LSTM, and random forest
models; and the conclusion reviews the research,
discusses limitations, and suggests future directions.
2 METHOD
2.1 Dataset Preparation
The dataset used in this study is sourced from the
Microsoft Azure Open Dataset. It encompasses a vast
collection of taxi trip records from 2009 to 2018,
totalling approximately 80 million entries
(TPSearchTool, 2022). Each entry is rich with details
such as precise latitude and longitude of pick-up and
drop-off locations, service dates and times, trip
distances, and fare amounts. The dataset's extensive
temporal and spatial coverage provides a
comprehensive view of taxi travel patterns in New
York City, offering valuable insights for urban
transportation planning and taxi service optimization.
The dataset includes several features such as Pickup
and Dropoff Latitude/Longitude, Passenger Count
and Payment Type.
The target variable for this study is the Taxi OD
demand, which is calculated by grouping data by start
and end points, demand date, and demand time, and
counting the number of orders from the same start
point to the same end point within the same hour.
This study also conducted a series of data
preprocessing steps on this dataset to ensure the
effectiveness of model training and the accuracy of
predictions. Initially, normalization was implemented
by applying the MinMaxScaler technique, which
scales the data to a range suitable for model training,
typically between 0 and 1. Following this, this paper
performed a train-test split, dividing the dataset into a
training set, which accounts for approximately 65%
of the total data, and a test set, comprising the
remaining 35%. This division is crucial for evaluating
the model's generalization capabilities. Additionally,
this study carried out data cleaning by filtering out
outliers based on geographical location and logical
inconsistencies within the data. For instance, records
with negative values for fare amount or passenger
count were identified and excluded to ensure the
quality and consistency of the dataset. Finally, to
reduce data dimensionality and facilitate model
training, K-Means clustering was employed. Using
this algorithm, this paper grouped the geographical
coordinates into seven distinct classes, streamlining
the data structure and enhancing the model's
efficiency in processing the information. This enables
model to predict Taxi OD demand more accurately
and provide robust data support for intelligent
transportation systems and urban planning.
The preprocessing steps were implemented using
Python, with libraries such as Pandas for data
manipulation, Scikit-learn for scaling and splitting the
dataset, and Matplotlib for visualization. Figures
illustrating the clustering results before and after
applying K-Means provide a visual representation of
the geographic distribution of taxi pick-up and drop-
off points. Figure 1 and Figure 2 below show the
location visualization after K-means clustering.