Survival Analysis of the Titanic Using Random Forests
Tianchong Tang
College of Information Science and Engineering, Ocean University of China, San Sha Street, Qing Dao, China
Keywords: Random Forest, Survival Analysis, Machine Learning.
Abstract: The Titanic disaster is one of the most widely studied maritime tragedies. Analyzing passenger survival rates
has become a hot topic. This research endeavors to forecast the likelihood of survival among Titanic voyagers
by employing a random forest algorithmic approach. The datasets employed in this analysis include features
such as PassengerId, Survived, Pclass, Name, Sex, Age, SibSp, Parch, Ticket, Fare, Cabin, and Embarked.
To enhance prediction performance, the author implemented a random forest algorithm, which integrates
multiple decision trees. Following data preprocessing, the dataset was randomly separated into a training set,
comprising 80%, and a test set, constituting 20%. Across 500 distinct iterations, the data was randomly split
into training and test sets. The random forest model achieved an average accuracy of 0.8013, demonstrating
its effectiveness in assessing the likelihood of Titanic voyagers' endurance. This underscores the considerable
potential of the random forest algorithm in conducting survival analyses.
1 INTRODUCTION
The Titanic was not only an engineering marvel of its
time but also a focal point for numerous studies and
discussions. The accident killed about 1,500
people. To enhance disaster response capabilities,
researchers began utilizing machine learning
techniques to predict passengers' survival
probabilities. In addition to the work by (Gleicher and
Stevans 2004), who employed logistic regression
models to analyze key factors such as gender and age,
researchers gradually recognized that social status
might also influence survival chances. For instance, a
passenger's cabin class and royal status could garner
support from fellow travelers in critical moments,
thereby improving survival odds. Furthermore,
research has demonstrated the successful application
of generalized linear models and decision tree
algorithms (distinct from the algorithms discussed in
this paper) (Durmuş and Güneri, 2020). For data
processing, there are also studies that have created
many new features, such as "child", "new_fare",
"FamilyIdentity", etc. (Datla, 2015), which are
different from the feature engineering in this paper.
In the study of Nadine Farag and Ghada Hassan, the
accuracy of the naive Bayes model even reached
92.52% (Farag and Hassan, 2018). There are also
some studies that predict the survival risk of Titanic's
passengers through statistical scoring methods (Ligot,
2022), which can divide passengers into different risk
levels, unlike machine learning. Historically, survival
analysis has relied on statistical methods, including
the Cox proportional hazards model (Cox, 1972) and
the Kaplan-Meier estimator (Kaplan and Meier,
1958). This paper delves into survival analysis of
Titanic passengers using the random forest method.
The article initially explicates the fundamental tenets
underlying the random forest model, subsequently
outlines the preprocessing procedures applied to the
original Titanic data collection, and ultimately
ascertains the precision of the random forest model.
This exemplifies the potentiality of machine learning
in the scrutiny of survival percentages. The
discussion section first introduces decision trees,
which are a key component of the random forest, then
analyzes the optimal decision tree from the
perspective of information gain, and finally concludes
by integrating decision trees into a unified random
forest framework.