Kaggle Challenge - What is this challenge all about

The Challenge - The sinking of the Titanic is one of the most infamous shipwrecks in history. On April 15, 1912, during her maiden voyage, the widely considered “unsinkable” RMS Titanic sank after colliding with an iceberg. Unfortunately, there weren’t enough lifeboats for everyone onboard, resulting in the death of 1502 out of 2224 passengers and crew. While there was some element of luck involved in surviving, it seems some groups of people were more likely to survive than others. In this challenge, we ask you to build a predictive model that answers the question: “what sorts of people were more likely to survive?” using passenger data (ie name, age, gender, socio-economic class, etc).

reference - Kaggle.com

Go to Kaggle Competition

Kaggle Titanic - Machine Learning from Disaster

Step 1

List all the files present in the directory, to verify if all the 3 files - train.csv, test.csv and gender_submissions.csv are present

Step 2

Load Training Data

Step 3

Load Testing Data

Step 4

Let's analyse features.

We start with gender based survival ratio. Women clearly show high survival rate.

Step 5

Fill Null Values.

Age feature has null values. We want to use this feature,so we replace all the missing values with a median age.

Step 6

Categorical Age

Divide the Age feature into 5 distinct categories and calculate its survival rate. Categorical data will help us understand people from which age groups have high survival chance.

Step 7

Assign a numeric value to people belonging to different age groups.

Step 8

Gender

Convert text values under 'Sex' column into numeric values-'Gender_Numeric' This helps into plotting of the histogram and will be cleaner to use for further processing of data.

Step 9

Gender

Repeat for Test Data.

Step 10

Class

There is a higher chance of survival if passengers belong to First Class than Second and Third Class respectively.

Step 11

Class-Histogram

Plotting a histogram which displays total number of Survived passengers across class.

Step 12

Female to Male Survival-Histogram

Plotting a histogram which displays total number of Survived passengers across class.

Step 13

Age-Histogram

Now we know that people belong to 5 different age groups and we have assigned a unique numeric value to these groups-

Age <= 16 = 0

Age > 16 & Age <= 32 = 1

Age > 32 & Age <= 48 = 2

Age > 48 & Age <= 64 = 3

Age > 64 = 4

This histogram shows that people belonging to a age group of 16-32 or younger people have higher chance of survival than others.

Step 14

Using Random Forest Classifier

We are using Random Forest Classifier to train our model using train.csv and test.csv to predict the survivors.

Initial Score

This score was before replacing the Null Values or Categorising Age or Converting text features to Numerical features.

Improved Score

This score was after replacing the Null Values, Categorising Age and Converting text features to Numerical features.

Contributions

Data Visualisation

Data Visualisation helps in understanding the dataset a little more. It helps put words in picture form.We all want data that is easily readable, Data Visualisation helps you do that.Using Histogram to visualize my data which was earlier in a tabular form, helped me streamline which features played an important part into survivor predictions. Based on the histograms above, I decided to keep using Sex, Age and Class as key features

Missing Values

Handling missing values in a large dataset becomes extremely important as not doing so may end up giving us inaccurate inference of data. In this case training and testing both the datasets had large number of missing values for "Age" feature.Dropping rows with null values did not improve my score but rather decreased it.So, I decided to fill thise missing values. I used mean age value and filled it in the missing places. This helped me increase the score a bit.

Categorization

I divided age feature into 5 categories or range. This helped me in grouping the data and visualisation of this feature became easy. After categorising all passengers into age groups, I assigned a numerical value to passengers belonging to a particular group. By doing so, I was able to bring all the desired features in same scale or range.

Text to Numerical

Similar to Age, I also converted text data of Sex feature to numeric, by assigning Male=1 and Female=0. Now all the features are in same scale and level. Applying these data-preprocessing techniques helped in increasing the models accuracy a bit.It also taught what not to do to decrease the accuracy.

References

https://www.kaggle.com/alexisbcook/titanic-tutorial

https://desktop.arcgis.com/en/arcmap/10.3/manage-data/feature-datasets/an-overview-of-working-with-feature-datasets.htm

https://www.kaggle.com/dansbecker/handling-missing-values

https://www.kaggle.com/brendan45774/titanic-top-solution