Kaggle Challenge - What is this challenge all about

The Challenge - The sinking of the Titanic is one of the most infamous shipwrecks in history. On April 15, 1912, during her maiden voyage, the widely considered “unsinkable” RMS Titanic sank after colliding with an iceberg. Unfortunately, there weren’t enough lifeboats for everyone onboard, resulting in the death of 1502 out of 2224 passengers and crew. While there was some element of luck involved in surviving, it seems some groups of people were more likely to survive than others. In this challenge, we ask you to build a predictive model that answers the question: “what sorts of people were more likely to survive?” using passenger data (ie name, age, gender, socio-economic class, etc).

reference - Kaggle.com

Go to Kaggle Competition
Kaggle Titanic - Machine Learning from Disaster

Step 1

List all the files present in the directory, to verify if all the 3 files - train.csv, test.csv and gender_submissions.csv are present

Image1

Step 2

Load Training Data

Image2

Step 3

Load Testing Data

Image3

Step 4

Let's analyse features.

We start with gender based survival ratio. Women clearly show high survival rate.

Image4

Step 5

Fill Null Values.

Age feature has null values. We want to use this feature,so we replace all the missing values with a median age.

Image5

Step 6

Categorical Age

Divide the Age feature into 5 distinct categories and calculate its survival rate. Categorical data will help us understand people from which age groups have high survival chance.

Image6

Step 7

Assign a numeric value to people belonging to different age groups.

Image7

Step 8

Gender

Convert text values under 'Sex' column into numeric values-'Gender_Numeric' This helps into plotting of the histogram and will be cleaner to use for further processing of data.

Image8

Step 9

Gender

Repeat for Test Data.

Image9

Step 10

Class

There is a higher chance of survival if passengers belong to First Class than Second and Third Class respectively.

Image10

Step 11

Class-Histogram

Plotting a histogram which displays total number of Survived passengers across class.

Image11

Step 12

Female to Male Survival-Histogram

Plotting a histogram which displays total number of Survived passengers across class.

Image12

Step 13

Age-Histogram

Now we know that people belong to 5 different age groups and we have assigned a unique numeric value to these groups-

  • Age <= 16 = 0
  • Age > 16 & Age <= 32 = 1
  • Age > 32 & Age <= 48 = 2
  • Age > 48 & Age <= 64 = 3
  • Age > 64 = 4
  • This histogram shows that people belonging to a age group of 16-32 or younger people have higher chance of survival than others.

    Image13

    Step 14

    Using Random Forest Classifier

    We are using Random Forest Classifier to train our model using train.csv and test.csv to predict the survivors.

    Image14

    Initial Score

    This score was before replacing the Null Values or Categorising Age or Converting text features to Numerical features.

    Image_score1

    Improved Score

    This score was after replacing the Null Values, Categorising Age and Converting text features to Numerical features.

    Improved_score

    Contributions

    Data Visualisation

    Data Visualisation helps in understanding the dataset a little more. It helps put words in picture form.We all want data that is easily readable, Data Visualisation helps you do that.Using Histogram to visualize my data which was earlier in a tabular form, helped me streamline which features played an important part into survivor predictions. Based on the histograms above, I decided to keep using Sex, Age and Class as key features

    Missing Values

    Handling missing values in a large dataset becomes extremely important as not doing so may end up giving us inaccurate inference of data. In this case training and testing both the datasets had large number of missing values for "Age" feature.Dropping rows with null values did not improve my score but rather decreased it.So, I decided to fill thise missing values. I used mean age value and filled it in the missing places. This helped me increase the score a bit.

    Categorization

    I divided age feature into 5 categories or range. This helped me in grouping the data and visualisation of this feature became easy. After categorising all passengers into age groups, I assigned a numerical value to passengers belonging to a particular group. By doing so, I was able to bring all the desired features in same scale or range.

    Text to Numerical

    Similar to Age, I also converted text data of Sex feature to numeric, by assigning Male=1 and Female=0. Now all the features are in same scale and level. Applying these data-preprocessing techniques helped in increasing the models accuracy a bit.It also taught what not to do to decrease the accuracy.

    References

  • https://www.kaggle.com/alexisbcook/titanic-tutorial
  • https://desktop.arcgis.com/en/arcmap/10.3/manage-data/feature-datasets/an-overview-of-working-with-feature-datasets.htm
  • https://www.kaggle.com/dansbecker/handling-missing-values
  • https://www.kaggle.com/brendan45774/titanic-top-solution
  • Prajakta Waikar

    ©Prajakta Waikar. All rights reserved.
    Design - TemplateFlip