Skip to content
Home » My Blog Tutorial » Mastering Machine Learning: Training Models with the Titanic Dataset

Mastering Machine Learning: Training Models with the Titanic Dataset

missing data titanic

Have you ever wondered how machines learn to make predictions? In this post, we’ll dive into the fascinating world of machine learning model training using the famous Titanic dataset. We’ll explore everything from data preparation to model evaluation, so buckle up and get ready for an exciting journey!

Getting Started: Data Cleaning and Preprocessing

Before we jump into training our model, we need to ensure our data is squeaky clean. Data cleaning and preprocessing form the foundation of any successful machine learning project. Think of it as tidying up your room before starting a big project – it sets the stage for success!

To begin, we’ll use Python and the powerful Scikit-learn library to prepare our dataset. Here’s a quick snippet to get you started:

import pandas as pd
from sklearn.preprocessing import StandardScaler

# Load the Titanic dataset
titanic_df = pd.read_csv('titanic.csv')

# Handle missing values
titanic_df['Age'].fillna(titanic_df['Age'].mean(), inplace=True)
titanic_df['Embarked'].fillna(titanic_df['Embarked'].mode()[0], inplace=True)

# Convert categorical variables to numeric
titanic_df = pd.get_dummies(titanic_df, columns=['Sex', 'Embarked'])

# Scale numerical features
scaler = StandardScaler()
titanic_df[['Age', 'Fare']] = scaler.fit_transform(titanic_df[['Age', 'Fare']])

This code snippet demonstrates how to handle missing values, convert categorical variables to numeric ones, and scale numerical features. These steps ensure our data is in tip-top shape for training!

Splitting the Data: The Train-Test Split

Now that our data is clean and preprocessed, it’s time to split it into training and testing sets. This crucial step allows us to train our model on one portion of the data and test its performance on another, unseen portion.

Let’s use Scikit-learn’s train_test_split function to accomplish this:

from sklearn.model_selection import train_test_split

# Separate features and target variable
X = titanic_df.drop('Survived', axis=1)
y = titanic_df['Survived']

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print(f"Training set shape: {X_train.shape}")
print(f"Testing set shape: {X_test.shape}")

By splitting our data, we’re essentially creating a mock exam for our model. It learns from the training set and then proves its skills on the testing set.

Training the Model: Introducing Logistic Regression

With our data prepared and split, it’s time to train our model. For this example, we’ll use Logistic Regression, a popular algorithm for binary classification problems like predicting survival on the Titanic.

Here’s how we can train our Logistic Regression model:

from sklearn.linear_model import LogisticRegression

# Initialize and train the model
logreg = LogisticRegression(random_state=42)
logreg.fit(X_train, y_train)

print("Model training complete!")

This process is similar to teaching a student. We provide our model (the student) with examples (the training data) and let it learn the patterns that lead to survival or non-survival.

Evaluating the Model: How Well Did We Do?

After training our model, we need to assess its performance. We’ll use several evaluation techniques to get a comprehensive view of our model’s capabilities.

from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

# Make predictions on the test set
y_pred = logreg.predict(X_test)

# Print evaluation metrics
print("Classification Report:")
print(classification_report(y_test, y_pred))

print("\nConfusion Matrix:")
print(confusion_matrix(y_test, y_pred))

print(f"\nAccuracy Score: {accuracy_score(y_test, y_pred):.2f}")

These metrics provide valuable insights into our model’s performance. The classification report shows precision, recall, and F1-score for each class. The confusion matrix visualizes the model’s predictions, and the accuracy score gives us an overall measure of correctness.

Conclusion: What Have We Learned?

In this post, we’ve embarked on an exciting journey through the machine learning model training process. We’ve cleaned and preprocessed our data, split it into training and testing sets, trained a Logistic Regression model, and evaluated its performance.

Remember, this is just the beginning! There are many more algorithms to explore and techniques to master. Keep practicing, and soon you’ll be building sophisticated machine learning models with ease.

Want to learn more about machine learning? Check out this comprehensive guide to machine learning algorithms for a deeper dive into various techniques and their applications.

Happy modeling!


Discover more from teguhteja.id

Subscribe to get the latest posts sent to your email.

Leave a Reply

Optimized by Optimole
WP Twitter Auto Publish Powered By : XYZScripts.com

Discover more from teguhteja.id

Subscribe now to keep reading and get access to the full archive.

Continue reading