Mastering Data Cleaning with Python: A Deep Dive into FIFA 21 Dataset

Data cleaning is an essential step in the data analysis process, often consuming the bulk of the time spent on data projects. Today, we will explore comprehensive techniques for cleaning a dataset using the Python programming language, specifically focusing on the FIFA 21 players dataset. Our goals include removing unnecessary columns, handling missing values, and converting financial data for better usability.

Pandas is an essential tool in the world of data analysis and engineering, renowned for its powerful data manipulation capabilities. As an open-source library, Pandas offers robust, flexible data structures like DataFrames and Series, making it ideal for handling and analyzing structured data. Whether you’re dealing with CSV files, Excel spreadsheets, SQL databases, or simply large datasets in Python, Pandas can streamline the process significantly.

The library simplifies data handling, allowing for the management of diverse data sets in various formats with ease. Its intuitive interface enables users to perform complex data transformations, clean data efficiently, and pivot tables for advanced data analysis—all with minimal code. Built on NumPy, Pandas not only ensures efficient operations on large datasets but also integrates seamlessly with other libraries in the scientific Python ecosystem, such as Matplotlib for data visualization and Scikit-learn for machine learning, enhancing its utility and versatility in data science projects.

If you want to dive deeper into Pandas and learn how to leverage it in data engineering and analysis, check out the detailed guide on Ruangguru’s AI Bootcamp. This resource gives you comprehensive insights and practical examples that help you kickstart your data engineering journey with Pandas.

Table of Contents

Importing Necessary Packages and Reading Data

Firstly, we start by importing crucial Python packages that will aid in our data manipulation tasks:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

The dataset is loaded from a remote URL into a pandas DataFrame, which allows us to manipulate the data efficiently:

url = 'https://storage.googleapis.com/rg-ai-bootcamp/assignment-1/fifa21_raw_data.csv'
fifa_df = pd.read_csv(url, low_memory=False)

Task 1: Removing Unnecessary Columns

In our first task, we address the ‘Unnamed: 0’ column in the dataset, which is often a residue of data import and does not hold any meaningful information. Removing this column simplifies the dataset:

fifa_df = fifa_df.drop(columns=['Unnamed: 0'])
print(fifa_df.head(1))

Task 2: Eliminating Newline Characters

Next, we tackle the removal of newline characters which can disrupt the parsing of data. This involves scanning the dataset for \n characters and replacing them:

fifa_df = fifa_df.replace('\n', '', regex=True)
print(fifa_df['Club'].head())

Task 3: Cleaning Up Star Characters

Some datasets include unnecessary characters such as ‘star’ (★) symbols. Removing these ensures that the data is numeric and can be analyzed correctly:

fifa_df = fifa_df.replace('★', '', regex=True)
print(fifa_df[['W/F', 'SM', 'IR']].head())

Task 4: Handling Missing Values

Missing data can lead to inaccurate analyses. In the FIFA dataset, we fill missing ‘Loan End Date’ with ‘Not on Loan’ and ‘Hits’ with ‘Unknown’ to maintain data integrity:

fifa_df['Loan Date End'] = fifa_df['Loan Date End'].fillna('Not on Loan')
fifa_df['Hits'] = fifa_df['Hits'].fillna('Unknown')
print(fifa_df[['Loan Date End', 'Hits']].tail())

Task 5: Converting Financial Data

Financial data often comes with symbols and suffixes that need to be converted for analyses. We clean and convert the ‘Value’, ‘Wage’, and ‘Release Clause’ columns by removing symbols and converting suffixes to actual values:

def clean_financial_data(value):
    value = value.replace('€', '')
    if 'K' in value:
        return int(float(value.replace('K', '')) * 1000)
    elif 'M' in value:
        return int(float(value.replace('M', '')) * 1000000)
    return int(value)

fifa_df['Value'] = fifa_df['Value'].apply(clean_financial_data)
fifa_df['Wage'] = fifa_df['Wage'].apply(clean_financial_data)
fifa_df['Release Clause'] = fifa_df['Release Clause'].apply(clean_financial_data)
print(fifa_df[['Name', 'Value', 'Wage', 'Release Clause']].head())

Conclusion

Data cleaning, though time-consuming, is crucial for making datasets useful and analysis-ready. By using Python and its powerful libraries, we can automate and simplify many of the tedious tasks involved in data cleaning. The cleaned FIFA 21 dataset now serves as a robust foundation for any further analysis or machine learning models.

For more insights and updates on data manipulation techniques, stay tuned to our blog. If you’re looking to enhance your data cleaning skills further, consider exploring more detailed Python packages and their functionalities.

link : https://www.kaggle.com/datasets/yagunnersya/fifa-21-messy-raw-dataset-for-cleaning-exploring

Discover more from teguhteja.id

Subscribe to get the latest posts sent to your email.

Mastering Data Cleaning with Python: A Deep Dive into FIFA 21 Dataset

Importing Necessary Packages and Reading Data

Task 1: Removing Unnecessary Columns

Task 2: Eliminating Newline Characters

Task 3: Cleaning Up Star Characters

Task 4: Handling Missing Values

Task 5: Converting Financial Data

Conclusion

Like this:

Related

Discover more from teguhteja.id

1 thought on “Mastering Data Cleaning with Python: A Deep Dive into FIFA 21 Dataset”

Leave a ReplyCancel reply

Mastering Data Cleaning with Python: A Deep Dive into FIFA 21 Dataset

Importing Necessary Packages and Reading Data

Task 1: Removing Unnecessary Columns

Task 2: Eliminating Newline Characters

Task 3: Cleaning Up Star Characters

Task 4: Handling Missing Values

Task 5: Converting Financial Data

Conclusion

Share this:

Like this:

Related

Discover more from teguhteja.id

1 thought on “Mastering Data Cleaning with Python: A Deep Dive into FIFA 21 Dataset”

Leave a ReplyCancel reply