Data cleaning is an essential step in the data analysis process, often consuming the bulk of the time spent on data projects. Today, we will explore comprehensive techniques for cleaning a dataset using the Python programming language, specifically focusing on the FIFA 21 players dataset. Our goals include removing unnecessary columns, handling missing values, and converting financial data for better usability.
Pandas is an essential tool in the world of data analysis and engineering, renowned for its powerful data manipulation capabilities. As an open-source library, Pandas offers robust, flexible data structures like DataFrames and Series, making it ideal for handling and analyzing structured data. Whether you’re dealing with CSV files, Excel spreadsheets, SQL databases, or simply large datasets in Python, Pandas can streamline the process significantly.
The library simplifies data handling, allowing for the management of diverse data sets in various formats with ease. Its intuitive interface enables users to perform complex data transformations, clean data efficiently, and pivot tables for advanced data analysis—all with minimal code. Built on NumPy, Pandas not only ensures efficient operations on large datasets but also integrates seamlessly with other libraries in the scientific Python ecosystem, such as Matplotlib for data visualization and Scikit-learn for machine learning, enhancing its utility and versatility in data science projects.
For those looking to dive deeper into the capabilities of Pandas and how it can be leveraged in data engineering and analysis, the detailed guide available on Ruangguru’s AI Bootcamp provides comprehensive insights and practical examples to get you started on your data engineering journey with Pandas.
Importing Necessary Packages and Reading Data
Firstly, we start by importing crucial Python packages that will aid in our data manipulation tasks:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
The dataset is loaded from a remote URL into a pandas DataFrame, which allows us to manipulate the data efficiently:
url = 'https://storage.googleapis.com/rg-ai-bootcamp/assignment-1/fifa21_raw_data.csv'
fifa_df = pd.read_csv(url, low_memory=False)
Task 1: Removing Unnecessary Columns
In our first task, we address the ‘Unnamed: 0’ column in the dataset, which is often a residue of data import and does not hold any meaningful information. Removing this column simplifies the dataset:
fifa_df = fifa_df.drop(columns=['Unnamed: 0'])
print(fifa_df.head(1))
Task 2: Eliminating Newline Characters
Next, we tackle the removal of newline characters which can disrupt the parsing of data. This involves scanning the dataset for \n
characters and replacing them:
fifa_df = fifa_df.replace('\n', '', regex=True)
print(fifa_df['Club'].head())
Task 3: Cleaning Up Star Characters
Some datasets include unnecessary characters such as ‘star’ (★) symbols. Removing these ensures that the data is numeric and can be analyzed correctly:
fifa_df = fifa_df.replace('★', '', regex=True)
print(fifa_df[['W/F', 'SM', 'IR']].head())
Task 4: Handling Missing Values
Missing data can lead to inaccurate analyses. In the FIFA dataset, we fill missing ‘Loan End Date’ with ‘Not on Loan’ and ‘Hits’ with ‘Unknown’ to maintain data integrity:
fifa_df['Loan Date End'] = fifa_df['Loan Date End'].fillna('Not on Loan')
fifa_df['Hits'] = fifa_df['Hits'].fillna('Unknown')
print(fifa_df[['Loan Date End', 'Hits']].tail())
Task 5: Converting Financial Data
Financial data often comes with symbols and suffixes that need to be converted for analyses. We clean and convert the ‘Value’, ‘Wage’, and ‘Release Clause’ columns by removing symbols and converting suffixes to actual values:
def clean_financial_data(value):
value = value.replace('€', '')
if 'K' in value:
return int(float(value.replace('K', '')) * 1000)
elif 'M' in value:
return int(float(value.replace('M', '')) * 1000000)
return int(value)
fifa_df['Value'] = fifa_df['Value'].apply(clean_financial_data)
fifa_df['Wage'] = fifa_df['Wage'].apply(clean_financial_data)
fifa_df['Release Clause'] = fifa_df['Release Clause'].apply(clean_financial_data)
print(fifa_df[['Name', 'Value', 'Wage', 'Release Clause']].head())
Conclusion
Data cleaning, though time-consuming, is crucial for making datasets useful and analysis-ready. By using Python and its powerful libraries, we can automate and simplify many of the tedious tasks involved in data cleaning. The cleaned FIFA 21 dataset now serves as a robust foundation for any further analysis or machine learning models.
For more insights and updates on data manipulation techniques, stay tuned to our blog. If you’re looking to enhance your data cleaning skills further, consider exploring more detailed Python packages and their functionalities.