Prepare and Clean Data for Analysis in Python

Data analysis of the FIFA World Cup Data from 1930 to 2014 to discover interesting insights such as players who have appeared most in the World Cup, nations with the highest average performance in the World cup, how playing at home or away can affect your match result, etc

gold-colored FIFA World Cup Trophy and Football — Photo by Fauzan Saari on Unsplash

Introduction

The FIFA World Cup is a global football competition contested by the various football-playing nations of the world. It is contested every four years and is the most prestigious and important trophy in football, otherwise known as Soccer.

This project aims to explore the FIFA World Cup dataset for interesting insights and answer some questions at the end. For this, will use Pandas and Numpy to analyse the data and Seaborn and Matplotlib for the visualization. In the end, we will pose and answer some questions, providing explanations and graphs where necessary.

The dataset used for this analysis was downloaded from Kaggle using this link: https://www.kaggle.com/abecklas/fifa-world-cup. The World Cup dataset shows all information about all the World Cups from 1930 to 2014, while the World Cup Matches dataset shows all the results from the matches contested as part of the cups. There is a third dataset that contains the players’ data.

I did this project offline on my computer using Jupyter Notebook. You can use any other solution, there are a few online like Binder, Kaggle, Google Colab, etc. You can download the complete notebook (code) from my GitHub.

In an Exploratory Data Analysis (EDA) project, there are usually three parts:

Data Preparation & Cleaning
Analysis and Visualization
Question Answering

This article will dwell on Part 1 — Data Preparation and Cleaning. If you want to read Part 2, which contains all the analysis made to attain the project’s goal, please click here. Also, if you want to check Part 3 which contains some questions and their answers, outlines the outcomes and conclusions, click here. And if you want to look at and run the entire project, download the project from my GitHub repo.

Let’s get started!

Step 1: Reading the data

The dataset contains 3 CSV files, we will use pd.read_csv() function to parse them into pandas dataframes. Here’s what we’ll do:

Load all 3 datasets into Pandas Dataframes and explore relationships.
Handle missing, incorrect, and invalid data.
Perform any additional steps (parsing dates, creating additional columns, deleting useless columns, merging multiple datasets, etc.)

Firstly, we’ll import all useful libraries, and set Matplotlib to display all graphs in the notebook, not as a picture.

Let’s import the World Cup dataset and see the info it contains. Remember there are three sets of data. We’ll look at them individually.

Here’s what the column headers stand for:

Year: Year of the world cup
Country: Host Country
Winner: The team that won the world cup
RunnersUp: The team was in second place
Third: The team that was in third place
Fourth: The team that was in fourth place
GoalsScored: Total number of goals scored during the world cup
QualifiedTeams: Total number of teams that qualified for the world cup
MatchesPlayed: Total number of matches played during the world cup
Attendance: Total crowd present during the world cup

Let us now look at the Matches dataset.

Here’s what the column headers stand for:

Year: Year of the world cup
Datetime: The Date on which the match was played along with a 24 hour format time
Stage: The stage at which the match was played
Stadium: Stadium name where the match was held
City: The city name, where the match was played
Home Team Name: Home team country name
Home Team Goals: Total goals scored by the home team by the end of the match
Away Team Goals: Total goals scored by the away team by the end of the match
Away Team Name: Away team country name
Win conditions: Special win condition (if any)
Attendance: Total crowd present at the stadium
Half-time Home Goals: Goals scored by the home team until halftime
Half-time Away Goals: Goals scored by the away team until halftime
Referee: Name of the first referee
Assistant 1: Name of the first assistant referee (linesman)
Assistant 2: Name of the second assistant referee (linesman)
RoundID: Unique ID of the Round
MatchID: Unique ID of the match
Home Team Initials: Home team country’s three-letter initials
Away Team Initials: Away team country’s three-letter initials

Let us also import the players’ dataset

Here’s what the column headers stand for:

RoundID: Unique ID of the round
MatchID: Unique ID of the match
Team Initials: Player’s team initials
Coach Name: Name and country of the team coach
Line-up: S=Line-up, N=Substitute
Shirt Number: Shirt number if available
Player Name: Name of the player
Position: C=Captain, GK=Goalkeeper
Event: G=Goal, OG=Own Goal, Y=Yellow Card, R=Red Card, SY = Red Card by the second yellow, P=Penalty, MP=Missed Penalty, I = Substitution In

Step 2: Data Cleaning

This step is pivotal to our analysis, as it is the step that distinguishes good analytical results from bad ones. For our project, we have three sets of data and will have to clean all three. However, we do not need super clean datasets for the exploration and analysis which we will have to do. That said, we will only perform basic operations to format some values, remove null values, delete some wrong data, delete unnecessary columns, and create new ones. Let us start with the world cup dataset.

From the output in Step 1 above, we realized that the values for the Attendance column were delimited by full stops; we will remove them so that the values are integers.

Looks good. Let us now work on the Matches dataset. If we view the bottom values using worldcup_matches.tail(), we’ll see that the matches dataframe has too many NaN values. We want to only get the rows that have valid entries. Now, the question is, how many of them are there? Let us see…

There are 852 rows of valid data out of about 4572, it is unsure why most of the data comprises NaN values. Well, for our analysis, we will drop these values.

TODO

Obtain only rows with valid Year values
Convert Year, MatchID from float to int
Check for duplicate entries, if any, and drop them
Split the DateTime column into Date and Time
Create new columns with already existing ones
Drop columns that are not needed
Fix country names

A. Obtaining Valid Year Values and Converting Float Columns to Int

Here, we selected all rows in the dataset whose Year values were not null, then we formatted the MatchID and Year values as integers (they were floats).

B. Check for Duplicate Entries and Drop them

There were 16 duplicate entries — we dropped them.

C. Split the DateTime column into Date and Time, Drop the Datetime column

The code looks complex, let’s break it down. First, we split the Datetime values using the ‘-’ separator, so we get the date and the time as separate values for each row entry. Next, we merge this data with the other dataframe columns, then, we rename the split columns to Date and Time respectively. Finally, we drop the Datetime column since we no longer need it.

D. Create New Columns

We will now create new columns which we will need for our analysis and inference. First, let us see the columns we already have…

We need a Goals column that will show the total number of goals scored per match.
We also need a Match Outcome column that says whether the match was a home team win or an away team win.

Let us create these columns now…

The goals row entries show the number of goals scored by both teams during a single match. Match outcome says whether the match ended with the home team winning, the away team winning, or the match ended as a draw.

E. Fix Country Names

Let us look at the teams which have played in the world cup.

The data is not clean, we have to remove the ‘rn”>’ that appears before some country names.

wrong_names dataframe contains a list of the wrong country names. We created the country_names dataframe which contains the correct country names. Subsequently, we’ll replace the wrong names with the correct ones, but before that, there are some countries whose names have funny characters. We will add those to the list of wrong names and put their correct names in the list of the correct names. We will then replace the wrong values in all three World Cup datasets.

The names have been sorted out. There is probably more we could do. For example, there are some non-English characters in the players’ names that appear funny in the dataframe; we could also adjust those. However, it is not necessary to “super-clean” the data for our use case.

In the next section, we’ll perform some EDA and Visualization on the data. Seaborn and Matplotlib will come in handy. As earlier mentioned, this next step and the subsequent steps are elicited in the second part of this project. Click here to continue to the next section.

References

Throughout this project, these resources came in handy:

Zero to Pandas Data Analysis Course on Jovian.ai: https://jovian.ml/learn/data-analysis-with-python-zero-to-pandas
FIFA World Cup dataset: https://www.kaggle.com/abecklas/fifa-world-cup
Pandas Documentation: https://pandas.pydata.org/docs/
Matplotlib Documentation: https://matplotlib.org/
Seaborn Documentation: https://seaborn.pydata.org/
FIFA Teams and their Initials: https://en.wikipedia.org/wiki/List_of_FIFA_country_codes
FIFA World Cup Record and Statistics: https://en.wikipedia.org/wiki/List_of_FIFA_World_Cup_records_and_statistics

What interesting insights can we draw from FIFA World Cup Data?… Part 1: Data Preparation & Cleaning

Introduction

Step 1: Reading the data

Step 2: Data Cleaning

A. Obtaining Valid Year Values and Converting Float Columns to Int

B. Check for Duplicate Entries and Drop them

C. Split the DateTime column into Date and Time, Drop the Datetime column

D. Create New Columns

E. Fix Country Names

References

Leave a Comment Cancel Reply

Contact Info

Services

Subscribe to our Newsletter