Data analysis of the FIFA World Cup Data from 1930 to 2014 to discover interesting insights such as players who have appeared most in the World Cup, nations with the highest average performance in the World cup, how playing at home or away can affect your match result, etc
The FIFA World Cup is a global football competition contested by the various football-playing nations of the world. It is played every four years and is the most prestigious and important trophy in football, otherwise known as Soccer.
Ths to explore the FIFA World Cup dataset for interesting insights and answer some questions at the end. For this, will use
Numpy to analyse the data and
Matplotlib for the visualization. the end, we will pose and answer some questions, providing explanations and graphs where necessary.
The dataset used for this analysis was downloaded from Kaggle using this link: https://www.kaggle.com/abecklas/fifa-world-cup. The World Cup dataset shows all information about all the World Cups from 1930 to 2014, while the World Cup Matches dataset shows all the results from the matches contested as part of the cups. There is a third dataset that contains the players’ data.
I did this project offline on my computer using Jupyter Notebook. You can use any other solution, there are a few online like Binder, Kaggle, Google Colab, etc. You can download the complete notebook (code) from my .
In an Exploratory Data Analysis (EDA) project, there are usually three parts:
- Data Preparation & Cleaning
- Analysis and Visualization
- Question Answering
This article will dwell on Part 2 — Exploratory Data Analysis & Visualization. If you want to read Part 1, which shows how we imported, prepared, and cleaned the data, please click here. Also, if you want to check Part 3 which contains some questions and their answers, outlines the outcomes and conclusions, click here. And if you want to look at and run the entire project, download the project from my GitHub repo.
Excited to do this? I too!
Exploratory Data Analysis & Visualization
In this section, we will use graphs and some functions to explore and visualize the data. Short explanations of our findings will be presented after each exploration and/or visualization. The aim is to explore relationships between variables (data columns), understand patterns and draw inferences.
All functions used in this section are available on the documentation of the libraries used. I will provide short explanations where need be, however, you should check out the documentation to have a deeper understanding of the functions (especially those used to plot the graphs).
Let’s begin by setting some general rules for all graphs and use the describe function to see some statistics on the numeric columns in the world cup dataframe.
Remember in Part 1, we created three dataframes from the three csv files in the dataset. They include the worldcup_data dataframe, the worldcup_matches_data dataframe (obtained after cleaning the worldcup_matches dataframe), and the worldcup_players dataframe. All three dataframes will be used for the analysis.
As the years moved on, the number of qualified teams increased, and so did the number of matches played, hence, many more goals were scored. We can’t say with certainty that the average number of goals scored was higher as the number of games increased, but this is a valid argument which we will look into later. We will be exploring some thoughts in the coming cells.
- Attendance per world cup has been on a steady rise. However, all graphs show that there was no world cup in 1942 and 1946. This Wikipedia article explains that it was because of World War II. Explore this article to know more.
- Looks like there are four clusters of qualified teams. During the early years, there weren’t as many qualified teams as there are now and I am led to believe it’s because the conditions at the time were not very favourable to harbour as many countries. These conditions include but are not limited to fewer stadiums, low standards of living especially in developing countries, most countries could not afford the expenses of sending a team to represent them, and not many people knew about and could play football.
- The number of goals scored does not have a very clear trend; in the 50s, many goals were scored. However, this declined in the years between the 60s and 80s and have been fluctuating since then.
- The number of matches played per world cup have been increasing almost every 4 world cup. This is most probably due to the increasing capacity to host more teams.
Let us make a few findings…
1. Top Performant Countries in the World Cups
We want to explore the worldcup_data dataframe to see how the teams have been performing across the world cups — precisely, those teams which have been in the top three positions.
Brazil, Italy, and Germany have been dominating World Cups. Most of the teams have only managed to finish third place and have never played a final.
2. Goals per Team per World Cup
The aim here is to see how the countries have been scoring throughout their participation in the World Cups. We will use my country, Cameroon as a use case.
It’s interesting to see that Cameroon has always scored at least a goal when they played a World Cup. The team performed best in 1990 when they were knocked out by England at the Quater Final. Read more here. Since then, the performances have been on a decline.
3. Number of Goals Per Country for all the World Cups they’ve played
We want to see how many goals each team has scored throughout their World cup appearances.
Germany leads the scoreboard with just a few goals above Brazil. The other teams have had decent goal-scoring streaks too.
4. Frequency of Goals Scored During a World Cup
We aim to see how often goals are scored in a world cup. i.e., Which is the most likely number of goals to be scored during a world cup match.
Most World Cup matches involve 0 to 5 goals. A few matches have about 6 to 9 goals. Some have had up to 11 and there haven’t been any more than these.
5. DISTRIBUTION OF HOME TEAM AND AWAY TEAM GOALS
We want to see how the teams have been scoring based on whether they were playing at home or away.
At home, most teams manage to score at least 1 to 3 goals. Some (about 4 or so) managed to score more than 8. While away, most teams struggle to score a goal; those which do, rarely score more than 3. No team has succeeded to score more than 7 goals while playing away.
6. Match Outcome based on whether is playing at home or away
This helps us to understand how likely it is for a team to win a match when they are playing at home or away.
It is obvious, from the chart that it is more likely for a team to win the match if they’re playing at home (57%). It is even more likely for a team to draw (22%) than for it to win a match when it is playing away (20%).
In this article, we performed some exploratory analysis and got insights on Top Performant Countries in the World Cups, Goals per team per world cup, Number of Goals Per Country for all the world cups they’ve played, Frequency of Goals Scored During a world cup, Distribution of Home Team and Away Team Goals, Match Outcome based on whether is playing at home or away.
There is an unlimited world of possibilities with this data. We will, however, end our exploration here. In the next part, we will conclude by posing and answering some questions related to this analysis. Click here to continue this project.
Throughout this project, these resources came in handy:
- Zero to Pandas Data Analysis Course on Jovian.ai: https://jovian.ml/learn/data-analysis-with-python-zero-to-pandas
- FIFA World Cup dataset: https://www.kaggle.com/abecklas/fifa-world-cup
- Pandas Documentation: https://pandas.pydata.org/docs/
- Matplotlib Documentation: https://matplotlib.org/
- Seaborn Documentation: https://seaborn.pydata.org/
- FIFA Teams and their Initials: https://en.wikipedia.org/wiki/List_of_FIFA_country_codes
- FIFA World Cup Record and Statistics: https://en.wikipedia.org/wiki/List_of_FIFA_World_Cup_records_and_statistics