What interesting insights can we draw from FIFA World Cup Data?… Part 3: The Summary (Questions & Answers)

Data analysis of the FIFA World Cup Data from 1930 to 2014 to discover interesting insights such as players who have appeared most in the World Cup, nations with the highest average performance in the World cup, how playing at home or away can affect your match result, etc

Introduction

The FIFA World Cup is a global football competition contested by the various football-playing nations of the world. It is contested every four years and is the most prestigious and important trophy in football, otherwise known as Soccer.

This project aims to explore the FIFA World Cup dataset for interesting insights and answer some questions at the end. For this, will use Pandas and Numpy to analyse the data and Seaborn and Matplotlib for the visualization. In the end, we will pose and answer some questions, providing explanations and graphs where necessary. 

The dataset used for this analysis was downloaded from Kaggle using this link: https://www.kaggle.com/abecklas/fifa-world-cup. The World Cup dataset shows all information about all the World Cups from 1930 to 2014, while the World Cup Matches dataset shows all the results from the matches contested as part of the cups. There is a third dataset that contains the players’ data.

I did this project offline on my computer using Jupyter Notebook. You can use any other solution, there are a few online like Binder, Kaggle, Google Colab, etc. You can download the complete notebook (code) from my GitHub.

In an Exploratory Data Analysis (EDA) project, there are usually three parts:

  1. Data Preparation & Cleaning
  2. Analysis and Visualization
  3. Question Answering

This article will dwell on Part 3 — Questions and Answers. If you want to read Part 1, which shows how we imported, prepared, and clean the data, please click here. Also, if you want to check Part 2 which sets out all the analysis made to attain the project’s goal, click here. And if you want to look at and run the entire project, download the project from my GitHub repo.

Excited to do this? Same here!


Asking and Answering Questions

In this section, we’re going to answer some questions based on our findings in Part 2, which covered EDA and visualizations. We will plot some more graphs where necessary. Let us dive in right away…

Question 1: Which players have played the highest number of World Cup Matches?

Here, we will use the world cup data to know the top 10 players with the highest number of World Cup match appearances. Appearance here means they were in the match sheet, even if they did not play for a second.

Ronaldo has played more matches than any other player up until World Cup 2014.

Question 2: Which players have the highest number of world cup appearances?

We want to know how many times each player has been selected for a World Cup and who has the highest number of selections.

For this task, we had to merge data from the players dataframe with Year and MatchID columns from the worldcup_matches_data dataframe. Our results show that Ronaldo has played in more World Cups than any other player, followed by Antonio Carbajal. The others (among the top 10) all have 4 appearances.

Data in this dataset spans from 1930 to 2014, thus, data from the 2018 World Cup is not included.

Question 3: Which coach has won the highest number of World Cup Matches?

Here’s what’s happening:

  • We merge relevant data from the worldcup_matches_data dataframe with data from the players dataframe.
  • Filtered out row entries for coaches who won at home — home_win_df and did the same for coaches who won away — away_win_df.
  • Then, we merged both dataframes accordingly. i.e., row entries in the away_win_df were added to their row entry counterparts in the home_win_df, provided the coach’s name in that row on both dataframes were the same.
  • We dropped duplicate values and plotted a graph for the top 10 coaches with the highest number of wins.

This Wikipedia article shows the country initials and their full names. 

Note: Some of the initials in this dataset are obsolete. The article highlights them.

Germany and Brazil (most especially) have dominated World Cups for years now. Their coaches have seen glory more than any other coaches.

Question 4: Which is the most used Referee in the knockout stages?

The aim is to know the referee who has arbitrated most during the knockout stages. i.e., The stages after the Group Stage.

To obtain this result, we selected all rows in the worldcup_matches_data dataframe that did not contain the word ‘Group’ in the Stage column. This data is data for the knockout stages. We then obtained the top 10 referees with the highest number of appearances. To answer the question, Belgian referee Langenus Jean and Swedish referee Eklind Ivan are the most used referees in the World Cup knockout stages — they have officiated 5 knockout matches each. 

There isn’t much gap in the data. This shows that the selection for referees during these stages has been fair.

Question 5: Which Countries Hosted the World Cup and Won?

We want to know those countries who hosted and succeeded to win the tournament at home.

Only six teams have succeeded to win the World Cup as hosts, and none has been able to do it more than once.

Question 6: Which Matches had the Highest Attendance?

The matches in the 1950s dominate the chart. This is logical, as there hadn’t been any world cup for close to 12 years. I’m sure the fans were very eager to show up this time.

Question 7: Which stadiums are most likely to be populated if there is a match?

By obtaining the average attendance in each stadium, we will be able to know which stadiums are most likely to get populated when there is a match.

Historically, Estadio Azieca has pulled more crowds than any other stadium. We can’t say with absolute certainty that it could be the case today, but there is a high likelihood that it gathers as much crowd if there is a match.


Inferences and Conclusion

In this article, we asked and answered interesting questions using the FIFA World Cup Dataset.

Here are a few insights:

  • The FIFA World Cup started in 1930 and was hosted by Uruguay which went on to win the cup.
  • Brazil has won the highest number of World Cups (5), followed by Germany and Italy with 4 each. Uruguay and Argentina have 2 cups each, followed by England, France and Spain which have each won the cup once.
  • SCHOEN Helmut is the coach who has won the highest number of matches throughout the World cups (48). He used to coach the then West Germany.
  • Belgian referee, LANGENUS Jean, and Swedish referee, EKLIND Ivan, have officiated the highest number of matches (5) during the knockout stages.
  • Germany has scored the highest number of goals throughout World Cup history (224), followed by Brazil (221), and Argentina (131).
  • Brazilian Ronaldo has appeared in more World Cup tournaments and has played more matches than any other player.
  • Only Uruguay (1930), Italy (1934), England (1966), Germany (1974), Argentina (1978), France (1998) have succeeded to win the World Cup at home, and none of them has done so more than once.
  • Throughout World Cup history, it is more likely to win a match while playing at home than away. It is even more likely that you draw while playing away than that you win.
  • Most World Cup matches have involved 1 to 5 goals.

Future Work

  • We could do some more in-debt analysis to answer more questions like:
  • Which players have played the highest number of minutes in World Cup history?
  • Which players have influenced the game when substituted? i.e., They either scored or provided an assist which resulted in their team winning.
  • Players who have been substituted the most. Likewise, players with the most starts.

and more…

  • We could also use a more statistically friendly approach to know the most performant teams in World Cup history. That is, by considering performance based on the number of games played, previous history of the team (how many times they’ve qualified, the stages they’ve played in at previous World Cups, etc). Performance here refers to the number of goals scored, highest stage attained, number of matches won, etc.
  • The data ends in 2014, but there has been one more world cup after then which was in 2018 and was won by France. So, we could look out for that data and add it to this one to make our analysis complete.

References

Throughout this project, these resources came in handy:

  1. Zero to Pandas Data Analysis Course on Jovian.ai: https://jovian.ml/learn/data-analysis-with-python-zero-to-pandas
  2. FIFA World Cup dataset: https://www.kaggle.com/abecklas/fifa-world-cup
  3. Pandas Documentation: https://pandas.pydata.org/docs/
  4. Matplotlib Documentation: https://matplotlib.org/
  5. Seaborn Documentation: https://seaborn.pydata.org/
  6. FIFA Teams and their Initials: https://en.wikipedia.org/wiki/List_of_FIFA_country_codes
  7. FIFA World Cup Record and Statistics: https://en.wikipedia.org/wiki/List_of_FIFA_World_Cup_records_and_statistics

Leave a Comment

Your email address will not be published. Required fields are marked *