World Cup 2018 is Finally Here
A Quick Analysis of all Teams
It has been a long, long 1,423 days since the conclusion of the last World Cup. Luckily, this long wait is almost over which means it is time to analyze everything and anything World Cup related.
In this post, let’s start with getting all the squads.
We will gather all the team data from the Wikipedia page.
First, load libraries:
# load libraries
library(tidyverse) # for data manipulation
library(rvest) # for web scraping
library(lubridate) # to work with the birthdates
library(DT) # to create an interactive table at the end
After, get the url for the Wikipedia page:
url <- "https://en.wikipedia.org/wiki/2018_FIFA_World_Cup_squads"
There are 32 teams and each one is in a table and each table has a sequential number in the XPath for that table.
This is also true for the team names but they are in headers which also contain the same set of sequential numbers.
Since we know this, we can first create a vector of sequential numbers from 1 to 32:
table_numbers <- c(1:32)
Next, write a function to get all the player data from each table:
# function to get table for each team
get_teams <- function(tbl_nums) {
# get country names
team_name <- url %>%
read_html() %>%
html_node(xpath = paste0('//*[@id="mw-content-text"]/div/h3[',tbl_nums,']')) %>%
html_text
# get squad info
url %>%
read_html() %>%
html_node(xpath = paste0('//*[@id="mw-content-text"]/div/table[',tbl_nums,']')) %>%
html_table(fill = TRUE) %>%
as.tibble() %>%
mutate(team_name = team_name) # add country names
}
Then, use map
to iterate over vector of numbers and pull player data for each team:
teams <- map_df(table_numbers, get_teams)
## Warning: `as.tibble()` is deprecated, use `as_tibble()` (but mind the new semantics).
## This warning is displayed once per session.
## Warning: The `printer` argument is deprecated as of rlang 0.3.0.
## This warning is displayed once per session.
Data Clean-Up
We will be stacking all the data for 2018 squads on top of all the historical team data for analysis later. As a result we need to make some changes to our extracted data so we can bind_rows
with the historical data.
# clean up the data
teams <- teams %>%
mutate(`No.` = as.character(`No.`), ClubCountry = NA, Year = 2018) %>%
rename(No = `No.`, Pos = `Pos.`, `DOB/Age` = `Date of birth (age)`, Country = team_name) %>%
select(No,Pos,Player,`DOB/Age`,Caps,Club,Country,ClubCountry,Year)
Get Historical Teams
Thanks to this repo by S. Anand we can get all historical teams:
## read in historical data
historic_data <- read_csv("https://raw.githubusercontent.com/sanand0/fifadata/master/squads.csv")
## Parsed with column specification:
## cols(
## No = col_character(),
## Pos = col_character(),
## Player = col_character(),
## `DOB/Age` = col_character(),
## Caps = col_character(),
## Club = col_character(),
## Country = col_character(),
## ClubCountry = col_character(),
## Year = col_integer()
## )
A Little More Clean-Up
We have to change the data type for the Caps
column to match
# change data type for Caps to match
historic_data <- historic_data %>%
mutate(Caps = as.integer(Caps))
## Warning in evalq(as.integer(Caps), <environment>): NAs introduced by
## coercion
Stack Everything
Finally, we can stack the data and we have all teams for 2018 as well as all historical teams in one table.
# stack 2018 data and historical data
all_squads <- bind_rows(historic_data,teams)
The table below contains all the squad details for the 2018 World Cup:
In the next post, we will break up date of birth into more useful pieces and vizualize some of this data.