Published on
19 min read

Web-scraping to get an edge in the NBA fantasy league draft

Authors

Note that the GitHub repository for this article may be found here.

In this project I will be using the rvest package to scrape the web for basketball player data and using this to explore previous performance in the NBA fantasy league. This should help identify the most consistent players, as well as those who may be undervalued in a fantasy draft.

The website I will be getting data from is the Basketball Reference website. I’m most interested in the game logs, as with these we can estimate each player’s consistency throughout last season, and calculate fantasy scores.

After investigating the pages containing each player game logs, we can see that Basketball Reference uses a player code and year to construct the URL of each page containing yearly stats for each player. This URL follows the general formula:

www.basketball-reference.com/players/i/{player_code}/gamelog/{year}

Using this as a template, I have made the below function that can scrape the gamelog table for a given player code and year.

This function utilises the rvest package function html_elements to parse the html code and find the table elements, then html_table reads this into a table format, which can be converted to a data frame.

# scrape game log table for a player in a given year
get_gamelog <- function(player_code, year){
  
  url <- paste0(glue('https://www.basketball-reference.com/players/i/{player_code}/gamelog/{year}'))
  # Scrape Player game log
  gamelog <- url %>%
    read_html() %>% 
    html_elements('table') %>%
    .[8] %>% 
    html_table() %>% 
    as.data.frame()
  
  return(gamelog)

}

get_gamelog("jokicni01", 2023) %>% 
  head() %>% 
  knitr::kable()
RkGDateAgeTmVar.6OppVar.8GSMPFGFGAFG.X3PX3PAX3P.FTFTAFT.ORBDRBTRBASTSTLBLKTOVPFPTSGmScX…
112022-10-1927-242DEN@UTAL (-21)133:251217.70613.333221.000224630342724.5-5
222022-10-2127-244DEN@GSWW (+5)134:24713.53812.50011111.00039121000542624.9-3
332022-10-2227-245DENOKCW (+5)138:41610.600111.00069.667115161311321925.4+18
442022-10-2427-247DEN@PORL (-25)126:5834.75000331.00018990015913.8-10
552022-10-2627-249DENLALW (+11)134:481217.70604.000771.00011213940333134.3+28
662022-10-2827-251DENUTAW (+16)125:28310.30004.00068.7503710611111214.1+8

I would like to scrape the game logs for all players, or at least the top players from the 2023 season. To do this, we need to get all of the player codes. Thankfully, I was able to find this github repo with a function that could be slightly modified to get a table from Basketball Reference containing all the player codes. This table also has average season stats for each player

I also decided to filter this player list, removing players with average minutes played below 25.

# Acknowledgement: code modified from github.com/djblechn-su/nba-player-team-ids/

# Create Function to Scrape Player codes
scrape_nba_main <- function(yr){
  # Create URL
  url <- glue('https://www.basketball-reference.com/leagues/NBA_{yr}_per_game.html')
  webpage <- read_html(url)
  
  # Scrape All Player Links on Page
  links <- webpage %>%
    html_nodes(xpath = "//td/a") %>% 
    html_attr("href")
  links <- links[grepl("/players", links)]
  links <- links[!duplicated(links)]
  
  # Scrape Player Information
  player_table <- webpage %>%
    html_nodes("table") %>%
    .[1] %>%
    html_table(fill = TRUE) %>%
    as.data.frame()
  player_table <- player_table %>% filter(Player != "Player")
  player_table <- player_table[!duplicated(player_table[c('Player', 'Age')]),]
  player_table <- player_table[,c(2,3,4,8)]
  player_table$Link <- links
  BBRefID <- strsplit(player_table$Link, '\\/')
  BBRefID <- sapply(BBRefID, function(x) x[4])
  BBRefID <- gsub(".html", "", BBRefID)
  player_table$BBRefID <- BBRefID

  return(player_table)
}

bbref_player_codes <- scrape_nba_main(2023)
bbref_player_codes <- bbref_player_codes[!duplicated(bbref_player_codes$BBRefID),]
bbref_player_codes <- bbref_player_codes[order(bbref_player_codes$BBRefID),]

bbref_player_codes_filtered <- bbref_player_codes %>%
  mutate(MP = as.numeric(MP)) %>% 
  filter(MP > 25)

bbref_player_codes_filtered %>%
  head() %>% 
  knitr::kable()
PlayerPosAgeMPLinkBBRefID
Steven AdamsC2927.0/players/a/adamsst01.htmladamsst01
Bam AdebayoC2534.6/players/a/adebaba01.htmladebaba01
Grayson AllenSG2727.4/players/a/allengr01.htmlallengr01
Jarrett AllenC2432.6/players/a/allenja01.htmlallenja01
Kyle AndersonPF2928.4/players/a/anderky01.htmlanderky01
Giannis AntetokounmpoPF2832.1/players/a/antetgi01.htmlantetgi01

OK, now that we have all of the basketball reference ID’s its a matter of looping through them to construct the URL for each individual player’s stats, scraping the game log table for each player and combining these into one master table.

player_codes <- bbref_player_codes$BBRefID
year <- 2023

all_game_logs <- list()
for (i in seq_along(player_codes)) {
  player_code <- player_codes[i]
  player_game_log <- get_gamelog(player_code=player_code, year=year) %>% 
    mutate(player_code = player_code) %>% 
    filter(Rk != "Rk") %>%  # remove the extra column-name rows 
    mutate(across(everything(), as.character)) 
    all_game_logs[[player_code]] <- player_game_log
  
  if (i > 1) {
    Sys.sleep(5) # sleep 5 seconds to prevent the website throttling the webscraper
    } 
}

all_game_logs_df <- bind_rows(all_game_logs)

all_game_logs_df %>%
  head() %>% 
  knitr::kable()
RkGDateAgeTmVar.6OppVar.8GSMPFGFGAFG.X3PX3PAX3P.FTFTAFT.ORBDRBTRBASTSTLBLKTOVPFPTSGmScX…player_code
112022-10-1923-030TORCLEW (+3)017:56411.36414.25012.50014500001105.0-4achiupr01
222022-10-2123-032TOR@BRKL (-4)017:2916.16701.00023.667066000324-2.2-8achiupr01
332022-10-2223-033TOR@MIAL (-3)033:3259.55625.40067.8573811100111817.1+16achiupr01
442022-10-2423-035TOR@MIAW (+8)033:43512.41701.00001.00041822201021012.7+10achiupr01
552022-10-2623-037TORPHIW (+10)021:1837.42913.333003364002375.9+14achiupr01
662022-10-2823-039TORPHIL (-22)014:5706.00003.00000213101210-3.5-10achiupr01

Next, we need to clean the data. Rows corresponding to missed games have a number of different string values, so I replaced those with NA. There were rows that contained column names, which I removed. I joined this table to the player codes table, so now we have player names and minutes played, to make the table a bit easier to read.

I’ve additionally calculated number of double doubles, triple doubles, missed points, and a few other stats that are used to calculate fantasy scores, but not included in the original Basketball Reference table.

# these values correspond to when a player did not play in the game
missing_values <- c("Inactive|Did Not Dress|Did Not Play|Not With Team|Player Suspended")

# add in the player name
all_game_logs_df <- all_game_logs_df %>% 
  left_join(bbref_player_codes[c("BBRefID", "Player")],
            by = c("player_code" = "BBRefID")) %>% 
  mutate_all(~str_replace_all(., missing_values, NA_character_)) %>% 
  mutate(across(c(
    "GS",
    "FG",
    "FGA",  
    "FG.",
    "X3P",
    "X3PA",
    "X3P.",
    "FT",
    "FTA",
    "FT.",
    "ORB",
    "DRB",
    "TRB", 
    "AST",
    "BLK",
    "TOV",
    "STL",
    "PTS"),
    ~as.numeric(.))) %>% 
  # make columns for triple doubles and double doubles:
  # DD: (IF: 2/5 OF ASS/BLOCK/STEAL/REB/PTS > 9)
  # TD: (IF: 3/5 OF ASS/BLOCK/STEAL/REB/PTS > 9)
  rowwise() %>% 
  mutate("DD" = if_else(sum(TRB > 9, AST > 9, BLK > 9, STL > 9, PTS > 9) >= 2, 1, 0)) %>%
  mutate("TD" = if_else(sum(TRB > 9, AST > 9, BLK > 9, STL > 9, PTS > 9) >= 3, 1, 0)) %>% 
  mutate("QD" = if_else(sum(TRB > 9, AST > 9, BLK > 9, STL > 9, PTS > 9) >= 4, 1, 0)) %>% 
  # games played - 1 if they played the game or 0 if not
  ungroup() %>% 
  mutate(GP = if_else(is.na(GS), 0, 1)) %>% # if game was played, 1 else zero
  # calculate number missed for field goals, free throws, three pointers
  mutate(FGM = FGA - FG) %>% 
  mutate(FTM = FTA - FT) %>% 
  mutate(X3PM = X3PA - X3P)

all_game_logs_df %>% 
  head() %>% 
  knitr::kable()
RkGDateAgeTmVar.6OppVar.8GSMPFGFGAFG.X3PX3PAX3P.FTFTAFT.ORBDRBTRBASTSTLBLKTOVPFPTSGmScX…player_codePlayerDDTDQDGPFGMFTMX3PM
112022-10-1923-030TORCLEW (+3)017:564110.364140.250120.50014500001105.0-4achiupr01Precious Achiuwa0001713
222022-10-2123-032TOR@BRKL (-4)017:29160.167010.000230.667066000324-2.2-8achiupr01Precious Achiuwa0001511
332022-10-2223-033TOR@MIAL (-3)033:32590.556250.400670.8573811100111817.1+16achiupr01Precious Achiuwa1001413
442022-10-2423-035TOR@MIAW (+8)033:435120.417010.000010.00041822201021012.7+10achiupr01Precious Achiuwa1001711
552022-10-2623-037TORPHIW (+10)021:18370.429130.33300NA3364002375.9+14achiupr01Precious Achiuwa0001402
662022-10-2823-039TORPHIL (-22)014:57060.000030.00000NA213101210-3.5-10achiupr01Precious Achiuwa0001603

Now we need to calculate “fantasy scores” for each player to see how they would have performed in a fantasy league last year. These are calculated using a formula specific to the fantasy league. Each stat is multiplied by a multiplier and then the weighted statistics are summed to get the fantasy score for a given game.

For each player, I calculated total fantasy score, as well as mean, median and standard deviation for the 2023 season. In order to gauge each player’s raw potential and consistency, these stats were calculated for the whole season (i.e. including missed games as zero’s in the calculations) to gauge each players consistency throughout the season - it’s valuable to have players who actually play a lot of games and score highly throughout the season. To examine player potential, I also calculated the same stats with missed games removed - to see how well they play per-game. This metric should highlight high-performing players, even those who missed a lot of games last year. I also have added in player position and minutes played to this table.

Now we have a table which can definitively rank player performance in 2023, and we can export this to a spreadsheet that will be very useful when making picks during the draft.

# the sum of each metric * by a multiplier is used to calc the fantasy score 
# this is specific to the league 
multipliers_vector <- c(
  "GP" = 1,
  "FG" = 2,
  "FGM" = -1,
  "FT" = 1,
  "FTA" = 0.5,
  "FTM" = -1,
  "X3P" = 3.5,
  "X3PM" = -1.5,
  "ORB" = 3,
  "DRB" = 1,
  "TRB" = 1,
  "AST" = 4,
  "STL" = 5,
  "BLK" = 6,
  "TOV" = -2.5,
  "DD" = 10,
  "TD" = 30,
  "QD" = 1000,
  "PTS" = 1)

# calculate the fantasy score per game 
scores <- all_game_logs_df %>% 
  select(c(Player, names(multipliers_vector)))

weighted_scores = list()
for (i in seq_along(multipliers_vector)){
  name <- names(multipliers_vector[i])
  weighted_scores[[name]] <- scores[[name]] * multipliers_vector[name]
}
weighted_scores_df <- as.data.frame(weighted_scores)
agg_weighted_scores <- data.frame(
  player_name = scores$Player,
  weighted_scores = rowSums(weighted_scores_df, na.rm = T)
) %>%
  cbind(weighted_scores_df) %>% 
  mutate(weighted_scores_with_na = if_else(GP == 1, weighted_scores, NA)) # create another weighted score with NA's if they did not play, this will allow to calculate per-game stats 

final_weighted_scores_summarised <- agg_weighted_scores %>% 
  select(player_name, weighted_scores, weighted_scores_with_na, GP) %>% 
  group_by(player_name) %>%
  summarise(season_total_score = sum(weighted_scores),
            n_games_played = sum(GP), 
            season_mean = mean(weighted_scores), # season stats count missed games as zero
            season_median = median(weighted_scores),
            season_stdev = sd(weighted_scores), 
            games_played_mean = mean(weighted_scores_with_na, na.rm = TRUE), # games_played stats do not include missed games in the calculation
            games_played_median = median(weighted_scores_with_na, na.rm = TRUE),
            games_played_stdev = sd(weighted_scores_with_na, na.rm = TRUE)) %>%
  arrange(desc(season_median)) %>% 
  # add in other player stats from Basketball reference 
  left_join(bbref_player_codes, by = c("player_name" = "Player")) %>% 
  select(-Link, -BBRefID)

final_weighted_scores_summarised %>%
  head() %>% 
  knitr::kable()
player_nameseason_total_scoren_games_playedseason_meanseason_medianseason_stdevgames_played_meangames_played_mediangames_played_stdevPosAgeMP
Nikola Jokić9440.569115.12805121.5062.46504136.81884131.5040.45340C2733.7
Joel Embiid7590.56692.56707107.2553.22522115.00758115.7530.16168C2834.6
Luka Dončić7788.56694.98171106.7560.93902118.00758118.0043.22779PG2336.2
Giannis Antetokounmpo7238.56388.27439105.0057.83945114.89683115.0035.53079PF2832.1
Domantas Sabonis8637.579105.33537103.2535.49361109.33544104.5029.41586C2634.6
Trae Young6894.57384.0792792.7537.6462194.4452196.0024.53124PG2434.8

To interactively explore the data, can create some plotly scatter charts to compare the players.

In these plots, the y-axis corresponds to a player’s potential to have high-scoring games, and the x-axis corresponds to their consistency throughout the season.

theme_set(theme_bw())

medians_scatter <- final_weighted_scores_summarised %>% 
  ggplot(aes(x = season_median,
             y = games_played_median,
             colour = Pos,
             text = glue(
               "
               Player: {player_name}
               Position: {Pos}
               Total score for season: {season_total_score}
               Median score for season: {season_median}
               Median score for games played: {games_played_median}
               "
             ))) +
  labs(x = 'Median score for season',
       y = 'Median score for games played')+
  geom_point() +
  scale_color_brewer(palette = 'Set3')

medians_scatter # %>% ggplotly(tooltip = 'text')

Player position is also an important factor when drafting a team, as you need to ensure all positions are filled. We can plot each position separately to see more clearly who are better players for their position.

split_medians_scatter <- final_weighted_scores_summarised %>% 
  ggplot(aes(x = season_median,
             y = games_played_median,
             colour = Pos,
             text = glue(
               "
               Player: {player_name}
               Position: {Pos}
               Total score for season: {season_total_score}
               Median score for season: {season_median}
               Median score for games played: {games_played_median}
               "
             ))) +
  labs(x = 'Median score for season',
       y = 'Median score for games played')+
  geom_point() +
  facet_wrap(~Pos) +
  scale_color_brewer(palette = 'Set3')

split_medians_scatter # %>% ggplotly(tooltip = 'text')

Total score for the previous season is also a useful metric to look at. This metric is often what other less data-savvy drafters will be using to evaluate their draft picks. If we plot season total score against median score of games played, this can help identify under-valued players who played less games, but scored highly for the games that they played.

total_scatter <- final_weighted_scores_summarised %>% 
  ggplot(aes(x = season_total_score,
             y = games_played_median,
             colour = Pos,
             text = glue(
               "
               Player: {player_name}
               Position: {Pos}
               Total score for season: {season_total_score}
               Median score for season: {season_median}
               Median score for games played: {games_played_median}
               "
             ))) +
  labs(x = 'Total score for season',
       y = 'Median score for games played')+
  geom_point() +
  scale_color_brewer(palette = 'Set3')

total_scatter # %>% ggplotly(tooltip = 'text')
split_total_scatter <- final_weighted_scores_summarised %>% 
  ggplot(aes(x = season_total_score,
             y = games_played_median,
             colour = Pos,
             text = glue(
               "
               Player: {player_name}
               Position: {Pos}
               Total score for season: {season_total_score}
               Median score for season: {season_median}
               Median score for games played: {games_played_median}
               "
             ))) +
  facet_wrap(~Pos) +
  labs(x = 'Total score for season',
       y = 'Median score for games played')+
  geom_point()+
  scale_color_brewer(palette = 'Set3')

split_total_scatter # %>% ggplotly(tooltip = 'text')
sessionInfo()

## R version 4.3.1 (2023-06-16)
## Platform: x86_64-apple-darwin20 (64-bit)
## Running under: macOS Ventura 13.2
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.3-x86_64/Resources/lib/libRblas.0.dylib 
## LAPACK: /Library/Frameworks/R.framework/Versions/4.3-x86_64/Resources/lib/libRlapack.dylib;  LAPACK version 3.11.0
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## time zone: Australia/Sydney
## tzcode source: internal
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] RColorBrewer_1.1-3 plotly_4.10.2      janitor_2.2.0      glue_1.6.2        
##  [5] XML_3.99-0.14      rvest_1.0.3        lubridate_1.9.2    forcats_1.0.0     
##  [9] stringr_1.5.0      dplyr_1.1.2        purrr_1.0.1        readr_2.1.4       
## [13] tidyr_1.3.0        tibble_3.2.1       ggplot2_3.4.2      tidyverse_2.0.0   
## 
## loaded via a namespace (and not attached):
##  [1] utf8_1.2.3        generics_0.1.3    xml2_1.3.5        stringi_1.7.12   
##  [5] hms_1.1.3         digest_0.6.33     magrittr_2.0.3    evaluate_0.21    
##  [9] grid_4.3.1        timechange_0.2.0  fastmap_1.1.1     jsonlite_1.8.7   
## [13] httr_1.4.6        selectr_0.4-2     fansi_1.0.4       viridisLite_0.4.2
## [17] scales_1.2.1      lazyeval_0.2.2    cli_3.6.1         rlang_1.1.1      
## [21] munsell_0.5.0     withr_2.5.0       yaml_2.3.7        tools_4.3.1      
## [25] tzdb_0.4.0        colorspace_2.1-0  curl_5.0.1        vctrs_0.6.3      
## [29] R6_2.5.1          lifecycle_1.0.3   snakecase_0.11.1  htmlwidgets_1.6.2
## [33] pkgconfig_2.0.3   pillar_1.9.0      gtable_0.3.3      data.table_1.14.8
## [37] highr_0.10        xfun_0.39         tidyselect_1.2.0  rstudioapi_0.15.0
## [41] knitr_1.43        farver_2.1.1      htmltools_0.5.5   labeling_0.4.2   
## [45] rmarkdown_2.23    compiler_4.3.1