Tour De France Historical Analysis

Author

Sonoma Miller

Published

October 14, 2025

Research Questions

  1. Which rider has placed (1st, 2nd, 3rd) in the most years of the Tour De France?
  2. What year had the longest Tour De France (# of stages and distance)?

Load and Clean Data (Data Wrangling)

About the dataset: This data contains information about the Tour De France stages and riders from 1903 (the year it started) to 2023. It was compiled on Kaggle by Sujay Kapadnis. The Tour De France is an annual bicycle race in which teams contain riders that compete in different races, or stages. Here, I have arranged stages by year ascending, and removed some rider columns we won’t use. Some key columns:

Riders

  1. Rank: The rider’s placing in the race
  2. Rider: The rider’s name
  3. Year: The year of the Tour De France
  4. Distance: The total kilometers the rider rode
  5. TotalSeconds: The total seconds the rider took to finish

Stages

  1. Year: The year the stage occurred
  2. TotalTDFDistance: The total kilometers of all the stages combined
  3. Stage: The stage name containing the stage number and start and finish locations
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.1     ✔ stringr   1.5.2
✔ ggplot2   4.0.0     ✔ tibble    3.3.0
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.1.0     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Rows: 9,878
Columns: 11
$ Rank             <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16…
$ Rider            <chr> "MAURICE GARIN", "LUCIEN POTHIER", "FERNAND AUGEREAU"…
$ Rider.No.        <int> 1, 37, 39, 33, 12, 9, 28, 2, 45, 71, 21, 62, 50, 14, …
$ Team             <chr> "TDF 1903 ***", "TDF 1903 ***", "TDF 1903 ***", "TDF …
$ Times            <chr> "94h 33' 14''", "97h 32' 35''", "99h 02' 38''", "99h …
$ Gap              <chr> "-", "+ 02h 59' 21''", "+ 04h 29' 24''", "+ 04h 39' 3…
$ Year             <int> 1903, 1903, 1903, 1903, 1903, 1903, 1903, 1903, 1903,…
$ Distance..km.    <int> 2428, 2428, 2428, 2428, 2428, 2428, 2428, 2428, 2428,…
$ Number.of.stages <int> 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6,…
$ TotalSeconds     <int> 340394, 351155, 356558, 357164, 358918, 365858, 37478…
$ GapSeconds       <int> 0, 10761, 16164, 16770, 18524, 25464, 34388, 41164, 4…
Rows: 2,365
Columns: 4
$ X                <int> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15,…
$ Year             <int> 1903, 1903, 1903, 1903, 1903, 1903, 1904, 1904, 1904,…
$ TotalTDFDistance <int> 2428, 2428, 2428, 2428, 2428, 2428, 2428, 2428, 2428,…
$ Stage            <chr> "Stage 1 : Paris > Lyon", "Stage 2 : Lyon > Marseille…

Visualization 1

Which rider has placed (1st, 2nd, 3rd) in the most years of the Tour De France?

Here, I filter the dataset to only include riders who were in the top 3 rank, then find the riders who placed the most times. Then, I create a bar graph representing top five riders with most wins (1st, 2nd, or 3rd) in the Tour De France ever.

Top 16 riders with most wins in the TDF ever, with Raymond Poulidor being the top rider with 20 places

Visualization 2

What year had the longest Tour De France (# of stages and distance)?

First, we must do more data cleaning to prepare the stages dataset for visualization. I will then graph the number of stages and distance as a scatterplot, with the point labeled as the year. Then, I will be able to visually identify the longest TDF.

However! When I plot it as a line graph, it is nearly impossible to see the difference in number of stages over the years because it’s so small compared to total km distance (ex. 24 stages and 5398 km). So instead, we’ll plot as a scatterplot where the two variables are on separate axes.

Line plot where x is year, then one line is the number of stages and the other is total distance.

Line plot where x is year, then one line is the number of stages and the other is total distance.

Interpretation and conclusion

  1. Raymond Poulidor has placed the most times ever in the TDF, with over 20 silver and bronze medals (but no gold!).
  2. The 1937 TDF had the most stages (31) and the 1928 TDF had the longest distance (5476 km).