This data set contains ratings of every episode from the IMDB top 250 series. Personally, I have heard the “Parks and Rec vs The Office” debate for years now. I would like to see how their IMDB ratings stack up against each other over the course of their series’.
ratings <- read.csv("C:/Users/ericp/Downloads/archive/imdb_top_250_series_episode_ratings.csv")
favShows <- c("Parks and Recreation", "The Office")
library(dplyr)
## Warning: package 'dplyr' was built under R version 4.0.5
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
recVsOffice <- ratings %>%
filter(Title %in% favShows)
head(recVsOffice)
## X Season Episode Rating Code Title
## 1 0 1 1 7.3 tt0386676 The Office
## 2 1 1 2 8.1 tt0386676 The Office
## 3 2 1 3 7.6 tt0386676 The Office
## 4 3 1 4 7.9 tt0386676 The Office
## 5 4 1 5 8.3 tt0386676 The Office
## 6 5 1 6 7.6 tt0386676 The Office
First, the data needed to be filtered to include only these two shows.
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.0.5
library(dplyr)
library(directlabels)
## Warning: package 'directlabels' was built under R version 4.0.5
plt <- ggplot(data = recVsOffice) +
geom_line(aes(x=X, y=Rating, color = Title)) + geom_dl(aes(x=X, y=Rating, color = Title,label=Title), method = 'top.points') + xlab("Episode Number")
plt
But what about by season?
(recVsOfficeSeasons <- recVsOffice %>% group_by(Season,Title) %>% summarise(Rating = mean(Rating), .groups = "keep"))
## # A tibble: 16 x 3
## # Groups: Season, Title [16]
## Season Title Rating
## <int> <chr> <dbl>
## 1 1 Parks and Recreation 7.17
## 2 1 The Office 7.97
## 3 2 Parks and Recreation 8.02
## 4 2 The Office 8.33
## 5 3 Parks and Recreation 8.41
## 6 3 The Office 8.50
## 7 4 Parks and Recreation 8.21
## 8 4 The Office 8.41
## 9 5 Parks and Recreation 8.10
## 10 5 The Office 8.37
## 11 6 Parks and Recreation 8.01
## 12 6 The Office 8.07
## 13 7 Parks and Recreation 8.3
## 14 7 The Office 8.18
## 15 8 The Office 7.43
## 16 9 The Office 7.73
plt <- ggplot(data = recVsOfficeSeasons) +
geom_line(aes(x=Season, y=Rating, color = Title)) + geom_dl(aes(x=Season, y=Rating, color = Title,label=Title), method = 'top.points') + xlab("Season")
plt
Finally, let’s look at each series’ distribution of ratings.
plt <- ggplot(data = recVsOffice) +
geom_bar(stat = "count", aes(x=Rating, fill = Title), position = "dodge") + xlab("Rating")
plt