::opts_chunk$set(echo = TRUE, message = FALSE) knitr
Homework 1 - ggplot
R Code
library(tidyverse)
Warning: package 'tidyverse' was built under R version 4.1.2
Warning: package 'ggplot2' was built under R version 4.1.2
Warning: package 'tibble' was built under R version 4.1.2
Warning: package 'tidyr' was built under R version 4.1.2
Warning: package 'readr' was built under R version 4.1.2
Warning: package 'dplyr' was built under R version 4.1.2
Warning: package 'stringr' was built under R version 4.1.2
Warning: package 'forcats' was built under R version 4.1.2
library(ggthemes)
library(ggplot2)
library(dplyr)
<- read.csv("~/Desktop/STAA566/Assignment1/ggplot2-gabcla/titles.csv", header = T)
titles
<- titles[!(is.na(titles$imdb_score) | titles$imdb_score ==""),]
title2 <- title2[!(is.na(title2$tmdb_score) | title2$tmdb_score ==""),]
title2
<- titles[!(is.na(titles$imdb_score) | titles$imdb_score ==""),]
title3 <- title3[!(is.na(title3$imdb_votes) | title3$imdb_votes ==""),]
title3
<- titles[!(is.na(titles$imdb_score) | titles$imdb_score ==""),]
title4 <- title4[!(is.na(title4$runtime) | title4$runtime ==""),]
title4
<- titles[!(is.na(titles$runtime) | titles$runtime ==""),]
title5 <- title5[!(is.na(title3$imdb_votes) | title5$imdb_votes ==""),] title5
Warning in is.na(title3$imdb_votes) | title5$imdb_votes == "": longer object
length is not a multiple of shorter object length
Data
This data set was created to list all shows available on Netflix streaming, and analyze the data to find interesting facts. This data was acquired in July 2022 containing data available in the United States. It can be found on Kaggle.
Variables
imdb_score - score on IMDB
imdb_votes - votes on IMDB
tmdb_score - score on TMDB
runtime - the length of the episode (show) or movie
type - type of title (TV show or movie)
Graph 1
First I want to check if there is a positive trend between IMDB scores and TMDB scores since they are both user rating databases & scores.
<- ggplot(data = title2,
p_titles mapping = aes(x = imdb_score,
y = tmdb_score,
color = type)) +
geom_point() +
theme_light() +
theme(legend.position = "none") +
theme_tufte(base_size=12, base_family = "sans") +
geom_smooth(method = "loess") +
scale_x_discrete(name = "IMDB Score", limits = seq(0,10, by = 0.5),
breaks = seq(0,10, by = 1)) +
scale_y_discrete(name = "TMDB Score", limits = seq(0,10, by = 0.5),
breaks = seq(0,10, by = 1)) +
theme(legend.position = c(0.87, 0.25))
Warning: Continuous limits supplied to discrete scale.
Did you mean `limits = factor(...)` or `scale_*_continuous()`?
Continuous limits supplied to discrete scale.
Did you mean `limits = factor(...)` or `scale_*_continuous()`?
# For some reason could not get the labeling working - adding the legend back in
# line_ends <- ggplot_build(p_titles)$data[[2]] %>%
# group_by(colour) %>%
# filter(x==max(x))
# #add type label
# line_ends$titles <- title2 %>% pull(type) %>%
# unique() %>%
# as.character() %>%
# sort()
#
# p_titles <- p_titles + ggrepel::geom_label_repel(data = line_ends,
# aes(x = 10^line_ends$x,
# y = line_ends$y,
# lable = titles),
# nudge_x = 1,
# label.size = NA,
# fill = alpha(c("white"),0))
<- p_titles + scale_color_colorblind()
p_titles
<- p_titles + ggtitle("TMDB Score by IMDB Score by Title Type")
p_titles
p_titles
Seems to match what we expected across type of title. Though it is a little interesting that shows rank on the higher side of both scoring systems.
Graph 2
Taking a deeper look into IMDB, only because it’s an older database and a site I am more familiar with. Maybe there is a relationship between scores and the number of votes.
<- ggplot(data = title3,
p_titles2 mapping = aes(x = imdb_score,
y = imdb_votes,
color = type)) +
geom_point() +
theme_minimal() +
theme(legend.position = "none") +
theme_tufte(base_size=12, base_family = "sans") +
geom_smooth(method = "loess") +
scale_x_discrete(name = "IMDB Score", limits = seq(0,10, by = 0.5),
breaks = seq(0,10, by = 1)) +
scale_y_continuous(name = "IMDB Votes") +
theme(legend.position = c(.3, .4)) +
scale_color_fivethirtyeight() +
ggtitle("IMDB Votes by IMDB Score by Title Type")
Warning: Continuous limits supplied to discrete scale.
Did you mean `limits = factor(...)` or `scale_*_continuous()`?
p_titles2
Looks like there is some relationship - for higher scores there are more votes. Again this trend is the same between the type of titles.
Graph 3
Next hunch, maybe scores are related to the length of title.
<- ggplot(data = title4,
p_titles3 mapping = aes(x = imdb_score,
y = runtime,
color = type)) +
geom_point() +
theme_minimal() +
theme(legend.position = "none") +
theme_tufte(base_size=11, base_family = "sans") +
geom_smooth(method = "loess") +
scale_x_discrete(name = "IMDB Score", limits = seq(0,10, by = 0.5),
breaks = seq(0,10, by = 1)) +
scale_y_continuous(name = "Runtime") +
theme(legend.position = c(.25, .8)) +
scale_color_few() +
ggtitle("Runtime by IMDB Score by Title Type")
Warning: Continuous limits supplied to discrete scale.
Did you mean `limits = factor(...)` or `scale_*_continuous()`?
p_titles3
Interesting, it looks like score isn’t really related to run time both for TV shows and movies. The graph is pretty linear.
Graph 4
Final thought, maybe the number of votes are impacted by runtime. I’m thinking if a title is longer there are going to be less votes.
<- ggplot(data = title5,
p_titles4 mapping = aes(x = runtime,
y = imdb_votes,
color = type)) +
geom_point() +
theme_minimal() +
theme(legend.position = "none") +
theme_tufte(base_size=12, base_family = "sans") +
geom_smooth(method = "loess") +
scale_x_continuous(name = "Runtime") +
scale_y_continuous(name = "IMDB Votes") +
theme(legend.position = c(.15, .7)) +
scale_color_discrete() +
ggtitle("IMDB Votes by Runtime by Title Type")
p_titles4
Warning: Removed 498 rows containing non-finite values (stat_smooth).
Warning: Removed 498 rows containing missing values (geom_point).
It does look like the smaller the runtime the more votes a title has on IMDB.