Homework 1 - ggplot

Author

Gabrielle J Clary

knitr::opts_chunk$set(echo = TRUE, message = FALSE)

R Code

library(tidyverse)
Warning: package 'tidyverse' was built under R version 4.1.2
Warning: package 'ggplot2' was built under R version 4.1.2
Warning: package 'tibble' was built under R version 4.1.2
Warning: package 'tidyr' was built under R version 4.1.2
Warning: package 'readr' was built under R version 4.1.2
Warning: package 'dplyr' was built under R version 4.1.2
Warning: package 'stringr' was built under R version 4.1.2
Warning: package 'forcats' was built under R version 4.1.2
library(ggthemes)
library(ggplot2)
library(dplyr)

titles <- read.csv("~/Desktop/STAA566/Assignment1/ggplot2-gabcla/titles.csv", header = T)

title2 <- titles[!(is.na(titles$imdb_score) | titles$imdb_score ==""),]
title2 <- title2[!(is.na(title2$tmdb_score) | title2$tmdb_score ==""),]

title3 <- titles[!(is.na(titles$imdb_score) | titles$imdb_score ==""),]
title3 <- title3[!(is.na(title3$imdb_votes) | title3$imdb_votes ==""),]

title4 <- titles[!(is.na(titles$imdb_score) | titles$imdb_score ==""),]
title4 <- title4[!(is.na(title4$runtime) | title4$runtime ==""),]

title5 <- titles[!(is.na(titles$runtime) | titles$runtime ==""),]
title5 <- title5[!(is.na(title3$imdb_votes) | title5$imdb_votes ==""),]
Warning in is.na(title3$imdb_votes) | title5$imdb_votes == "": longer object
length is not a multiple of shorter object length

Data

This data set was created to list all shows available on Netflix streaming, and analyze the data to find interesting facts. This data was acquired in July 2022 containing data available in the United States. It can be found on Kaggle.

Variables

  • imdb_score - score on IMDB

  • imdb_votes - votes on IMDB

  • tmdb_score - score on TMDB

  • runtime - the length of the episode (show) or movie

  • type - type of title (TV show or movie)

Graph 1

First I want to check if there is a positive trend between IMDB scores and TMDB scores since they are both user rating databases & scores.

p_titles <- ggplot(data = title2,
                   mapping = aes(x = imdb_score,
                                 y = tmdb_score,
                                 color = type)) +
  geom_point() +
  theme_light() +
  theme(legend.position = "none") +
  theme_tufte(base_size=12, base_family = "sans") +
  geom_smooth(method = "loess") +
  scale_x_discrete(name = "IMDB Score", limits = seq(0,10, by = 0.5),
                   breaks = seq(0,10, by = 1)) +
  scale_y_discrete(name = "TMDB Score", limits = seq(0,10, by = 0.5),
                   breaks = seq(0,10, by = 1)) +
  theme(legend.position =  c(0.87, 0.25))
Warning: Continuous limits supplied to discrete scale.
Did you mean `limits = factor(...)` or `scale_*_continuous()`?
Continuous limits supplied to discrete scale.
Did you mean `limits = factor(...)` or `scale_*_continuous()`?
# For some reason could not get the labeling working - adding the legend back in
# line_ends <- ggplot_build(p_titles)$data[[2]] %>%
#   group_by(colour) %>%
#   filter(x==max(x))
# #add type label
# line_ends$titles <- title2 %>% pull(type) %>%
#   unique() %>%
#   as.character() %>%
#   sort()
#
# p_titles <- p_titles + ggrepel::geom_label_repel(data = line_ends,
#                                                  aes(x = 10^line_ends$x,
#                                                      y = line_ends$y,
#                                                      lable = titles),
#                                                  nudge_x = 1,
#                                                  label.size = NA,
#                                                  fill = alpha(c("white"),0))
p_titles <- p_titles + scale_color_colorblind()

p_titles <- p_titles + ggtitle("TMDB Score by IMDB Score by Title Type")

p_titles

Seems to match what we expected across type of title. Though it is a little interesting that shows rank on the higher side of both scoring systems.

Graph 2

Taking a deeper look into IMDB, only because it’s an older database and a site I am more familiar with. Maybe there is a relationship between scores and the number of votes.

p_titles2 <- ggplot(data = title3,
                    mapping = aes(x = imdb_score,
                                  y = imdb_votes,
                                  color = type)) +
  geom_point() +
  theme_minimal() +
  theme(legend.position = "none") +
  theme_tufte(base_size=12, base_family = "sans") +
  geom_smooth(method = "loess") +
  scale_x_discrete(name = "IMDB Score", limits = seq(0,10, by = 0.5),
                   breaks = seq(0,10, by = 1)) +
  scale_y_continuous(name = "IMDB Votes") +
  theme(legend.position =  c(.3, .4)) +
  scale_color_fivethirtyeight() +
  ggtitle("IMDB Votes by IMDB Score by Title Type")
Warning: Continuous limits supplied to discrete scale.
Did you mean `limits = factor(...)` or `scale_*_continuous()`?
p_titles2

Looks like there is some relationship - for higher scores there are more votes. Again this trend is the same between the type of titles.

Graph 3

Next hunch, maybe scores are related to the length of title.

p_titles3 <- ggplot(data = title4,
                    mapping = aes(x = imdb_score,
                                  y = runtime,
                                  color = type)) +
  geom_point() +
  theme_minimal() +
  theme(legend.position = "none") +
  theme_tufte(base_size=11, base_family = "sans") +
  geom_smooth(method = "loess") +
  scale_x_discrete(name = "IMDB Score", limits = seq(0,10, by = 0.5),
                   breaks = seq(0,10, by = 1)) +
  scale_y_continuous(name = "Runtime") +
  theme(legend.position =  c(.25, .8)) +
  scale_color_few() +
  ggtitle("Runtime by IMDB Score by Title Type")
Warning: Continuous limits supplied to discrete scale.
Did you mean `limits = factor(...)` or `scale_*_continuous()`?
p_titles3

Interesting, it looks like score isn’t really related to run time both for TV shows and movies. The graph is pretty linear.

Graph 4

Final thought, maybe the number of votes are impacted by runtime. I’m thinking if a title is longer there are going to be less votes.

p_titles4 <- ggplot(data = title5,
                    mapping = aes(x = runtime,
                                  y = imdb_votes,
                                  color = type)) +
  geom_point() +
  theme_minimal() +
  theme(legend.position = "none") +
  theme_tufte(base_size=12, base_family = "sans") +
  geom_smooth(method = "loess") +
  scale_x_continuous(name = "Runtime") +
  scale_y_continuous(name = "IMDB Votes") +
  theme(legend.position =  c(.15, .7)) +
  scale_color_discrete() +
  ggtitle("IMDB Votes by Runtime by Title Type")

p_titles4
Warning: Removed 498 rows containing non-finite values (stat_smooth).
Warning: Removed 498 rows containing missing values (geom_point).

It does look like the smaller the runtime the more votes a title has on IMDB.