This dataset contains information regarding salaries of positions in data science fields around the world.

library(dplyr)
## Warning: package 'dplyr' was built under R version 4.0.5
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
salaries <- read.csv("C:/Users/ericp/Documents/ds_salaries.csv")
summary(salaries)
##        X           work_year    experience_level   employment_type   
##  Min.   :  0.0   Min.   :2020   Length:607         Length:607        
##  1st Qu.:151.5   1st Qu.:2021   Class :character   Class :character  
##  Median :303.0   Median :2022   Mode  :character   Mode  :character  
##  Mean   :303.0   Mean   :2021                                        
##  3rd Qu.:454.5   3rd Qu.:2022                                        
##  Max.   :606.0   Max.   :2022                                        
##   job_title             salary         salary_currency    salary_in_usd   
##  Length:607         Min.   :    4000   Length:607         Min.   :  2859  
##  Class :character   1st Qu.:   70000   Class :character   1st Qu.: 62726  
##  Mode  :character   Median :  115000   Mode  :character   Median :101570  
##                     Mean   :  324000                      Mean   :112298  
##                     3rd Qu.:  165000                      3rd Qu.:150000  
##                     Max.   :30400000                      Max.   :600000  
##  employee_residence  remote_ratio    company_location   company_size      
##  Length:607         Min.   :  0.00   Length:607         Length:607        
##  Class :character   1st Qu.: 50.00   Class :character   Class :character  
##  Mode  :character   Median :100.00   Mode  :character   Mode  :character  
##                     Mean   : 70.92                                        
##                     3rd Qu.:100.00                                        
##                     Max.   :100.00

Since I am self-interested, I want to know salaries of entry level positions in the US. So we will filter those parameters.

entryUS <- salaries %>% filter(company_location == "US" & experience_level == "EN")
head(entryUS)
##    X work_year experience_level employment_type                 job_title
## 1  5      2020               EN              FT              Data Analyst
## 2 28      2020               EN              CT     Business Data Analyst
## 3 31      2020               EN              FT         Big Data Engineer
## 4 37      2020               EN              FT Machine Learning Engineer
## 5 39      2020               EN              FT Machine Learning Engineer
## 6 51      2020               EN              FT              Data Analyst
##   salary salary_currency salary_in_usd employee_residence remote_ratio
## 1  72000             USD         72000                 US          100
## 2 100000             USD        100000                 US          100
## 3  70000             USD         70000                 US          100
## 4 250000             USD        250000                 US           50
## 5 138000             USD        138000                 US          100
## 6  91000             USD         91000                 US          100
##   company_location company_size
## 1               US            L
## 2               US            L
## 3               US            L
## 4               US            L
## 5               US            S
## 6               US            L

First we will look at the overall distribution of this data, with salary as the response variable.

library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.0.5
library(plotly)
## Warning: package 'plotly' was built under R version 4.0.5
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
histGG <- ggplot(data = entryUS, mapping = aes(x = salary_in_usd/1000)) + geom_histogram(bins = 15) + theme_minimal() + xlab("Salary in USD in Thousands") + labs(title="Distribution of Salary for Entry Level Data Scientists in the US")
ggplotly(histGG)

Clearly, there is not much data in this subset, so we will just look at companies in the United States.

library(ggplot2)
library(plotly)
histGG <- ggplot(data = salaries, mapping = aes(x = salary_in_usd/1000, fill = experience_level, alpha = 0.3)) + geom_density() + theme_minimal() + xlab("Salary in USD in Thousands") + labs(title="Distribution of Salaries for Data Scientists in the US")
ggplotly(histGG)

I am also interested in how remote work affects salary, based on experience level. EN: Entry-level / Junior; MI: Mid-level / Intermediate; SE: Senior-level / Expert; EX: Executive-level / Director

salaryUS <- salaries %>% filter(company_location == "US") %>% arrange(desc(experience_level))
ggTime <- ggplot(data = salaryUS, mapping = aes(x = remote_ratio, y = salary_in_usd/1000), color = experience_level) +
  geom_point(alpha = .3) +
  xlab("Ratio of Work Done Remote") +
  ylab("Salary in USD in Thousands") +
  theme_minimal() + 
  facet_wrap(. ~ experience_level) +
  labs(title = "How does experience level and remote work impact salary?")
ggplotly(ggTime)

I’ve also heard a lot of varying advice regarding the size of company you should work for as a new grad. Let’s take a look at salaries as they relate to company size. S: less than 50 employees; M: 50 to 250 employees; L: more than 250 employees (large).

salarySize <- salaries %>% filter(company_location == "US") 
salarySize$company_size <- factor(salarySize$company_size, levels=c("S","M","L"))
ggTime <- ggplot(data = salaryUS, mapping = aes(x = company_size, y = salary_in_usd/1000, color = experience_level)) + geom_point(alpha = 0.3) + theme_minimal() + ylab("Salary in USD in Thousands") + xlab("Company Size") + labs(title = "Company size vs Salary (USD)")
ggplotly(ggTime)