This dataset contains information regarding salaries of positions in data science fields around the world.
library(dplyr)
## Warning: package 'dplyr' was built under R version 4.0.5
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
salaries <- read.csv("C:/Users/ericp/Documents/ds_salaries.csv")
summary(salaries)
## X work_year experience_level employment_type
## Min. : 0.0 Min. :2020 Length:607 Length:607
## 1st Qu.:151.5 1st Qu.:2021 Class :character Class :character
## Median :303.0 Median :2022 Mode :character Mode :character
## Mean :303.0 Mean :2021
## 3rd Qu.:454.5 3rd Qu.:2022
## Max. :606.0 Max. :2022
## job_title salary salary_currency salary_in_usd
## Length:607 Min. : 4000 Length:607 Min. : 2859
## Class :character 1st Qu.: 70000 Class :character 1st Qu.: 62726
## Mode :character Median : 115000 Mode :character Median :101570
## Mean : 324000 Mean :112298
## 3rd Qu.: 165000 3rd Qu.:150000
## Max. :30400000 Max. :600000
## employee_residence remote_ratio company_location company_size
## Length:607 Min. : 0.00 Length:607 Length:607
## Class :character 1st Qu.: 50.00 Class :character Class :character
## Mode :character Median :100.00 Mode :character Mode :character
## Mean : 70.92
## 3rd Qu.:100.00
## Max. :100.00
Since I am self-interested, I want to know salaries of entry level positions in the US. So we will filter those parameters.
entryUS <- salaries %>% filter(company_location == "US" & experience_level == "EN")
head(entryUS)
## X work_year experience_level employment_type job_title
## 1 5 2020 EN FT Data Analyst
## 2 28 2020 EN CT Business Data Analyst
## 3 31 2020 EN FT Big Data Engineer
## 4 37 2020 EN FT Machine Learning Engineer
## 5 39 2020 EN FT Machine Learning Engineer
## 6 51 2020 EN FT Data Analyst
## salary salary_currency salary_in_usd employee_residence remote_ratio
## 1 72000 USD 72000 US 100
## 2 100000 USD 100000 US 100
## 3 70000 USD 70000 US 100
## 4 250000 USD 250000 US 50
## 5 138000 USD 138000 US 100
## 6 91000 USD 91000 US 100
## company_location company_size
## 1 US L
## 2 US L
## 3 US L
## 4 US L
## 5 US S
## 6 US L
First we will look at the overall distribution of this data, with salary as the response variable.
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.0.5
library(plotly)
## Warning: package 'plotly' was built under R version 4.0.5
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
histGG <- ggplot(data = entryUS, mapping = aes(x = salary_in_usd/1000)) + geom_histogram(bins = 15) + theme_minimal() + xlab("Salary in USD in Thousands") + labs(title="Distribution of Salary for Entry Level Data Scientists in the US")
ggplotly(histGG)
Clearly, there is not much data in this subset, so we will just look at companies in the United States.
library(ggplot2)
library(plotly)
histGG <- ggplot(data = salaries, mapping = aes(x = salary_in_usd/1000, fill = experience_level, alpha = 0.3)) + geom_density() + theme_minimal() + xlab("Salary in USD in Thousands") + labs(title="Distribution of Salaries for Data Scientists in the US")
ggplotly(histGG)
I am also interested in how remote work affects salary, based on experience level. EN: Entry-level / Junior; MI: Mid-level / Intermediate; SE: Senior-level / Expert; EX: Executive-level / Director
salaryUS <- salaries %>% filter(company_location == "US") %>% arrange(desc(experience_level))
ggTime <- ggplot(data = salaryUS, mapping = aes(x = remote_ratio, y = salary_in_usd/1000), color = experience_level) +
geom_point(alpha = .3) +
xlab("Ratio of Work Done Remote") +
ylab("Salary in USD in Thousands") +
theme_minimal() +
facet_wrap(. ~ experience_level) +
labs(title = "How does experience level and remote work impact salary?")
ggplotly(ggTime)
I’ve also heard a lot of varying advice regarding the size of company you should work for as a new grad. Let’s take a look at salaries as they relate to company size. S: less than 50 employees; M: 50 to 250 employees; L: more than 250 employees (large).
salarySize <- salaries %>% filter(company_location == "US")
salarySize$company_size <- factor(salarySize$company_size, levels=c("S","M","L"))
ggTime <- ggplot(data = salaryUS, mapping = aes(x = company_size, y = salary_in_usd/1000, color = experience_level)) + geom_point(alpha = 0.3) + theme_minimal() + ylab("Salary in USD in Thousands") + xlab("Company Size") + labs(title = "Company size vs Salary (USD)")
ggplotly(ggTime)