library(tidyverse)

In 2017, Kaggle conducted an industry-wide survey to establish a comprehensive view of the state of data science and machine learning. The survey received over 16,000 responses and we learned a ton about who is working with data, what’s happening at the cutting edge of machine learning across industries, and how new data scientists can best break into the field.

“Most of our respondents were found primarily through Kaggle channels, like their email list, discussion forums and social media channels. The survey was live from August 7th to August 25th. The median response time for those who participated in the survey was 16.4 minutes. We allowed respondents to complete the survey at any time during that window. We received salary data by first asking respondents for their day-to-day currency, and then asking them to write in either their total compensation.”

The data provided here are a subset of the survey results provided by Kaggle, retaining observations with full responses for selected variables.

# From Kaggle: https://www.kaggle.com/datasets/kaggle/kaggle-survey-2017/
datascience <- read_csv("data/kaggle_survey_subset.csv", show_col_types = F) 

Exercises

Exercise 1

How many of the survey respondents are from the US?

datascience %>%
  filter(Country == _____)

Exercise 2

How many people think their title fits them perfectly or fine?

# multiple ways

Exercise 3

Display just the Gender, Age, and JobSatisfcation columns, and display them in descending order of Age (i.e. oldest to youngest).

Exercise 4

For respondents from "United States", create a frequency table of the number of people within each category of FormalEducation. Display the results in descending order so the most common observation is on top. What is the most common formal education level in this data? Are there any surprising results?

Exercise 5

For respondents from "United States", create a frequency table of the number of people within each combination of FormalEducation and WorkDataVisualizations.

Exercise 6

What proportion of respondents fall into each category of FormalEducation?

Exercise 7

Among respondents who are "Employed full-time" and use the "USD", calculate the mean, median, and maximum compensation (CompensationAmount) grouped by Gender. What do you notice? Is anything surprising to you?

Data dictionary

Below is the data dictionary for the subset of the Kaggle data data.

variable class description
Country character Home country of employee
Gender character Selected gender, one of “A different identity”, “Female”, “Male”, “Non-binary, genderqueer, or gender non-conforming”
Age double Age at time of survey
EmploymentStatus character One of “Employed full-time”, “Employed part-time”, or “Independent contractor, freelancer, or self-employed
EmployerIndustry character Industry of current employer
FormalEducation character Highest level of education
Major character College major
CompensationAmount double Total compensation
CompensationCurrency character 3-letter currency code for day-to-day currency
CurrentJobTitle character Job title
TitleFit character Assessment of how well title fits actual duties. One of “Fine”, “Perfectly”, “Poorly”
LanguageRecommendation character Recommended programming language
WorkDataVisualizations character Proportion of job dedicated to creating data visualizations, broken into pre-determined categories
JobSatisfaction character Rating of job satisfcation one scale of 1-10, where 1 is not satisfied and 10 is highly satisfied