By Domeda Duncan, Robert Gastineau, Randeep Singh, Devon Ankar (Contact: devon@devonankar.com)
Questions answered:
We chose the following three job portals to scrape data from:
#imports
import pandas as pd
import numpy as np
from collections import Counter
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
#pull CSVs into dataframes
#main CSVs
monster = pd.read_csv('data\CleanedMonster.csv', sep='|', index_col=False, encoding='latin1')
ziprecruiter = pd.read_csv('data\cleanedziprec.csv', sep='|', index_col=False, encoding='latin1')
simplyhired = pd.read_csv('data\CleanedSimply.csv', sep='|', index_col=False, encoding='latin1')
#skills CSVs
monsterskills = pd.read_csv('data\WordCount_Monster.csv', sep='|', index_col=False, encoding='latin1')
ziprecruiterskills = pd.read_csv('data\WordCount_Zip.csv', sep='|', index_col=False, encoding='latin1')
simplyhiredskills = pd.read_csv('data\WordCount_Simply.csv', sep='|', index_col=False, encoding='latin1')
monster.shape + ziprecruiter.shape + simplyhired.shape
#append dataframes into one dataframe
df1 = monster.append(ziprecruiter, ignore_index=True)
alldata = df1.append(simplyhired, ignore_index=True)
alldata.shape
Total: 2432 jobs
#append dataframes into one dataframe
df2 = monsterskills.append(ziprecruiterskills, ignore_index=True)
allskills = df2.append(simplyhiredskills, ignore_index=True)
Problems discovered later that we want to correct at the beginning:
#city name replacements
alldata['City'] = alldata['City'].replace(['New York City','new york city','NEW YORK CITY'], 'New York')
alldata['City'] = alldata['City'].replace('South San Francisco', 'San Francisco')
#import state code reference table
statecodes = pd.read_csv('data\statecodes.csv', index_col=False)
#replace any full state names with state codes
alldata['State'] = alldata.State.replace(statecodes.set_index('State').Abbreviation.to_dict())
#replace any cut-off state names with state codes
alldata['State'] = alldata.State.replace(statecodes.set_index('StatewoFirst').Abbreviation.to_dict())
#other issues with state field
alldata['State'] = alldata['State'].replace('ome Based', 'Home Based')
alldata['State'] = alldata['State'].replace('emote', 'Remote')
alldata['State'] = alldata['State'].replace('one', 'None')
alldata['State'] = alldata['State'].replace('nited States', 'United States')
alldata['State'] = alldata['State'].replace('AZ, United States, AZ', 'AZ')
Assumptions:
titlecounter = Counter(alldata['Title'])
len(titlecounter.keys())
There are 1450 unique job titles among 2432 jobs.
titlecount2 = pd.DataFrame([titlecounter])
titlecount = titlecount2.transpose()
plt.ticklabel_format(style='plain', axis='x')
titlecountplot=titlecount.nlargest(20,0)
sns.barplot(y=titlecountplot.index, x=titlecountplot[0])
plt.title('20 most common job titles within Data Science')
plt.ylabel('Job Titles')
plt.xlabel('Frequency/Count')
titlecountplot.head(10)
statecounter = Counter(alldata['State'])
len(statecounter.keys())
There are 55 unique states represented!
Possibilities:
Assumptions:
statecount2 = pd.DataFrame([statecounter])
statecount = statecount2.transpose()
plt.ticklabel_format(style='plain', axis='x')
statecountplot=statecount.nlargest(25,0)
sns.barplot(y=statecountplot.index, x=statecountplot[0])
plt.title('Geographic distribution of data science jobs')
plt.ylabel('States')
plt.xlabel('Frequency/Count')
statecountplot.head(5)
Issues:
#add a city-state field to alldata
alldatawithcitystate = alldata.assign(city_state = alldata['City'] + "-" + alldata['State'])
citycounter = Counter(alldatawithcitystate['city_state'])
citycount2 = pd.DataFrame([citycounter])
citycount = citycount2.transpose()
plt.ticklabel_format(style='plain', axis='x')
citycountplot=citycount.nlargest(20,0)
sns.barplot(y=citycountplot.index, x=citycountplot[0])
plt.title('Geographic distribution of data science jobs')
plt.ylabel('Cities')
plt.xlabel('Frequency/Count')
Cross-reference publicly available table of latitudes and longitudes
#import coordinates csv file
coordinates = pd.read_csv('data\citycoordinates.csv', index_col=False)
#remove duplicates from coordinates
coordinates = coordinates.drop_duplicates(subset=['city','state','county'], keep='first')
#add a city-state field to coordinates
coordinates = coordinates.assign(city_state = coordinates['city'] + "-" + coordinates['state'])
#add coordinates to citycount
citycount = citycount.assign(city_state = citycount.index)
citycount = citycount.merge(coordinates, left_on='city_state', right_on='city_state', how='outer')
x2 = citycount['longitude']
y2 = citycount['latitude']
area = citycount[0]
colors = 'b'
alpha2 = 0.7
plt.scatter(x2, y2, s=area, c=colors, marker='o', alpha=alpha2, linewidths=1, edgecolors='face')
plt.title('Geographic distribution of data science jobs by coordinates')
plt.xlabel('longitude')
plt.ylabel('latitude')
plt.show()
Map of the US! Complete with small spots for Alaska (top left corner) and Hawaii (bottom left corner). We see again that jobs are concentrated in the Pacific Northwest, California, and the East Coast (NY, MA, VA, DC).
Assumptions:
skillcounter = Counter(allskills['Words'])
len(skillcounter.keys())
skillcount2 = pd.DataFrame([skillcounter])
skillcount = skillcount2.transpose()
#exclude skill "SED" because "SED" appears within a lot of other words and that caused a bunch of false positives.
skillcount = skillcount[skillcount.index != 'SED']
plt.ticklabel_format(style='plain', axis='x')
skillcountplot=skillcount.nlargest(25,0)
sns.barplot(y=skillcountplot.index, x=skillcountplot[0])
plt.title('25 most in-demand skills within Data Science')
plt.ylabel('Skills')
plt.xlabel('Frequency/Count')
skillcountplot.head(10)
In this class, we learned Python and Hadoop, but we did not learn SQL, Java, or Excel. It is only partially true that technologies we learned in this class are in higher demand than older technologies or technologies useful for smaller datasets. Also, statistical packages did not make the top 10 skills except for SAS at no. 9.
df4=allskills[allskills['Words']!='SED']
df4.head()
df6=df4[['key','Words']]
df6.index=df6['key']
df6Dup = df6
# Now to do a cross join to itself
df7 = df6.join(df6Dup, lsuffix='_1', rsuffix='_2')
# Now we want to drop the records where Words_1=Words_2 and Words 1 < Words_2
df8=df7[df7['Words_1']<df7['Words_2']]
df8= df8.reset_index(drop=True)
df8['freqCombo']=df8.groupby(['Words_1','Words_2'])['Words_1'].transform('count')
df9 = df8[['Words_1','Words_2','freqCombo']]
df9 = df9.sort_values('freqCombo', ascending=False)
ComboFreq=df9.drop_duplicates(subset=['Words_1','Words_2'])
print(ComboFreq.head(20))
The above are the top 20 combinations of skills that occur most often together.
Assumptions:
#create new dataframe with just the salary information
salaryfilter1 = alldata.filter(items=['Title','City', 'State', 'Min_Salary', 'Max_Salary'])
#filter out the nan and 0s
salaryfilter2 = salaryfilter1[salaryfilter1['Min_Salary'].notnull() & (salaryfilter1['Max_Salary'].notnull())
& (salaryfilter1['Min_Salary'] !=0) & (salaryfilter1['Max_Salary'] != 0)]
#& (salaryfilter1['Title'] == 'Data Scientist')
#let's see what the smallest salaries are
#salaryfilter2.sort_values(['Min_Salary'], ascending=True)
Salaries of just 2000/year are almost certainly incorrect. Let's filter out salaries under 12000/year (since that's 1000/month). Generally, Data Scientist positions requiring MS will pay much more than that.
salaryfilter2 = salaryfilter2[salaryfilter2['Min_Salary'] >= 12000]
#salaryfilter2.sort_values(['Min_Salary'], ascending=True)
#check the high end too
#salaryfilter2.sort_values(['Max_Salary'], ascending=False)
#highest Max_Salary is 300,000/year, which is probably correct
#min values for both min and max salary range
print(salaryfilter2['Min_Salary'].min(), "-", salaryfilter2['Max_Salary'].min())
#max values for both min and max salary range
print(salaryfilter2['Min_Salary'].max(), "-", salaryfilter2['Max_Salary'].max())
#filter on title = data scientist
salaryfilter3 = salaryfilter2[salaryfilter2['Title'] == 'Data Scientist']
#min values for both min and max salary range - for data scientist title only
print(salaryfilter3['Min_Salary'].min(), "-", salaryfilter3['Max_Salary'].min())
#max values for both min and max salary range - for data scientist title only
print(salaryfilter3['Min_Salary'].max(), "-", salaryfilter3['Max_Salary'].max())
For all job titles: The lowest salary in the dataset is 12,000 - 15,000 per year. The highest salary in the dataset is 200,000 - 300,000 per year.
For data scientists: The lowest salary in the dataset is 20,000 - 45,000 per year. The highest salary in the dataset is 190,000 - 230,000 per year.
Next, test hypothesis by checking for correlation of salary to location (if any).
salaryfilter4 = salaryfilter2.filter(items=['City', 'State', 'Max_Salary'])
salaryfilter4 = salaryfilter4.assign(city_state = salaryfilter4['City'] + "-" + salaryfilter4['State'])
avgmaxsalaries = salaryfilter4.groupby('city_state', as_index=False).Max_Salary.mean().round(2)
plt.ticklabel_format(style='plain', axis='x')
salaryplot = avgmaxsalaries.nlargest(20,'Max_Salary')
sns.barplot(y=salaryplot['city_state'], x=salaryplot['Max_Salary'])
plt.title('Top 20 cities by salary')
plt.ylabel('Cities')
plt.xlabel('Average Max Salary')
avgmaxsalaries = avgmaxsalaries.merge(coordinates, left_on='city_state', right_on='city_state', how='outer')
x2 = avgmaxsalaries['longitude']
y2 = avgmaxsalaries['latitude']
area = avgmaxsalaries['Max_Salary']/4000 #without this factor, the numbers are too large and it overwhelms the map!
colors = 'g'
alpha2 = 0.5
plt.scatter(x2, y2, s=area, c=colors, marker='x', alpha=alpha2, linewidths=1, edgecolors='face')
plt.title('Geographic distribution of salaries by coordinates')
plt.xlabel('longitude')
plt.ylabel('latitude')
plt.show()
Note: We are using average salary for each city, so the number of jobs in each city (number of observations) does not weight the size of each bubble. For example, San Francisco's bubble is not bigger because there are more jobs there; it's bigger because the average salary is higher there.
However, without significance testing, we don't know if these differences in salaries are statistically significant.