Continued from Part 3 - Making Predicitions

Right, so we've looked at prediciting salary level based on the city the job is listed in, and we were a little bit better at predicting level than the baseline, but we are sure we can do better.

We did plenty of scraping, so we have plenty of data to use. Let's look at some other possible scenarios we can think of that would be a good predictor of salary level.

Prediction based on Job Title

We can look at words within job title that may prove to give a good indication of salary level. Let's create a few more predictor columns based on these words:

df['senior']=df['title'].map(lambda x: 1 if 'senior' in x.lower() else 1 if 'sr.' in x.lower() else 0)
df['junior']=df['title'].map(lambda x: 1 if 'junior' in x.lower() else 1 if 'jr.' in x.lower() else 0)
df['assistant']=df['title'].map(lambda x: 1 if 'assistant' in x.lower() else 0)
df['manager']=df['title'].map(lambda x: 1 if 'manager' in x.lower() else 0)
df['director']=df['title'].map(lambda x: 1 if 'director' in x.lower() else 0)
df['associate']=df['title'].map(lambda x: 1 if 'associate' in x.lower() else 0)
df['architect']=df['title'].map(lambda x: 1 if 'architect' in x.lower() else 0)

seniority = ['senior','junior','assistant','manager','director','associate','architect','intercept']
X2 = df4[seniority]
#split our data into training and testing
X2_train,X2_test,y2_train,y2_test = train_test_split(X2,y,test_size = 0.2,random_state=6,stratify=y)

#perform Logistic Regression using sklearn
logreg = LogisticRegression()
#fit the model
logreg.fit(X2_train, y_train)
score the model on the test set
score = logreg.score(X2_test, y2_test)
#print the predictor and the coefficient
pd.DataFrame([(b,a) for a,b in zip(logreg.coef_[0],seniority)],columns=['title','coefficient'])
#print the score
print(logreg.score(X2_test, y2_test))
title coefficient
senior -1.516221
junior 0.187472
assistant 1.680311
manager 0.596676
director -0.710076
associate 2.216260
architect -1.637794
intercept 0.067391

As before, a negative coefficient shows a predictor of a 'High' level. We can see that our model would predict the presence of 'architect' in the title as the most likely to cause a 'high', and 'associate the most likely to cause a 'low'.

We get a better score (0.64) with this model than the city alone, so we could have saved ourself a lot of work there, but I'm still sure we could do better, lets create some more features from our data

Prediction by Type of Institution

Lets look at features in the company name that could give us a good prediction: whether the type of company is a university, government or corporate entity:

#features to look for in company name
schools = ['university','college','uni','school','academy']
govs = ['government','city of','police']
institutions = schools + govs

#add columns to the dataframe using certain key elements to indicate type of institution
df4['edu']=
    df4['company'].map(lambda x: 1 if any(school in x.lower() for school in schools) else 0)
df4['gov']=
    df4['company'].map(lambda x: 1 if any(gov in x.lower() for gov in govs) else 0)
df4['corp'] = 
    df4['company'].map(lambda x: 0 if any(inst in x.lower() for inst in institutions) else 1)

#create a datarame of only institution types
X3 = df4[['edu','gov','corp']]

#Split the data into train and test sets
X3_train,X3_test,y3_train,y3_test = 
    train_test_split(X3,y,test_size = 0.2,random_state=6,stratify=y)


#perform Logistic Regression using sklearn
logreg = LogisticRegression(random_state=6)
logreg.fit(X3_train, y3_train)

#score the model on our test set
scores = cross_val_score(logreg,X3_test,y_test,cv=10)

#print the scores and coefficients
print(scores.mean())
print(logreg.coef_)
institution coefficient
edu 1.430497
gov 0.836198
corp -1.282344

Great we get a score of 0.65, so our model isn't awful, its still better than a guess and we can see that working for a company is the best indicator of predicting a 'High' level of salary. This would fit with our intuition as to the nature of the job market, but it's always best to check these things out.

Prediction based on Payment Period

Period Coefficient
yearly -1.029591
monthly 1.411035
hourly 0.367898
We could look at the payment period for the job listing to see if a yearly, monthly or hourly salary were good predictors of level.

I'll save you the code snippet this time, as its very similar to both above. We get a score of 0.63 and our coefficients look like the the table on the right. We can see here form the coefficients that an annual salary listing is the better predictor for a 'High' level of salary in comparison to monthly or hourly. Whilst this might be true in this instance, we can suggest that perhaps the higher paying consultancy jobs that would generate larger monthly or hourly fees might not be listed on indeed.com.

More Models

So far we have used logistic regression for our predictions, but there are many (many!) more algorithms we can use to make predictions. Let's get modelling

It would be perhaps foolish for us to believe we were domain masters and able to choose the best predictors for the most accurate model. We are guilty of prematurely optimising. Donanld Knuth said it best in 1974:

"The real problem is that programmers have spent far too much time worrying about efficiency in the wrong places and at the wrong times".

Let's use our computers the way they were meant to be used; They don't get tired doing calculations, so lets throw everything at the problem, and trust in algorithms to get the job done.
The best way (that I've been taught so far) is to use a gridsearch to cycle through the many possible hyperparameter combinations to tune a model, until the best accuracy can be found for the model. Thankfully we have many algorithms to try and plenty of hyperparameters for each. Let's see how accurate we can get.

K-Nearest Neighbours

I am going to try using a K-Nearest Neighbours classifier on the location to see if this might be a better estimator than logistic regression. I am going to make no assumptions about the best parameters to use to fit the model, so I will use a gridsearch to work through the problem and return the best combination of parameters, and the best performing model. I will then test the trained model on test data and see how well the model performs.

from sklearn.neighbors import KNeighborsClassifier

knn = KNeighborsClassifier()

#split to train & test sets
X_train,X_test,y_train,y_test = 
    train_test_split(X,y,test_size=0.2,stratify=(y),random_state=6)

#scale the predictors 
ss = StandardScaler()
X_train_ss = ss.fit_transform(X_train)
X_test_ss = ss.transform(X_test)

#import the gridsearch crossvalidator
from sklearn.model_selection import GridSearchCV
#establish the possible hyperparameters for the model
knn_parameters = {
    'n_neighbors':range(1,101),
    'weights':['uniform','distance'],
    'metric':['euclidean','manhattan']

}
#cycle through all combinations of hyperparameters
knn_gridsearch = GridSearchCV(KNeighborsClassifier(), 
                              knn_parameters, 
                              n_jobs=1, cv=5, verbose=1)

knn_gridsearch.fit(X_train_ss, y_train)
#return the best model from the search
knn_model = knn_gridsearch.best_estimator_
#use the best model to test against our test set
scores = cross_val_score(knn_model,X_test_ss,y_test,cv=5)
#show us the accuracy score
scores.mean()

The score using K-nearest neighbours for city is 0.60, so we haven't improved that much over the logistic regression. Let's try this gridsearching on all of our predictors, and use regularisation to 'punish' predictors that are not useful to predicting the salary level.

Gridsearch with Logistic Regression on all predictors we have created

# create a train/test split of the predictors 
    and outcomes, and then scale the predictors.

ss = StandardScaler()
X_train,X_test,y_train,y_test = 
    train_test_split(X,y,test_size=0.2,stratify=(y),random_state=6)

X_train_ss = ss.fit_transform(X_train)
X_test_ss = ss.transform(X_test)


#establish the hyperparameters for a logistic regression model
gs_params = {'penalty':['l1','l2'],
             'solver':['liblinear'],
             'C':np.logspace(-3,1,50)}

#create your gridsearch object
lr_gridsearch = GridSearchCV(LogisticRegression(), 
                             gs_params, 
                             n_jobs=1, cv=5, verbose=1)

#fit the gridsearch objects on the training set
lr_gridsearch.fit(X_train_ss,y_train)

#take the best model from the gridsearch
lr_model = lr_gridsearch.best_estimator_

#cross validate the model on the test set 5 times
scores = cross_val_score(lr_model,X_test_ss,y_test,cv=5)

#take the mean accuracy score from the model as applied to the test sets
scores.mean()

Wow, we get a score of 0.72, the best yet! Having a look at the classification report, we see that our precision is at 70% for predicting 'high'

Things are going good, lets see if we can improve on our accuracy with different algorithms:

Ensemble Methods

So far we have been training one model on our data at a time and then testing the model on a test set that we split out from the original scraped data.
Ensemble methods aim to make multiple different models and then converge on an optimum final model in two ways. The random Forest classifier aims to make many weak predicting models with high variance, and then by averaging all of the models it generates, reduce the variance.

Gridsearching with a Random Forest Classifier

from sklearn.ensemble import RandomForestClassifier

gs_params = {
    'n_estimators': range(1,15),
    'criterion': ['entropy','gini'],
    'max_features': ['sqrt','log2'],
    'max_depth': range(1,5),
}

rf_gridsearch = GridSearchCV(RandomForestClassifier(), 
                             gs_params, 
                             n_jobs=1, cv=5, verbose=1)

rf_gridsearch.fit(X_train_ss,y_train)
rf_model = rf_gridsearch.best_estimator_

scores = cross_val_score(rf_model,X_test_ss,y_test,cv=5)
print(scores.mean())

We get another 0.72 accuracy, however our precision drops to 0.68 for predicting high

Let's just try one more for fun.

Gridsearching with an AdaBoost Classifier

Adaboost starts by making many highly biased models, and then through iterating and retraining on misclassifed samples converges on a model which has lower bias.

from sklearn.ensemble import AdaBoostClassifier

clf = AdaBoostClassifier()

ada_params = {
    'n_estimators': range(20,200,10),
    'learning_rate':[0.1,0.4]
}

ada_cv = GridSearchCV(clf,ada_params,n_jobs=1,cv=5,verbose=1)

ada_cv.fit(X_train_ss,y_train)
ada_model = ada_cv.best_estimator_
scores = cross_val_score(ada_model,X_test_ss, y_test)
scores.mean()

This time round, we have an accuracy score of 0.68, and a precision for predicting high of 0.69.

Conclusion

With the simple data we have scraped from indeed.com in the US, we can predict the outcome of a salary level in comparison to the national median for data science jobs with at best an accuracy of 72%. This is not bad going just by looking at the search index pages. We could further this by looking at many more features, from the job listing page itself and we would hope to improve our model. We could also have scraped other job listing sites, and used these reults here too, again hoping to improve our model. We might also decide to change our dependent variable (the outcome), here we have used wither side of the median, we could have decided to use the mean or broken down the salary into a range of outcomes.

We have shown that with a little bit of data we can make many useful inferences and give ourself a good general understanding of a real world scenario. We can also make predictions with a greater accuracy than just guesswork, and examine the process for making these predictions. There will always be more data, and there will always be plenty of ways to disect and investigate what the data can tell us.

This was project 4, approximately half way through the Data Science Immersive course at General Assembly in London.