Using supervised learning to classify drug consuption behavior¶
One of the first things that fascinated me about machine learning is the ability to predict the behavior of individuals. There are many uses for this capability, from predicting the liklihood of an individual to click on an ad, to recommending follow-up mental health treatment based on risk of self-harm.
Dataset¶
Human behavior is delightfully complex. This complexity is reflected in this drug consumption dataset, which contains demographic info as well as scores from personality inventories.
This dataset is interesting to me because constructs from personality inventories can often seem abstract, so it's informative to explore how they relate to a concrete behavior. This dataset includes scores from the NEO personality inventory (neuroticism, extraversion, openness to experience, agreeableness, and conscientiousness), the BIS-11 (impulsivity), and ImpSS (sensation seeking).
Last, it includes info on whether each individual had used a number of drugs.
Project Goals¶
In this project, I explored the ability of three supervised learning algorithms to predict whether an individual has used a specific drug. I explored and visualized parameter selection for each model, as well as how metrics can help us select the best model for a problem.
Summary of findings¶
Based on test-set accuracy (0.76) and ROC AUC (0.81), I found that the SVC model performed better than other models in predicting whether or not an individual had used cannabis.
Full project¶
Importing the data¶
import os
import sys
import pandas as pd
from urllib.request import urlretrieve
import warnings
warnings.filterwarnings('ignore')
PROJ_ROOT_DIR = os.getcwd()
DATA_PATH = os.path.join(PROJ_ROOT_DIR, "data")
DATA_DOWNLOAD_ROOT = "https://archive.ics.uci.edu/ml/machine-learning-databases/00373/"
DATA_DOWNLOAD_URL = DATA_DOWNLOAD_ROOT + "drug_consumption.data"
def import_drug_data(data_url=DATA_DOWNLOAD_URL, data_path=DATA_PATH):
"""
A function to save the data to the data directory
"""
# Create a directory for data if it doesnt exist
if not os.path.isdir(data_path):
os.makedirs(data_path)
# Create a file for the data
data_file = os.path.join(data_path, "drug_consumption.data")
# Save the data
urlretrieve(data_url, data_file)
def load_drug_data():
"""
A function to load the data into the environemnt
"""
# Column headers:
header_list=['ID','Age','Gender','Education','Country', 'Ethnicity', 'Nscore','Escore','Oscore','Ascore','Cscore','ImpulsiveScore','SS','Alcohol','Amphet','Amyl','Benzos','Caff','Cannabis','Choc','Coke','Crack','Ecstacy','Heroin','Ketamine','LegalH','LSD','Meth','Mushroom','Nicotine','Semeron','VSA']
# Read in data
drug_df = pd.read_csv(os.path.join(DATA_PATH, "drug_consumption.data"), header=None, names=header_list)
return drug_df
import_drug_data()
drug_df = load_drug_data()
Let's take a look at the imported data
drug_df.head()
Processing the data¶
1. Changing drug consumption to be a binary outcome
I want to do binary classification, so I need to change the drug columns to be a binary (0 and 1).
import pandas as pd
def binarize_categorical_variable(df, column, yescat):
"""
Function to turn a categorical variable in a DataFrame into a 0 or 1 response
Arguments:
df = pandas DataFrame,
column = name of variable column as string
yes = quoted *and bracketed* list of category names that you wish to turn to 1
Returns: the original pandas df with new binarized variable"""
# Change column to category dtype so that we can access it with .categories method
df[column]=df[column].astype('category')
# Create list of categories in column
category_list = []
for cat in df[column].cat.categories:
category_list.append(cat)
# Create dictionary with 1s for yes categories and 0 for no categories
repl_dict = {}
for cat in category_list:
for yes in yescat:
if cat == yes:
repl_dict[cat] = 1
if repl_dict.get(cat) == None:
repl_dict[cat] = 0
# Replace original column in DataFrame
df[column] = df[column].replace(repl_dict)
return df
2. Changing Gender, Ethnicity, Education, and Country to be categorical variables
These features are currently encoded as real numbers. However, there is no ordering betwen the levels of these features, so they should be treated as discrete.
def new_categories(df, column, newcat):
"""
Function to replace categories in a variables with new category names
Arguments:
df = a pandas DataFrame
column = name of variable column as string
newcat = a list of strings that name the new category names
Output: a pandas DataFrame with new column names"""
# Change column to category dtype so that we can access it with .categories method
df[column]=df[column].astype('category')
# Create list of categories in this variable
category_list = []
for cat in df[column].cat.categories:
category_list.append(cat)
# Create a dictionary of new names for each old category name
repl_dict = {}
for i, cat in enumerate(category_list):
repl_dict[cat] = newcat[i]
df[column] = df[column].replace(repl_dict)
return df
3. The personality scores don't need to be processed
These features are scaled to a mean of 0 and standard deviation of 1 already.
Below, I applied the above functions to clean up the data.
def cleanup_drug_df(df):
""" A function to clean up the drug dataframe
Argument: Pandas DataFame
Output: Cleaned up data as pandas DataFrame"""
# Binarize the drug columns
drug_df_clean = df
for column in drug_df.columns[13:]:
drug_df_clean = binarize_categorical_variable(drug_df_clean, column, yescat=['CL1','Cl2','CL3','CL4','CL5','CL6'])
# Responses from individuals who said they had done an imaginary drug probably arent repliable,
# so they are removed
drug_df_clean = drug_df_clean[drug_df_clean.Semeron==0]
new_categories_list = [
['18-24','25-34','35-44','45-54','55-64','65-100'],
['Female','Male'],
['Left school before age 16', 'Left school at age 16', 'Left school at age 17', 'Left school at age 18', 'Some college or university','Professional diploma/certificate', 'University degree', 'Masters degree', 'Doctorate degree'],
['USA','New Zealand','Other', 'Australia', 'Republic of Ireland','Canada', 'UK'],
['Black', 'Asian','White','Mixed Black/White','Other','Mixed White/Asian','Mixed Black/Asian']]
columns_to_replace = ['Age', 'Gender', 'Education', 'Country', 'Ethnicity']
for i, column in enumerate(columns_to_replace):
new_categories(drug_df_clean, column, new_categories_list[i])
return drug_df_clean
# Cleanup drug DataFrame
drug_df_clean=cleanup_drug_df(drug_df)
drug_df_clean.head()
Selecting a target variable for classification¶
I want to pick one that isn't too unevenly distributed, so here I checked the response counts for each drug:
for column in drug_df_clean.columns[13:]:
print(drug_df_clean[column].value_counts())
I decided to move forward with predicting Cannabis use because it isn't too unbalanced compared to some of the others. So, I created a new dataframe for classification with Cannabis included as the only drug.
Dealing with age¶
Also, I created two new columns 'Upper_Age' and 'Lower_Age', which replace the Age column. Right now, the age column is an ordinal feature; therefore, simple one-hot encoding loses some information. We can maintain more information about the feature by creating a new feature for the lower bound of the age and the upper bound the age. This strategy also maintains more information about the age than changing the feature to a simple mean.
def make_single_drug_df(df_full, drug):
"""
A function to select a single drug as the target variable, process age variable, and isolate features dataset
Arguments:
df_full = the full drug pandas Dataframe
drug = the name of the drug column of interest as a string
Returns:
a tuple that is the single drug pandas dataframe and the features dataframe
"""
cols = [
'ID','Age', 'Gender', 'Education', 'Country',
'Ethnicity', 'Nscore', 'Escore', 'Oscore',
'Ascore', 'Cscore', 'ImpulsiveScore','SS', drug]
df = df_full.loc[:,cols]
# create a new feature for the lower bound of the age and the upper bound the age.
df['Lower_Age'], df['Upper_Age'] = zip(*df['Age'].map(lambda x: x.split('-')))
# Convert lower and upper age to floats
df.Lower_Age=df.Lower_Age.astype('float')
df.Upper_Age=df.Upper_Age.astype('float')
# Make dataframe of features
features=df.drop(['ID','Age', drug], axis = 1)
return (df, features)
drug = 'Cannabis'
df, features = make_single_drug_df(df_full=drug_df_clean, drug= drug)
Exploring the dataset¶
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
# Look at the distribution of the numerical variables
df.drop(columns=['ID', drug]).hist(bins=50, figsize=(20,15));
- They are largely normally distributed
# Pairplot of numerical features
sns.pairplot(df, height = 4, vars=[
'Nscore', 'Escore', 'Oscore',
'Ascore', 'Cscore'], );
- Some appear to be slightly correlated, but none appear to be represnting the same construct
# Value counts for the categorical variables
col_list = ['Age', 'Gender', 'Education', 'Country', 'Education', 'Ethnicity']
for feature in col_list:
print("Counts per response:\n", {
n: v for n, v in zip(df.loc[:,feature].value_counts().index, df.loc[:, feature].value_counts())
})
print("Shape of", drug, "data:", df.shape)
print("Counts per response:\n", {
n: v for n, v in zip(df[drug].value_counts().index, df[drug].value_counts())
}, "\nProportion each response:\n", {
n: v for n, v in zip(df[drug].value_counts().index, df[drug].value_counts() / len(df))
})
Because 64% of responses are in class 1, there would be a baseline accuracy of 0.64 if we predicted all responses to be class 1
Time to get ready for maching learning¶
First, I split the data into train and test sets for machine learning
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(features, df[drug], random_state= 42)
Now, for any model we are fitting, there is preprocessing to be done and a number of hyperparameters that need to be adjusted.
For preprocessing, we need to one-hot encode the Gender, Ethnicity, Education, and Country features. For some models, we would also need to scale the numerical features Upper_Age and Lower_Age.
To choose the most appropriate model hyperparameters, it's best to evaluate a number of models trained with various hyperparameters and compare their scores on a chosen metric.
By what metric should I select hyperparameters?¶
Because this this data is slightly imbalanced, accuracy may not be the best evaluation metric (as noted, baseline accuracy would be 0.64). Another option is area under the ROC curve (ROC AUC). What does this metric convey?
So, classifiers calculate a predicted probability and/or decision function output for each observation. Above some threshold of this value, observations are classified as part of the positive class. The ROC curve plots the true positive rate against false positive rate for a range of thresholds, thereby capturing the predictive power of the model without depending on class predictions decided by an artibrary decision threshold. Area under the ROC curve shows the probability, for all thresholds, that a value of the positive class will have a higher score than one of the negative class according to the decision function of the model.
In scenarios where classes are unbalanced, you can't depend on the default decision threshold being appropriate. Therefore, using a metric that considers all thresholds can be useful.
The below function is a wrapper for a GridSearchCV object that searches for the best parameters for a classifier, reports a number of metrics for the classifier, and refits on a metric of choice. It takes as arguments the training data, a pipeline object that can include any preprocessing and model desired, and the parameter space to search. It produces a heatmap for cross validation results so we can visualize if we are searching the best hyperparameter space.
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import make_scorer, average_precision_score,precision_score, roc_auc_score, accuracy_score
# Wrapper to fit GridSearchCV optimized with different scores:
def grid_search_wrapper(X_train, y_train, pipe, param_grid, refit_score):
"""
Fits a GridSearchCV classifier that optimizes with refit_score and reports ROC AUC, precision score, and accuracy
Arguments:
X_train = training features
y_train training target
pipe = a pipline object that include preprocessing steps and chosen classifier
param_grid = a dict or list of dicts with keys as strings of the
parameter names and values are a list of possible values for the parameter
refit_score = string naming the score to refit by
Returns: tuple of GridSearchCV (for using for predictions) and results pandas DataFrame for evaluating the model
"""
# Scores to report in the results
scorers = {
'roc_auc' : make_scorer(roc_auc_score),
'precision_score' : make_scorer(precision_score),
'accuracy' : make_scorer(accuracy_score)
}
# Create a GridSearchCV instance optimized for refit_score
grid_search = GridSearchCV(pipe, param_grid, scoring=scorers, refit=refit_score, cv=5, return_train_score=True, n_jobs=-1)
# Fit on training data
grid_search.fit(X_train,y_train)
# Store results of each cross val
results = pd.DataFrame(grid_search.cv_results_)
# Make a list of the parameters to extract from results
subset_results=['mean_test_precision_score', 'mean_test_roc_auc', 'mean_test_accuracy']
# Initialize list of params and list of parameter dimensions for graphing
params = []
dim = []
# SVC has a 'list' type parameter grid (two kernels)
# this extracts the best kernel if the classifier used is SVC
# SVC must be named 'svc' for this to work
if 'svc__kernel' in grid_search.best_params_:
best_kernel = grid_search.best_params_['svc__kernel']
# add params to subset_results. If the classifier is SVC, only add the params for the best kernel
if type(param_grid) == list:
for grid in param_grid:
if grid['svc__kernel']==[best_kernel]:
for param in grid.keys():
# Save length of the param for the dimensions of the graph
dim.append(len(grid[param]))
# Save param for graphing later
if param not in params:
params.append(param)
# take out dimension of kernel
dim = dim[1:]
else:
for param in grid.keys():
# Save param for graphing later
if param not in params:
params.append(param)
# Take out kernel param
params=params[1:]
# Add the param to the list of column names in results
for param in params:
param_str='param_'+str(param)
if param_str not in subset_results:
subset_results.append(param_str)
# Selected results
results= results[results.param_svc__kernel==best_kernel].loc[:,subset_results].round(3).reindex()
else:
for param in param_grid.keys():
# Save param for graphing later
params.append(param)
# Save dimensions of param for graphing
dim.append(len(param_grid[param]))
param_str='param_'+str(param)
if param_str not in subset_results:
subset_results.append(param_str)
# Selected results
results = results.loc[:,subset_results].round(3).reindex()
#Print best params for the chosen score and the best cross-val score
print(
"Parmaeters when refit for {}".format(refit_score),
"\n {}".format(grid_search.best_params_),
"\nBest cross-val roc_auc score: {}".format(np.max(results.mean_test_roc_auc)),
"\nBest cross-val accuracy score: {:.2f}".format(np.max(results.mean_test_accuracy)))
# Store the name of the score to plot on a heat map
plot_score = 'mean_test_' + str(refit_score)
# Store average cross-val score of each combo of parameters in an array the size of the parameter grid space
scores = np.array(results[plot_score]).reshape(dim[0],dim[1])
# Plot the score of each combo of parameters on a heatmap
plt.figure(figsize=(10,6))
heatmap=sns.heatmap(scores, annot=True)
plt.title('Cross-val test scores for ' + str(refit_score))
plt.xlabel(params[1])
plt.ylabel(params[0])
plt.xticks(np.arange(dim[1]), results[subset_results[-1]].unique())
plt.yticks(np.arange(dim[0]), results[subset_results[-2]].unique())
plt.show()
# Return both the grid search object and the results DataFrame
return (grid_search, results)
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
import numpy as np
# First, we make a column transformer to scale the numerical features, and one-hot encode the categorical ones
ct = ColumnTransformer([
("num", StandardScaler(), ['Upper_Age', 'Lower_Age']),
("cat", OneHotEncoder(handle_unknown = 'ignore', sparse=False), ['Gender', 'Education', 'Country', 'Ethnicity'])],
remainder='passthrough')
# Next, make a pipeline with the column transformer and the logistic regression instance
pipe = Pipeline([
("preparation", ct),
("logreg", LogisticRegression())
])
# Now, we make a parameter grid for the grid search to search over
# l1 and l2 are types of regularization; they restrict the model and prevent it from overfitting by restricting the coefficients to be near zero.
# l1 also forces some to be exactly 0 so that they can be ignorned entirely
# C specifies the strength of the regularization
param_grid = {
'logreg__C': [0.001, 0.01, 1, 100],
'logreg__penalty': ["l1", "l2"]
}
print("parameter grid:\n{}".format(param_grid))
# Run the grid search, refitting with the ROC AUC metric
grid_search, logreg_results = grid_search_wrapper(X_train, y_train, pipe=pipe, param_grid=param_grid, refit_score='roc_auc')
# Assign the full pipeline for the estimator selected by the grid search to it's own object
logreg_pipe = grid_search.best_estimator_
An ROC-AUC score of 0.73 is ok, but not awesome. It's definitely worth seeing if another model would produce better results.
2. SVC (support vector classifier)¶
from sklearn.svm import SVC
# We can use the same ColumnTransformer instance as was used for Logistic regression
# We add the SVC to the pipeline
pipe = Pipeline([
("preparation", ct),
("svc", SVC())
])
# This parameter grid is a list of dicts instead of a dict
# The rbf kernel is used to project the data into a higher dimensional freature space without actually computing the transformations
# Linear kernel keeps the features in linear space
# C is a regularization parameter, corresponding to the importance that all training samples are
# classified correctly, sacrificing the smoothness or margin of the decision function
# gamma defines how much influence a single training sample will have on the decision function, ie the inverse
# of the radius of the influence of support vectors
param_grid = [
{'svc__kernel': ['rbf'], 'svc__C': [0.001, 0.01, 0.1, 1, 10, 100], 'svc__gamma': [0.001, 0.01, 0.1, 1, 10, 100]},
{'svc__kernel': ['linear'], 'svc__C': [0.001, 0.01, 0.1, 1, 10, 100]}]
# Run the grid search
grid_search, svc_results = grid_search_wrapper(X_train, y_train, pipe, param_grid, refit_score='roc_auc')
Because the largest scores are near the endge of the heat map, we'll adjust the parameter grid to see if better scores are beyong the search space of the last grid search
param_grid = [
{'svc__kernel': ['rbf'], 'svc__C': [0.1, 1, 10, 100, 1000, 10000], 'svc__gamma': [0.00001, 0.0001, 0.001, 0.01, 0.1, 1]},
{'svc__kernel': ['linear'], 'svc__C': [0.001, 0.01, 0.1, 1, 10, 100]}]
grid_search, svc_results = grid_search_wrapper(X_train, y_train, pipe, param_grid, refit_score='roc_auc')
svc_pipe = grid_search.best_estimator_
This seems like a better parameter grid space because the plot is filled up with evenly with the higher cross-val scores.
But, the best parameters are the same as the ones we found in the first grid search.
The best ROC AUC from cross validation is slightly higher! But still not awesome.
3. Random Forest classifier¶
RF is the average of a number of decision tree classifiers with some randomness injected into the tree-building process.
from sklearn.ensemble import RandomForestClassifier
# For RF, we don't need to scale the numerical features, so we'll use a different column transformer
ct = ColumnTransformer([
("cat", OneHotEncoder(handle_unknown = 'ignore', sparse=False), ['Gender', 'Education', 'Country', 'Ethnicity'])],
remainder='passthrough')
pipe = Pipeline([
("preparation", ct),
("rf", RandomForestClassifier(n_estimators=200, random_state=2))
])
# The parameters that will be searched accross are max_depth, and max_features
# max_depth limits the number of decisions that can happen in a tree; set lower to avoid overfitting
# max_features is the number of features randomly selected to fit each tree; set lower to include more randomness and avoid overfitting
# usually set to sqrt(n_features)
param_grid = {
'rf__max_depth': [5, 15, 25, 35, 45],
'rf__max_features': [4, 7, 10]
}
grid_search, rf_results = grid_search_wrapper(X_train, y_train, pipe=pipe, param_grid=param_grid, refit_score='roc_auc')
rf_pipe = grid_search.best_estimator_
The best ROC AUC score for cross validation isn't better than SVC.
Therefore, it looks like SVC is the best model.
Using one function to determine the best model¶
Alternatively, we can do a grid search between the models and parameters together, as shown here:
ct = ColumnTransformer([
("num", StandardScaler(), ['Upper_Age', 'Lower_Age']),
("cat", OneHotEncoder(handle_unknown = 'ignore', sparse=False), ['Gender', 'Education', 'Country', 'Ethnicity'])],
remainder='passthrough')
# Initialize pipe with column transformer and Log Reg
pipe = Pipeline([
("preparation", ct),
("classifier", LogisticRegression())
])
# Param grid with all possibilities for all models
param_grid = [
{'classifier': [LogisticRegression(solver='liblinear')],
'classifier__C': [0.001, 0.01, 1, 100],
'classifier__penalty': ["l1", "l2"]},
{'classifier': [LogisticRegression(solver='lbfgs', penalty="l2", max_iter=500)],
'classifier__C': [0.001, 0.01, 1, 100]},
{'classifier': [RandomForestClassifier(n_estimators=200, random_state=2)],
'classifier__max_depth': [15, 25, 35],
'classifier__max_features': [4, 7, 10]},
{'classifier': [SVC()],
'classifier__C': [0.1, 1, 10, 100, 1000, 10000],
'classifier__gamma': [0.00001, 0.0001, 0.001, 0.01, 0.1, 1]}
]
# For this grid search between all classifiers, I decided to refit on ROC AUC
clf_grid = GridSearchCV(pipe, param_grid, cv=5, scoring=make_scorer(roc_auc_score))
clf_grid.fit(X_train, y_train)
print("Best classifier and params: \n{}\n".format(clf_grid.best_params_))
print("Best cross-val score: {:.2f}".format(clf_grid.best_score_))
This grid search also came up with the same conclusion: SVC is the best estimator, with the above params
Time to generate predictions!¶
Now that SVC was determined to be the best model, we can generate predictions on the test set and inspect their metrics.
from sklearn.metrics import roc_auc_score, accuracy_score, confusion_matrix, classification_report
print("Test set roc-auc for SVC: {:.2f}".format(roc_auc_score(y_test, clf_grid.decision_function(X_test))))
print("Test set accuracy score for SVC: {:.2f}".format(accuracy_score(y_test, clf_grid.predict(X_test))))
print("Metrics for SVC: \n", classification_report(y_test, svc_pipe.predict(X_test)))
cm=confusion_matrix(y_test, clf_grid.predict(X_test))
plt.figure(figsize=(10,6))
sns.heatmap(cm, annot=True, linewidth=0.5, fmt='g', cmap='Blues')
plt.title('Confusion Matrix')
plt.xlabel('True Labels')
plt.ylabel('Predicted Labels');
One aspect to consider is that the classification is decided based on the default threshold applied to the decision function or predicted probability. For example, in the case of a decision function, 0 is the default threshold (positive values are classified in the positive class, negative values in the negative class, value is associated with certainty or probability to be in that class). As mentioned previously in the discussion of ROC AUC, that default value may not be optimal in certain cases, such as imbalanced classes.
The model with the best ROC AUC may still not perform well if this threshold of 0 is in a suboptimal place.
Because this model is still not getting an impressively high performance, I will take a look at the ROC curve to see if the threshold of the decision function is in the optimal place (i.e. high TPR and high FPR)
Plotting ROC Curve¶
This is the false positive rate by the true positive rate for every threshold value in the decision curve:
from sklearn.metrics import roc_curve
# Extract false positive rate and true positive rate
fpr_svc, tpr_svc, thresholds_svc = roc_curve(y_test, svc_pipe.decision_function(X_test))
# Extract the threshold by getting the value in the decision function that is closest to 0 (default is 0)
thresh_svc = np.argmin(np.abs(thresholds_svc))
def plot_roc():
# Plot FPR and TPR
plt.figure(figsize=(10,6))
plt.plot(fpr_svc, tpr_svc, label="ROC Curve")
plt.ylabel('TPR (recall)')
plt.xlabel('FPR')
# Plot threshold
plt.plot(fpr_svc[thresh_svc], tpr_svc[thresh_svc], 'o', markersize = 10, label = "threshold zero svc")
plt.legend(loc=4)
plt.title("ROC Curve SVC")
plt.show()
plot_roc()
Looks like the default threshold for the decision function is in a pretty good spot because there is high TPR assocated with low FPR. So this is probably the optimal decision boundary.
Plotting feature importances¶
Although the SVC was the highest performing model, I wanted to extract feature importances from the Random Forest model in order to get an idea of which features were most predictive in that context.
# Isolate the rf model from the pipeline
rf = rf_pipe.named_steps['rf']
# These are the column names from the numpy array producted from the column transformer
ct_index = [
'Gender_Female', 'Gender_Male', 'Education_Doctorate',
'Eduction_age16', 'Education_age17', 'Education_age18',
'Education_before16', 'Education_Masters', 'Education_Professional_Diploma/Cert',
'Education_SomeCollege', 'Education_UniDegree', 'Country_Australia',
'Country_Canada', 'Country_NewZealand', 'Country_Other', 'Country_Ireland',
'Country_UK', 'Country_USA', 'Ethnicity_Asian', 'Ethnicity_Black', 'Ethnicity_Mixed_BlackAsian',
'Ethnicity_Mixed_BlackWhite', 'Ethnicity_Mixed_WhiteAsian', 'Ethnicity_Other',
'Ethnicity_White', 'Nscore', 'Escore', 'Oscore', 'Ascore', 'Cscore', 'ImpulsiveScore', 'SS',
'Lower_Age', 'Upper_Age']
# Create a data frame of feature importances for each feature in the transformed dataset
feature_importances = pd.DataFrame(rf.feature_importances_, index=ct_index, columns=['importance']).sort_values('importance', ascending=False)
# Plot feature importnaces
def plot_rf_feature_importance():
"""
A function to plot feature importances as a horizontal barplot
"""
plt.figure(figsize=(10,8))
n_features = len(ct_index)
plt.barh(np.arange(n_features), rf.feature_importances_, align='center')
plt.yticks(np.arange(len(ct_index)), ct_index, size= 10)
plt.ylabel("Feature")
plt.xlabel("Feature Importance")
plt.title("Feature Importances per Random Forest Classifier")
plt.show()
plot_rf_feature_importance()
It's interesting that all of the most important features are the personality measures and not the demographic measures. I think it's interesting that this connects these abstract measures to a concrete behavior.
Visualizing important features¶
As shown above, the important features in determining the group that each individual belonged in included age and scores on personality inventories. Below, I plotted the distribution of these features for each group (used cannabis and never used cannabis).
The distribution of age in each group:
import plotly.graph_objects as go
fig = go.Figure()
fig.add_trace(go.Violin(x=df.Cannabis[ df.Cannabis == 1 ],
y=df.Lower_Age.astype('category', ordered=True)[ df.Cannabis == 1 ],
legendgroup='Used Cannabis', scalegroup='Used Cannabis', name='Used Cannabis',
side='positive',
line_color='blue')
)
fig.add_trace(go.Violin(x=df.Cannabis[ df.Cannabis == 1 ],
y=df.Lower_Age.astype('category', ordered=True)[ df.Cannabis == 0 ],
legendgroup='Never used Cannabis', scalegroup='Never used Cannabis', name='Never used Cannabis',
side='negative',
line_color='purple')
)
fig.update_traces(meanline_visible=True)
fig.update_layout(template="plotly_white", title_text = "Age distribution in each group", violinmode='overlay', width=700, height =600)
fig.update_yaxes(title_text = "Age (years)",
ticktext=['18-24', '25-34', '35-44', '45-54', '55-64', '65-100'],
tickvals = [ 18., 25., 35., 45., 55., 65.,])
fig.update_xaxes(ticktext = [' '],
tickvals = [1])
fig.show()
The distribution of each personality score in each group:
import plotly.figure_factory as ff
from plotly.subplots import make_subplots
scores = [['Nscore', 'Neuroticism score'],
['Escore', 'Extraversion score'],
['Oscore', 'Openness to experience score'],
['Ascore','Agreeableness score'],
['Cscore','Conscientiousness score'],
['ImpulsiveScore', 'Impulsivity score'],
['SS', 'Sensation seeking score']]
colors = ['#13eac9', '#0504aa']
for score in scores:
df_long = df[['ID', 'Cannabis', score[0]]].pivot(columns='Cannabis', values=score[0])
df_long = df_long[[0,1]].rename(columns={0:'Never used Cannabis', 1:'Used Cannabis'})
fig = ff.create_distplot([df_long[c].dropna() for c in df_long.columns],
df_long.columns.astype('str'),
bin_size=.2,
show_rug=False,
colors=colors)
fig.update_layout(template="plotly_white", width=700, height =350)
fig.update_traces(marker={"opacity": 0})
fig.update_xaxes(title_text = score[1])
fig.update_yaxes(title_text = 'Frequency')
fig.show()
These plots show us that for people that had ever used Cannabis, the average age is lower, they have slightly lower neuroticism scores, slightly higher openness to experience scores, slightly lower conscientiousness scores, slightly higher impulsivity scores, and higher sensation seeking scores.
Future¶
The dataset includes a multi-class response for each drug. If the target was binarized differently (perhaps including all people who had used the drug once in the negative class), would the model perform better? I.e. is the binarization I chose the best one to reflect human personality factors and behavior? What about if the problem was treated as a multi-class classification problem?
This dataset also includes responses on numerous drugs. What is the best model to predict drug use for other drugs? Does SVC perform best for other targets in the same dataset or would I be able to find a higher-performing model?