Imagine having a dataset that you need to use for training a prediction model, but some of the features are missing. The good news is you don’t need to throw some data away, just have to impute them. Below are steps you can take in order to create an imputation pipeline. Github link here!
from random import randint
import pandas as pd
import numpy as np
from sklearn.preprocessing import OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor, ExtraTreesRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error, median_absolute_error
from hyperopt import fmin, tpe, hp, Trials, STATUS_OK
import mlflow
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
Generate data
Since this is an example and I don’t want to get sued by using my company’s data, synthetic data it is :) This simulates a dataset from different pseudo-regions, with different characteristics. Real data will be much more varied, but I make it more obvious so it’s easy to see the differences.
def generate_array_with_random_nan(lower_bound, upper_bound, size):
a = np.random.randint(lower_bound, upper_bound+1, size=size).astype(float)
mask = np.random.choice([1, 0], a.shape, p=[.1, .9]).astype(bool)
a[mask] = np.nan
return a
size = 6000
df_cbd = pd.DataFrame()
df_cbd['bed'] = generate_array_with_random_nan(1, 2, size)
df_cbd['bath'] = generate_array_with_random_nan(1, 2, size)
df_cbd['area_usable'] = np.random.randint(20, 40, size=size)
df_cbd['region'] = 'cbd'
df_suburb = pd.DataFrame()
df_suburb['bed'] = generate_array_with_random_nan(1, 4, size)
df_suburb['bath'] = generate_array_with_random_nan(1, 4, size)
df_suburb['area_usable'] = np.random.randint(30, 200, size=size)
df_suburb['region'] = 'suburb'
df = pd.concat([df_cbd, df_suburb])
df
bed | bath | area_usable | region | |
---|---|---|---|---|
0 | 2 | 1 | 33 | cbd |
1 | 1 | 2 | 23 | cbd |
2 | 1 | 2 | 33 | cbd |
3 | 2 | 1 | 26 | cbd |
4 | 2 | 1 | 28 | cbd |
5 | 2 | 2 | 36 | cbd |
6 | 1 | 2 | 38 | cbd |
7 | 2 | 1 | 23 | cbd |
8 | 2 | 1 | 36 | cbd |
9 | nan | 2 | 29 | cbd |
Report missing values
I also randomly remove some values to mimic real-world data (read: they are never ready to use), here we will visualize the missing rate of each column.
def report_missing(df):
cnts = []
cnt_total = len(df)
for col in df.columns:
cnt_missing = sum(pd.isnull(df[col]) | pd.isna(df[col]))
print("col: {}, missing: {}%".format(col, 100.0 * cnt_missing / cnt_total))
cnts.append({
'column': col,
'missing': 100.0 * cnt_missing / cnt_total
})
cnts_df = pd.DataFrame(cnts)
sns.barplot(x=cnts_df.missing,
y=cnts_df.column,
# palette=['r','b'],
# data=cnts_df
)
return sns
report_missing(df)
col: bed, missing: 10.266666666666667%
col: bath, missing: 9.616666666666667%
col: area_usable, missing: 0.0%
col: region, missing: 0.0%
Data exploration
Knowing the missing rate isn’t everything, thus it is also a good idea to explore data in other areas too.
## missing bed per region
df[df.bed.isna()]["region"].value_counts(dropna=False)
cbd 634
suburb 598
Name: region, dtype: int64
## missing bath per region
df[df.bath.isna()]["region"].value_counts(dropna=False)
suburb 588
cbd 566
Name: region, dtype: int64
## explore region
df.region.value_counts()
suburb 6000
cbd 6000
Name: region, dtype: int64
## explore bed
df.bed.value_counts()
2.0 4050
1.0 4009
4.0 1393
3.0 1316
Name: bed, dtype: int64
## explore bath
df.bath.value_counts()
1.0 4142
2.0 4022
3.0 1393
4.0 1289
Name: bath, dtype: int64
Remove outliers
(wouldn’t want your model to have a sub-par performance from skewed data :-P)
## remove outliers here
Create synthetic columns
In this step, we create percentile, mean and rank columns to add more data points, so the model can perform better :D
First, we find aggregate percentiles for each groupby set, then add mean and rank columns.
synth_columns = {
'bed': {
"region_bath": ['region', 'bath']
},
'bath': {
"region_bed": ['region', 'bed']
}
}
for column, groupby_levels in synth_columns.items():
for groupby_level_name, groupby_columns in groupby_levels.items():
# percentile aggregates
for pctl in [20,50,80,90]:
col_name = 'p{}|{}|{}'.format(pctl, groupby_level_name, column)
print("calculating -- {}".format(col_name))
df[col_name] = df[groupby_columns+[column]].fillna(0).groupby(groupby_columns)[column].transform(lambda x: x.quantile(pctl/100.0))
# mean impute
mean_impute = 'mean|{}|{}'.format(groupby_level_name,column)
print("calculating -- {}".format(mean_impute))
df[mean_impute] = df.groupby(groupby_columns)[column].transform('mean')
# bed/bath rank
rank_impute = column_name = 'rank|{}|{}'.format(groupby_level_name,column)
print("calculating -- {}".format(rank_impute))
df[rank_impute] = df.groupby(groupby_columns)[column].rank(method='dense', na_option='bottom')
calculating -- p20|region_bath|bed
calculating -- p50|region_bath|bed
calculating -- p80|region_bath|bed
calculating -- p90|region_bath|bed
calculating -- mean|region_bath|bed
calculating -- rank|region_bath|bed
calculating -- p20|region_bed|bath
calculating -- p50|region_bed|bath
calculating -- p80|region_bed|bath
calculating -- p90|region_bed|bath
calculating -- mean|region_bed|bath
calculating -- rank|region_bed|bath
Coalesce values
In this step we fill in values obtained from the previous step – impute time!!
def coalesce(df, columns):
'''
Implement coalesce of function in colunms.
Inputs:
df: reference dataframe
columns: columns to perform coalesce
Returns:
df_tmp: pd.Series that is coalesced
Example:
df_tmp = pd.DataFrame({'a': [1,2,None,None,None,None],
'b': [None,6,None,8,9,None],
'c': [None,10,None,12,None,13]})
df_tmp['new'] = coalesce(df_tmp, ['a','b','c'])
print(df_tmp)
'''
df_tmp = df[columns[0]]
for c in columns[1:]:
df_tmp = df_tmp.fillna(df[c])
return df_tmp
coalesce_columns = [
'bed',
'p50|region_bath|bed',
# p50|GROUPBY_LESSER_WEIGHT|bed, ...
]
df["bed_imputed"] = coalesce(df, coalesce_columns)
coalesce_columns = [
'bath',
'p50|region_bed|bath',
# p50|GROUPBY_LESSER_WEIGHT|bath, ...
]
df["bath_imputed"] = coalesce(df, coalesce_columns)
Report missing values (again)
After we impute the values, let’s see how much we are doing better!
report_missing(df)
col: bed, missing: 10.266666666666667%
col: bath, missing: 9.616666666666667%
col: area_usable, missing: 0.0%
col: region, missing: 0.0%
col: p20|region_bath|bed, missing: 0.0%
col: p50|region_bath|bed, missing: 0.0%
col: p80|region_bath|bed, missing: 0.0%
col: p90|region_bath|bed, missing: 0.0%
col: mean|region_bath|bed, missing: 9.616666666666667%
col: rank|region_bath|bed, missing: 0.0%
col: p20|region_bed|bath, missing: 0.0%
col: p50|region_bed|bath, missing: 0.0%
col: p80|region_bed|bath, missing: 0.0%
col: p90|region_bed|bath, missing: 0.0%
col: mean|region_bed|bath, missing: 10.266666666666667%
col: rank|region_bed|bath, missing: 0.0%
col: bed_imputed, missing: 0.0%
col: bath_imputed, missing: 0.0%
Notice that the imputed columns there are no missing values. Yay!
Assign partition
In this step, we partition the data into three sets: train, dev and test. Normally we only split into train and test set, but the additional “dev” set is there so we can make sure it’s not too overfit or underfit.
## assign partition
def assign_partition(x):
if x in [0,1,2,3,4,5]:
return 0
elif x in [6,7]:
return 1
else:
return 2
## assign random id
df['listing_id'] = [randint(1000000, 9999999) for i in range(len(df))]
## hashing
df["hash_id"] = df["listing_id"].apply(lambda x: x % 10)
## assign partition
df["partition_id"] = df["hash_id"].apply(lambda x: assign_partition(x))
## define columns group
y_column = 'area_usable'
categ_columns = ['region']
numer_columns = [
'bed_imputed',
'bath_imputed',
'p20|region_bath|bed',
'p50|region_bath|bed',
'p80|region_bath|bed',
'p90|region_bath|bed',
'mean|region_bath|bed',
'rank|region_bath|bed',
'p20|region_bed|bath',
'p50|region_bed|bath',
'p80|region_bed|bath',
'p90|region_bed|bath',
'mean|region_bed|bath',
'rank|region_bed|bath',
]
id_columns = [
'listing_id',
'hash_id',
'partition_id'
]
## remove missing y
df = df.dropna(subset=[y_column])
## split into train-dev-test
df_train = df[df["partition_id"] == 0]
df_dev = df[df["partition_id"] == 1]
df_test = df[df["partition_id"] == 2]
## split each set into x and y
y_train = df_train[y_column].values
df_train = df_train[numer_columns+categ_columns]
y_dev = df_dev[y_column].values
df_dev = df_dev[numer_columns+categ_columns]
y_test = df_test[y_column].values
df_test = df_test[numer_columns+categ_columns]
Create sklearn pipelines
In this step, we chain a few pipelines together to process the dataset for the final time. In this example, we use median followed by standard scalar for numeric columns, and mode followed by encoding labels for categorical columns.
## define pipelines
impute_median = SimpleImputer(strategy='median')
impute_mode = SimpleImputer(strategy='most_frequent')
num_pipeline = Pipeline([
('impute_median', impute_median),
('std_scaler', StandardScaler()),
])
categ_pipeline = Pipeline([
('impute_mode', impute_mode),
('categ_1hot', OneHotEncoder(handle_unknown='ignore')),
])
full_pipeline = ColumnTransformer([
("num", num_pipeline, numer_columns),
("cat", categ_pipeline, categ_columns),
])
## fit and transform
X_train = full_pipeline.fit_transform(df_train)
X_dev = full_pipeline.transform(df_dev)
X_test = full_pipeline.transform(df_test)
X_train
array([[ 0.04673184, 0.06391404, 0. , ..., -0.16000115,
1. , 0. ],
[-0.97000929, -0.97263688, 0. , ..., -1.01065389,
1. , 0. ],
[ 0.04673184, 0.06391404, 0. , ..., -0.16000115,
1. , 0. ],
...,
[-0.97000929, 1.10046497, 0. , ..., 0.69065159,
0. , 1. ],
[ 0.04673184, 1.10046497, 0. , ..., 0.69065159,
0. , 1. ],
[ 1.06347297, 2.13701589, 0. , ..., 1.54130432,
0. , 1. ]])
Hyperparameter tuning
In this step, we try to use different models and parameters to see which performs the best. We utilize mlflow for logging and hyperopt to help with tuning. In this example, we run the trials for 40 iterations, each using a different combination of model and parameters.
## mlflow + hyperopt combo
def objective(params):
regressor_type = params['type']
del params['type']
if regressor_type == 'gradient_boosting_regression':
estimator = GradientBoostingRegressor(**params)
elif regressor_type == 'random_forest_regression':
estimator = RandomForestRegressor(**params)
elif regressor_type == 'extra_trees_regression':
estimator = ExtraTreesRegressor(**params)
elif regressor_type == 'decision_tree_regression':
estimator = DecisionTreeRegressor(**params)
else:
return 0
estimator.fit(X_train, y_train)
# mae
y_dev_hat = estimator.predict(X_dev)
mae = median_absolute_error(y_dev, y_dev_hat)
# logging
with mlflow.start_run():
mlflow.log_param("regressor", estimator.__class__.__name__)
# mlflow.log_param("params", params)
mlflow.log_param('n_estimators', params.get('n_estimators'))
mlflow.log_param('max_depth', params.get('max_depth'))
mlflow.log_metric("median_absolute_error", mae)
return {'loss': mae, 'status': STATUS_OK}
space = hp.choice('regressor_type', [
{
'type': 'gradient_boosting_regression',
'n_estimators': hp.choice('n_estimators1', range(100,200,50)),
'max_depth': hp.choice('max_depth1', range(10,13,1))
},
{
'type': 'random_forest_regression',
'n_estimators': hp.choice('n_estimators2', range(100,200,50)),
'max_depth': hp.choice('max_depth2', range(3,25,1)),
'n_jobs': -1
},
{
'type': 'extra_trees_regression',
'n_estimators': hp.choice('n_estimators3', range(100,200,50)),
'max_depth': hp.choice('max_depth3', range(3,10,2))
},
{
'type': 'decision_tree_regression',
'max_depth': hp.choice('max_depth4', range(3,10,2))
}
])
trials = Trials()
max_evals = 40
best = fmin(
fn=objective,
space=space,
algo=tpe.suggest,
max_evals=max_evals,
trials=trials)
print("Found minimum after {} trials:".format(max_evals))
from pprint import pprint
pprint(best)
100%|██████████| 40/40 [00:19<00:00, 2.11trial/s, best loss: 8.569474762575908]
Found minimum after 40 trials:
{'max_depth2': 1, 'n_estimators2': 1, 'regressor_type': 1}
Evaluate performance
Run “mlflow server” to see the loggin dashboard. There, we can see that RandomForestRegressor has the best performance (the less MAE the better) when using max_depth=4 and n_estimators=150, to test the model’s performance against another test set:
## use best params on TEST set
estimator = RandomForestRegressor(max_depth=4, n_estimators=150)
estimator.fit(X_train, y_train)
y_train_hat = estimator.predict(X_train)
train_mae = median_absolute_error(y_train, y_train_hat)
y_dev_hat = estimator.predict(X_dev)
dev_mae = median_absolute_error(y_dev, y_dev_hat)
y_test_hat = estimator.predict(X_test)
test_mae = median_absolute_error(y_test, y_test_hat)
mae = {
'name': estimator.__class__.__name__,
'train_mae': train_mae,
'dev_mae': dev_mae,
'test_mae': test_mae
}
mae = pd.DataFrame([mae]).set_index('name')
mae
name | train_mae | dev_mae | test_mae |
---|---|---|---|
DecisionTreeRegressor | 8.930245 | 8.592484 | 8.729826 |
You’ll notice that we use “median absolute error” to measure performance. There are other metrics available, such as mean squared error, but in some cases it’s more meaningful to use a metric that measure the performance in actual data’s unit, in this case the error on dev and test set are around 8 units away from its correct value. Since normally we use square meter for area, it means the prediction will be off by about 8 square meters in most cases.
PS: We applied the same process to data from https://baania.com/ and it was a success!
Update 2022-07-14: Baania now has opendata! Check it out at https://gobestimate.com/data.