Imagine having a dataset that you need to use for training a prediction model, but some of the features are missing. The good news is you don’t need to throw some data away, just have to impute them. Below are steps you can take in order to create an imputation pipeline. Github link here!
from random import randint import pandas as pd import numpy as np from sklearn.preprocessing import OneHotEncoder from sklearn.impute import SimpleImputer from sklearn.preprocessing import StandardScaler from sklearn.pipeline import Pipeline from sklearn.compose import ColumnTransformer from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor, ExtraTreesRegressor from sklearn.tree import DecisionTreeRegressor from sklearn.metrics import mean_squared_error, median_absolute_error from hyperopt import fmin, tpe, hp, Trials, STATUS_OK import mlflow import matplotlib.pyplot as plt import seaborn as sns sns.set() Generate data Since this is an example and I don’t want to get sued by using my company’s data, synthetic data it is :) This simulates a dataset from different pseudo-regions, with different characteristics. Real data will be much more varied, but I make it more obvious so it’s easy to see the differences.
...