Collaboration model for data science projects

Many data science teams are struggling with implementing end-to-end machine learning projects. It’s a very common phenomenon, so if you are experiencing this, you are not alone. Having worked in every stage of data science project lifecycle, in addition to normal web services deployments, this is what I think how we should collaborate. Collaboration model between teams Note: The diagram does not signify order of communication. Rather, it states the communication pathways between teams. ...

January 20, 2024 · 2 min · Karn Wong

Should data scientists deploy models to production?

Over the years I’ve heard stories of data teams struggling with deploying machine learning models to production. Clearly there is a pattern here. This article is my reflection on the matter. So what’s the problem? Data scientists, by definition, create mathematical models from data so some unknowns can become known. This is colloquially known as “prediction.” For example, if you have sales data from last year, you can use it to forecast sales performance of next year. ...

December 30, 2023 · 2 min · Karn Wong

Use pyspark locally with docker

For data that doesn’t fit into memory, spark is often a recommended solution, since it can utilize map-reduce to work with data in a distributed manner. However, setting up local spark development from scratch involves multiple steps, and definitely not for a faint of heart. Thankfully using docker means you can skip a lot of steps 😃 Instructions Install Docker Desktop Create docker-compose.yml in a directory somewhere version: "3.3" services: pyspark: container_name: pyspark image: jupyter/pyspark-notebook:latest ports: - "8888:8888" volumes: - ./:/home/jovyan/work ...

December 21, 2021 · 3 min · Karn Wong

Impute pipelines

Imagine having a dataset that you need to use for training a prediction model, but some of the features are missing. The good news is you don’t need to throw some data away, just have to impute them. Below are steps you can take in order to create an imputation pipeline. Github link here! from random import randint import pandas as pd import numpy as np from sklearn.preprocessing import OneHotEncoder from sklearn.impute import SimpleImputer from sklearn.preprocessing import StandardScaler from sklearn.pipeline import Pipeline from sklearn.compose import ColumnTransformer from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor, ExtraTreesRegressor from sklearn.tree import DecisionTreeRegressor from sklearn.metrics import mean_squared_error, median_absolute_error from hyperopt import fmin, tpe, hp, Trials, STATUS_OK import mlflow import matplotlib.pyplot as plt import seaborn as sns sns.set() ...

May 22, 2020 · 8 min · Karn Wong