Collaboration model for data science projects

Many data science teams are struggling with implementing end-to-end machine learning projects. It’s a very common phenomenon, so if you are experiencing this, you are not alone. Having worked in every stage of data science project lifecycle, in addition to normal web services deployments, this is what I think how we should collaborate. Collaboration model between teams Note: The diagram does not signify order of communication. Rather, it states the communication pathways between teams. ...

January 20, 2024 · 2 min · Karn Wong

Should data scientists deploy models to production?

Over the years I’ve heard stories of data teams struggling with deploying machine learning models to production. Clearly there is a pattern here. This article is my reflection on the matter. So what’s the problem? Data scientists, by definition, create mathematical models from data so some unknowns can become known. This is colloquially known as “prediction.” For example, if you have sales data from last year, you can use it to forecast sales performance of next year. ...

December 30, 2023 · 2 min · Karn Wong

Use pyspark locally with docker

For data that doesn’t fit into memory, spark is often a recommended solution, since it can utilize map-reduce to work with data in a distributed manner. However, setting up local spark development from scratch involves multiple steps, and definitely not for a faint of heart. Thankfully using docker means you can skip a lot of steps 😃 Instructions Install Docker Desktop Create docker-compose.yml in a directory somewhere version: "3.3" services: pyspark: container_name: pyspark image: jupyter/pyspark-notebook:latest ports: - "8888:8888" volumes: - ./:/home/jovyan/work Run docker-compose up from the same folder where the above file is located. You should see something like this. It’s the same from running jupyter notebook locally. Click the link at the end to access jupyter notebook. Creating pyspark ... done Attaching to pyspark pyspark | WARNING: Jupyter Notebook deprecation notice https://github.com/jupyter/docker-stacks#jupyter-notebook-deprecation-notice. pyspark | Entered start.sh with args: jupyter notebook pyspark | /usr/local/bin/start.sh: running hooks in /usr/local/bin/before-notebook.d as uid / gid: 1000 / 100 pyspark | /usr/local/bin/start.sh: running script /usr/local/bin/before-notebook.d/spark-config.sh pyspark | /usr/local/bin/start.sh: done running hooks in /usr/local/bin/before-notebook.d pyspark | Executing the command: jupyter notebook pyspark | [I 12:36:04.395 NotebookApp] Writing notebook server cookie secret to /home/jovyan/.local/share/jupyter/runtime/notebook_cookie_secret pyspark | [W 2021-12-21 12:36:05.487 LabApp] 'ip' has moved from NotebookApp to ServerApp. This config will be passed to ServerApp. Be sure to update your config before our next release. pyspark | [W 2021-12-21 12:36:05.488 LabApp] 'port' has moved from NotebookApp to ServerApp. This config will be passed to ServerApp. Be sure to update your config before our next release. pyspark | [W 2021-12-21 12:36:05.488 LabApp] 'port' has moved from NotebookApp to ServerApp. This config will be passed to ServerApp. Be sure to update your config before our next release. pyspark | [W 2021-12-21 12:36:05.488 LabApp] 'port' has moved from NotebookApp to ServerApp. This config will be passed to ServerApp. Be sure to update your config before our next release. pyspark | [I 2021-12-21 12:36:05.497 LabApp] JupyterLab extension loaded from /opt/conda/lib/python3.9/site-packages/jupyterlab pyspark | [I 2021-12-21 12:36:05.498 LabApp] JupyterLab application directory is /opt/conda/share/jupyter/lab pyspark | [I 12:36:05.504 NotebookApp] Serving notebooks from local directory: /home/jovyan pyspark | [I 12:36:05.504 NotebookApp] Jupyter Notebook 6.4.6 is running at: pyspark | [I 12:36:05.504 NotebookApp] http://bd20652c22d3:8888/?token=408f2020435dfb489c8d2710736a83f7a3c7cd969b3a1629 pyspark | [I 12:36:05.504 NotebookApp] or http://127.0.0.1:8888/?token=408f2020435dfb489c8d2710736a83f7a3c7cd969b3a1629 pyspark | [I 12:36:05.504 NotebookApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation). pyspark | [C 12:36:05.509 NotebookApp] pyspark | pyspark | To access the notebook, open this file in a browser: pyspark | file:///home/jovyan/.local/share/jupyter/runtime/nbserver-7-open.html pyspark | Or copy and paste one of these URLs: pyspark | http://bd20652c22d3:8888/?token=408f2020435dfb489c8d2710736a83f7a3c7cd969b3a1629 pyspark | or http://127.0.0.1:8888/?token=408f2020435dfb489c8d2710736a83f7a3c7cd969b3a1629 This snippet ...

December 21, 2021 · 3 min · Karn Wong

Impute pipelines

Imagine having a dataset that you need to use for training a prediction model, but some of the features are missing. The good news is you don’t need to throw some data away, just have to impute them. Below are steps you can take in order to create an imputation pipeline. Github link here! from random import randint import pandas as pd import numpy as np from sklearn.preprocessing import OneHotEncoder from sklearn.impute import SimpleImputer from sklearn.preprocessing import StandardScaler from sklearn.pipeline import Pipeline from sklearn.compose import ColumnTransformer from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor, ExtraTreesRegressor from sklearn.tree import DecisionTreeRegressor from sklearn.metrics import mean_squared_error, median_absolute_error from hyperopt import fmin, tpe, hp, Trials, STATUS_OK import mlflow import matplotlib.pyplot as plt import seaborn as sns sns.set() Generate data Since this is an example and I don’t want to get sued by using my company’s data, synthetic data it is :) This simulates a dataset from different pseudo-regions, with different characteristics. Real data will be much more varied, but I make it more obvious so it’s easy to see the differences. ...

May 22, 2020 · 8 min · Karn Wong