Karn Wong

Terraform with ECS task on EC2 backend

Previously I wrote about setting up ECS task on fargate backend. But we can also use EC2 as backend too, in some cases where the workload is consistent, ie scaling is not required, since EC2 would be cheaper than fargate backend, even more so if you have reserved instance on top. There’s a few modifications from the fargate version to make it work with EC2 backend, if you are curious you can try to hunt those down 😎. Repo here. ...

Intro to Dagster Cloud

Imagine you have a few data pipelines to schedule. Simplest solution would be cronjob. Time goes by and next thing you know, you have around 50 pipelines to manage. The fun starts when you have to hunt down which pipeline doesn’t run normally. And by then it would be super hard to do tracing if you haven’t set up logging and monitoring. Luckily there are tools we can use to improve the situation. Task orchestrators are born exactly for this, to schedule and monitor pipelines. These days there are more bells and whistles, such as backfilling and sensor triggers. Some also integrate with data catalog tools and provide table specs and data lineage. ...

Intro to Pulumi

For IaC, no doubt that Terraform is the leader. But there are other alternatives too, one of them is Pulumi. Currently Pulumi provides fun challenges to get started with their services. Best of all, they give you swags too! We are going to create a simple Pulumi project for hosting a static site through Cloudfront CDN. Challenge url: https://www.pulumi.com/challenge/startup-in-a-box/ Pre-requisites Pulumi account Checkly account AWS acount Install Pulumi cli: brew install pulumi/tap/pulumi Steps Init Init pulumi project ...

Minimal ECS task with fargate backend

To deploy a web application, there are many ways to go about it. I could spin up a bare VM and set up the environment manually. To make things easier, I could have package the app into docker image. But this still means I have to “update” the app manually if I add changes to it. Things would be super cool if: after I push the changes to master branch, the app would be deployed automatically. In order to achieve this, I could use AWS ECS task to deploy the app, and add CI/CD to it (because this is 2022 after all). ...

Data engineer archtypes

I have been working in the data industry since almost half a decade ago. Over time I have noticed so-called archetypes within various data engineering roles. Below are main skills and combinations I have seen over the years. This is by no means an exhaustive list, rather what I often see. SQL + SSIS Using SQL to manipulate data via SSIS, in which data engine is Microsoft SQL Server. Commonly found in enterprise organizations that use Microsoft stack. SQL + Hive Using SQL to manipulate data via Hive, a filesystem that support columnar data format, usually accessed via Zeppelin. Often found in enterprise organizations that work with big data before Spark was released. SQL + DBT Using SQL to manipulate data via DBT, an abstraction later for data pipelines scheduler that allows users to use SQL interface with various database engines. DBT is often mentioned in Modern Data Stack. Often found in newly established organizations in the last few years. Python + pandas Using python with pandas to manipulate data, usually with data that can fit into memory (ie less than 5GB) This is also common if you have data scientists manipulate data, since pandas is what they are familiar with. In addition, most people who write pandas are not known for writing well-optimized code, but it’s negligible for small data. Python + pyspark Using python with pyspark to manipulate data, can be either SQL or Spark SQL. Usually organizations that use pyspark also does machine learning as well. Often found in organizations that work with big data, and have established data lake platform. Scala + spark Using Scala to manipulate data via spark. Often found on enterprise organizations where they have been using spark before pyspark was released. Has more limited data ecosystem. Python + Task orchestrator (airflow, dagster, etc) Using task orchestrators to run pipelines on a regular basis, the application logic is written in python. Inside can be anything from pure python to pyspark. Or you can use bash and use any unix tools. People who fall under this category often have software engineering background. Platform engineering (setting up data infrastructure, etc) These are people that set up database, infrastructure, networking, and everything required to allow engineers/users to create data pipelines and consume data at downstream. Usually they are DevOps who transitioned from working with app infra to data infra. Updated 2022-09-02 ...