Data Engineering

DuckDB vs Polars vs Spark!

I think everyone who has worked with data, in any role or function, used pandas 🐼 at certain point. I first used pandas in 2017, so it’s 6 years already. Things have come a long way, and so is data size I’m working with! Pandas has its own issues, namely no native support for nested schema. In addition, it’s very heavy-handed regarding data types inference. It can be a blessing, but it’s a bane for data engineering work, where you have to make sure that your data conforms to agreed-upon schema (hello data contracts!). But the worst issue? Pandas can’t open data that doesn’t fit into memory. So if you have a 16 GB RAM machine, you can’t read 12GB data with pandas 😭. ...

Google Analytics v4 ingestion via BigQuery

Background You want to track who access your site, Google Analytics can do that. To see the data, you can use Google Analytics dashboard, the default settings is good enough for most use cases. But what if you have a lot of tracking data, and you want to streamline a way to analyze it? You could use Data Studio for this, so it’s cool for the moment. But what if you want to use Google Analytics data in conjunction with other data? Say, product SKUs? There are multiple ways to achieve that, but one approach that provides a lot of flexibility and allow for long-term maintenance is to write a custom pipeline yourself. ...

Data transformation - python vs sql showdown

For most people, using SQL to transform data is a no-brainer, seeing it’s a very versatile language, and doesn’t have quite a steep learning curve compared to python. There are some cases where SQL is more suitable for a task, but the reverse can also happen as well. For instance, given a string conversion problem: if a string occurs only one time, replace it with # if a string occurs multiple times, replace it with & > one ### > three ###&& > Heartbreak hotel &&&&&#&&&##&#&&# A solution in python would be: ...

Intro to Dagster Cloud

Imagine you have a few data pipelines to schedule. Simplest solution would be cronjob. Time goes by and next thing you know, you have around 50 pipelines to manage. The fun starts when you have to hunt down which pipeline doesn’t run normally. And by then it would be super hard to do tracing if you haven’t set up logging and monitoring. Luckily there are tools we can use to improve the situation. Task orchestrators are born exactly for this, to schedule and monitor pipelines. These days there are more bells and whistles, such as backfilling and sensor triggers. Some also integrate with data catalog tools and provide table specs and data lineage. ...

Data engineer archtypes

I have been working in the data industry since almost half a decade ago. Over time I have noticed so-called archetypes within various data engineering roles. Below are main skills and combinations I have seen over the years. This is by no means an exhaustive list, rather what I often see. SQL + SSIS Using SQL to manipulate data via SSIS, in which data engine is Microsoft SQL Server. Commonly found in enterprise organizations that use Microsoft stack. SQL + Hive Using SQL to manipulate data via Hive, a filesystem that support columnar data format, usually accessed via Zeppelin. Often found in enterprise organizations that work with big data before Spark was released. SQL + DBT Using SQL to manipulate data via DBT, an abstraction later for data pipelines scheduler that allows users to use SQL interface with various database engines. DBT is often mentioned in Modern Data Stack. Often found in newly established organizations in the last few years. Python + pandas Using python with pandas to manipulate data, usually with data that can fit into memory (ie less than 5GB) This is also common if you have data scientists manipulate data, since pandas is what they are familiar with. In addition, most people who write pandas are not known for writing well-optimized code, but it’s negligible for small data. Python + pyspark Using python with pyspark to manipulate data, can be either SQL or Spark SQL. Usually organizations that use pyspark also does machine learning as well. Often found in organizations that work with big data, and have established data lake platform. Scala + spark Using Scala to manipulate data via spark. Often found on enterprise organizations where they have been using spark before pyspark was released. Has more limited data ecosystem. Python + Task orchestrator (airflow, dagster, etc) Using task orchestrators to run pipelines on a regular basis, the application logic is written in python. Inside can be anything from pure python to pyspark. Or you can use bash and use any unix tools. People who fall under this category often have software engineering background. Platform engineering (setting up data infrastructure, etc) These are people that set up database, infrastructure, networking, and everything required to allow engineers/users to create data pipelines and consume data at downstream. Usually they are DevOps who transitioned from working with app infra to data infra. Updated 2022-09-02 ...