Data engineering toolset (that I use) glossary

2021-06-04

Big data

Spark: Map-reduce framework for dealing with big data, especially for data that doesn't fit into memory. Utilizes parallelization.

AWS: Cloud platform for many tools used in software engineering.
AWS Fargate: A task launch mode for ECS task, where it automatically shuts down once a container exits. With EC2 launch mode, you'll have to turn off the machine yourself.
AWS Lambda: Serverless function, can be used with docker image too. Can also hook this with API gateway to make it act as API endpoint.
AWS RDS: Managed databases from AWS.
ECS Task: Launch a task in ECS cluster. For long-running services, launch via EC2. For small periodical tasks, trigger via Cloudwatch. For the latter, think of cron-like schedule for a task. Essentially at specified time, it runs a predefined docker image (you should configure your entrypoint.sh accordingly).

Parquet: Columnar data blob format, very efficient due to column-based compression with schema definition baked in.

Dagster: Task orchestration framework with built-in pipelines validation.
ETL: Stands for extract-transform-load. Essentially it means "moving data from A to B, with optional data wrangling in the middle."

NLP: Using machine (computer) to work on human languages. For instance, analyze whether a message is positive or negative.

Docker: Virtualization via containers.
Git: Version control.
Kubernetes: Container orchestration system.
Terraform: Infrastructure as code tool, essentially you use it to store a blueprint for your infra setup. If you were to move to another account, you can re-conjure existing infra with one command. This makes editing infra config easier too, since it automatically cleans up / update config automatically.

MLflow: A framework to track model parameters and output. Can also store model artifact as well.

Jupyter: Python notebook, used for exploring solutions before converting it to .py.