Data engineering toolset (that I use) glossary
2021-06-04
Big data
- Spark: Map-reduce framework for dealing with big data, especially for data that doesn't fit into memory. Utilizes parallelization.
Cloud
- AWS: Cloud platform for many tools used in software engineering.
- AWS Fargate: A task launch mode for
ECS task
, where it automatically shuts down once a container exits. With EC2 launch mode, you'll have to turn off the machine yourself. - AWS Lambda: Serverless function, can be used with docker image too. Can also hook this with API gateway to make it act as API endpoint.
- AWS RDS: Managed databases from AWS.
- ECS Task: Launch a task in ECS cluster. For long-running services, launch via EC2. For small periodical tasks, trigger via Cloudwatch. For the latter, think of cron-like schedule for a task. Essentially at specified time, it runs a predefined docker image (you should configure your
entrypoint.sh
accordingly).
Data
- Parquet: Columnar data blob format, very efficient due to column-based compression with schema definition baked in.
Data engineering
- Dagster: Task orchestration framework with built-in pipelines validation.
- ETL: Stands for extract-transform-load. Essentially it means "moving data from A to B, with optional data wrangling in the middle."
Data science
- NLP: Using machine (computer) to work on human languages. For instance, analyze whether a message is positive or negative.
Data wrangling
- Pandas: Dataframe wrangler, think of programmable Excel.
Database
- Postgres: RMDBS with good performance.
DataOps
- Great expectations: A framework for data validation.
DevOps
- Docker: Virtualization via containers.
- Git: Version control.
- Kubernetes: Container orchestration system.
- Terraform: Infrastructure as code tool, essentially you use it to store a blueprint for your infra setup. If you were to move to another account, you can re-conjure existing infra with one command. This makes editing infra config easier too, since it automatically cleans up / update config automatically.
GIS
- PostGIS: GIS extension for Postgres.
MLOps
- MLflow: A framework to track model parameters and output. Can also store model artifact as well.
Notebook
- Jupyter: Python notebook, used for exploring solutions before converting it to .py.