Airflow
Airflow is an orchestrator, not a processing framework. Process your gigabytes of data outside of Airflow (i.e. You have a Spark cluster, you use an operator to execute a Spark job, and the data is processed in Spark).
The core components of Apache Airflow are:
- Web server
- Scheduler
- Metastore
- Triggerer
DAG
DAG stands for directed acyclic graphs. It is a graph that represents a data pipeline with tasks and directed dependencies.
The scheduler parses for new DAG files every 5 minutes by default
When a DAG runs, the scheduler creates a DAG Run for that specific run.
A DAG is a data pipeline, an Operator is a task.
An Executor defines how your tasks are executed, whereas a worker is a process executing your task
The Scheduler schedules your tasks, the web server serves the UI, and the database stores the metadata of Airflow.
Installing Apache Airflow
-
Create a folder, for e.g.
airflow-docker
. Within the folder, download the docker compose file from here and save it asdocker-compose.yaml
. -
Create a
.env
file within folder and copy the following:AIRFLOW_IMAGE_NAME=apache/airflow:2.4.2 AIRFLOW_UID=50000
-
Open the terminal and go that folder, and type
docker-compose up -d
With this command, docker installs airflow within it. To check, open a web browser and go to localhost:8080
and you will see something like below: