Airflow

Airflow is an orchestrator, not a processing framework. Process your gigabytes of data outside of Airflow (i.e. You have a Spark cluster, you use an operator to execute a Spark job, and the data is processed in Spark).

The core components of Apache Airflow are:

  • Web server
  • Scheduler
  • Metastore
  • Triggerer

DAG

DAG stands for directed acyclic graphs. It is a graph that represents a data pipeline with tasks and directed dependencies.

The scheduler parses for new DAG files every 5 minutes by default

When a DAG runs, the scheduler creates a DAG Run for that specific run.

A DAG is a data pipeline, an Operator is a task.

An Executor defines how your tasks are executed, whereas a worker is a process executing your task

The Scheduler schedules your tasks, the web server serves the UI, and the database stores the metadata of Airflow.

Installing Apache Airflow

  1. Create a folder, for e.g. airflow-docker. Within the folder, download the docker compose file from here and save it as docker-compose.yaml.

  2. Create a .env file within folder and copy the following:

     AIRFLOW_IMAGE_NAME=apache/airflow:2.4.2
     AIRFLOW_UID=50000
    
  3. Open the terminal and go that folder, and type docker-compose up -d

    alt text

With this command, docker installs airflow within it. To check, open a web browser and go to localhost:8080 and you will see something like below:

alt text