ELT Workshop — Interactive Workflow

PageViewsTransform — State machine workflow

Set up databaseATHENA

Creates the appdb database if it doesn't already exist. Think of this as creating a folder to organise all the tables the pipeline will use.

State: Create Glue DB

Two tasks run at the same time. Each one tells Athena where the raw CSV files live in S3 and scans the folder structure so all the data becomes queryable.

State: Parallel

RUNS IN PARALLEL

Customers

1. Point Athena to the customer CSV files in S3

2. Scan folders to discover all country partitions (USA, UK, etc.)

Page views

1. Point Athena to the page view CSV files in S3

2. Scan folders to discover all date partitions (year/month/day)

Create empty output tableATHENA

Sets up the pageviews_daily table structure — defining what columns it will have (date, URL, country, view count) — but doesn't put any data in it yet. The table uses Iceberg format for efficient updates later.

State: Create Pageviews Daily Table

Check if data already existsCHOICE

Counts how many rows are in the output table. If it's empty, this is a first-time run and all data needs to be loaded from scratch. If data already exists, only the new and changed data needs to be processed.

States: Query Count → Get Count → Choice

EMPTY TABLE Full load

5a Load everything ATHENA

Combines customer and page view data, counts up page views by date, URL, and country, and writes all the results into the output table. Since this is the first run, every record gets processed.

State: Full Load

HAS DATA Incremental refresh

5b Save trusted old data ATHENA

Copies all the previously processed results into a temporary table — but excludes the most recent day, because that day might have been incomplete when it was last processed.

The most recent day acts as the "watermark" — the point from which data needs to be refreshed.

State: Incremental Load Step 1

6b Process new + recent data ATHENA

Goes back to the raw page view data and re-processes everything from the watermark date onwards. This picks up any new files that have arrived and recalculates the day that might have been partial. The fresh results are added to the temporary table.

Only data from the watermark onwards is reprocessed — everything before it is already safely stored from step 5b.

State: Incremental Load Step 2

7b Remove old output table SWAP

Deletes the old version of the output table to make room for the updated one.

State: Incremental Load Step 3

8b Replace with updated table SWAP

Creates a fresh output table from the temporary table, which now contains old trusted data plus newly processed data. This swap ensures analysts always see a complete, up-to-date table.

State: Incremental Load Step 4

9b Clean up CLEANUP

Removes the temporary table since it's no longer needed. The pipeline is now ready for its next run.

State: Incremental Load Step 5

Athena query

Parallel

Choice

Table swap

Full load

ELT Pipeline with AWS Step Functions & Athena