ELT Pipeline with AWS Step Functions & Athena

Interactive architecture and workflow explorer for incremental page view analytics
High-level architecture — ELT on AWS
AWS Step Functions orchestrates sequences branches DATA SOURCES customers.csv country=USA, UK pageviews.csv year/month/day E + L Amazon S3 Data Lake pageviews_daily Iceberg / Parquet / Snappy customers/ country=... pageviews/ yr/mo/day Amazon Athena Serverless SQL on S3 reads writes TRANSFORM CTAS / INSERT INTO Extract + Load S3 data lake Athena Transform Output table Orchestrate
PageViewsTransform — State machine workflow
1
Set up databaseATHENA
Creates the appdb database if it doesn't already exist. Think of this as creating a folder to organise all the tables the pipeline will use.
State: Create Glue DB
2
Register raw data tablesPARALLEL
Two tasks run at the same time. Each one tells Athena where the raw CSV files live in S3 and scans the folder structure so all the data becomes queryable.
State: Parallel
RUNS IN PARALLEL
Customers
1. Point Athena to the customer CSV files in S3
2. Scan folders to discover all country partitions (USA, UK, etc.)
Page views
1. Point Athena to the page view CSV files in S3
2. Scan folders to discover all date partitions (year/month/day)
3
Create empty output tableATHENA
Sets up the pageviews_daily table structure — defining what columns it will have (date, URL, country, view count) — but doesn't put any data in it yet. The table uses Iceberg format for efficient updates later.
State: Create Pageviews Daily Table
4
Check if data already existsCHOICE
Counts how many rows are in the output table. If it's empty, this is a first-time run and all data needs to be loaded from scratch. If data already exists, only the new and changed data needs to be processed.
States: Query Count → Get Count → Choice
EMPTY TABLE Full load
5a Load everything ATHENA
Combines customer and page view data, counts up page views by date, URL, and country, and writes all the results into the output table. Since this is the first run, every record gets processed.
State: Full Load
HAS DATA Incremental refresh
5b Save trusted old data ATHENA
Copies all the previously processed results into a temporary table — but excludes the most recent day, because that day might have been incomplete when it was last processed.
The most recent day acts as the "watermark" — the point from which data needs to be refreshed.
State: Incremental Load Step 1
6b Process new + recent data ATHENA
Goes back to the raw page view data and re-processes everything from the watermark date onwards. This picks up any new files that have arrived and recalculates the day that might have been partial. The fresh results are added to the temporary table.
Only data from the watermark onwards is reprocessed — everything before it is already safely stored from step 5b.
State: Incremental Load Step 2
7b Remove old output table SWAP
Deletes the old version of the output table to make room for the updated one.
State: Incremental Load Step 3
8b Replace with updated table SWAP
Creates a fresh output table from the temporary table, which now contains old trusted data plus newly processed data. This swap ensures analysts always see a complete, up-to-date table.
State: Incremental Load Step 4
9b Clean up CLEANUP
Removes the temporary table since it's no longer needed. The pipeline is now ready for its next run.
State: Incremental Load Step 5
Athena query
Parallel
Choice
Table swap
Full load