data engineering

How We Built a Config-Driven ELT Engine to Orchestrate 39 Dataset Ingestions on AWS

Nassim El Ghazaz

April 14, 202610 min read

How We Built a Config-Driven ELT Engine to Orchestrate 39 Dataset Ingestions on AWS

When we started building data lakes for companies running complex IT landscapes, we quickly ran into a problem that every data team eventually faces: pipeline sprawl.

The client had dozens of ERP tables to ingest. Sales data, master data, quality records, finance exports, environmental health and safety reports. Each one had its own extraction logic, its own schedule, its own quirks. The classic approach would have been to write a dedicated pipeline for each. Copy the template, tweak the config, deploy. Repeat forty times.

We did not go down that road.

Instead, we built a single engine that reads a YAML configuration file and dynamically generates the entire ingestion workflow. Today, that engine handles 39 data sources across 8 business domains, orchestrated by AWS Step Functions, powered by five core Lambda functions, and maintained by a team that rarely touches ingestion code anymore. When a new dataset needs to come in, someone adds a few lines to a YAML file and deploys. That is it.

This post walks through how we designed it, why configuration-driven ELT changes the game for enterprise data platforms, and the specific AWS patterns that make it work.

The Problem with Pipeline-Per-Source

Most data teams start with a reasonable approach. You need to ingest a table from your ERP? Write a Lambda, set up a Step Function, wire the Glue catalog, deploy. Takes a day, maybe two. Done.

Now multiply that by forty. Suddenly you have forty slightly different Lambda functions, forty CloudFormation stacks, forty things to monitor, patch, and debug when something breaks at 3AM. They all do essentially the same thing with minor variations in schema, scheduling, and transformation logic. But because each one is its own codebase, a bug fix means patching forty repos.

We saw this pattern at companies of all sizes, from growing mid-market teams to large organizations. The data engineering team becomes a bottleneck, not because the work is hard, but because the work is repetitive and every new source means another bespoke pipeline.

The question we asked ourselves was simple: what actually changes between these pipelines? The answer turned out to be surprisingly little.

Configuration as the Single Source of Truth

The core insight was that most of the variation between pipelines can be expressed as data, not code. What changes from one ERP table to another is the object name, the file format, whether you need delta extraction, which fields to select, and whether the dataset is large enough to require batch processing.

So we centralized all of that into a single YAML configuration file. Here is a simplified version of what a dataset family looks like:

YAML

1qualityManagement:
2  description: "Quality management objects"
3  objects:
4    - name: QALS
5      format: tsv
6      deltaField: "ENSTEHDAT:2020:CURRENTYEAR"
7      deltaStrategy: "1"
8      batchProcessing: true
9      fieldsList: "PRUEFLOS;WERK;ART;STAT"
10
11    - name: QAVE
12      format: tsv
13
14    - name: QASR
15      format: tsv
16      fieldsList: "PRUEFLOS;VESSION;VESSION_EXT"

Notice how each object inherits sensible defaults. If you just need a simple full extraction in TSV format, the config entry is two lines. If you need delta extraction with field selection and batch processing, you add those properties. The engine handles the rest.

We grouped datasets into nine logical families: master data, sales, finance, quality, HR, authorizations, and so on. Each family has its own schedule and its own set of objects. Adding a new table to an existing family is literally adding a YAML block and deploying.

The Architecture: One Engine, Two Paths

The engine runs on AWS Step Functions with a two-level orchestration pattern.

At the top level, a Master State Machine fires on a schedule via EventBridge. Each dataset family has its own CRON trigger. Most families run once daily. High-velocity families like sales data run twice a day. The Master State Machine calls a Job Config Mapper Lambda, which reads the YAML configuration and returns a list of payloads, one per object in that family.

Step Functions then iterates over that list using a Map state, processing up to ten objects in parallel. Each object triggers an ELT State Machine that handles the actual extraction, loading, and transformation.

The ELT State Machine has two execution paths depending on how the data arrives.

For pull-based extraction, the engine calls the ERP system directly through RFC connectors, retrieves the data, aggregates multi-part files into a single landing file on S3, and then runs an Athena INSERT query to move the data from the landing zone into the raw zone.

For push-based ingestion (file drops from managed file transfer), the engine detects the new file via an S3 event, validates the schema against expected column headers, applies any custom transformations through Athena SQL, and archives the processed file.

Both paths end at the same place: a clean, typed table in the Glue Catalog, sitting in one of eight domain-specific raw databases, ready for downstream consumers.

Five Lambda Functions That Do Everything

The entire engine runs on five core Lambda functions. Each one has a single responsibility.

The Job Config Mapper is the brain. It receives a family name, loads the YAML config, and produces a list of execution payloads. Each payload contains everything the downstream steps need: the object name, the target S3 bucket, delta extraction parameters, batch processing flags. Environment variables inject the runtime context like database names and bucket paths, so the same code works across dev, staging, and production without modification.

The Batch Processor handles large historical datasets. When an object has batch processing enabled, this function splits the extraction into yearly chunks. A table with data from 2020 to 2026 becomes six separate extractions, each covering one year. This keeps memory pressure low, isolates failures (if the 2023 batch fails, the others still complete), and gives you year-by-year monitoring granularity. Batches run sequentially to avoid overwhelming the source system.

The Files Aggregator deals with multi-part exports. Enterprise ERP systems often split large extractions into multiple files. This function consolidates them into a single file using S3 multipart uploads. Small files (under 10MB) get aggregated in memory; large files stream directly as upload parts. At the end, it generates the Athena SQL query that will move data from landing to raw.

The Prepare and Analyze handler processes file-drop ingestions in two phases. The prepare phase extracts metadata from the S3 event, maps the file to its target domain and table, and sets up the copy and archive paths. The analyze phase reads the first few kilobytes of the file, validates the delimiter with Python's CSV sniffer, and checks column headers against expected schemas. If anything looks wrong, the pipeline stops before bad data enters the raw zone.

The Error Handler is the safety net. Every failure in the state machine routes here. It normalizes the error information and pushes it to the centralized job orchestration system, where the operations team can see it alongside failures from all other ingesters.

Why Athena for Transformations

A question we get often is why we use Athena for the landing-to-raw transformation instead of doing it in Lambda or Glue.

The answer is practical. The transformation at this stage is relatively simple: take the data from the landing table, add an integration timestamp, apply some type casting or date parsing for specific tables, and insert into the raw table. Athena handles this with a single SQL statement, and it scales automatically. No cluster to manage, no Spark overhead for what is essentially a SELECT ... INSERT INTO.

For standard tables, the SQL is generated dynamically:

SQL

1INSERT INTO master_data_raw.erp_material
2SELECT [columns],
3    CAST(CURRENT_TIMESTAMP AS TIMESTAMP) AS integrationdate
4FROM landing_database.erp_material_landing

For tables that need custom transformations (date format parsing, conditional logic, field renaming), we have dedicated SQL functions that produce the appropriate query. The key is that even these custom queries are invoked through the same state machine step. The pipeline does not care whether the SQL is trivial or complex. It runs the query, checks for errors, and moves on.

What Onboarding a New Dataset Actually Looks Like

This is where the design pays off. When a business team comes to us and says they need a new ERP table in the data lake, here is what happens.

First, we add the object to the YAML configuration under the appropriate family. If it is a simple full extraction, that is two lines. If it needs delta loading or batch processing, a few more.

Second, we create the Glue Catalog tables for the landing and raw zones. This is a one-time schema definition.

Third, we deploy. The SAM build picks up the new configuration, and the next scheduled run automatically includes the new object.

No new Lambda functions. No new state machines. No new CloudFormation stacks. The engine already knows how to handle it because the logic is generic and the specifics live in the config.

In practice, onboarding a new standard dataset takes about thirty minutes. Most of that time is spent on the Glue table definition, not on pipeline code.

Lessons We Learned Along the Way

Concurrency limits matter more than you think. We cap parallel execution at ten objects per family. Early on, we ran everything wide open and occasionally overwhelmed the ERP system's RFC layer. The rate limit is a deliberate choice: fast enough to complete within the schedule window, gentle enough to avoid impacting transactional workloads.

Sequential batch processing is a feature, not a limitation. When processing large historical datasets, running batches in parallel sounds appealing but creates unpredictable memory spikes and can trigger throttling. Sequential processing with yearly partitions gives you predictable resource consumption and cleaner error isolation.

Schema validation catches problems early. The CSV sniffer and column header validation in the analyze phase have saved us more than once from silent data corruption. A file with the wrong delimiter or a missing column gets rejected before it touches the raw zone, instead of producing confusing results downstream.

Exponential backoff is not optional. Every state machine step has retry logic with exponential backoff (starting at 15 seconds, up to 5 retries). Transient failures from Athena throttling, S3 eventual consistency, or Lambda cold starts resolve themselves without human intervention in the vast majority of cases.

Separate landing from raw, always. The landing zone is ephemeral. Files arrive, get validated, get transformed into the raw zone, and then get archived or deleted. This separation means you can always replay from source if something goes wrong, and the raw zone stays clean and typed.

The Bigger Picture

This engine is one piece of a larger data platform that serves over thirty client applications across sixteen business domains. The config-driven approach at the ingestion layer set the tone for the entire architecture: convention over configuration, generic engines over bespoke pipelines, and operational simplicity as a first-class design goal.

For organizations dealing with strict data governance requirements, this pattern has another advantage. Because every pipeline follows the same path through the same engine, auditing and compliance become straightforward. You can trace any record from the raw zone back to its source file, its ingestion timestamp, and the exact configuration that governed its processing.

If you are building a data platform and find yourself copying pipeline templates for each new source, take a step back. Identify what actually varies. Put that variance in configuration. Build one engine that handles the rest. Your future self (and your operations team) will thank you.

At VERAPLOT, we build data platforms for companies that need to move fast without breaking things. Whether you are a growing team with ten data sources or a large organization with hundreds, our focus is on AWS-native architectures using Python, Step Functions, Glue, Athena, and the broader serverless ecosystem. If pipeline sprawl is slowing you down, we would love to talk.

Get in touch at veraplot.com

data engineeringAWSELTconfiguration-drivendata ingestion

Comments

0/3000