TCP #42: Zero-ETL Patterns on AWS with DynamoDB and Redshift

Part 2: How to Achieve Zero-ETL with DynamoDB and Redshift Using Lambda Triggers

Amrut Patil

Feb 23, 2025

You can also read my newsletters from the Substack mobile app and be notified when a new issue is available.

Available for iOS and Android

👋 Before we begin... a big thank you to today's sponsor, Lemon.io

Your Backlog Shouldn't Be a Bottleneck

Finding skilled developers feels impossible, even for the most experienced engineering leaders. While you're stuck in months-long hiring cycles, your critical features gather dust.

Lemon.io changes everything by:

Rigorously vetting engineers through a multi-step process
Evaluating technical skills, problem-solving, and communication
Matching you with perfect-fit talent in just 48 hours

Need specialized skills or quick scaling?

Our pre-vetted developers from Europe and Latin America integrate seamlessly into your workflow without long hiring cycles or long-term contracts.

➡️ Turn your backlog into shipped features at lemon.io

This is part 2 of the 3-part series on Zero-ETL Patterns on AWS series.

Check out part 1, where I discussed Zero-ETL Patterns on AWS with Aurora, Redshift, and OpenSearch.

Analyzing data quickly and efficiently is crucial for making informed business decisions in today's data-driven world.

Traditionally, this process has involved Extract, Transform, Load (ETL) operations that move data from operational databases to analytical systems. However, these operations can be complex, resource-intensive, and introduce latency.

Zero-ETL is an architectural pattern that eliminates or minimizes traditional ETL processes by automatically synchronizing data between operational and analytical systems.

In this newsletter, I'll explore how to implement Zero-ETL patterns on AWS using DynamoDB as your operational database and Redshift as your analytical data warehouse, with Lambda functions serving as the connecting tissue.

Understanding the Zero-ETL Concept

Before diving into implementation details, let's clarify what Zero-ETL means.

Despite the name, Zero-ETL doesn't eliminate the need to move and transform data.

Instead, it automates these processes to make them transparent to users and reduce the engineering effort required.

In a Zero-ETL architecture:

Data moves automatically from source to destination
Transformations happen in-flight or at the destination
The process runs continuously with minimal latency
Users don't need to manage complex ETL pipelines

The goal is to create a seamless flow of data that maintains consistency between your operational and analytical systems.

The AWS Components: DynamoDB, Redshift, and Lambda

DynamoDB

Amazon DynamoDB is a fully managed NoSQL database service that provides fast and predictable performance with seamless scalability. It's designed for applications that need consistent, single-digit millisecond latency at any scale.

Its key features include:

Automatic scaling
Built-in security
Backup and restore capabilities
Global tables for multi-region deployment

Amazon Redshift

Amazon Redshift is a fully managed, petabyte-scale data warehouse service in the cloud. It's designed for high-performance analysis of structured data using standard SQL.

Key features include:

Columnar storage technology
Massively parallel processing (MPP) architecture
Advanced compression capabilities
Integration with various AWS services

AWS Lambda

AWS Lambda is a serverless compute service that lets you run code without provisioning or managing servers.

It executes your code only when needed and scales automatically. In our Zero-ETL architecture, Lambda will:

Detect changes in DynamoDB
Transform data as needed
Load the transformed data into Redshift

"While other companies were still reviewing resumes, we were already shipping new features..."

That's the Lemon.io difference. They rigorously vet engineers from Europe and Latin America who match your specific needs and are ready to contribute in just 48 hours.

No lengthy hiring cycles. No long-term contracts.

Just skilled developers who integrate seamlessly into your workflow, helping you ship faster and scale on demand.

Visit lemon.io to transform your development velocity.

Building the Zero-ETL Pipeline

Let's walk through creating a Zero-ETL pipeline between DynamoDB and Redshift.

Step 1: Set Up DynamoDB Streams

DynamoDB Streams captures a time-ordered sequence of item-level modifications in a DynamoDB table.

To enable streams:

Go to the DynamoDB console
Select your table
Choose the "Manage Stream" option
Select "New and old images" as the view type

This configuration allows Lambda to access the previous and new state of modified items.

Step 2: Create the Lambda Function

Now, let's create a Lambda function that will process the stream records and load them into Redshift:

import json

import boto3

import psycopg2

import os

from decimal import Decimal

# Initialize AWS clients

dynamodb = boto3.resource('dynamodb')

redshift_client = boto3.client('redshift-data')

# Get environment variables

redshift_cluster_id = os.environ['REDSHIFT_CLUSTER_ID']

redshift_database = os.environ['REDSHIFT_DATABASE']

redshift_user = os.environ['REDSHIFT_USER']

redshift_table = os.environ['REDSHIFT_TABLE']

# Helper class to convert DynamoDB items to JSON

class DecimalEncoder(json.JSONEncoder):

def default(self, obj):

if isinstance(obj, Decimal):

return float(obj)

return super(DecimalEncoder, self).default(obj)

def lambda_handler(event, context):

for record in event['Records']:

# Get the DynamoDB table name

table_name = record['eventSourceARN'].split('/')[1]

# Process only if it's an INSERT or MODIFY event

if record['eventName'] in ['INSERT', 'MODIFY']:

# Get the new image of the item

new_image = record['dynamodb']['NewImage']

# Convert DynamoDB JSON to regular JSON

item_json = json.dumps(new_image, cls=DecimalEncoder)

item = json.loads(item_json)

# Prepare SQL statement for Redshift

columns = ', '.join(item.keys())

placeholders = ', '.join(['%s' for _ in item.keys()])

values = list(item.values())

sql = f"INSERT INTO {redshift_table} ({columns}) VALUES ({placeholders})"

# Execute statement on Redshift

response = redshift_client.execute_statement(

ClusterIdentifier=redshift_cluster_id,

Database=redshift_database,

DbUser=redshift_user,

Sql=sql,

Parameters=values

)

return {

'statusCode': 200,

'body': json.dumps('Data successfully processed and loaded into Redshift')

}

This Lambda function:

Listens for changes in DynamoDB via the stream
Processes each record in the stream
Converts DynamoDB JSON format to standard JSON
Constructs and executes an SQL INSERT statement to load the data into Redshift

Step 3: Configure Lambda Permissions and Environment Variables

For the Lambda function to work correctly, you need to:

Create an IAM role with permissions for:
- DynamoDB Streams (read access)
- Redshift Data API (execute statements)
- CloudWatch Logs (for logging)
Set the following environment variables:
- REDSHIFT_CLUSTER_ID: Your Redshift cluster identifier
- REDSHIFT_DATABASE: The name of your Redshift database
- REDSHIFT_USER: Username for Redshift access
- REDSHIFT_TABLE: The target table in Redshift

Step 4: Create the Trigger

Now, connect your Lambda function to the DynamoDB stream:

In the Lambda console, add a trigger for DynamoDB
Select your DynamoDB table
Set the starting position to "Latest"
Save the configuration

Step 5: Prepare the Redshift Table

Your Redshift table schema should match the structure of your DynamoDB data. For example:

CREATE TABLE customer_orders (

order_id VARCHAR(100) PRIMARY KEY,

customer_id VARCHAR(100),

order_date TIMESTAMP,

total_amount DECIMAL(10,2),

items SUPER

);

Note the SUPER data type, which can store semi-structured data like JSON, making it ideal for DynamoDB items that may contain nested attributes.

Handling Schema Evolution

One challenge with Zero-ETL architectures is handling changes to the source schema.

Here are strategies to make your pipeline resilient:

Use Schema-on-Read in Redshift: Leverage the SUPER data type to store JSON documents, then use JSON parsing functions when querying.
Implement a Schema Registry: Maintain metadata about current and historical schemas.
Version Your Data: Include schema version information with each record.
Lambda Transformation Logic: Update your Lambda function to handle different schema versions.

Optimizing Performance

To ensure your Zero-ETL pipeline performs well at scale:

Batch Operations: Modify the Lambda function to collect multiple records before sending them to Redshift.
Implement Retries: Add retry logic to handle transient failures.
Monitor Latency: Set up CloudWatch metrics to track the time from DynamoDB change to Redshift availability.
Scale Lambda Appropriately: Adjust concurrency limits based on your DynamoDB throughput.

Best Practices and Considerations

When implementing Zero-ETL patterns, keep these best practices in mind:

Security: Ensure proper encryption of data in transit and at rest
Monitoring: Set up alerts for pipeline failures or unusual latency
Cost Management: Monitor Lambda invocations and Redshift usage
Error Handling: Implement dead-letter queues for failed records
Testing: Thoroughly test your pipeline with various data scenarios

Final Thoughts

While the implementation requires careful planning and consideration of various factors, the benefits of reduced maintenance and faster insights make it worthwhile for many use cases.

By leveraging the serverless capabilities of Lambda and the managed services of DynamoDB and Redshift, you can create a robust, scalable data pipeline that grows with your business needs while minimizing operational overhead.

Next Steps

To get started with your own Zero-ETL implementation:

Assess your current data flow and identify candidates for Zero-ETL
Start with a pilot project on non-critical data
Monitor performance and iteratively improve your implementation
Gradually expand to more critical data flows as you gain confidence

Remember that while Zero-ETL reduces traditional ETL work, it doesn't eliminate the need for thoughtful data architecture and governance.

The goal is to make data movement transparent, not to avoid thinking about how data should be structured and used.

Hire smarter, build faster.

Lemon.io connects you with top-tier engineers without the hiring headaches.

Start hiring

That’s it for today!

Did you enjoy this newsletter issue?

Share with your friends, colleagues, and your favorite social media platform.

Share The Cloud Playbook

Until next week — Amrut

Whenever you’re ready, there are 4 ways I can help you:

NEW! Get certified as an AWS AI Practitioner in 2025. Sign up today to elevate your cloud skills. (link)
Are you thinking about getting certified as a Google Cloud Digital Leader?
Here’s a link to my Udemy course, which has helped 628+ students prepare and pass the exam. Currently, rated 4.37/5. (link)
Free guides and helpful resources: https://thecloudplaybook.gumroad.com/
Sponsor The Cloud Playbook Newsletter:
https://www.thecloudplaybook.com/p/sponsor-the-cloud-playbook-newsletter

Get in touch

You can find me on LinkedIn or X.

If you wish to request a topic you would like to read, you can contact me directly via LinkedIn or X.

The Cloud Playbook