TCP #42: Zero-ETL Patterns on AWS with DynamoDB and Redshift
Part 2: How to Achieve Zero-ETL with DynamoDB and Redshift Using Lambda Triggers
You can also read my newsletters from the Substack mobile app and be notified when a new issue is available.
👋 Before we begin... a big thank you to today's sponsor, Lemon.io
Your Backlog Shouldn't Be a Bottleneck
Finding skilled developers feels impossible, even for the most experienced engineering leaders. While you're stuck in months-long hiring cycles, your critical features gather dust.
Lemon.io changes everything by:
Rigorously vetting engineers through a multi-step process
Evaluating technical skills, problem-solving, and communication
Matching you with perfect-fit talent in just 48 hours
Need specialized skills or quick scaling?
Our pre-vetted developers from Europe and Latin America integrate seamlessly into your workflow without long hiring cycles or long-term contracts.
➡️ Turn your backlog into shipped features at lemon.io
This is part 2 of the 3-part series on Zero-ETL Patterns on AWS series.
Check out part 1, where I discussed Zero-ETL Patterns on AWS with Aurora, Redshift, and OpenSearch.
Analyzing data quickly and efficiently is crucial for making informed business decisions in today's data-driven world.
Traditionally, this process has involved Extract, Transform, Load (ETL) operations that move data from operational databases to analytical systems. However, these operations can be complex, resource-intensive, and introduce latency.
Zero-ETL is an architectural pattern that eliminates or minimizes traditional ETL processes by automatically synchronizing data between operational and analytical systems.
In this newsletter, I'll explore how to implement Zero-ETL patterns on AWS using DynamoDB as your operational database and Redshift as your analytical data warehouse, with Lambda functions serving as the connecting tissue.
Understanding the Zero-ETL Concept
Before diving into implementation details, let's clarify what Zero-ETL means.
Despite the name, Zero-ETL doesn't eliminate the need to move and transform data.
Instead, it automates these processes to make them transparent to users and reduce the engineering effort required.
In a Zero-ETL architecture:
Data moves automatically from source to destination
Transformations happen in-flight or at the destination
The process runs continuously with minimal latency
Users don't need to manage complex ETL pipelines
The goal is to create a seamless flow of data that maintains consistency between your operational and analytical systems.
The AWS Components: DynamoDB, Redshift, and Lambda
DynamoDB
Amazon DynamoDB is a fully managed NoSQL database service that provides fast and predictable performance with seamless scalability. It's designed for applications that need consistent, single-digit millisecond latency at any scale.
Its key features include:
Automatic scaling
Built-in security
Backup and restore capabilities
Global tables for multi-region deployment
Amazon Redshift
Amazon Redshift is a fully managed, petabyte-scale data warehouse service in the cloud. It's designed for high-performance analysis of structured data using standard SQL.
Key features include:
Columnar storage technology
Massively parallel processing (MPP) architecture
Advanced compression capabilities
Integration with various AWS services
AWS Lambda
AWS Lambda is a serverless compute service that lets you run code without provisioning or managing servers.
It executes your code only when needed and scales automatically. In our Zero-ETL architecture, Lambda will:
Detect changes in DynamoDB
Transform data as needed
Load the transformed data into Redshift
"While other companies were still reviewing resumes, we were already shipping new features..."
That's the Lemon.io difference. They rigorously vet engineers from Europe and Latin America who match your specific needs and are ready to contribute in just 48 hours.
No lengthy hiring cycles. No long-term contracts.
Just skilled developers who integrate seamlessly into your workflow, helping you ship faster and scale on demand.
Visit lemon.io to transform your development velocity.
Building the Zero-ETL Pipeline
Let's walk through creating a Zero-ETL pipeline between DynamoDB and Redshift.
Step 1: Set Up DynamoDB Streams
DynamoDB Streams captures a time-ordered sequence of item-level modifications in a DynamoDB table.
To enable streams:
Go to the DynamoDB console
Select your table
Choose the "Manage Stream" option
Select "New and old images" as the view type
This configuration allows Lambda to access the previous and new state of modified items.
Step 2: Create the Lambda Function
Now, let's create a Lambda function that will process the stream records and load them into Redshift:
import json
import boto3
import psycopg2
import os
from decimal import Decimal
# Initialize AWS clients
dynamodb = boto3.resource('dynamodb')
redshift_client = boto3.client('redshift-data')
# Get environment variables
redshift_cluster_id = os.environ['REDSHIFT_CLUSTER_ID']
redshift_database = os.environ['REDSHIFT_DATABASE']
redshift_user = os.environ['REDSHIFT_USER']
redshift_table = os.environ['REDSHIFT_TABLE']
# Helper class to convert DynamoDB items to JSON
class DecimalEncoder(json.JSONEncoder):
def default(self, obj):
if isinstance(obj, Decimal):
return float(obj)
return super(DecimalEncoder, self).default(obj)
def lambda_handler(event, context):
for record in event['Records']:
# Get the DynamoDB table name
table_name = record['eventSourceARN'].split('/')[1]
# Process only if it's an INSERT or MODIFY event
if record['eventName'] in ['INSERT', 'MODIFY']:
# Get the new image of the item
new_image = record['dynamodb']['NewImage']
# Convert DynamoDB JSON to regular JSON
item_json = json.dumps(new_image, cls=DecimalEncoder)
item = json.loads(item_json)
# Prepare SQL statement for Redshift
columns = ', '.join(item.keys())
placeholders = ', '.join(['%s' for _ in item.keys()])
values = list(item.values())
sql = f"INSERT INTO {redshift_table} ({columns}) VALUES ({placeholders})"
# Execute statement on Redshift
response = redshift_client.execute_statement(
ClusterIdentifier=redshift_cluster_id,
Database=redshift_database,
DbUser=redshift_user,
Sql=sql,
Parameters=values
)
return {
'statusCode': 200,
'body': json.dumps('Data successfully processed and loaded into Redshift')
}
This Lambda function:
Listens for changes in DynamoDB via the stream
Processes each record in the stream
Converts DynamoDB JSON format to standard JSON
Constructs and executes an SQL INSERT statement to load the data into Redshift
Step 3: Configure Lambda Permissions and Environment Variables
For the Lambda function to work correctly, you need to:
Create an IAM role with permissions for:
DynamoDB Streams (read access)
Redshift Data API (execute statements)
CloudWatch Logs (for logging)
Set the following environment variables:
REDSHIFT_CLUSTER_ID: Your Redshift cluster identifier
REDSHIFT_DATABASE: The name of your Redshift database
REDSHIFT_USER: Username for Redshift access
REDSHIFT_TABLE: The target table in Redshift
Step 4: Create the Trigger
Now, connect your Lambda function to the DynamoDB stream:
In the Lambda console, add a trigger for DynamoDB
Select your DynamoDB table
Set the starting position to "Latest"
Save the configuration
Step 5: Prepare the Redshift Table
Your Redshift table schema should match the structure of your DynamoDB data. For example:
CREATE TABLE customer_orders (
order_id VARCHAR(100) PRIMARY KEY,
customer_id VARCHAR(100),
order_date TIMESTAMP,
total_amount DECIMAL(10,2),
items SUPER
);
Note the SUPER data type, which can store semi-structured data like JSON, making it ideal for DynamoDB items that may contain nested attributes.
Handling Schema Evolution
One challenge with Zero-ETL architectures is handling changes to the source schema.
Here are strategies to make your pipeline resilient:
Use Schema-on-Read in Redshift: Leverage the SUPER data type to store JSON documents, then use JSON parsing functions when querying.
Implement a Schema Registry: Maintain metadata about current and historical schemas.
Version Your Data: Include schema version information with each record.
Lambda Transformation Logic: Update your Lambda function to handle different schema versions.
Optimizing Performance
To ensure your Zero-ETL pipeline performs well at scale:
Batch Operations: Modify the Lambda function to collect multiple records before sending them to Redshift.
Implement Retries: Add retry logic to handle transient failures.
Monitor Latency: Set up CloudWatch metrics to track the time from DynamoDB change to Redshift availability.
Scale Lambda Appropriately: Adjust concurrency limits based on your DynamoDB throughput.
Best Practices and Considerations
When implementing Zero-ETL patterns, keep these best practices in mind:
Security: Ensure proper encryption of data in transit and at rest
Monitoring: Set up alerts for pipeline failures or unusual latency
Cost Management: Monitor Lambda invocations and Redshift usage
Error Handling: Implement dead-letter queues for failed records
Testing: Thoroughly test your pipeline with various data scenarios
Final Thoughts
While the implementation requires careful planning and consideration of various factors, the benefits of reduced maintenance and faster insights make it worthwhile for many use cases.
By leveraging the serverless capabilities of Lambda and the managed services of DynamoDB and Redshift, you can create a robust, scalable data pipeline that grows with your business needs while minimizing operational overhead.
Next Steps
To get started with your own Zero-ETL implementation:
Assess your current data flow and identify candidates for Zero-ETL
Start with a pilot project on non-critical data
Monitor performance and iteratively improve your implementation
Gradually expand to more critical data flows as you gain confidence
Remember that while Zero-ETL reduces traditional ETL work, it doesn't eliminate the need for thoughtful data architecture and governance.
The goal is to make data movement transparent, not to avoid thinking about how data should be structured and used.
Hire smarter, build faster.
Lemon.io connects you with top-tier engineers without the hiring headaches.
That’s it for today!
Did you enjoy this newsletter issue?
Share with your friends, colleagues, and your favorite social media platform.
Until next week — Amrut
Whenever you’re ready, there are 4 ways I can help you:
NEW! Get certified as an AWS AI Practitioner in 2025. Sign up today to elevate your cloud skills. (link)
Are you thinking about getting certified as a Google Cloud Digital Leader?
Here’s a link to my Udemy course, which has helped 628+ students prepare and pass the exam. Currently, rated 4.37/5. (link)
Free guides and helpful resources: https://thecloudplaybook.gumroad.com/
Sponsor The Cloud Playbook Newsletter:
https://www.thecloudplaybook.com/p/sponsor-the-cloud-playbook-newsletter
Get in touch
You can find me on LinkedIn or X.
If you wish to request a topic you would like to read, you can contact me directly via LinkedIn or X.