TCP #43: Zero-ETL Patterns on AWS with DynamoDB and OpenSearch
Part 3: How to Achieve Zero-ETL Patterns on AWS with DynamoDB and OpenSearch
You can also read my newsletters from the Substack mobile app and be notified when a new issue is available.
👋 Before we begin... a big thank you to today's sponsor, Lemon.io
How Much Does Your Hiring Delay Cost You?
When critical features sit in your backlog for months during traditional hiring cycles, you're losing more than time—you're losing market opportunity.
Lemon.io eliminates this risk.
Through our rigorous multi-step vetting process, we assess technical skills, problem-solving abilities, and communication—delivering top engineers in just 48 hours.
Need to scale quickly or find specialized talent? Our developers from Europe and Latin America become part of your team immediately—no lengthy hiring, no long-term commitments.
This is part 3 of the 3-part series on Zero-ETL Patterns on AWS series.
You can find the other parts here:
Traditional ETL pipelines require significant engineering resources to build and maintain, introduce latency and can become bottlenecks as your data needs grow.
Zero-ETL represents a paradigm shift in how we think about data movement.
Rather than treating data transfer as a separate, batch-oriented process, Zero-ETL aims to make data flow automatically and in real time between different systems.
When appropriately implemented, this approach can dramatically reduce data latency, simplify architecture, and allow your teams to focus on deriving insights rather than managing pipelines.
In this newsletter, I'll explore how to implement Zero-ETL patterns on AWS using DynamoDB as your operational database and OpenSearch as your search and analytics engine.
Let's dive in!
Understanding the Building Blocks
Before we get into implementation details, let's understand the core AWS services we'll be working with:
Amazon DynamoDB is a fully managed NoSQL database service that provides fast and predictable performance with seamless scalability.
It excels at handling high-throughput, low-latency workloads and is an excellent operational database for many applications.
Amazon OpenSearch Service (the successor to Amazon Elasticsearch Service) is a managed service that makes it easy to deploy, operate, and scale OpenSearch clusters. It's ideal for log analytics, full-text search, application monitoring, and more.
AWS Lambda is a serverless compute service that lets you run code without provisioning or managing servers. In our Zero-ETL architecture, Lambda will serve as the connector that reacts to changes in DynamoDB and updates OpenSearch accordingly.
The Architecture: Building a Real-Time Data Pipeline
Our Zero-ETL architecture follows these principles:
Changes in DynamoDB are captured automatically
Data transformations happen in-flight
OpenSearch is updated in near real-time
The entire process is automatic and resilient
Here's how the components fit together:
DynamoDB Streams captures item-level changes in your DynamoDB table
Lambda function is triggered by these streams
The function transforms the data as needed
The transformed data is indexed in OpenSearch
Your applications can then search and analyze this data
Let's build this pipeline step by step.
Imagine Your Entire Backlog Cleared by Next Month
While other engineering teams still post job ads, you could ship those critical features. Lemon.io makes this possible by matching you with rigorously vetted engineers in just 48 hours.
Unlike platforms that merely check résumés, our multi-step vetting process thoroughly assesses technical skills, problem-solving abilities, and communication—delivering developers who make an immediate impact.
Need specialized talent that's hard to find locally?
Scale quickly with our pre-vetted developers from Europe and Latin America—no long hiring cycles, no long-term commitments.
Step 1: Setting Up DynamoDB Streams
DynamoDB Streams is the foundation of our Zero-ETL architecture. It provides a time-ordered sequence of item-level modifications in your DynamoDB table.
To enable DynamoDB Streams:
Open the DynamoDB console
Select your table
Choose the "Exports and streams" tab
Under "DynamoDB stream details," click "Enable"
Select "New and old images" for the view type
This configuration captures both the previous and new state of each modified item, giving you maximum flexibility for handling updates and deletes.
Step 2: Creating the Lambda Function
Now, let's create a Lambda function that will process the stream records and update OpenSearch. Here's a detailed implementation:
import boto3
import json
import os
import requests
from requests_aws4auth import AWS4Auth
from decimal import Decimal
# Initialize AWS clients
region = os.environ['AWS_REGION']
opensearch_endpoint = os.environ['OPENSEARCH_ENDPOINT']
opensearch_index = os.environ['OPENSEARCH_INDEX']
# Get credentials for OpenSearch requests
credentials = boto3.Session().get_credentials()
awsauth = AWS4Auth(
credentials.access_key,
credentials.secret_key,
region,
'es',
session_token=credentials.token
)
# Helper class to handle Decimal types in DynamoDB items
class DecimalEncoder(json.JSONEncoder):
def default(self, obj):
if isinstance(obj, Decimal):
return float(obj)
return super(DecimalEncoder, self).default(obj)
def lambda_handler(event, context):
success_count = 0
failure_count = 0
# Process each record in the DynamoDB stream
for record in event['Records']:
try:
# Get the DynamoDB table name from the event source ARN
table_name = record['eventSourceARN'].split('/')[1]
# Get the primary key configuration for the table
dynamodb = boto3.resource('dynamodb')
table = dynamodb.Table(table_name)
key_schema = table.key_schema
# Extract the primary key from the record
if record['eventName'] == 'REMOVE':
# For deletions, use the old image
ddb_item = record['dynamodb'].get('OldImage', {})
else:
# For inserts and updates, use the new image
ddb_item = record['dynamodb'].get('NewImage', {})
# Convert DynamoDB JSON to regular JSON
item_json = json.dumps(ddb_item, cls=DecimalEncoder)
item = json.loads(item_json)
# Extract the document ID from the primary key
document_id = build_document_id(key_schema, item)
# Handle different event types
if record['eventName'] == 'REMOVE':
# Delete document from OpenSearch
response = delete_from_opensearch(document_id)
else:
# Index document in OpenSearch
response = index_in_opensearch(document_id, item)
# Check if operation was successful
if 200 <= response.status_code < 300:
success_count += 1
else:
failure_count += 1
print(f"Error processing record: {response.text}")
except Exception as e:
failure_count += 1
print(f"Error processing record: {str(e)}")
# Return summary of processing
return {
'statusCode': 200,
'body': json.dumps({
'processedRecords': len(event['Records']),
'successCount': success_count,
'failureCount': failure_count
})
}
def build_document_id(key_schema, item):
"""Build a document ID from the primary key attributes"""
parts = []
for key in key_schema:
attr_name = key['AttributeName']
if attr_name in item:
parts.append(str(item[attr_name]))
return "#".join(parts)
def index_in_opensearch(document_id, document):
"""Index a document in OpenSearch"""
url = f"{opensearch_endpoint}/{opensearch_index}/_doc/{document_id}"
headers = {"Content-Type": "application/json"}
response = requests.put(url, auth=awsauth, json=document, headers=headers)
return response
def delete_from_opensearch(document_id):
"""Delete a document from OpenSearch"""
url = f"{opensearch_endpoint}/{opensearch_index}/_doc/{document_id}"
response = requests.delete(url, auth=awsauth)
return response
This Lambda function:
Processes each record from the DynamoDB stream
Handles different event types (INSERT, MODIFY, REMOVE)
Converts DynamoDB's JSON format to standard JSON
Creates a unique document ID based on the item's primary key
Indexes or deletes documents in OpenSearch as appropriate
Includes proper error handling and reporting
Step 3: Configuring Lambda Permissions and Environment Variables
For the Lambda function to work correctly, you need to:
Create an IAM role with these permissions:
DynamoDB:
dynamodb:DescribeTable
,dynamodb:GetItem
DynamoDB Streams:
dynamodb:GetRecords
,dynamodb:GetShardIterator
,dynamodb:DescribeStream
,dynamodb:ListStreams
OpenSearch:
es:ESHttpPut
,es:ESHttpDelete
,es:ESHttpGet
CloudWatch Logs:
logs:CreateLogGroup
,logs:CreateLogStream
,logs:PutLogEvents
Set these environment variables:
OPENSEARCH_ENDPOINT
: The endpoint URL of your OpenSearch domainOPENSEARCH_INDEX
: The name of the index where you want to store the data
Step 4: Creating the OpenSearch Index
Before data can flow into OpenSearch, you must create an index with the appropriate mapping. The mapping defines how documents and their fields are stored and indexed.
Here's how to create a basic index using the OpenSearch Dashboard's Dev Tools:
PUT /your-index-name
{
"settings": {
"number_of_shards": 3,
"number_of_replicas": 1
},
"mappings": {
"properties": {
"id": { "type": "keyword" },
"name": { "type": "text" },
"description": { "type": "text" },
"price": { "type": "float" },
"created_at": { "type": "date" },
"tags": { "type": "keyword" }
}
}
}
Adjust the field names and types to match your DynamoDB table structure. The mapping should reflect the structure of your data after transformation in the Lambda function.
Step 5: Setting Up the Lambda Trigger
Now, connect your Lambda function to the DynamoDB stream:
In the Lambda console, go to your function
Click on "Add trigger"
Select "DynamoDB" as the source
Choose your DynamoDB table
Set "Starting position" to "Latest"
Click "Add"
Your Lambda function will be invoked whenever changes occur in your DynamoDB table.
Handling Schema Evolution
One of the challenges in maintaining Zero-ETL pipelines is dealing with schema changes.
Here are strategies to make your pipeline resilient:
Dynamic Mapping in OpenSearch: Enable dynamic mapping in OpenSearch to adapt to new fields automatically.
Schema Version Tracking: Include a schema version with each item and handle different versions in your Lambda function.
Field Pruning: In the Lambda function, selectively include or exclude fields based on your needs.
Transform Layer: Implement a transformation layer in your Lambda function that can adapt to schema changes.
Here's how you might implement versioned transformations in your Lambda function:
def transform_item(item, version):
if version == 1:
return transform_v1(item)
elif version == 2:
return transform_v2(item)
else:
return item # Default transformation
def transform_v1(item):
# Original transformation logic
pass
def transform_v2(item):
# Updated transformation logic
pass
Advanced Pattern: Using SQS for Resilience
For improved resilience, you can introduce an SQS queue between the Lambda function and OpenSearch:
The Lambda function sends records to an SQS queue instead of directly to OpenSearch
A separate Lambda function processes messages from the queue and updates OpenSearch
Failed operations can be sent to a Dead Letter Queue for later processing
This pattern provides better buffering and retry capabilities, essential for handling OpenSearch service disruptions or throttling.
Optimizing for Performance and Cost
To ensure your Zero-ETL pipeline performs well while keeping costs under control:
Batch Processing: Modify your Lambda function to process multiple records in a single OpenSearch bulk operation.
Error Handling: Implement robust error handling and consider using Dead Letter Queues for records that can't be processed.
Monitoring: Set up CloudWatch Alarms to monitor the performance and health of your pipeline.
Scaling: Configure appropriate provisioned capacity for DynamoDB and choose the right instance type and count for OpenSearch.
Data Lifecycle: Implement index lifecycle management in OpenSearch to handle aging data.
Monitoring Your Zero-ETL Pipeline
To ensure the health and performance of your Zero-ETL pipeline, monitor these key metrics:
Lambda Execution Metrics: Invocations, errors, duration, throttles
DynamoDB Metrics: Read/write capacity, throttled requests, stream records
OpenSearch Metrics: Cluster health, indexing rate, search latency
End-to-End Latency: Time from DynamoDB update to OpenSearch availability
Set up CloudWatch Dashboards to visualize these metrics and configure alarms for important thresholds.
Best Practices and Considerations
When implementing Zero-ETL patterns with DynamoDB and OpenSearch, keep these best practices in mind:
Security: Use IAM roles with least privilege, encrypt data in transit and at rest
Testing: Thoroughly test your pipeline with various data scenarios and failure modes
Documentation: Document your architecture, including error handling and recovery procedures
Versioning: Version your Lambda functions and OpenSearch mappings
Costs: Regularly review and optimize costs across all components
Final Thoughts
Implementing Zero-ETL patterns between DynamoDB and OpenSearch opens up new real-time search and analytics possibilities.
By eliminating traditional ETL processes, you can:
Reduce data latency from hours to seconds
Simplify your overall architecture
Improve developer productivity
Enable new real-time use cases
While setting up a Zero-ETL pipeline requires careful planning and consideration, the long-term benefits in terms of agility, performance, and reduced maintenance make it a worthwhile investment for many organizations.
As you embark on your Zero-ETL journey, remember that the goal is not just to move data more efficiently, but to enable your organization to make better, faster decisions based on the most current information available.
Next Steps
To get started with your own Zero-ETL implementation:
Evaluate your current data flow and identify opportunities for real-time processing
Start with a pilot project using non-critical data
Iterate and refine based on lessons learned
Gradually expand to more critical data flows
Hiring doesn’t have to slow you down.
Lemon.io delivers pre-vetted developers who integrate seamlessly into your team—fast.
That’s it for today!
Did you enjoy this newsletter issue?
Share with your friends, colleagues, and your favorite social media platform.
Until next week — Amrut
Whenever you’re ready, there are 4 ways I can help you:
NEW! Get certified as an AWS AI Practitioner in 2025. Sign up today to elevate your cloud skills. (link)
Are you thinking about getting certified as a Google Cloud Digital Leader?
Here’s a link to my Udemy course, which has helped 628+ students prepare and pass the exam. Currently, rated 4.37/5. (link)
Free guides and helpful resources: https://thecloudplaybook.gumroad.com/
Sponsor The Cloud Playbook Newsletter:
https://www.thecloudplaybook.com/p/sponsor-the-cloud-playbook-newsletter
Get in touch
You can find me on LinkedIn or X.
If you wish to request a topic you would like to read, you can contact me directly via LinkedIn or X.
The breakdown of using DynamoDB Streams with Lambda to power near real time syncing into OpenSearch makes the concept really click. I’m curious, have you come across any performance bottlenecks when scaling this setup with high velocity data streams?