TCP #47: Zero-ETL Architectures with Amazon MSK

Building the Data Backbone of Modern Applications

Mar 16, 2025

You can also read my newsletters from the Substack mobile app and be notified when a new issue is available.

Available for iOS and Android

The Cloud Playbook is now offering sponsorship slots in each issue. If you want to feature your product or service in my newsletter, explore my sponsor page
Become a Proud Sponsor!

In today's digital economy, data doesn't just grow; it flows.

Organizations increasingly need to process and analyze information as it's created, not hours or days later. This shift from batch to real-time processing has made Apache Kafka a cornerstone technology in modern data architectures.

Amazon Managed Streaming for Kafka (MSK) brings Kafka's power to AWS as a fully managed service, eliminating much of the operational overhead while maintaining compatibility with the Kafka ecosystem.

When appropriately designed, MSK can be the foundation for truly Zero-ETL architectures, where data flows seamlessly from producers to consumers without traditional extract, transform, and load processes.

In this post, I'll explore leveraging Amazon MSK to build resilient, scalable Zero-ETL pipelines that can transform your organization's data management.

The Promise of MSK in Zero-ETL Architectures

Traditional ETL processes often introduce delays, require significant engineering resources, and create brittle dependencies.

With Amazon MSK at the center of your data architecture, you can:

Process data in real-time as events occur
Decouple producers and consumers for greater flexibility
Scale horizontally to handle growing data volumes
Preserve the complete data history for reprocessing when needed
Integrate natively with numerous AWS services

Perhaps most importantly, MSK enables a publish-subscribe model where producers don't need to know who will consume their data. This fundamental shift eliminates the need to build and maintain point-to-point integrations between systems.

Building Blocks of an MSK-Based Zero-ETL Architecture

Let's explore the key components that make up a comprehensive Zero-ETL solution using Amazon MSK:

1. Amazon MSK Cluster

The central nervous system of your architecture, an MSK cluster:

Provides durable storage for event streams
Handles partitioning for parallel processing
Manages replication for high-availability
Scales to accommodate growing workloads

When creating your MSK cluster, consider these key configurations:

Broker size and count based on throughput needs
Availability zones for resilience (minimum of two recommended)
Storage throughput mode (provisioned vs. throughput-optimized)
Encryption settings for data in transit and at rest

2. MSK Connect for Seamless Integration

One of the most powerful features for Zero-ETL workflows is MSK Connect, which provides managed Kafka Connect workers that can move data between Kafka and other systems without custom code.

MSK Connect S3 Sink Connector Configuration

{

"connector.class": "io.confluent.connect.s3.S3SinkConnector",

"tasks.max": "4",

"topics": "web-events,app-events,system-metrics",

"s3.region": "us-east-1",

"s3.bucket.name": "my-analytics-data-lake",

"topics.dir": "raw-events",

"flush.size": "10000",

"rotate.schedule.interval.ms": "300000",

"storage.class": "io.confluent.connect.s3.storage.S3Storage",

"format.class": "io.confluent.connect.s3.format.parquet.ParquetFormat",

"parquet.codec": "snappy",

"partitioner.class": "io.confluent.connect.storage.partitioner.TimeBasedPartitioner",

"path.format": "'year'=YYYY/'month'=MM/'day'=dd/'hour'=HH",

"locale": "en-US",

"timezone": "UTC",

"schema.compatibility": "FULL",

"transforms": "addMetadata",

"transforms.addMetadata.type": "org.apache.kafka.connect.transforms.InsertField$Value",

"transforms.addMetadata.timestamp.field": "s3_ingestion_time"

}

With pre-built connectors for popular systems, you can easily:

Stream data to S3 for your data lake
Feed OpenSearch for real-time search and analytics
Update DynamoDB tables as events occur
Sync with RDS databases for operational workloads
Send data to Redshift for analytical queries

The beauty of MSK Connect is that once configured, these data flows run continuously and automatically, embodying the Zero-ETL principle.

3. Lambda for Custom Processing

While MSK Connect handles many common scenarios, AWS Lambda provides flexibility for custom transformations and routing:

import json

import base64

import boto3

from datetime import datetime

# Initialize DynamoDB client

dynamodb = boto3.resource('dynamodb')

table = dynamodb.Table('customer-profile-updates')

def lambda_handler(event, context):

processed_records = 0

for record in event['records']:

# Decode the base64-encoded Kafka message

payload = base64.b64decode(record['value']).decode('utf-8')

message = json.loads(payload)

# Extract customer ID and enrich the record

customer_id = message.get('customerId')

if not customer_id:

print(f"Warning: Record missing customerId: {message}")

continue

# Add processing timestamp

message['processedAt'] = datetime.utcnow().isoformat()

# Write to DynamoDB

try:

table.put_item(Item={

'customerId': customer_id,

'timestamp': message.get('eventTime', datetime.utcnow().isoformat()),

'profileData': message

})

processed_records += 1

except Exception as e:

print(f"Error writing to DynamoDB: {str(e)}")

print(f"Successfully processed {processed_records} records")

return {

'statusCode': 200,

'body': json.dumps(f'Processed {processed_records} records')

}

Lambda functions can be triggered by new messages in your MSK topics to:

Transform data formats (e.g., XML to JSON)
Enrich events with additional context
Filter out unwanted records
Route data based on content
Aggregate events for efficiency

For optimal performance, consider configuring your Lambda consumer group to process multiple partitions in parallel while maintaining exactly-once semantics.

4. Kafka Streams and KSQL for Stream Processing

For more complex stream processing, you can use:

Kafka Streams API for building Java applications that transform, aggregate, or join streams
KSQL for SQL-like stream processing without writing code

These technologies let you implement sophisticated processing logic while maintaining the Zero-ETL principle, as all processing happens within the Kafka ecosystem.

Real-World Zero-ETL Patterns with MSK

Let's explore some common patterns that exemplify Zero-ETL with MSK:

Pattern 1: Real-Time Data Lake Ingestion

Organizations must populate their data lakes with fresh data while preserving the original events for compliance and reprocessing.

Solution:

Configure MSK Connect with the S3 Sink Connector
Set up partitioning by date/time for optimal query performance
Choose Parquet or Avro format for storage efficiency
Implement schema registry for evolution

Benefits:

Data lands in S3 within seconds of being produced
The original event structure is preserved
No custom code required
Scales automatically with volume increases

Pattern 2: Multi-Environment Replication

Enterprises must share data across development, testing, and production environments while maintaining isolation.

Solution:

Set up MSK Replication to copy topics between clusters
Configure topic-level filtering for security
Implement cross-region replication for disaster recovery
Use MirrorMaker 2.0 for advanced replication needs

Benefits:

Environments remain isolated but share necessary data
No custom ETL code to maintain
Changes propagate automatically
Supports global distribution of data

Pattern 3: Event-Driven Microservices

Building responsive microservices that react to changes across the organization.

Solution:

Each service publishes domain events to MSK
Services subscribe only to relevant topics
Use consumer groups for load balancing
Implement dead-letter queues for error handling

Benefits:

Services are loosely coupled
New consumers can access historical data
The system handles backpressure gracefully
Scales horizontally as needs grow

Best Practices for MSK-Based Zero-ETL

To ensure your MSK-based Zero-ETL architecture succeeds:

1. Schema Management

Implement a schema registry to:

Enforce compatibility as schemas evolve
Document the structure of your events
Enable consumers to interpret data correctly
Support multiple serialization formats

2. Monitoring and Alerting

Set up comprehensive monitoring:

Track broker health and performance
Monitor consumer lag to identify processing delays
Alert on connectivity issues or throughput drops
Create dashboards for visibility across teams

3. Security and Compliance

Implement proper security controls:

Use IAM for authentication and authorization
Encrypt data in transit and at rest
Implement network isolation with private VPC
Audit access and changes to configuration

4. Cost Optimization

Control costs while maintaining performance:

Right-size your brokers based on actual usage
Implement topic compaction where appropriate
Use tiered storage for historical data
Monitor and adjust as workloads change

Challenges and Solutions in MSK Zero-ETL Architectures

While powerful, MSK-based Zero-ETL is not without challenges:

1. Schema Evolution

As your systems evolve, so must your data structures.

Solution: Implement a schema registry with compatibility checks and use formats like Avro or Protobuf that support evolution. Plan for backward compatibility in your data models.

2. Exactly-Once Semantics

Some use cases require guarantees that events are processed exactly once.

Solution: Use Kafka's transactional API, idempotent producers, and consumer offsets to achieve exactly-once semantics. Design consumers to handle potential duplicates gracefully.

3. Reprocessing Historical Data

Sometimes, you need to reprocess data with new logic.

Solution: Configure adequate retention periods for your topics, implement consumer groups with explicit offset management, and design your processing to be idempotent.

Getting Started with MSK Zero-ETL

Ready to implement your own MSK-based Zero-ETL architecture?

Here's a roadmap:

Start small: Identify a specific use case with a clear value
Plan your topics: Design a thoughtful topic structure with partitioning in mind
Set up infrastructure: Deploy your MSK cluster with appropriate sizing
Implement producers: Begin sending data to your topics
Add consumers: Start with simple consumers and add complexity gradually
Monitor and optimize: Watch for bottlenecks and adjust as needed

Final Thoughts

As organizations generate more data faster, traditional ETL approaches become increasingly untenable.

Amazon MSK offers a path to true Zero-ETL architectures where data flows continuously and seamlessly between systems.

By embracing event streaming as a core architectural principle and leveraging MSK's managed capabilities, you can build data pipelines that are:

More responsive to business needs
More resilient to failures
More adaptable to changing requirements
More scalable as volumes grow

Next Steps and Resources

To dive deeper into MSK-based Zero-ETL:

Explore the AWS documentation: Amazon MSK Developer Guide
Try MSK Connect: Getting Started with MSK Connect
Learn Kafka fundamentals: Apache Kafka Documentation

That’s it for today!

Did you enjoy this newsletter issue?

Share with your friends, colleagues, and your favorite social media platform.

Share The Cloud Playbook

Until next week — Amrut

Whenever you’re ready, there are 4 ways I can help you:

NEW! Get certified as an AWS AI Practitioner in 2025. Sign up today to elevate your cloud skills. (link)
Are you thinking about getting certified as a Google Cloud Digital Leader?
Here’s a link to my Udemy course, which has helped 628+ students prepare and pass the exam. Currently, rated 4.37/5. (link)
Free guides and helpful resources: https://thecloudplaybook.gumroad.com/
Sponsor The Cloud Playbook Newsletter:
https://www.thecloudplaybook.com/p/sponsor-the-cloud-playbook-newsletter

Get in touch

You can find me on LinkedIn or X.

If you wish to request a topic you would like to read, you can contact me directly via LinkedIn or X.

The Cloud Playbook

TCP #47: Zero-ETL Architectures with Amazon MSK

Building the Data Backbone of Modern Applications

The Promise of MSK in Zero-ETL Architectures

Building Blocks of an MSK-Based Zero-ETL Architecture

1. Amazon MSK Cluster

2. MSK Connect for Seamless Integration

MSK Connect S3 Sink Connector Configuration

3. Lambda for Custom Processing

4. Kafka Streams and KSQL for Stream Processing

Real-World Zero-ETL Patterns with MSK

Pattern 1: Real-Time Data Lake Ingestion

Pattern 2: Multi-Environment Replication

Pattern 3: Event-Driven Microservices

Best Practices for MSK-Based Zero-ETL

1. Schema Management

2. Monitoring and Alerting

3. Security and Compliance

4. Cost Optimization

Challenges and Solutions in MSK Zero-ETL Architectures

1. Schema Evolution

2. Exactly-Once Semantics

3. Reprocessing Historical Data

Getting Started with MSK Zero-ETL

Final Thoughts

Next Steps and Resources

Whenever you’re ready, there are 4 ways I can help you:

Get in touch

Discussion about this post