Serverless Data Lake Architecture

How We Replaced a $15,000/month EMR Pipeline with a Serverless Data Lake for Under $300

Thu Jun 20 2024

The Problem

Our data team was operating a powerful—but extremely costly—pipeline: Kafka topics were streamed into Flink jobs running on EMR, and the resulting data was stored, monitored, and queried for downstream products. The cost? Between $5,000 and $15,000 per month per product—mostly from EMR and CloudWatch usage alone.


The Mission

We needed to eliminate persistent compute, simplify the architecture, and still deliver rich data products that supported dynamic event schemas and analytics.


The Serverless Architecture

  • Step 1: Kafka writes messages to S3 using an S3 Sink Connector.
  • Step 2: S3 triggers an AWS Lambda function on each file drop.
  • Step 3: Lambda reads .ndjson data using Pandas, infers schema from the headers and column data types, and modifies batches of Parquet data.
  • Step 4: The Lambda function registers the schema with AWS Glue if it doesn’t already exist.
  • Step 5: The data is made queryable in Amazon Athena and shared through the organization’s internal data marketplace.

The Impact

This new architecture reduced costs from $15,000/month to less than $300/month while maintaining near real-time ingestion and full schema registration.

With no persistent compute and no need for EMR management, the engineering overhead dropped significantly—and the team gained agility in iterating on new data products.


Final Thoughts

If you’re maintaining costly Flink or EMR-based streaming infrastructure and your use case doesn’t require per-record low latency, this serverless approach could save you thousands per month with almost no tradeoffs.


Need Help Designing Your Serverless Data Architecture?

Cipher Codex LLC helps teams modernize infrastructure with scalable, cost-effective cloud-native solutions. Contact us to learn more.