Friday, September 26, 2025

Iceberg on AWS: A Lakehouse CTOs Can Love (Without the Warehouse Tax)

Subsec

Introduction

Modern data platforms are evolving beyond traditional warehouses. Apache Iceberg – an open table format for data lakes – enables a data Lakehouse architecture that combines the flexibility of data lakes with warehouse-like features. An Iceberg-based Lakehouse on AWS can serve analytics directly to tools like Tableau, Power BI, or Amazon QuickSight without needing a separate Snowflake data warehouse for compute. This article, aimed at technical architects and business decision-makers, compares an Iceberg data Lakehouse on AWS with a Snowflake data warehouse solution. We’ll also examine a serverless approach (exemplified by Subsec’s solution) that uses AWS Lambda microservices and autonomous AI-driven data quality – avoiding heavy Spark/EMR clusters. Finally, we’ll provide a cost comparison (including cloud resources and full-time employees) and recommendations based on workload and cost sensitivity.

Executive takeaway: An Apache Iceberg–based Lakehouse on Amazon S3 gives you warehouse-grade reliability (ACID, schema evolution, time travel) with the flexibility and cost model of the data lake. BI tools—Tableau, Power BI, and Amazon QuickSight—can query directly on the lake via engines like Athena or Trino, so you don’t need to load and pay for a separate Snowflake warehouse just to compute on your data. For many teams, that means lower TCO, faster access to data, and less lock-in—especially when paired with serverless, autonomous data-quality agent operations.

Three viable patterns (and where each shines)

1) Snowflake + dbt (classic warehouse)

Salesforce lands in a Bronze schema (via managed connectors), dbt models cleanse and conform into Silver, and business-ready marts are published in Gold. BI connects to Snowflake directly. This route is fast to stand up and SQL-centric, with strong performance and a rich ecosystem. The trade-offs are platform lock-in and costs that scale with credits and storage—plus substantial manual, rules-based data-quality effort by engineers. Expect 64 credits to process 1 TB ($192 at $3/credit) and 640 credits for 10 TB ($1,920), with Snowflake storage $40/TB-month (on-demand) vs $23/TB-month on capacity. The largest line item is often people: 2 data engineers maintaining models and DQ checks.

2) Iceberg on S3 + AWS Glue/Spark (open Lakehouse)

Keep data in Amazon S3; define Iceberg tables for Bronze/Silver/Gold; run AWS Glue (Spark) jobs to transform and write ACID snapshots; query with Athena/Trino. You get pay-per-use compute, cheap storage (S3 $23/TB-month), and no vendor lock-in—the same tables work with Spark, Trino, Flink, and more. The compromise is more engineering (Spark code, job orchestration) and BI latency that may trail Snowflake for some interactive workloads. Compute is typically tens of dollars per TB transformed (e.g., $80–$90/TB via Glue in the paper’s model). Most teams still budget 2 data engineers for pipelines and rules-based DQ unless they add automation.

3) Serverless Iceberg with Lambda + autonomous DQ agents(Subsec-style)

Keep the same Iceberg S3 backbone but replace clusters with thousands of AWS Lambdas running in parallel and introduce autonomous, AI-driven data-quality agents. This pattern processes 1 TB in 12 minutes (vs 4 hours on Spark or a warehouse) by fanning out to 1,000 functions, then writing back atomic Iceberg commits. The DQ agents continuously detect drift and anomalies and auto-remediate within guardrails, escalating only when needed. Result: minutes-level SLAs, near-zero idle cost, and a fraction of the FTE effort (e.g., 0.5–1 FTE oversight instead of multiple engineers babysitting rules and pipelines).

What this means for cost & risk

Compute & storage: In the paper’s scenario, Snowflake ETL compute for 1 TB is $192 (credits) and $1,920 for 10 TB; S3+Iceberg+Glue for the same work is tens to low hundreds of dollars; Lambda microservices are single- to low-double-digit dollars per TB (paying by millisecond of execution). Storage is $40/TB-month in Snowflake vs $23/TB-month on S3, multiplied by how many layers you persist (Bronze/Silver/Gold).
BI/query: With Iceberg, Athena charges $5/TB scanned; well-partitioned Gold tables keep scans small. Snowflake query cost is embedded in warehouse credits—great speed, but the meter runs when users hammer dashboards.
People: Traditional pipelines spend a lot of human time on cleaning and rule management. The autonomous approach offloads “data janitor” work to AI agents that monitor, fix, and document actions, so engineers focus on products, not plumbing. That’s a TCO lever most spreadsheets miss.

Key architectural differences (business impact, not just tech)

Lock-in vs portability: Iceberg on S3 keeps data in open formats—query anywhere, switch engines at will. Snowflake offers convenience but anchors data and compute to one platform. This matters for long-horizon costs and optionality.
Performance model: Snowflake optimizes for interactive SQL and concurrency. Iceberg’s performance depends on your query engine (Athena/Trino) and table design (partitioning, file sizes). The serverless model trades monolithic scale-up for massive parallel fan-out and “scale-to-zero” when idle.
Operations & DQ: Rules-only approaches miss “unknown unknowns.” Autonomy shifts from suggest to do—detect, correct, and log, in flight—so defects are quarantined before they hit dashboards or models. That reduces rework and decision risk.

A simple decision frame

Choose Snowflake + dbt if you need a turnkey SQL experience, rapid initial time-to-value, and can tolerate a higher unit cost as data and users scale. Great for teams that are primarily SQL and want managed infrastructure. Watch credit usage and engineer time on DQ rules.
Choose Iceberg + Glue/Athena if you’re AWS-first and cost-sensitive at scale. You’ll cut platform spend, keep data portable, and serve BI directly from the lake. Plan for solid data engineering and DevOps to get the best from Spark and table design.
Choose Serverless Iceberg + autonomous DQ if you want the lowest TCO and fastest SLAs with minimal ops. Lambda microservices provide elastic compute; AI agents handle drift, anomalies, and remediation—with full audit trails—so humans focus on outcomes. This is the modern default for bursty, multi-TB workloads.

Bottom line

You don’t need a proprietary warehouse to get reliable, fast analytics. An Apache Iceberg Lakehouse on AWS lets you store once in S3 and query anywhere, plug in the engines you prefer, and point BI tools at curated Gold tables without copying data into Snowflake. For many organizations, that’s the shortest path to lower cost, higher agility, and fewer late-night pipeline fixes. Add a serverless, autonomous compute and data-quality layer, and you shift from maintaining pipelines to shipping data products—with fresher data, tighter SLAs, and far better economics.

(This summary condenses the full analysis, including cost models for 1 TB and 10 TB, medallion pipelines for Snowflake vs Iceberg, and the serverless/AI approach to quality and operations.)

*** See white paper for detailed architecture solutions and cost profiles

Grid 1 — Monthly costs to process 1 TB (AWS us-east-1)

Grid 2 — Monthly costs to process 10 TB (AWS us-east-1)

What’s included & key assumptions (edit as needed)

Region & scope: AWS us-east-1, monthly view, Salesforce → Bronze → Silver → Gold with transformations, joins/augmentations, and DQ at each stage.
Processing time: For Snowflake and Glue/Spark, ~4 hours per 1 TB end-to-end (Bronze→Silver→Gold). For Subsec, ~12 minutes per 1 TB using 1,000 concurrent Lambdas (same rows as Snowflake).
Per-layer compute split: Bronze 25%, Silver 50%, Gold 25% of total ETL compute.
Storage pricing: Snowflake $40/TB-mo on-demand; S3 $23/TB-mo. (Shown as three copies—Bronze, Silver, Gold—for like-for-like comparison.)
Snowflake compute: $3.00/credit; 8-node = 8 credits/hr; 16-node = 16 credits/hr. For a fixed workload, credits are ~equal (bigger warehouse, fewer hours).
1 TB: 64 credits ≈ $192 total ETL (48/96/48 across layers).

10 TB: 640 credits ≈ $1,920 total ETL (480/960/480).
Glue (Spark) compute: $0.44/DPU-hr. Modeled at ~$88 / TB end-to-end (22/44/22 per layer). 10 TB ⇒ $880 (220/440/220).
Subsec Lambda compute: Lambda GB-second pricing; ~$40 / TB end-to-end (10/20/10 per layer), ~$360 at 10 TB (90/180/90). Includes a small allowance for orchestration & S3 I/O.
BI/query cost:
Snowflake: Modeled as 8 hrs/mo Large for 1 TB (≈$192) and 40 hrs/mo Large for 10 TB (≈$960).

Athena (Iceberg/Subsec): $5/TB scanned – modeled at 0.5 TB/mo (=$2.50) for 1 TB estate, 2 TB/mo (=$10) for 10 TB.
FTEs: Snowflake: 2 data engineers (dbt & rules-based DQ) = $25k/mo. Glue/Spark: 2 data engineers (EMR/Glue & rules-based DQ) = $25k/mo. Subsec: 0.5 FTE oversight (autonomous DQ agents remediate anomalies) = $6,250/mo. Assumes $150k/yr fully loaded, 160 hrs/mo.

Notes: These figures are illustrative and conservative so you can swap in your technologies and actual rates (Snowflake $ /credit, Glue DPU profile, Athena scanned TB, Lambda memory/seconds, S3 class, and FTE mix). If you keep only one physical copy at Bronze (and derive Silver/Gold as Iceberg views or incremental snapshots), S3 storage can be lower than shown. Likewise, if your BI scans are heavier/light, adjust the Athena/Snowflake BI rows accordingly.

Back to blog

Friday, September 26, 2025

Iceberg on AWS: A Lakehouse CTOs Can Love (Without the Warehouse Tax)

Product

Company

Legal