AWS announced S3 Tables yesterday, which brings native support for Apache Iceberg to S3. It’s hard to overstate how exciting this is for the data analytics ecosystem. Here’s a quick rundown of my thoughts so far:

  • The integration is deep.

    This isn’t a separate service that sits on top of S3. Rather AWS has added a new type of bucket to the S3 service itself: a table bucket.

    This design seems to be S3’s new standard practice for these major, paradigm-shifting features. Express One Zone works analogously: you need to create a directory bucket to use the Express One Zone storage class. I’m not sure whether this stems from an underlying technical constraint or whether this is simply an API design choice.

    Table buckets come with a host of new APIs. It’s the stuff you’d expect for working with Iceberg tables. To name a few: CreateNamespace, CreateTable, ListTables, RenameTable, PutTableMaintenanceConfiguration.

    Table maintenance (i.e., data file compaction, snapshot management, leaked file cleanup) is handled automatically by S3. You’re billed for the maintenance operations, though—more on this later—and so you can disable automatic maintenance on a per-table basis if you’d prefer to handle maintenance yourself.

    Integration with AWS analytics services (Athena, Redshift, EMR, QuickSight, Data Firehouse) is possible but goofy. It requires enabling an AWS Glue Data Catalog feature1 that automatically mirrors the catalogs from the S3 table buckets in your account like so:

    Glue Data Catalog

  • The price seems right.

    Here’s the quick cost comparison against S3 standard buckets:

    Resource S3 standard bucket S3 table bucket Δ
    Storage $0.023 per GB-month $0.0265 per GB-month +15%
    PUTs $0.005 per 1k reqs $0.005 per 1k reqs 0%
    GETs $0.0004 per 1k reqs $0.0004 per 1k reqs 0%
    Monitoring $0.025 per 1k objects
    Compaction $0.004 per 1k objects
    Compaction $0.05 per GB processed

    The price of PUT and GET requests are unchanged. Storage costs are 15% higher, which seems tolerable. A 15% increase on “very cheap” is still “very cheap.”

    My quick back of the envelope calculation is that monitoring costs will be immaterial, assuming a 100MB+ average file size. A 1TB table will cost $27.14/mo in storage costs, but only $0.26/mo in monitoring costs.

    Compaction costs are more of a mixed bag. For analytic workloads that write infrequently, they also look to be immaterial. But for streaming workloads that write frequently (say, once per second, or once per every ten seconds), compaction costs may be prohibitive. The cost per object processed looks tolerable (writing an object per second results in only 2.5MM objects per month that need to be compacted), but write amplification will be severe, and the cost per GB processed is likely to add up.

    To really get a sense for compaction costs, someone will need to run some experiments. A lot depends on how often S3 chooses to compact data files for a given workload, which is not something that’s directly under the user’s control.

  • The effort required for a data tool to write Iceberg tables has dropped dramatically.

    While most OLAP tools today have good support for reading Iceberg tables, only a lucky few tools (Databricks, Snowflake, Spark, Hive, Flink, to name the big players) have good support for writing Iceberg tables.

    Notably, DuckDB can’t write Iceberg tables, ClickHouse can’t write Iceberg tables, and Amazon Redshift can’t write Iceberg tables. The system I work on, Materialize, also can’t write Iceberg tables.

    Why? To read an Iceberg table, all you need is to do is parse a bit of metadata on S3 will tell you which Parquet data files to download and decode. But to write an Iceberg table requires much more care. Each write to an Iceberg table creates new data files in S3. Over time, the table’s data gets split across many small files, and reading the data from the table becomes prohibitively slow. The solution is to periodically compact the table, which combines the data files into fewer larger files and drops any data that has been overwritten/deleted. See this article from Dremio for a better explanation.

    The only way I know of today to compact an Iceberg table is via running a Spark or Flink job that looks like this:

    Table table = ...
    SparkActions
     .get()
     .rewriteDataFiles(table)
     .execute();
    

    Neither the Python Iceberg library nor the Rust Iceberg library today support compaction (apache/iceberg-python#1092, apache/iceberg-rust#624).2 So unless your OLAP system is written in Java or another JVM-based language, you’re basically out of luck if you wanted to write Iceberg tables.

    That changes with S3 table buckets, which handle all the required maintenance operations for Iceberg automatically, in the background, for a small fee. Compaction no longer needs to be reinvented in each system that wants to write Iceberg tables.

    As a developer of a system that wants to write Iceberg tables without managing Iceberg compaction, I see this as a huge win.3 It’s early days, but when we build an Iceberg sink for Materialize, we’re planning to lean on S3 table buckets to handle compaction. This is essentially a bet that other object storage technologies (Google Cloud Storage, Azure Blob Storage, Cloudflare R2, MinIO, Ceph), will eventually follow suit and provide their own equivalent of S3 table buckets that transparently handle Iceberg table maintenance.

  • Perhaps Iceberg REST catalogs can go quietly into the night? 🌶️4

    Iceberg catalogs keep track of which Iceberg tables live at which path in the S3 bucket and, crucially, which files within that path comprise the live version of the table. This catalog is key to the consistency guarantees that have made Iceberg successful. It’s what ensures that two systems attempting to write to the table at the same time don’t corrupt each other’s writes. Reading from an Iceberg table doesn’t require interacting with the catalog5, but writing to an Iceberg table always does.

    Dealing with the catalog is, unfortunately, quite annoying. There is not one canonical Iceberg catalog, but rather several catalog implementations. Some catalog implementations exist only as Java libraries. Some catalog implementations are REST services that you can interact with over HTTP from any language.

    If you’re lucky, someone in your organization has already set up an Iceberg catalog for you to use. If you’re unlucky, you’ll need to evaluate the several possible options (Hive metastore? JDBC? AWS Glue? Polaris?), figure out how to get it deployed into your production environment, and then figure out how to set up authentication and distribute credentials to all your Iceberg-using applications.

    Now, with S3 table buckets, AWS has introduced yet another catalog implementation. It’s called, unsurprisingly, the S3TablesCatalog. The source code for the S3TablesCatalog is available on GitHub: https://github.com/awslabs/s3-tables-catalog. It’s a pretty straightforward wrapper over the aforementioned S3 tables API. In fact most Iceberg catalog operations are one-to-one with S3 tables API operations.6

    For me personally, as an end user of Iceberg, this actually seems great. I don’t want to think about which catalog implementation to use. I don’t want to have to run a separate catalog service. I don’t want to have to set up authentication for my catalog that’s separate from my S3 bucket.

    S3 tables take all these questions off the table. I just use the catalog implementation that comes built-in to where I’m storing my data. The slickest part here might be the authentication story: the same AWS IAM policy that grants an application access to the data files can also grant access to the necessary catalog APIs.

    In my ideal world, all the major object store providers would eventually provide table buckets with integrated Iceberg catalogs, and we could do away with separate catalog services entirely. 🌶️4

    Are there niche uses for custom catalog implementations that I’m missing? The catalog API is so straightforward that I just can’t imagine what differentiated value a separate catalog implementation (e.g., Polaris) could provide over the S3-native catalog.

  • The S3 feature drought is over.

    S3 has now launched three major features in 12 months. Last November we got S3 Express One Zone and this summer we got conditional writes.

    Maybe it’s just my perception, but it feels like something material has shifted in S3’s development culture. During S3’s first decade of life we regularly saw major feature releases (Glacier, object versioning, object lifecycle rules, multi-region buckets), but those dried up around 2015—until now.

  • The name implies more table formats could be added in the future.

    It’s telling that the name of the new feature is “S3 Tables” and not “S3 Iceberg.” It seems like AWS wanted to leave the door open to supporting other table formats in the future.

    Perhaps we’ll see support for Apache Hudi or Apache Paimon tables in the future, or an as-yet-to-be-developed open table format. AWS seems to be tracking the open table format ecosystem quite closely.

  1. Naturally, because this is AWS, setting up the integration requires installing a Byzantine IAM policy to allow the LakeFormation service to access the S3 Tables API, then making a LakeFormation RegisterResource API call to register all S3 Table buckets with LakeFormation.

    Alternatively, you can click a button in the AWS console to handle these steps automatically. Remember to click this button for each AWS account you control and in each AWS region you use. Finally, consult your organization’s DevOps team to determine how to atone for your ClickOps sins. 

  2. There isn’t even a C++ library for Iceberg at all. I think ClickHouse had to hack together their own basic C++ library for reading Iceberg tables. Which isn’t too bad if you can limit yourself to reading from S3 and not interacting with the catalog. Much harder if you need to both write to the catalog and manage maintenance operations like compaction. 

  3. I can imagine complaints that S3 table buckets make Iceberg less open, because they move compaction from an open-source implementation in the Iceberg project to a proprietary, closed-source implementation provided by AWS. It seems fine to me though. The primary value of Iceberg is having open standard for how to read and write tables on object storage while preserving ACID semantics. Compaction is just another write operation that can be applied to a table. Using a proprietary compaction implementation doesn’t impact the interopability of the table at all. 

  4. Disclaimer: I’m not deeply familiar with the Iceberg community or the project’s development history. Apologies in advance if I’m missing something important about what benefits REST catalogs provide or am otherwise being insensitive!  2

  5. If you need a fully consistent read where you’re guaranteed to see any writes that committed before your read started, you need to interact with the Iceberg catalog. But most OLAP systems don’t mind a little bit of staleness, and so they can read directly from S3 without interacting with the catalog. 

  6. It is a bit surprising that they chose to implement this catalog as a Java library, instead of making the S3 Tables API directly implement the Iceberg Catalog REST specification. Maybe it had something to do with IAM authentication. The Iceberg REST spec does not permit using AWS Signature Version 4 as an authentication mechanism. Or perhaps AWS’s internal tooling and policies for service APIs would have made it impossible to hew to the Iceberg REST spec.