Goldilox Optimizes Snowflake Iceberg Tables, Too
Snowflake-managed Apache Iceberg tables get the same automated, workload-based clustering recommendations Goldilox Insights gives native tables. In our TPC-H SF100 benchmark, applying those recommendations to a managed Iceberg table delivered 32% partition pruning and 40% less data scanned, on par with native, all with zero manual tuning.
Why Snowflake's Apache Iceberg support matters
Apache Iceberg is an open table format for large analytic tables, and it's quickly becoming the standard for the open data lakehouse. In Snowflake, Iceberg tables pair Snowflake's query engine and table semantics with data stored as open Parquet files plus Iceberg metadata, either in Snowflake-managed storage or in your own cloud storage through an external volume.
For Snowflake-managed Iceberg tables that use Snowflake as the catalog, the appeal is real:
- No lock-in. Your data stays in the open Iceberg table format with Parquet files, rather than Snowflake's native proprietary table format. Depending on configuration, those files live in Snowflake-managed storage or customer-managed cloud storage.
- Interoperability. Through Snowflake Horizon Catalog and the Iceberg REST Catalog API, the same tables can be read by external engines such as Spark, Trino, Flink, and DuckDB under supported configurations. One copy of the data, many engines.
- Snowflake performance and maintenance. Snowflake provides read/write access, lifecycle maintenance such as compaction, and table optimization including Automatic Clustering. (Some Snowflake features have Iceberg-specific limitations, and externally-managed Iceberg tables, where another system is the catalog, have more.)
You get much of the openness of a data lake with the performance and simplicity of Snowflake. But that performance point comes with a catch most teams overlook.
Iceberg tables still need good clustering
Adopting an open format doesn't change the physics of query execution. Table scanning is still the most expensive part of most analytical queries, and an Iceberg table written in arrival order, with no thought given to physical layout, still forces queries to read nearly every file to answer a selective filter. That's wasted I/O and wasted compute, exactly as it would be on an unclustered native table.
The good news: Snowflake-managed Iceberg tables that use Snowflake as the catalog support clustering keys and Automatic Clustering, just like native tables. (Externally-managed Iceberg tables, where another system owns the catalog, don't currently support Snowflake clustering.) So for the common managed case, the opportunity is identical: physically organize the data around how it's actually queried, and Snowflake prunes away the files it doesn't need. The hard part, as always, is knowing which keys to choose, and keeping that choice current as workloads evolve.
Goldilox recommends clustering keys for Iceberg automatically
This is exactly the kind of recommendation Goldilox Insights produces. Insights analyzes your actual Snowflake workload, the real query history, query profiles, filters, joins, and scan patterns that show how each table is genuinely used, and recommends clustering keys from that evidence rather than from generic rules of thumb. Because those workload signals are captured the same way for Iceberg and native tables, the same engine produces clustering key recommendations for Snowflake-managed Iceberg tables, which is exactly what we put to the test below. The recommended key is applied by setting it on the table:
-- Apply the recommended clustering key to a managed Iceberg table
ALTER ICEBERG TABLE analytics.public.lineitem CLUSTER BY (l_shipmode, l_shipdate);
Setting a clustering key is a simple DDL change. Snowflake then performs the physical reorganization asynchronously through Automatic Clustering, so existing readers keep working while it runs. That keeps the workflow operationally simple. As with native auto-clustering, the reorganization runs as a separate, billed maintenance process rather than a pure metadata flip.
The benchmark: TPC-H SF100, native vs. Iceberg, same keys
To measure this directly, we used the industry-standard TPC-H benchmark at scale factor 100 (a 600-million-row LINEITEM table). We loaded the data into both a native Snowflake schema and a Snowflake-managed Iceberg schema, then applied the same Goldilox-recommended clustering keys to both formats, so the only variable is the storage format itself. We ran the 22 TPC-H queries with result caching disabled and measured partition pruning, data scanned, and query time.
Result 1: Goldilox optimization on Iceberg
Starting from an unoptimized Iceberg table (load order, effectively zero pruning), applying the Goldilox-recommended keys delivered:
| Metric (22-query workload) | Unoptimized Iceberg | Goldilox-optimized |
|---|---|---|
| Partition pruning | 0% | 32% |
| Data scanned | 450 GB | 268 GB (−40%) |
| Query time (Small warehouse) | 108 s | 91 s (−15%) |
Partition pruning went from effectively zero to 32%, and the workload scanned 40% less data, achieved entirely from Goldilox's workload-based recommendation, with no manual analysis.
Result 2: Iceberg keeps pace with native at the same keys
The natural worry with any open format is whether you give up performance for openness. With the identical clustering keys applied to both, the Iceberg table kept pace with native, scanning essentially the same data in comparable time:
| Metric (22-query workload) | Native table | Iceberg table |
|---|---|---|
| Partition pruning | 29.6% | 32.4% |
| Data scanned | 271 GB | 268 GB |
| Query time | 93.8 s | 91.3 s |
The differences here, a few percent on data scanned and time, are within the run-to-run variance you'd expect from rebuilding and re-clustering tables this large (clustering relies on sampling, so each rebuild lands a little differently). The point isn't that one format wins. It's the reassurance that matters: you don't trade query performance for the open format. Iceberg keeps pace with native, and Goldilox's workload-based optimization works the same on both.
How to read these numbers
Partition pruning and data scanned are the most stable measures of this optimization because they're primarily driven by physical layout and query predicates, not warehouse size. Wall-clock time and cost conversion depend more heavily on warehouse size, cache state, concurrency, and how scan-heavy the workload is. The time savings are largest when scanning dominates query runtime. We report Small-warehouse times above.
Takeaways
- Iceberg gives you openness without giving up performance. One open copy of your data, with full Snowflake query performance.
- Iceberg tables still benefit from clustering. The open format doesn't remove the pruning problem; well-chosen clustering keys cut data scanned by a third or more.
- Goldilox optimizes Iceberg exactly like native. The same automated, workload-based recommendations, the same real gains, with zero manual tuning, so your lakehouse stays fast and cost-efficient as it grows.
Running Iceberg or native tables on Snowflake? Install Goldilox Insights from the Snowflake Marketplace and preview clustering key recommendations for your own tables, free, entirely inside your Snowflake account.
Try Free on Snowflake Marketplace