The Medallion Architecture, Explained: Raw, Refined, Curated

The medallion architecture organizes a data platform into three layers, Raw, Refined, and Curated, that take source data and refine it step by step into trusted, business-ready tables. It is the most common blueprint for the modern data lakehouse, and it is the foundation OptimaFlo is built on. This guide explains each layer in plain English, then shows how to stand one up without hiring a data team.

What is the medallion architecture?

The medallion architecture is a data design pattern popularized by Databricks. It splits data processing into three progressively cleaner layers. You may know these layers by their traditional names, bronze, silver, and gold; we use clearer terms: Raw, Refined, and Curated. Each stage adds structure, quality, and business meaning:

Raw holds data exactly as it arrived.
Refined holds cleaned, validated, joined data.
Curated holds business-ready metrics for dashboards and reports.

The point is simple. You never transform data in one big, fragile step. You move it through three small, reliable ones, and every stage stays queryable and auditable.

The three layers

Raw: untouched source data

Raw is your system of record. Data lands here exactly as the source sent it, with zero transforms and full history. Nothing is thrown away. If a downstream rule turns out to be wrong, you can always reprocess from Raw.

Because Raw keeps everything, the table format matters. OptimaFlo stores every layer in an open lakehouse table format, which gives Raw ACID transactions, schema evolution, and time travel. You can query the table as it looked last Tuesday, not just today.

Refined: cleaned and conformed

Refined is where source data becomes reliable. This layer validates types, removes duplicates, fixes nulls, and joins sources into clean, conformed tables. It is the layer most analysts actually build on.

Refined is also where SQL lives. In a traditional stack this is the dbt layer, written and maintained by data engineers. The transforms are version-controlled and reviewed, because a bad Refined query quietly poisons everything downstream.

Curated: business-ready

Curated is the layer your business sees. It holds aggregated metrics, star schemas, and the tables that power dashboards and reports. Curated tables are usually updated incrementally, so a daily metric refreshes in seconds instead of rebuilding from scratch.

Here is how the three layers compare:

Layer	Holds	Transforms	Typical consumer
Raw	Source data, full history	None	Reprocessing, audits
Refined	Cleaned, validated, joined	Type, dedupe, join	Analysts, data scientists
Curated	Metrics, star schema	Aggregate, model	Dashboards, executives

Why data teams adopt it

The pattern is popular because it solves real problems at once:

Reliability. Failures are isolated to one hop, not the whole pipeline.
Reproducibility. Keeping Raw plus lakehouse time travel means any result can be rebuilt exactly.
Incremental refinement. Each layer does one job, so changes are small and safe.
Auditability. You can trace any Curated number back through Refined to the original source record.

The catch: it usually takes a whole stack and a team

The architecture is elegant. Building it is not. A typical stack stitches together six to ten tools: ingestion (Fivetran), a warehouse (Snowflake), transforms (dbt), orchestration (Airflow), a BI tool (Tableau), and data observability (Monte Carlo). Each one is a contract, a bill, and a thing that breaks at 2 a.m.

It also takes people. Someone has to write the Refined and Curated SQL, wire up the orchestration, and own the schemas. For a data-rich team without a dedicated data engineering function, that is the wall most projects hit.

How OptimaFlo builds the medallion architecture for you

OptimaFlo is one platform that runs the full Raw, Refined, Curated flow inside your own cloud. It is built to remove the stack and the staffing problem at the same time.

LLM-native. Describe the table you want in plain language. OptimaFlo generates the Refined and Curated SQL, you review and approve it, and it ships. The AI writes the data engineering, you keep the judgment.
Open lakehouse foundation. Every layer is an open table, so ACID guarantees, schema evolution, and time travel apply end to end, not just in one tool. No proprietary lock-in.
Bring your own cloud (BYOC). The platform runs in your environment. Your source data never leaves your account.
The right compute, automatically. OptimaFlo matches the compute to the size of your data, from small jobs to very large ones. You never size or manage clusters.
One platform, not ten. Ingestion, transforms, orchestration, dashboards, and data quality checks live in one place.

The result: you can go from your first data source connection to a live, Curated-backed dashboard in a single afternoon. See the getting started guide for the five-minute version, or the architecture overview for the full picture.

What a Refined transform looks like

A Refined model is just SQL: take the Raw table, cast the types, drop the bad rows, and write a clean table. OptimaFlo generates SQL like this, and you approve it before it runs:

SELECT
    order_id,
    customer_id,
    CAST(order_total AS DECIMAL(12, 2)) AS order_total,
    DATE_TRUNC('day', ordered_at)       AS order_day
FROM raw.orders
WHERE order_total IS NOT NULL;

That clean refined.orders table then feeds a Curated metric, like daily revenue by customer segment, which lands in your dashboard.

Frequently asked questions

What are the Raw, Refined, and Curated layers? Raw stores source data with full history and no transforms. Refined stores cleaned, validated, and joined data. Curated stores business-ready metrics and star-schema tables for dashboards.

Is the medallion architecture only for Databricks? No. Databricks popularized the name, but the pattern is tool-agnostic. Any lakehouse built on an open table format can implement the Raw, Refined, and Curated layers.

Do I need Spark to use it? No. Spark shines on very large data, but for most workloads a lighter engine is faster and cheaper. OptimaFlo matches the compute to your data automatically, so you never have to choose.

How is it different from a traditional ELT data warehouse? ELT usually loads source data and transforms it inside one warehouse. The medallion architecture formalizes that into three explicit, queryable layers on an open lakehouse, which adds reproducibility, time travel, and clearer separation of concerns.

Can I build a medallion architecture without a data team? Yes. OptimaFlo generates the Refined and Curated SQL with an LLM, runs every layer on an open lakehouse in your own cloud, and replaces the six-to-ten-tool stack with one platform, so you can stand up the full flow without dedicated data engineers.

Key takeaways

The medallion architecture refines data through three layers: Raw source data, Refined clean data, and Curated business-ready tables.
Its real cost is the multi-tool stack and the data team needed to run it.
OptimaFlo collapses that into one LLM-native platform on an open lakehouse, running in your own cloud.

Ready to build your first Raw-to-Curated pipeline? Start building with OptimaFlo or read the getting started guide.