Databricks Briefing · prepared for Roy Gatling
01 / 12
Partner Briefing · 2026

Databricks for Data
Pipeline Ingestion
and AI-Ready Lakehouse Architecture

A consultant's view of how Databricks ingests, governs, and operationalizes data — and how it sits alongside Snowflake on the modern data stack.

Audience
Roy Gatling — AI Implementation Consultancy
Focus
Ingestion · Governance · AI-readiness
Format
12-slide partner briefing
Platform overview

A unified data intelligence platform

The Databricks Data Intelligence Platform combines ETL, machine learning, AI, data warehousing, BI, and governance on a single lakehouse foundation — built on open formats and available across AWS, Azure, and GCP.

Lakehouse foundation

One platform for ETL, ML/AI, and DWH/BI workloads — built on lakehouse architecture using open formats like Delta Lake and Apache Iceberg.

Unity Catalog governance

Central governance for data and AI: metadata management, access control, auditing, discovery, lineage, and monitoring across every asset.

Mosaic AI workflows

End-to-end ML and AI workflows from data preparation through model building and serving — governed and observable.

Databricks SQL & AI/BI

Databricks SQL, AI/BI dashboards, and Genie Spaces deliver governed analytics directly on lakehouse tables — no copy required.

Real-time analytics

Structured Streaming, Auto Loader, and Lakeflow Declarative Pipelines support continuous, low-latency data flows alongside batch.

Multi-cloud, open formats

Runs on AWS, Azure, and GCP. Built on open table formats so data is not locked into a single engine or vendor.

The case for the lakehouse

From fragmented pipelines to AI-ready data products

Most enterprise data estates still consist of disconnected ETL jobs, BI marts, and ML notebooks. Databricks consolidates them into governed data products — the prerequisite for production AI.

Before — Fragmented
  • Separate stacks for ETL, warehousing, ML, and analytics
  • Multiple copies of data across systems with drift
  • Governance reapplied per tool, often inconsistently
  • AI projects blocked on data readiness, not modeling
  • Slow iteration as data flows through brittle handoffs
After — Lakehouse
  • One platform for ingestion, transformation, BI, and AI
  • Single governed copy in Delta Lake or Iceberg
  • Unity Catalog applies governance once, everywhere
  • AI-ready data products with traceable lineage
  • Faster iteration — data, model, and dashboard in one place
3 → 1
Stacks consolidated
Governed
Data & AI in one catalog
Open
Delta & Iceberg formats
Ingestion · Auto Loader

Incremental ingestion from cloud object storage

Auto Loader is Databricks' recommended source for ingesting files as they land in cloud storage. It exposes a Structured Streaming source called cloudFiles and works across every major cloud and file format teams use today.

Cloud storage sources

Watches and incrementally processes new files as they arrive in the cloud:

Amazon S3 Azure Data Lake Storage Gen2 Google Cloud Storage Azure Blob Storage Unity Catalog volumes

File formats supported

Native support for the formats data engineering teams actually use:

JSON CSV XML Parquet Avro ORC TEXT BINARYFILE
Authoring

Auto Loader can be authored in Python or SQL and runs inside Lakeflow Spark Declarative Pipelines, so the same source definition works for ad-hoc streaming jobs and production pipelines alike.

Reliability guarantees

Exactly-once, recoverable, schema-aware ingestion

Auto Loader's reliability semantics are what let teams replace bespoke ingestion code with declarative pipelines and trust the result in production.

Exactly-once processing

Checkpoint metadata records every file that has been ingested, providing exactly-once delivery semantics — even across job restarts or failures.

Failure recovery

Pipelines resume from the last successful checkpoint after a failure, picking up from the exact file boundary they left off on — no manual replay logic.

Schema evolution

Auto Loader detects new columns in incoming files and evolves the target schema — keeping the pipeline running while preserving historical data.

Idempotent loading

Files are processed once and only once. Re-running a pipeline against the same source is safe — duplicates do not appear in the destination tables.

Why it matters for partner work: these guarantees move ingestion from "bespoke ETL we maintain" to "declarative pipelines we configure" — freeing implementation time for the AI work that actually differentiates the practice.

Reference architecture

The lakehouse pipeline pattern

Databricks recommends Auto Loader as the entry point into a medallion architecture, with Delta Lake as the storage layer and Lakeflow Declarative Pipelines orchestrating the flow into Databricks SQL, BI, and AI.

01
Auto Loader
Incrementally ingests new files from cloud storage and Unity Catalog volumes.
02
Delta Lake
Open ACID table format that stores Bronze, Silver, and Gold layers with full lineage.
03
Lakeflow Declarative Pipelines
Declarative orchestration in SQL or Python with monitoring, retries, and dependency management.
04
Databricks SQL · BI · AI
Serve analytics, dashboards, and AI workloads directly from governed Gold tables.

Medallion layers within Delta Lake

Each layer represents a stage of progressive refinement.

Layer 01 · Bronze

Raw, append-only

Auto Loader lands source data verbatim. Full history, minimal transformation, source-of-truth for reprocessing.

Layer 02 · Silver

Cleansed, conformed

Joined, deduplicated, and validated tables — ready for cross-domain analytics and feature engineering.

Layer 03 · Gold

Business-ready products

Aggregated metrics, semantic tables, and feature stores. Direct sources for dashboards, AI agents, and ML models.

Governance · Unity Catalog

One catalog for data and AI assets

Unity Catalog is Databricks' central governance layer. The same catalog governs tables, volumes, models, features, dashboards, and AI agents — applying policy once and inheriting it everywhere.

Metadata management

A unified registry for every governed asset across catalogs, schemas, and workspaces.

Access control

Fine-grained, role-based controls applied uniformly to data and AI assets.

Auditing

Complete audit history of access, change, and policy enforcement across the catalog.

Discovery

Search across data, ML models, dashboards, and notebooks from a single surface.

Lineage

Column-level lineage stitched from notebooks, jobs, dashboards, and pipelines automatically.

Monitoring

Quality and operational monitoring built into the same governance plane as data and AI.

Analytics & BI

Serving analytics directly from the lakehouse

Databricks delivers analytics where the data already lives — so dashboards, embedded insights, and conversational analytics all share the same governed source.

Databricks SQL

A serverless SQL warehouse engine that runs directly on lakehouse tables. Compatible with the BI tools enterprises already use.

  • SQL editor and query history with workspace governance
  • Direct connectivity to Tableau, Power BI, Looker, Sigma
  • Inherits Unity Catalog access controls automatically

AI/BI Dashboards & Genie

A native dashboard layer plus Genie Spaces — conversational analytics over governed datasets.

  • AI/BI Dashboards — built and shared inside Databricks
  • Genie Spaces — natural-language Q&A on curated tables
  • Same governance, lineage, and access policy as the source

One ingestion path, one governance layer, one analytics surface — no copy-out, no separate semantic layer to maintain.

AI & ML · Mosaic AI

An end-to-end AI workflow on governed data

Mosaic AI takes the Databricks workflow from data preparation through model building and serving — without leaving the lakehouse, and without losing Unity Catalog governance.

Stage 01
Prepare
Use governed lakehouse tables and a feature store as the AI substrate.
Stage 02
Build
Train, fine-tune, and evaluate models in notebooks with governed compute.
Stage 03
Register
Track models in Unity Catalog with version, lineage, and approval workflows.
Stage 04
Serve
Deploy via Model Serving for real-time inference, batch scoring, or AI agents.

Open foundation models

Access proprietary and open-source LLMs through a single governed endpoint.

Vector search & RAG

Build retrieval-augmented agents over Unity Catalog–governed embeddings.

Evaluation & observability

Monitor model quality, drift, and inference cost from the same control plane.

Side-by-side comparison

Databricks vs Snowflake — both partner platforms

Both are leading data platforms with overlapping capabilities. The differences are in heritage, primary workload, and architectural posture — not in whether one is "better."

Dimension Databricks Snowflake
Heritage Lakehouse — Apache Spark, Delta Lake, ML/AI native Cloud data warehouse — SQL-first, separation of storage and compute
Primary ingestion Auto Loader — incremental file processing from S3, ADLS, GCS, Azure Blob, Unity Catalog volumes; JSON, CSV, XML, Parquet, Avro, ORC, TEXT, BINARYFILE Snowpipe Streaming — row-level loads up to 10 GB/s per table, sub-second availability, exactly-once delivery, in-flight transformations, schema evolution
Pipelines Lakeflow Declarative Pipelines — SQL or Python, orchestration with monitoring built in Dynamic Tables — SQL or Python declarative transformations with freshness targets and incremental refresh
Streaming sources Structured Streaming + Auto Loader; Lakeflow Connect for SaaS sources Snowflake Openflow for streaming sources alongside Snowpipe Streaming
Storage format Open — Delta Lake and Apache Iceberg Proprietary micro-partitions; Iceberg tables for open interop
Governance Unity Catalog — data and AI assets in one catalog Horizon Catalog — governance for data products and pipelines
AI / ML stack Mosaic AI — full ML/AI lifecycle native to the platform Cortex — LLM functions, agents, and search built into SQL surface
Developer surface Notebooks, Spark, Python, SQL, MLflow — engineer-led Snowsight, SQL, Python, Streamlit, Notebooks — analyst-led
Client fit framework

When to use Databricks

Databricks is the right fit when the data strategy needs to scale beyond reporting into governed engineering, streaming, and AI-ready data products.

Large-scale ETL & streaming

Use Databricks when pipelines involve high-volume data movement, complex transformations, or sustained streaming patterns that benefit from Spark and Auto Loader.

Collaborative engineering teams

Use Databricks when data engineers, analysts, and ML teams need to collaborate in Python, SQL, Spark, and notebooks against the same governed data foundation.

Custom ML & AI

Use Databricks when the roadmap includes fine-tuning, model serving, RAG applications, MLOps pipelines, or AI products that require full lifecycle control.

Open data architecture

Use Databricks when the organization wants portable lakehouse data, open formats such as Delta Lake and Iceberg, and flexibility across multiple processing engines.

Quick decision rule

If the primary need is governed reporting and SQL analytics, Snowflake is often the simpler fit. If the strategy spans large-scale data engineering, streaming, and custom AI, Databricks can provide more leverage from a single governed platform.

Platform fit

Databricks and Snowflake can work together

The right data architecture is rarely either/or. It should match each platform to the workloads it supports best, then connect them through governed data, clear ownership, and an AI-ready roadmap.

Snowflake — SQL analytics foundation

Governed reporting & SQL analytics

Snowflake remains an excellent default for SQL warehousing, business intelligence, and data sharing. Snowpipe Streaming and Dynamic Tables make near-real-time pipelines straightforward, and Horizon Catalog covers governance over data products.

Databricks — engineering and AI foundation

Intelligent data products & AI

Databricks becomes the stronger fit when the roadmap moves beyond reporting — large-scale engineering, streaming at the lakehouse layer, custom ML, RAG agents, and AI products that need lifecycle ownership in one governed platform.

Client decision principle

Start with the workload, not the logo. Use Snowflake where the primary value is fast, governed SQL analytics. Use Databricks where the primary value is engineered data products that feed AI. Where both are present, align governance, data ownership, and open table strategy early.

Sources