Dagster & Databricks

You can orchestrate Databricks from Dagster in multiple ways depending on your needs, including through Databricks Connect, Dagster Pipes, the Dagster Databricks Component, or Dagster Connections.

Choosing an integration approach

Approach	How it works	Choose when
Databricks Connect	Write Spark code in Dagster assets that executes on Databricks compute	You want to write Spark code directly in your Dagster assets You prefer centralized code that doesn't need deployment to Databricks You're doing interactive development or quick iterations Your workloads are moderate in size and duration You want simpler debugging with local code execution
Dagster Pipes	Submit Databricks jobs and stream logs/metadata back to Dagster	You need real-time log streaming from Databricks to Dagster Databricks job code needs to report custom metadata back to Dagster You're running large batch jobs that should execute independently You want fine-grained control over job submission parameters Your code is already deployed to Databricks
DatabricksAssetBundleComponent	Reads your `databricks.yml` bundle config and creates Dagster assets from job tasks	Your team is using Databricks Asset Bundles You want Databricks job definitions version-controlled with your Dagster code You're starting fresh or migrating jobs to a new structure You need tight integration between job config and Dagster definitions CI/CD pipelines manage your Databricks deployments
DatabricksWorkspaceComponent	Connects to your workspace, discovers jobs, and exposes them as Dagster assets	Jobs already exist in Databricks and are managed there You want Dagster to discover and orchestrate existing jobs Teams manage jobs directly in Databricks UI/API You need flexibility without maintaining local config files
Dagster Databricks Connection (Dagster+ only)	Automatically discover Databricks tables and catalogs as external assets in the Dagster+ UI	You're on Dagster+ You want visibility into Databricks tables/catalogs as external assets Data governance and lineage tracking are priorities You don't need to orchestrate jobs, just observe metadata You want automatic schema and statistics extraction

About Databricks

Databricks is a unified data analytics platform that simplifies and accelerates the process of building big data and AI solutions. It integrates seamlessly with Apache Spark and offers support for various data sources and formats. Databricks provides powerful tools to create, run, and manage data pipelines, making it easier to handle complex data engineering tasks. Its collaborative and scalable environment is ideal for data engineers, scientists, and analysts who need to process and analyze large datasets efficiently.

Choosing an integration approach​

About Databricks​

Choosing an integration approach

About Databricks