Skip to main content

Dagster & Databricks

You can orchestrate Databricks from Dagster in multiple ways depending on your needs, including through Databricks Connect, Dagster Pipes, the Dagster Databricks Component, or Dagster Connections.

Choosing an integration approach

ApproachHow it worksChoose when
Databricks ConnectWrite Spark code in Dagster assets that executes on Databricks compute
  • You want to write Spark code directly in your Dagster assets
  • You prefer centralized code that doesn't need deployment to Databricks
  • You're doing interactive development or quick iterations
  • Your workloads are moderate in size and duration
  • You want simpler debugging with local code execution
Dagster PipesSubmit Databricks jobs and stream logs/metadata back to Dagster
  • You need real-time log streaming from Databricks to Dagster
  • Databricks job code needs to report custom metadata back to Dagster
  • You're running large batch jobs that should execute independently
  • You want fine-grained control over job submission parameters
  • Your code is already deployed to Databricks
DatabricksAssetBundleComponentReads your databricks.yml bundle config and creates Dagster assets from job tasks
  • Your team is using Databricks Asset Bundles
  • You want Databricks job definitions version-controlled with your Dagster code
  • You're starting fresh or migrating jobs to a new structure
  • You need tight integration between job config and Dagster definitions
  • CI/CD pipelines manage your Databricks deployments
DatabricksWorkspaceComponentConnects to your workspace, discovers jobs, and exposes them as Dagster assets
  • Jobs already exist in Databricks and are managed there
  • You want Dagster to discover and orchestrate existing jobs
  • Teams manage jobs directly in Databricks UI/API
  • You need flexibility without maintaining local config files
Dagster Databricks Connection (Dagster+ only)Automatically discover Databricks tables and catalogs as external assets in the Dagster+ UI
  • You're on Dagster+
  • You want visibility into Databricks tables/catalogs as external assets
  • Data governance and lineage tracking are priorities
  • You don't need to orchestrate jobs, just observe metadata
  • You want automatic schema and statistics extraction

About Databricks

Databricks is a unified data analytics platform that simplifies and accelerates the process of building big data and AI solutions. It integrates seamlessly with Apache Spark and offers support for various data sources and formats. Databricks provides powerful tools to create, run, and manage data pipelines, making it easier to handle complex data engineering tasks. Its collaborative and scalable environment is ideal for data engineers, scientists, and analysts who need to process and analyze large datasets efficiently.