I came across this recommendation in the official Google Cloud documentation, where Dataform is mentioned as an orchestration tool.
I’m a bit unclear on why Dataform would be recommended for orchestrating notebooks, since it doesn’t support creating or managing notebooks directly—unlike BigQuery Notebooks (Colab Enterprise + Scheduler) or Vertex AI Pipelines. Could you please clarify the reasoning behind this recommendation or why it’s listed as an option?
Hi a8mad_mohammad,
While you won’t find a “create notebook” button within the Dataform interface, its recommendation as an orchestration tool for notebooks in Google Cloud stems from its behind-the-scenes integration with BigQuery Studio and its underlying Colab Enterprise notebooks. The key is to understand that Dataform provides the foundational engine for crucial lifecycle management features of these notebooks, even though they are not directly visible or manageable within the Dataform UI.
In essence, when you schedule a BigQuery notebook, you are leveraging Dataform’s orchestration capabilities. Notebooks within BigQuery are treated as code assets, and Dataform is the technology that powers the versioning and scheduling of these assets.
Here’s a breakdown of Dataform’s role in the lifecycle of a BigQuery notebook:
- Scheduling: The ability to run a notebook at a specific time or on a recurring schedule is a feature directly powered by Dataform. The scheduling service for BigQuery notebooks is built upon Dataform’s infrastructure. In fact, to manage notebook schedules, you often need Dataform-related IAM permissions, such as dataform.editor.
- Versioning: When you save versions of your BigQuery notebook, it is Dataform’s underlying infrastructure that handles the tracking of these changes. This allows you to view version history and revert to previous versions if needed.
- Metadata Management: Information about the notebook, including its name, creation date, and version history, is managed as a dataform-code-asset within Google Cloud’s data catalog, Dataplex.
While Dataform is a powerful tool for developing, testing, and scheduling complex SQL-based data transformation pipelines (ELT workflows), its interface is tailored for that specific purpose, focusing on SQLX files and dependency management.
In conclusion, while other services like Vertex AI Pipelines offer more explicit and visual pipeline-building experiences that can include notebooks, Dataform’s recommendation for notebook orchestration within BigQuery is due to its fundamental role in powering the scheduling and versioning of these notebooks in a deeply integrated, less visible, manner.