Ability to link top-level folder in git repository to Dataform Repository

I am exploring the use of Cloud Dataform to establish create for transforming raw data into its final state and generating modelled tables within a single pipeline for our new project, with over 100 pipelines anticipated.

Does each transformation pipeline require a separate repository in Git connected to a new Dataform repository?

Is it possible to link a top-level folder within a Git repository to a Dataform repository?

In our previous project, we maintained a single private Git repository to store our various pipeline codes. We aim to avoid creating multiple Git repositories and also we want to avoid complicating the workflow by consolidating multiple pipelines into one. What approach should we take?

1 Like

For your new project with over 100 anticipated pipelines, it is not necessary to create a separate Git repository for each Dataform pipeline. Dataform is designed to manage multiple data transformation scripts within a single project repository, leveraging SQLX files for organization. This approach aligns with Dataform’s capabilities and is conducive to efficient project management within a data warehousing environment like BigQuery.

Here are the revised points to consider:

  • Modular Organization: Dataform supports a modular setup within a single repository, allowing you to use directories and naming conventions to organize your SQLX models and transformations. This structure maintains a clear separation of concerns while keeping all related data models within one repository.

  • Isolation and Code Changes: Dataform provides features such as assertions, tests, and the ability to create separate environments (development, staging, production) to help catch issues early and prevent them from affecting production datasets. Branching and pull requests in Git further support isolation of changes during the development process.

  • Accessibility Control: While separate repositories can offer granular access control, Dataform allows you to manage permissions at the directory level within a repository. This can be sufficient for many projects, especially when combined with the access control features provided by your Git hosting service.

  • Version Control: A single repository can effectively track and version your codebase. Git is adept at managing multiple directories and changes, which means you can still maintain a comprehensive version history for all your pipelines in one place.

  • Developer Collaboration: Using a single repository doesn’t hinder collaboration. Developers can work on different aspects of the project simultaneously using branches, and changes can be reviewed and merged via pull requests, ensuring that collaboration is structured and controlled.

Given these considerations, a single repository is often recommended for managing Dataform projects, especially when dealing with interdependent data transformations. This setup simplifies dependency management, deployment processes, and project maintenance. It also reduces the overhead of managing multiple repositories and can streamline your development workflow.

For a large-scale project with numerous pipelines, using a single Git repository for your Dataform project is likely the most efficient and manageable approach. This strategy will leverage Dataform’s strengths in handling complex data projects, while still providing the benefits of version control, collaboration, and modular organization.

If we adopt a single repository approach for 100 pipelines, how can we effectively manage the following:

  1. Implementing different permissions for service accounts for pipelines handling sensitive and personally identifiable information (PII) datasets.
  2. Ensuring the prevention of unnecessary dependencies between pipelines.
  3. Maintaining codebase readability.
  4. Avoiding interdependencies between individual pipelines.
  5. Managing single pipeline with a large number of tasks (30-40 tasks).
  6. Scheduling different pipelines at various timings.
  7. Orchestrating situations where certain independent pipelines depend on another ingestion job, requiring orchestration using Airflow.
  8. Suppose over time we need more pipelines(200 more) in the same repository, will it create any compilation and trigger memory/space issue.
  9. Preventing accidental triggering of execution with all actions and tags.

When adopting a single repository approach for managing a large number of Dataform pipelines, it’s crucial to implement a structured and scalable strategy. Here’s how you can address your concerns effectively:

  1. Service Account Permissions: Configure which service account Dataform uses to execute SQL in BigQuery. Manage access at the dataset level within BigQuery, ensuring that only authorized pipelines have access to sensitive datasets.

  2. Data Masking and Access Control: Implement data masking and use BigQuery’s column-level security features to protect sensitive data. Dataform will execute the SQL that enforces these protections.

  3. Modular Design and Codebase Readability: Use Dataform’s features to create modular SQLX files and reusable JavaScript includes. Document transformations within Dataform, which integrates with Git for version control, to ensure that your codebase remains clear and maintainable.

  4. Dependency Management: Utilize Dataform’s ref() function to manage dependencies between datasets explicitly. This built-in feature helps maintain clear and manageable interdependencies.

  5. Task Organization: Organize transformations into separate SQLX files. While Dataform compiles these into a single run operation, keeping them modular helps manage large pipelines with many tasks.

  6. Scheduling Pipelines: Use Dataform’s scheduling features to define execution times for each pipeline. For more complex scheduling needs, such as dependencies on external ingestion jobs, consider using a workflow orchestration tool like Airflow.

  7. Airflow Orchestration: Integrate Dataform with Airflow to manage complex workflows and orchestrate pipelines that depend on external triggers or ingestion jobs.

  8. Scalability and Performance: Dataform compiles SQLX code server-side, so the user does not need to worry about compilation overhead. Monitor and optimize your BigQuery queries for performance, and consider structuring your Dataform project into multiple smaller projects if necessary.

  9. Preventing Accidental Triggering: Configure schedules and dependencies within Dataform carefully to control execution. In your CI/CD pipeline or orchestration tool, implement manual approval steps and other safeguards to prevent accidental deployments.

As your project scales and potentially doubles in the number of pipelines, regular refactoring, and optimization of your Dataform code will be essential to manage complexity. If you encounter performance bottlenecks or manageability issues, consider restructuring your project. This could involve splitting your Dataform project into multiple smaller projects or reorganizing your repository to better reflect the logical grouping of pipelines.

By following these strategies, you can maintain a clean, efficient, and scalable workflow within a single Dataform repository. Continuous evaluation and adaptation of your repository structure and practices will be key to successfully managing a growing number of data pipelines.