Managing different environments in Dataform

Hi,

We have 2 environments, staging and production. Currently I am using a staging environment to create sqlx files and using the Executions in Dataform can push these tables into a specified dataset in Staging for testing. However, the data in the staging datasets is not aligned with the production environment. They match schemas, but the data in staging is not up-to-date or accurate. I would rather test using a production data.

I am struggling to workout how I can create and run files that use the production tables, but push the outputs to staging for testing. When we commit files from staging and merge into production the files get run as normal in production.

As a workaround we print the full environment-dataset-table in the sqlx, but I want to use ${ref} and keep the compiled lineage.

Hi @datatom,

Welcome to the Google Cloud Community!

It looks like you are encountering an issue where testing data transformations in your staging environment presents challenges due to the need for production-like data, while avoiding the hardcoding of environment-specific details in your Dataform SQLX files, all while ensuring proper data lineage.

Here are the potential ways that might help with your use case:

  • Configure environment-specific datasets using the dataform.json file: You may want to set up environment-specific configurations in the ‘dataform.json’ file and reference them in your SQLX scripts. For example, you could define variables that link to production datasets for data extraction while ensuring that any resulting tables are stored in a staging environment.
  • Using ref() in your SQLX files: When you refer to your production tables, specify the ‘production_dataset’ variable from your ‘dataform.json’ file in your ‘ref()’ calls. Keep ‘defaultSchema’ as the dataset for your output tables.
  • Schema Management: To keep your datasets organized and avoid naming conflicts, consider using different schemas for your staging and production environments (e.g., ‘staging_schema’ vs. ‘production_schema’). The configuration ‘{ schema: “staging_schema” }’ in your SQLX file is very useful here, ensuring your output lands in the correct staging schema.
  • Permissions: Make sure the service account used by Dataform has the appropriate permissions to read data from production datasets and write to staging or development datasets. A frequent mistake is neglecting to grant the staging environment’s service account read access to production tables.

Was this helpful? If so, please accept this answer as “Solution”. If you need additional assistance, reply here within 2 business days and I’ll be happy to help.