Building a medallion architecture in BigQuery using Data Engineering Agents

issaghaba · October 30, 2025, 7:33pm

The Lakehouse architecture has become the go-to design pattern for modern data platforms, combining the flexibility and scalability of a data lake with the performance of a data warehouse. Within this architecture, the medallion architecture (bronze, silver, and gold layers) is a proven way to keep data organized, trusted, and optimized for business use, with each layer progressively improving data quality and readiness for analytics and machine learning.

Manually building and maintaining these multi-stage pipelines, handling various file formats, transformations, and quality checks is time-consuming and demands highly skilled data engineers and significantly increases the time-to-insight.

That’s where the BigQuery Data Engineering Agent (currently in preview) steps in. The agent allows you to build data pipelines using natural language. It generates Dataform’s SQLX files, which you can then modify manually or continue to refine using the agent.

In this post, we’ll walk through how you can accelerate the creation of end-to-end data pipelines across your medallion architecture on Google Cloud, leveraging BigQuery’s Data Engineering Agent together with BigQuery Pipelines to simplify and speed up your data workflow.

* Follow along: This tutorial is designed for hands-on learning. We encourage you to follow along in your own Google Cloud project, creating datasets, pipelines, and prompts as we go.

Roles and permissions

Before you begin, make sure you have the required IAM roles:

Dataform Editor (roles/dataform.codeEditor)
BigQuery Job User (roles/bigquery.jobUser)

Additionally, ensure the following APIs are enabled:

Data Analytics API with Gemini
Gemini in BigQuery

High level architecture

Figure 1 : High level architecture

Bronze/raw layer

In a medallion architecture, the bronze layer acts as the immutable landing zone, storing raw data as it was sourced.

For the purpose of this blog, we assume you have already ingested your raw data into a Cloud Storage (GCS) bucket. This initial ingestion step can be handled by various ETL services, such as Dataflow, Data Fusion, or the Data Transfer Service.

In our example, the raw layer contains many file formats across different GCS folders.

customer/ → Avro file
customer_transaction/ → Parquet file
product/ → JSON file
product_category/ → CSV file

These files are available here.

Figure 2 : Raw layer

The goal will be to just create external tables in BigQuery and run SQL queries against the raw files in GCS as if they were standard BigQuery tables, without moving or transforming the data yet.

This is precisely where the BigQuery Data Engineering Agent accelerates our process. We don’t need to manually write complex DDL statements for each format, we simply tell the agent what we want.

In this post, we’ll delegate the repetitive tasks to our AI assistant, Agent Smith .

Let’s get started.

Step 1: Access the BigQuery Pipeline environment

Open BigQuery Studio: Navigate to BigQuery in the Google Cloud Console.
Create 3 datasets: Create bronze, silver, and gold datasets, representing each layer of your Medallion architecture. The steps to create a dataset are described here. Note that Agent Smith will create the datasets if they do not exist.
Create Pipeline: Select your project, right-click on the Pipelines section, and choose Create Pipeline.

Figure 3 : Pipeline creation

Specify the authentication method: In the next window, you can choose between a Service Account or the authenticated user’s credentials. For this demonstration, select the authenticated user’s credentials. Click “Get started.

Figure 4 : Pipeline credentials

Begin the implementation: The BigQuery Pipeline canvas is now displayed. To begin creating your pipeline using AI, click “Ask Agent” or the Gemini icon found under the “try out agent experience for data pipeline” section.

Step 2: Create external tables in the bronze layer

As mentioned above, for the bronze layer, our goal is to quickly expose all raw GCS files as queryable tables in BigQuery. The challenge here is the heterogeneity of the files and the need for a dynamic naming convention (table name = folder name).

Instead of running four separate CREATE EXTERNAL TABLE statements (one for each file type and folder), we’ll let Agent Smith handle the pattern recognition and code generation in one go.

Prompt 1 : Bulk creation of external tables

Enter the following natural language prompt and hit enter:

“Load all data in the Cloud Storage bucket < Your bronze layer GCS Bucket > into the bronze dataset. Each resulting table should be named after the parent folder containing the source files.”

Figure 5 : Prompt 1

Within seconds, Agent Smith will generate the pipelines to create the external tables.

Figure 6 : Bronze pipelines

To execute the pipelines generated by Agent Smith and create the corresponding external tables in BigQuery, click the Apply button in the top-right corner, then press Run.

Navigate to the Execution table to view the results of your pipeline run.

Figure 7 : Bronze pipelines execution

You should now see the external tables in your dataset, and you can execute SQL queries just like you would with any standard managed table.

Figure 8 : Bronze pipelines execution result

Rename the pipeline by double-clicking on ‘Untitled pipeline’ in the top right (we chose Bronze).

Just like that, you created external tables over all your GCS Bronze folders in about 2 minutes.

Silver/enriched layer

You’re right, creating external tables for the Bronze layer was straightforward. Agent Smith is designed not just for easy tasks but for accelerating your most complex data engineering challenges as well!

Next, we move on to the Silver layer, where data quality is paramount. In the Medallion architecture, this layer involves more intensive transformations, including data validation, cleansing, and deduplication.

Step 1: Define the source and the target

We’ll begin processing the customer data.

Create a new pipeline and name it Silver.
Enter the following prompt and press Enter:

Prompt 2 : Define the source and the target

“Load the < Your project>.< bronze dataset>.customer table into a < Your project>.< silver dataset>.ref_customer.”

Figure 9 : Prompt 2

Do not apply or run the pipeline yet. We’ll execute the flow only after all steps are complete.

Now, we’ll implement our data quality checks, starting with uniqueness for the Customer ID, where we’ll filter out duplicate customer records and route them to a separate review table.

Enter the following prompt and hit Enter:

Prompt 3: Uniqueness/(Referential Integrity: Customer ID

“

If customer_id is unique, load the record into < Your project>.< silver dataset>.ref_customer.
If customer_id is duplicated, load the record into < Your project>.< bronze dataset>.customer_rejected_records.

”

As shown in the screenshot below, customer_id = 1 is duplicated and will be filtered out by this logic.

Figure 10 : Customer Id duplicates

Next, we address Completeness. First or last names should not be empty for any records. When data is missing, we’ll apply a default value and load the data both in the target and the defects table for review.

Prompt 4: Completeness: First Name and Last Name are empty or NULL

“

If the first_name or last_name fields are NULL (missing) or empty, transform the data by replacing the NULL or blank value with the string ‘N/A’ and load the record in the < Your project>.< silver dataset>.ref_customer table.
Whenever a record fails this completeness check (i.e., when a NULL value is replaced by ‘N/A’), log or load the original, unmodified record into the rejection table: < Your project>.< bronze dataset>.customer_rejected_records.

“

As shown in the screenshot below, customer ID = 2 has a NULL last name. We’ll correct it in the Silver table while sending the original record to the rejection table.

Figure 11 : Empty last name

The third dimension of quality we tackle is Validity. We must ensure that the email field conforms to a standard, valid format. We will log invalid records for audit but still allow the data to proceed to the Silver table.

Prompt 5: Validity: Email

“The email must conform to a valid email format. If the email is invalid, load the record into both of the following tables:

< Your project>.< bronze dataset>..bronze.customer_rejected_records (for audit)
< Your project>.< silver dataset>..ref_customer (allowing the data through to the enriched layer)“

As shown in the screenshot below, customer ID = 3 has an invalid email address. Agent Smith will generate the logic to validate the email format, ensuring the original record is sent in both the rejection table and the target table in the silver layer.

Figure 12 : Invalid email address

Once you execute all the prompts above, Agent Smith will finalize the pipeline logic. Note that because the agent uses generative AI, your final pipeline’s structure may differ slightly from the visual shown here, as the process is not strictly deterministic. However, the core data quality and transformation logic will be the same.

Figure 13 : Final data pipeline

Now that all the complex data quality logic is defined, hit Apply and then Run the pipeline.

You can confirm the successful execution by checking the Execution tab.

This completes the implementation of the customer data pipeline, moving data from the Bronze to the Silver layer. You can now review the data in both the target table (< Your project>.< silver dataset>.ref_customer) and the rejected records table (< Your project>.< bronze dataset>.customer_rejected_records) to confirm that the data quality rules were applied correctly.

Gold/curated layer

In the Gold layer, we will load data from our Silver layer’s customer reference table into a dimension table, dim_customer. Agent Smith will handle the creation of both the Gold dataset and the dim_customer table if they do not yet exist. We will instruct the agent to generate a surrogate key named customer_key and implement an upsert load method, updating records if they exist and inserting new ones otherwise.

Prompt 6: Load dim_customer

“We want to load data from the < Your project>.< silver dataset>.ref_customer into the < Your project>.< gold dataset>.dim_customer in the gold dataset.

If they do not exist, create the Gold dataset and the dim_customer table.
Generate a surrogate key for dim_customer named customer_key. The surrogate key is an integer and must start with the value 1
Implement an upsert (Update/Insert) logic. Use the customer ID for record matching to update existing rows or insert new ones.

“

With these steps, you are now fully equipped to leverage Agent Smith for building robust data pipelines within your Medallion architecture.

What’s next?

Once you’ve implemented the data pipelines using the Data Engineering Agent, the next step is to turn them into reliable, production-ready assets.

Scheduling: Once the pipeline is fully validated, you can schedule it to run automatically. Whether you require a daily, weekly, or monthly refresh, or a completely custom schedule. You can configure it directly within the pipeline tool.

You can also orchestrate the data pipeline using Cloud Composer

CI/CD Integration: The pipelines generated by the Data Engineering Agent are standard Dataform SQLX files. This means you can implement full Continuous Integration/Continuous Delivery (CI/CD), exactly as you would with any native Dataform project, including connecting your pipeline to a remote GitHub repository.

Conclusion

Leveraging the BigQuery Data Engineering Agent transforms the way you implement data pipelines in general and the medallion architecture in particular. By automating the creation of external tables, enforcing complex data quality rules, and enabling rapid, natural language-driven development, you can drastically reduce time-to-insight while maintaining high-quality, production-ready pipelines.

With these capabilities, data teams can focus less on repetitive coding and more on deriving business value—building analytics-ready, trustworthy datasets that fuel BI and machine learning at scale. Agent Smith is not just an assistant; it’s your accelerator for modern data engineering in BigQuery.

Additional resources

Henry_Jon · November 3, 2025, 3:32pm

This is a great deep dive into how the BigQuery Data Engineering Agent simplifies medallion architecture workflows. The ability to build and manage pipelines through natural language prompts is a real game-changer for data teams especially when dealing with multi-format sources. Excited to see how this evolves once it’s out of preview!

Praveen_Kumar3 · November 10, 2025, 7:48am

Certainly, something to watch out for. But dataform is a boon already.

Topic		Replies	Views
A2A ecosystem: BigQuery Data Engineering Agent Agents gemini , a2a	0	71	May 25, 2026
From Chaos to Clarity: Fast-Track Data Engineering with Autocleaning Data Analytics googler-article	2	1249	April 28, 2026
BigQuery Data Preparation with AI Agents: Smarter, faster, and code-free Community Articles googler-article , ai-ml , data-analytics	0	441	April 17, 2025