How to UPSERT in BigQuery

Sorry for the confusion. You cannot directly update existing rows in a BigQuery table using the appendRows method. The appendRows method of the BigQuery Write API is designed primarily to add new rows of data to a table.

To achieve the effect of updating existing records, the common strategy is to:

  1. Identify the records that need updating (as explained in the previous steps).
  2. Prepare the updated data (with the modified values you want).
  3. Use appendRows to append these updated rows to the BigQuery table.

BigQuery’s Handling of Appended Rows:

When you use appendRows with data that has the same unique identifier(s) as existing rows:

  • BigQuery will add the appended rows as new rows to the table.
  • You will effectively have duplicate rows(one with the old values and one with the updated values).

Workarounds to Simulate Updates:

To address this and achieve the desired update behavior, you usually need to employ one of the following techniques:

1. Client-Side Deduplication and Overwrite:

  • Query the table to get the latest data (including the newly appended duplicates).
  • In your application logic, deduplicate on the unique key, keeping only the latest version of each record.
  • Delete the entire table content and re-insert all the deduplicated data.
  • Caveats:
    • Can be inefficient for large tables due to data transfer and processing overhead.
    • Temporary data unavailability during the delete and re-insert process.

2. Using MERGE Statement (for larger datasets or frequent updates):

  • Write a SQL query using the MERGEstatement. This statement allows you to match records based on a condition and conditionally perform updates or inserts.
  • Execute this MERGE query as a BigQuery Script.
  • Advantage: Handles the update logic more efficiently within the BigQuery engine.