Hi there
We got an issue with a python Cloud Function that writes data to a BigQuery table using the google-cloud-bigquery Python API.
First to give you some context:
We got a Python 3.9 Cloud Function running to read some data from a spreadsheet and to put that data into a BigQuery table. Since the data that’s collected from the spreadsheet is very sensitive, we would like to add Data Catalog Policy Tags to the BigQuery table to make sure the data isn’t visible to all the developers of our organisation. To add those policy tags, we’ve added them to some specific fields in our SchemaField in the JobConfig in our Python code. To write the data to the table we use the WRITE_TRUNCATE option to overwrite the existing data and add the new data to the table.The problem we have is the following:
Every time we run the Cloud Function, the Policy Tags are completely removed from every field in the table and we have to manually put them back to secure our sensitive data. After days of debugging and research we found that it has probably to do with the fact that WRITE_TRUNCATE overwrites the complete table before adding the new data. This shouldn’t be a problem if the Cloud Function adds them back after adding the data but this never happens. It looks like this is a known issue and that we are not the only ones who have this problem. Here you can find a blogpost on StackOverflow with some developers who have the same problem: https://stackoverflow.com/questions/64440753/bigqueryoperator-changes-the-table-schema-and-column-modes-when-write-dispositio
For now we first Truncate all our data and then add our new data with the WRITE_APPEND option since this does not overwrites the whole table, just add the data. But this off course is not ideal because if the Cloud Function crashes or times-out after truncating all the data but did not have the chance to add the new data back, our table will be left empty until the next run which is far from ideal.
Is there any way we can get some help to find a good and working solution to this problem? And if anyone from Google reads this, is it possible to check if the problem lies with the backend code/API from Google?
Thanks in advance