Context
I have many Google Sheets spreadsheets with the same layout. In a master sheet called “links db,” I store the URLs for all these sheets (one per row). I want a process that goes through that list, downloads each sheet with its contents, and deposits them consolidated into a Google Cloud Storage (GCS) bucket, either in CSV (minimal) or Parquet (ideal).
I’m looking for the best practical approach. I’m sharing what I’ve put together so you can suggest improvements or best practices.
You can try a Cloud Function, triggered by a Cloud Scheduler job. The function will read the URLs from your “links db” sheet, download each spreadsheet as a CSV, and merge them into a single file. This consolidated CSV is then uploaded to a Google Cloud Storage (GCS) bucket.
To enhance reliability and analytical capability, consider converting the data to Parquet. This can be done within the same Cloud Function using a library like pyarrow for smaller files, or by using a managed service like Cloud Dataflow for larger datasets. This method is highly efficient for data analysis because Parquet is a columnar format, allowing services like BigQuery to read only the necessary data.
For secure authentication, use a service account with the principle of least privilege, granting it only the necessary permissions to read from Sheets and write to GCS. To ensure reliability, include robust error handling with exponential backoff and use Cloud Logging and Cloud Monitoring to track the process and receive alerts for any failures.
Yes, it’s exactly the pipeline I had in mind: a Cloud Function triggered by Cloud Scheduler that reads the URLs from the “db links” sheet, consolidates the data into a CSV, and uploads it to the GCS bucket. I’d even considered the possibility of transforming it into Parquet to optimize analysis in BigQuery.
What I’m struggling with is the practical part of building the Cloud Function, since I don’t come from the programming industry. Do you know of any repositories or resources where I can see examples of similar projects, specifically those involving ingesting data from Google Spreadsheets to GCS? That would be quite helpful as a starting point.