Uploading millions of documents to Google Cloud Storage

I’ve got a project to upload millions of files to Google Cloud Storage.

Total objects: 100 million+, total size: 14 TB.

My original plan was to use a 40TB Google Transfer Appliance for this task. This would let me treat it as a portable NFS server, easily load the files to it, ship it to Google and have it loaded into GCS. My primary motivation for this was the simple nature of getting data on the Transfer device and avoiding any impact on our network bandwidth to get data into the cloud.

Then I found the “known limitation” that Transfer Appliances only accept files >= 1 MB in size. Our files are primarily < 1 MiB in size, so this seems to rule out.

“The minimum file size for files copied onto the appliance is 1MB. To copy many smaller files, we recommend that you archive the files before copying them.”

I’m now pivoting to using Google Transfer Service to manage the uploads of files, upload multiple files in parallel and restrict bandwidth so it doesn’t cause problems with the daily operations of my network.

I’d love to hear from others about this. Is my interpretation that a Transfer Appliance can’t do this correct? Will I have success with Transfer Service and does anyone have any recommendations for my workload?

Thanks

Hi @maxwellpinna ,

Welcome to Google Cloud Community!

You are right in your interpretation that the Google Transfer Appliance is not a good fit for your workload because it doesn’t support files smaller than 1 MB, ruling out most of your dataset.

Switching to the Google Transfer Service is a wise decision. This service is ideal for managing small file datasets, allowing for parallel file uploads with restricted bandwidth usage to prevent network disruptions.

For your workload, you may:

  • Use gsutil -m to perform multi-threaded uploads for better efficiency.
  • Consider using bandwidth control options in gsutil to ensure that network resources aren’t overwhelmed during the transfer.
  • If feasible, aggregating files into larger archives may improve performance.

I hope the above information is helpful.

1 Like