Issues like ineffective query speed, more metadata management, and greater storage costs are brought on by the large ingestion rate of small .gz JSON files into Cloud Storage. When handling data lakes with high-frequency intake, this is typical. To keep a data lake operating efficiently, frequent file compaction is necessary.
To handle millions of small .gz JSON files in Cloud Storage, you can compact them into larger files using Dataflow (Apache Beam). You may try this following solution/s:
Create a Dataflow Pipeline:
This will read small files from Cloud Storage, batch them, and write back larger files (e.g., .gz, .parquet, or .avro).
Schedule the Pipeline:
Use Cloud Scheduler with Cloud Functions or Composer to run the process regularly (daily/hourly).
Use Efficient Formats:
Convert to formats like Parquet or Avro for better performance and compression, especially if using BigQuery.