Error importing JSONL data from GCS into RAG Data Store – invalid unstructured data format

When attempting to synchronize/import data from a Google Cloud Storage bucket into a RAG-enabled Data Store (Discovery Engine), the import fails with the following error:

“The provided GCS URI has invalid unstructured data format. Please provide a valid GCS path in either NDJSON (.ndjson) or JSON Lines (.jsonl) format.”

The bucket contains:

  • metadata.jsonl (JSON Lines format)

  • applications.txt (referenced via GCS URI inside metadata.jsonl)

The .jsonl file follows the documented structure and references the .txt file using a gs:// URI. However, Discovery Engine rejects the GCS path during synchronization and does not index the data.

I have verified:

  • The bucket and Data Store are in the same region

  • The file extension is .jsonl

  • The JSON structure matches the documentation

  • MIME types were tested (text/plain, application/json, text/markdown)

  • Full text embedding inside .jsonl was also attempted

Despite this, the import consistently fails with the same format validation error.

1 Like

Hi @Mary_Oud Usually this means the file is not valid true JSON Lines format Each line must be one valid JSON object no array brackets no multiline JSON no empty lines and no BOM Also make sure the GCS path points directly to the .jsonl file and the service account has access As a test try importing a minimal JSONL file with one simple inline document if that works the issue is in your structure or external file reference

Hi @a_aleinikov, thank you so much for your help.

I simplified the setup and tested with just one JSON object per file (one line each). The files are valid JSONL, with no BOM, no multiline content, and no extra characters.

I checked the GCS URI as well, and it is valid and accessible.

Despite this, the import still fails with this error message : “extraneous characters after end of JSON object.”

Currently, metadata.jsonl references applications.jsonl through the uri field.