Dataset creation: Unable to import data due to errors.

Trying to import a simple dataset like the one provided here:

https://cloud.google.com/vertex-ai/generative-ai/docs/models/tune-text-models-supervised?hl=pt-br#sample-datasets

I always get: “Unable to import data due to errors.”. The details don’t help much:

Error: Error parsing index file gs://features/simple_dataset.jsonl line 2 for:Error: Error parsing index file gs://features/simple_dataset.jsonl line 1 for:Error: Data item expected to have non-empty blob_data_gcs_uri for example id 5238579814764755606 for:

If I’m unable to import a dataset provided by the docs, how can I import my own dataset and make sure it works?

Why does this happen?

1 Like

Hi @gensummary,

Welcome to Google Cloud Community!

From my understanding of the link you provided, you are trying to import a simple dataset following the guideline using the PaLM2 model.

The error message “Unable to import data due to errors” and the specific details you provided indicate a problem with the format or content of your dataset file. In this case, gs://features/simple_dataset.jsonl.

Here’s why the error occurs:

  • Missing or Invalid Data: The error message suggests that the dataset file is missing data or contains invalid data. Specifically, it mentions a missing blob_data_gcs_uri for a particular example ID.
    • blob_data_gcs_uri: This field likely points to the actual data (e.g., a text file) associated with the example. If it’s missing or invalid, Vertex AI cannot locate the data it needs.
  • Incorrect JSONL Formatting: JSON Lines (JSONL) files are designed to have each JSON object on a separate line. If your file has incorrect formatting (e.g., multiple JSON objects on a single line or missing line breaks), it will lead to parsing errors.

And here are the troubleshooting steps:

1. Verify the blob_data_gcs_uri Field:

  • Download the simple_dataset.jsonl file: Use gsutil cp gs://features/simple_dataset.jsonl local_file.jsonl to download the file locally.
  • Inspect the File: Open the local_file.jsonl file in a text editor. Look for the line corresponding to example ID 5238579814764755606 (or any other example ID mentioned in the error).
  • Check the blob_data_gcs_uri: Make sure the blob_data_gcs_uri field exists and contains a valid Google Cloud Storage (GCS) URI pointing to the actual data associated with that example.

2. Inspect the JSONL Format:

  • Use a JSON Validator: Tools like JSONLint or your IDE’s JSON validation capabilities can help identify formatting issues.
  • Look for Missing Line Breaks: Make sure each JSON object is on a separate line in your file.

3. Examine Your Dataset Structure:

  • Check for Consistency: Ensure that all the examples in your dataset have the same structure and include the necessary fields.
  • Use the Correct File Extension: Use .jsonl for JSON Lines files.

Here are some guidelines on importing your own dataset:

1. Prepare Your Data:

  • Create a JSONL file: If you don’t have a JSONL file, convert your data into this format.
  • Store Your Data in GCS: Upload your JSONL file to Google Cloud Storage (GCS). Make sure it’s publicly accessible or that the Vertex AI service account has access to the bucket.

2. Use the ImportData Function:

  • In the Vertex AI UI: Follow the steps in the documentation to create a model and import your dataset.
  • Using the API: The documentation you linked provides code examples for importing data using the Vertex AI Python library.

3. Adapt the blob_data_gcs_uri: Make sure the blob_data_gcs_uri field in your JSONL file correctly points to the GCS location of your actual data files.

Additional Tips:

  • Logging: Enable logging for your Vertex AI job to get more detailed error messages.
  • Debug Locally: If possible, try to debug the data import process locally by loading the dataset into a Python environment and simulating the steps Vertex AI takes.

By carefully inspecting your dataset, addressing any formatting or content errors, and following the guidelines for importing data into Vertex AI, you should be able to resolve the “Unable to import data due to errors” issue.

Should you want to try to import a dataset using the Gemini model, you may check this documentation for reference.

I hope the above information is helpful.

1 Like

It looks like your dataset might have formatting issues. Here are some quick tips to resolve them:

  1. Check JSONL Format: Ensure each line is a valid JSON object.

    <strong>{"text": "Example 1", "label": "positive"}
    {"text": "Example 2", "label": "negative"}
    </strong>
    
  2. Ensure blob_data_gcs_uri is Non-Empty: Each entry should have a valid blob_data_gcs_uri.

    <strong>{"text": "Example 1", "label": "positive", "blob_data_gcs_uri": "gs://your-bucket/path/to/data1"}
    {"text": "Example 2", "label": "negative", "blob_data_gcs_uri": "gs://your-bucket/path/to/data2"}
    </strong>
    
  3. Match Schema: Verify the schema in Vertex AI matches your JSONL structure.

  4. UTF-8 Encoding: Ensure your file is saved in UTF-8 format.

By following these steps, you should be able to import your dataset successfully. If issues persist, consider reaching out to Google Cloud support.

I’ve checked all that. The provided JSONL in the docs follows this structure:

{“input_text”:“aaa”, “output_text”: “bbb”}
{“input_text”:“ccc”, “output_text”: “ddd”}

File is in UTF-8:

$ file -i dataset.jsonl
dataset.jsonl: application/json; charset=utf-8

Step 3 is intriguing. Where can I check that?

1 Like