Importing to Vertex dataset does not import labels.

In Vertex AI I am updating an image dataset, thus:

from google.cloud import aiplatform
import_schema_uri = aiplatform.schema.dataset.ioformat.image.single_label_classification
dataset_id = "my_ds_id"

ds = aiplatform.ImageDataset(dataset_id)))
ds.import_data(gcs_source=DATASET_PATH, import_schema_uri=import_schema_uri)

the images are uploaded to the dataset but their labels are ignored and they are classed as Unlabeled. What am I doing wrong? TIA!

PS they are in a csv, like:

gs://path/to/file/barnacles.jpg,label1

which worked fine for the dataset creation.

You could check this sample code to Import data for image classification single label:

from google.cloud import aiplatform

def import_data_image_classification_single_label_sample(
    project: str,
    dataset_id: str,
    gcs_source_uri: str,
    location: str = "us-central1",
    api_endpoint: str = "us-central1-aiplatform.googleapis.com",
    timeout: int = 1800,
):
    # The AI Platform services require regional API endpoints.
    client_options = {"api_endpoint": api_endpoint}
    # Initialize client that will be used to create and send requests.
    # This client only needs to be created once, and can be reused for multiple requests.
    client = aiplatform.gapic.DatasetServiceClient(client_options=client_options)
    import_configs = [
        {
            "gcs_source": {"uris": [gcs_source_uri]},
            "import_schema_uri": "gs://google-cloud-aiplatform/schema/dataset/ioformat/image_classification_single_label_io_format_1.0.0.yaml",
        }
    ]
    name = client.dataset_path(project=project, location=location, dataset=dataset_id)
    response = client.import_data(name=name, import_configs=import_configs)
    print("Long running operation:", response.operation.name)
    import_data_response = response.result(timeout=timeout)
    print("import_data_response:", import_data_response)

Thanks, but exactly the same result.

From this Tensorflow blog post:

In addition to image files, we’ve provided a CSV file (all_data.csv) containing the image URIs and labels. We randomly split this data into two files, train_set.csv and eval_set.csv, with 90% data for training and 10% for eval, respectively.> > > gs://cloud-ml-data/img/flower_photos/dandelion/17388674711_6dca8a2e8b_n.jpg,dandelion> gs://cloud-ml-data/img/flower_photos/sunflowers/9555824387_32b151e9b0_m.jpg,sunflowers> gs://cloud-ml-data/img/flower_photos/daisy/14523675369_97c31d0b5b.jpg,daisy> gs://cloud-ml-data/img/flower_photos/roses/512578026_f6e6f2ad26.jpg,roses> gs://cloud-ml-data/img/flower_photos/tulips/497305666_b5d4348826_n.jpg,tulips> > > > We also need a text file containing all the labels (dict.txt), which is used to sequentially map labels to internally used IDs. In this case, daisy would become ID 0 and tulips would become 4. If the label isn’t in the file, it will be ignored from preprocessing and training.> > > daisy > dandelion > roses > sunflowers > tulips > >

Therefore, you need to create the dict.txt file which will have the all the labels used as shown above.

See also:

Thanks but that is six years old and not a Vertex AI dataset.

Could you please raise a private thread in the issue tracker (referencing this question, as stated in the template) with the project ID, job ID and a sample data of your input CSV file (Don’t want the entire file or any PII)?