Having Issue with data

Asif_Shaharia · August 19, 2024, 12:34am

Hello,
Im a beginner trying to do a project. I generated script from chat gpt for python where it will generate dummy data of employees name, address, ssn, password etc. load it into cloud storage bucket and use data fusion wrangler to transform data. Unfortunatley when I pulled data from cloud buck in data fusion i saw some of the field are missing value, and last two columns is totally empty. Can anyone help me trouble shoot this problem?

Here is my python code:

import csv
from faker import Faker
from google.cloud import storage
import os

Set Google Cloud project environment variable

os.environ[‘GOOGLE_CLOUD_PROJECT’] = ‘marine-champion-432318-n3’

Initialize Faker

fake = Faker()

Generate dummy data

def generate_employee_data():
data = {
“first_name”: fake.first_name(),
“last_name”: fake.last_name(),
“email”: fake.email(),
“address”: fake.address(),
“phone_number”: fake.phone_number(),
“ssn”: fake.ssn(),
“date_of_birth”: fake.date_of_birth(minimum_age=18, maximum_age=65).isoformat(),
“password”: fake.password(length=12, special_chars=True, digits=True, upper_case=True, lower_case=True)
}
print(data) # Print generated data for debugging
return data

Save data to CSV

def save_to_csv(file_path, data_list
with open(file_path, mode=‘w’, newline=‘’, encoding=‘utf-8’) as file:
writer = csv.DictWriter(file, fieldnames=data_list[0].keys())
writer.writeheader()
writer.writerows(data_list)
print(f"Data saved to {file_path}")

Upload file to GCS

def upload_to_gcs(bucket_name, source_file_name, destination_blob_name
“”“Uploads a file to the bucket.”“”

Initialize a client

storage_client = storage.Client(project=‘marine-champion-432318-n3’)

Get the bucket

bucket = storage_client.bucket(bucket_name)

Create a blob object

blob = bucket.blob(destination_blob_name)

Upload the file

blob.upload_from_filename(source_file_name)
print(f"File {source_file_name} uploaded to {destination_blob_name}.")

if name == “main”:

Generate a list of employee data

employees = [generate_employee_data() for _ in range(10)] # Adjust the number of records as needed

Define file paths

csv_file_path = “employee_data.csv”

Save data to CSV

save_to_csv(csv_file_path, employees)

Define GCS parameters

bucket_name = “employee-project” # Replace with your bucket name
source_file_name = “employee_data.csv”
destination_blob_name = “employee_data.csv” # Blob name in GCS

Upload the file to GCS

upload_to_gcs(bucket_name, source_file_name, destination_blob_name)

Here is a screenshot of the data viewed via wrangles in data fusion:

I didn’t do anything on the GCP UI. Used python script to load data in bucket. Please guide me.

jangemmar · August 21, 2024, 9:47pm

Hi @Asif_Shaharia ,

Welcome to Google Cloud Community!

One possible reason why you’re experiencing missing values and completely blank columns is using a wrong delimiter in your Data Fusion Wrangler. If Data Fusion expects a different delimiter like for example a semicolon (;), you’ll encounter data loss. Since you have employee_data.csv, you should be using a comma (,) delimiter.

And also, in your Data Fusion pipeline, examine the data types assigned to each field in the Wrangler transformations. Ensure that the types are compatible with the data you’re trying to process.

Note: Data Fusion is a visual point-and-click interface enabling code-free deployment of ETL/ELT data pipelines. If you really want to use python code in your pipeline, I highly suggest to use Dataflow instead.

I hope the above information is helpful.

Topic		Replies	Views
Having trouble with csv file in data fusion Database cloud-bigtable , datastream	1	39	August 21, 2024
Error while configuring data fusion wrangler properties Data Analytics cloud-data-fusion	4	62	May 22, 2024
Failing to load empty cells when a column is converted to simple-date in cloud data fusion Data Analytics cloud-data-fusion	1	72	May 21, 2024

Having Issue with data

Set Google Cloud project environment variable

Initialize Faker

Generate dummy data

Save data to CSV

Upload file to GCS

Initialize a client

Get the bucket

Create a blob object

Upload the file

Generate a list of employee data

Define file paths

Save data to CSV

Define GCS parameters

Upload the file to GCS

AI Suggested topics