How to insert JSON along with PDF into Document AI Warehouse using API

Hi,

Our usecase is to process the PDF documents from Document AI process and pass the JSON file along with the PDF to the document warehouse. I am using contentwarehouse.CreateDocumentRequest function, the function works well if I only supply the PDF document, but if I process the file from the document and push the JSON along with the PDF the it gives an error saying the following:

File “”, line 1, in
runfile(‘C:/Users/HP/Downloads/test simple.py’, wdir=‘C:/Users/HP/Downloads’)

File “D:\Anaconda3\lib\site-packages\spyder_kernels\customize\spydercustomize.py”, line 827, in runfile
execfile(filename, namespace)

File “D:\Anaconda3\lib\site-packages\spyder_kernels\customize\spydercustomize.py”, line 110, in execfile
exec(compile(f.read(), filename, ‘exec’), namespace)

File “C:/Users/HP/Downloads/test simple.py”, line 110, in
process_document_sample(config[‘project_id’],config[‘location’],config[‘Custom_processor_id’],file_path,mime_type)

File “C:/Users/HP/Downloads/test simple.py”, line 88, in process_document_sample
doc=documentai.types.Document(docDictionary)

File “D:\Anaconda3\lib\site-packages\proto\message.py”, line 566, in init
“Unknown field for {}: {}”.format(self.class.name, key)

ValueError: Unknown field for Document: _pb

Following is the code Snippet:

def process_document_sample(
project_id: str,
location: str,
processor_id: str,
file_path: str,
mime_type: str,
field_mask: str = None,
:disappointed_face:

You must set the api_endpoint if you use a location other than ‘us’.

opts = storage.Client(project_id)

client = documentai.DocumentProcessorServiceClient()

The full resource name of the processor, e.g.:

projects/{project_id}/locations/{location}/processors/{processor_id}

name = client.processor_path(project_id, location, processor_id)

Read the file into memory

with open(file_path, “rb”) as image:
image_content = image.read()

Load Binary Data into Document AI RawDocument Object

raw_document = documentai.RawDocument(content=image_content, mime_type=mime_type)

Configure the process request

request = documentai.ProcessRequest(
name=name, raw_document=raw_document, field_mask=field_mask
)

result = client.process_document(request=request)

return result

TODO(developer): Uncomment these variables before running the sample.

project_number = ‘YOUR_PROJECT_NUMBER’

location = ‘YOUR_PROJECT_LOCATION’ # Format is ‘us’ or ‘eu’

#print(result.document.entities)

Create a Schema Service client

import json
with open(‘test.json’,‘w’) as f:
json.dump(documentai.Document.to_dict(result.document),f)

documentai.Document.to_dict(result)

document_schema_client = contentwarehouse.DocumentSchemaServiceClient()

The full resource name of the location, e.g.:

projects/{project_number}/locations/{location}

parent = document_schema_client.common_location_path(
project=config[‘project_number’], location=config[‘location’]
)

Create a Document Service client

document_client = contentwarehouse.DocumentServiceClient()

The full resource name of the location, e.g.:

projects/{project_number}/locations/{location}

parent = document_client.common_location_path(
project=config[‘project_number’], location=config[‘location’]
)
#print(result.document._pb)
docDictionary = result.document.dict
doc=documentai.types.Document(docDictionary)

Define Document

document = contentwarehouse.Document(

raw_document_file_type=1,

display_name=“60.pdf”,
document_schema_name=schema_URI,
inline_raw_document=open(‘60.pdf’,‘rb’).read(),
#plain_text=str(result.document)
cloud_ai_document=doc
)

Define Request

create_document_request = contentwarehouse.CreateDocumentRequest(
parent=parent, document=document
)

Create a Document for the given schema

response = document_client.create_document(request=create_document_request)

print(response)

process_document_sample(config[‘project_id’],config[‘location’],config[‘Custom_processor_id’],file_path,mime_type)

I have read all the documentation, but couldn’t find why the dictionary is not being picked by the object.

Hi @malimasood ,

Welcome to Google Cloud Community.

It looks like that the following line is to look for the error:

doc = documentai.types.Document(docDictionary)

Although it appears that the Document class constructor does not accept a dictionary as a parameter, the docDictionary variable represents the Document object returned by the Document AI API as a dictionary.

Instead, you should generate a new Document object from the dictionary representation using the from_dict class function of the Document class. Here’s how to change your code such that it uses from_dict:

doc = documentai.types.Document.from_dict(docDictionary)

After making this modification, the doc variable ought to have a legitimate Document object that you can give to the constructor of the CreateDocumentRequest method.

Here are some documentation that might help you.
https://cloud.google.com/document-ai/docs/reference/rest/v1/Document?_ga=2.116734662.-1392753435.1676655686
https://cloud.google.com/document-ai/docs/handle-response?_ga=2.116734662.-1392753435.1676655686
https://cloud.google.com/discovery-engine/media/docs/documents?_ga=2.138870387.-1392753435.1676655686

Thanks for the prompt response, but I tried your solution it still throws an exception

Traceback (most recent call last):

File “”, line 1, in
runfile(‘C:/Users/HP/Downloads/test simple.py’, wdir=‘C:/Users/HP/Downloads’)

File “D:\Anaconda3\lib\site-packages\spyder_kernels\customize\spydercustomize.py”, line 827, in runfile
execfile(filename, namespace)

File “D:\Anaconda3\lib\site-packages\spyder_kernels\customize\spydercustomize.py”, line 110, in execfile
exec(compile(f.read(), filename, ‘exec’), namespace)

File “C:/Users/HP/Downloads/test simple.py”, line 110, in
process_document_sample(config[‘project_id’],config[‘location’],config[‘Custom_processor_id’],file_path,mime_type)

File “C:/Users/HP/Downloads/test simple.py”, line 88, in process_document_sample
doc=documentai.types.Document.from_dict(docDictionary)

AttributeError: type object ‘Document’ has no attribute ‘from_dict’

Code Changes:
docDictionary = result.document.dict
doc=documentai.types.Document.from_dict(docDictionary)