JasonC
December 15, 2023, 9:24pm
1
I’m relatively new to AI and still trying to wrap my head around some concepts. I’ve followed this lab which uses Document AI OCR to scan PDF’s and be able to have conversations against them.
https://cloud.google.com/blog/products/ai-machine-learning/ask-your-documents-document-ai-and-palm2-for-question-answering
What I’m trying to understand is, what if I have an entire shared Google Drive or GCS bucket that I want a broad user community to be able to have conversations against. Do all of those documents need to go through Doc AI first? How are new documents handled?
2 Likes
Hi @JasonC ,
Welcome and thank you for reaching out to our community.
Here are the things you need to consider when working on data collections and user interactions.
Processing of Documents:
Existing:
Full OCR extraction: This will run all the documents from your drive or storage bucket to Document AI to extract data
Light extraction: This fetch basic information like document type, title and author that is used for quick categorization
New:
Setup scheduled processing or a trigger to capture newly added documents in your drive or storage bucket
Apply your chosen processing method, either full OCR or light extraction, depending on your needs
Update your database and indexes by adding the extracted information from your new documents
Storage:
You can store the extracted information to a database like Cloud SQL or BigQuery
You can also create indexes for faster information search and retrieval based from supplied keywords
Considerations:
Document AI pricing : Full OCR costs higher than light extraction methods, be sure of the functions you want to serve your customers.
I found a post that discusses Document AI Extraction , you can check it out as the solution might be helpful to you.
1 Like