I am trying to improve “Invoice Parser” processor to know some additional labels. Major problem, that it did not see some numbers. I tested “OCR processor” and it extracted this numbers without issue, but on training screen when I select this numbers, like on screenshot, I get nothing in value. Even if I am correct value, it will not allow me to train model, because it will say this selected labels is empty in documents. How I can fix this issue? I have multiple documents and all have such issues. It does not skip all numbers, but like skip some of them and ignore them (especially “0” in this table).
There are several reasons why Document AI might be struggling to recognize zeros in your training data. Here’s a consolidated view of the potential causes and solutions:
Labeling Issues:
Incorrect Labeling: Double-check your labeling to ensure zeros are accurately selected. They might be accidentally skipped, labeled as spaces, or merged with adjacent characters.
Label Format: Verify that the labeling tool uses the expected format (bounding boxes or text annotations) and positions them correctly around the zeros.
Label documents - are required to train, up-train, or evaluate a processor version.
Data Quality and Preprocessing:
Document Clarity: Analyze your training documents. Are zeros clear and well-formatted, or are there issues with blurry scans, small font sizes, or low contrast with the background?
Preprocessing Options: If available, explore Document AI’s preprocessing settings. Look for options that might improve small character recognition, adjust contrast specifically for numbers, or filter out noise that might obscure zeros.
Clear Labeling: Ensure consistent and accurate labeling of zeros throughout your training data.
Data Quality: Use high-quality training documents with clear and well-formatted zeros.
Balanced Training: Balance your training data set to avoid biasing the model towards more frequent characters.
Iterative Training: Train, test, and refine your model iteratively, adjusting labeling, preprocessing, or the data set based on your findings.
By addressing these factors and implementing the appropriate solutions, you should be able to improve Document AI’s ability to recognize zeros in your training data and successfully train your model.
Label Format: Verify that the labeling tool uses the expected format (bounding boxes or text annotations) and positions them correctly around the zeros - I tried both - it dont want to select numbers
Clear Labeling: Ensure consistent and accurate labeling of zeros throughout your training data. - checked, this is number field in table - it may have zeros
Data Quality: Use high-quality training documents with clear and well-formatted zeros - if OCR can extract this numbers, so looks like no issues
Balanced Training: Balance your training data set to avoid biasing the model towards more frequent characters - I need this numbers, so I train model to extract this numbers
Iterative Training: Train, test, and refine your model iteratively, adjusting labeling, preprocessing, or the data set based on your findings - I cannot train if I cannot select
Preprocessing Options: If available, explore Document AI’s preprocessing settings. Look for options that might improve small character recognition, adjust contrast specifically for numbers, or filter out noise that might obscure zeros - where can I activate this settings on training screen? It is invoice parser, not OCR parser
Thanks for reply. Sorry, but already spend weeks to train “invoice parser” for new labels.
Do you want me to say, that “Expense Parser” have better training ability to recognize numbers, than “Invoice parser”? Why train playground different for this parsers, if both will be used by human, not robot? Also, does this mean even if I need parse bills and invoice parser is logical, I still need to use “Expense Parser” because somehow it better? Just need to know which is best one to select in future.
No worries oleks_vasyliev, I am here to help you out with this matter. I understand that switching parser may take you a lot of effort and time. Here are some documentation regarding invoice parser that might help you out to understand why zeros are not recognized by the processor. As well as the limits of Document AI depending on the processor you are using.
If none of these suggestions resolve the issue, consider reaching out to Google Cloud support for further assistance. Thank you.
Last question @McMaco . If we switch to “Custom Extractor”, is it can handle such cases better (with this numbers), than invoice/expense parsers (but will cost us 3 times more for such functionality)? Thanks