Hello,
I am building a Custom Extractor in Document AI to extract data from a form that has changed over time and, therefore, has a number of different, but similar, layouts all containing the same information.
I’ve met the “full requirements” (i.e. 50 test and training instances of each label) and, for the most part, I am happy with my percentages.
However, I am stuck with a section that may or may not be tabular.
In some cases the form has a table that can contain zero or more records/rows:
In the case of tabular data, DocAI evaluates and auto-labels everything with a high degree of accuracy:
In other cases, though, the form has separate fields to capture the details for only a single record:
However, DocAI, seems incapable or evaluating and auto-labeling this as a single record and, instead, evaluates this as multiple, incomplete, records:
In the above example note that it groups section “7. Seams” as it’s own parent label and separate from section “6. Shell” when, in fact, sections 6 & 7 are representative of a single record.
Because it’s possible, in the tabular case, for a form to have multiple records, I am using a Parent Label to capture the details. I was hoping that DocAI would be smart enough to figure out, given enough examples, that in some cases there is only a single record that is not in a tabular format.
I have uploaded and labeled a large number of documents (500+) in both formats, yet DocAI is still breaking the non-tabular format into multiple records rather than treating it as a single record.
So the question is, do Parent Labels have the capability to do this? i.e. if I just continue to add more examples, will it eventually figure it out, or should this actually be two different processors – one processor for tabular, multi-record form layouts and another processor for non-tabular, single-record form layouts?
I would really prefer to have a single processor, if possible.
This question is related to another post ( Document AI don’t recognize parent label area correctly, and does it only on per line basis ), for which there was no answer.
Thank you!



