Precision over proximity: Why Semantic search fails for hierarchical data

Introduction

Retrieval-Augmented Generation (RAG) has become the standard architecture for grounding Large Language Models in enterprise data. The primary benefit of RAG is its ability to significantly reduce LLM hallucinations and provide domain-specific accuracy by retrieving relevant information from a knowledge base before generating a response, all without the need for costly model fine-tuning. However, traditional RAG systems are dependent on semantic similarity, matching text chunks based on the proximity of their embeddings rather than their broader relevance to the input query. This is particularly problematic in complex, hierarchical documents where the standard approach of chunking text by character count can render the document’s ontology useless and leads to contextually fragmented snippets that don’t relate to the intent of the question.

Large Language Models demonstrate a sophisticated capacity for interpreting relevance through native document hierarchies. By implementing ontological chunking—an ingestion method that decomposes a document according to its inherent logical framework—a system can traverse the data down to its specific leaf nodes while maintaining the integrity of the document’s internal lineage. This architectural choice optimizes retrieval by allowing the LLM to target information based on systemic relevance rather than simple semantic proximity, substantially boosting overall precision.


Implementation bakeoff:

To demonstrate this approach, we chose one of the most complex and rigidly structured documents we could think of: the US Tax Code. Specifically, we used IRS Publication 15 for our proof-of-concept.
The goal of the bakeoff was to have the same document chunked based on ontology, one with raw embeddings, one using ontological retrieval and to measure based on latency, token usage and accuracy which approach was better.

To do this we set up an AlloyDB as Google’s fully managed, PostgreSQL-compatible database. We used it because it natively supports the ltree extension along with pgvector, this allowed us to have one storage location for the two methods of storing to perform a simple comparison ontological retrieval vs semantic.

To preserve this hierarchy, we created a deterministic address space for the information. Similar to how a file path (/folder/subfolder/file.txt) uniquely identifies a file on a hard drive, a deterministic address space in a document provides a unique, predictable location for every piece of information. In this context, it means every paragraph has a fixed “address” that reflects its position in the document’s ontology. This allows us to target retrieval at specific levels of the hierarchy, ensuring we can pull either precise facts or entire logical containers of context.

In AlloyDB we used the PostgreSQL ltree extension. This creates a deterministic “Address Space” for the tax law:

Pub15 (Root)
 └── Chapter_1 (Logical Container)
      ├── New_hire_reporting (Context Anchor)
      │    ├── p1 (Leaf Node / Vector Target)
      │    └── p2 (Leaf Node / Vector Target)
      └── Penalties (Table/Figure Container)
           ├── Rate_Table (Structured Data)
           └── Summary (Paragraph)
  • Semantic search targets these tiny, precise leaf nodes using vectors.
  • Vectorless search targets the parent ltree container (Pub15.Chapter_1.New_hire_reporting), retrieving all child paragraphs in their exact original sequence using a native GIST index.

Methods:

The ontological retrieval

During ingestion, we generated a Minimap—a structured Table of Contents derived from the ltree paths and their human-readable titles. Then, treating the LLM as a Navigator rather than just a generator when a user submits a query, we send it to the LLM Navigator along with this Minimap. The LLM performs a semantic match between the query and the descriptions in the minimap and emits the relevant path.

Example navigator prompt:

You are an expert tax attorney. Below is a Table of Contents:
Path: Pub15.Ch1.Hiring | Description: New Hire Reporting
Path: Pub15.Ch1.Rates  | Description: Tax Rates and Wage Bases

User Query: "What is the SS rate for 2026?"
Respond with ONLY the Path Locations.

The application then uses the emitted path (e.g., Pub15.Ch1.Rates) to construct a SQL query with the ltree descendant operator (<@) to fetch all text payloads in that container in their original sequence.

We also tested a multi-step approach whereby the minimap is broken into levels to guide the LLM down the tree. For the query:

  1. We only send the high-level nodes to the LLM to ask “Which general area is this query about?” This prompt is much smaller and continues to traverse the tree.
  2. Once the LLM picks a topic (e.g., Pub15.Chapter_1.Depositing_Taxes ), we go back to the database and pull only the sub-sections that belong to that specific topic. We show that tiny list to the LLM and ask: “Now, which specific subsection is it?” and retrieve the address.

Semantic retrieval

For the Semantic search we followed the classic RAG playbook, instead of chunking by fixed character counts (like every 500 characters), we generated embeddings for the smallest logical units in the document—the “leaf nodes” (individual paragraphs and bullet points).These embeddings were stored in the same AlloyDB instance using the pgvector extension.

When a user asked a question:

  1. The system generated an embedding for the user’s query.
  2. It ran a standard vector similarity search (using cosine distance or inner product) to find the top K most semantically similar leaf nodes in the database.
  3. It sent those disjointed chunks to the LLM to synthesize an answer.

Comparative results

To evaluate these approaches, we implemented both engines and conducted a comparative test across a series of queries. We used Gemini 3.1 Pro to navigate the hierarchy and synthesize responses.

Side-by-side empirical results

  • Ctx: Context Tokens (retrieved content size).
  • Tot: Total Tokens consumed (Nav + Gen + Thoughts).
  • Lat: Total Query Latency (seconds).
Query Vectorless (Ctx / Tot / Lat) Semantic (Ctx / Tot / Lat) Multi-Step (Ctx / Tot / Lat) Winner (Accuracy)
Social Security Tax Rate 195 / 8,585 / 9.55s 195 / 644 / 4.92s 195 / 4,232 / 12.69s TIE
Max Wage Base 2026 195 / 8,692 / 10.08s 195 / 755 / 6.58s 195 / 4,593 / 15.56s TIE
Backup Withholding Rate 403 / 9,017 / 10.34s 173 / 796 / 6.19s 403 / 4,336 / 12.18s Semantic
Small Business Tax Credit 473 / 9,024 / 9.80s 473 / 1,356 / 7.44s 473 / 5,024 / 16.85s TIE
Supplemental Wage Rate 86 / 9,054 / 13.33s 240 / 1,290 / 9.44s 86 / 4,329 / 14.95s Vectorless/MS
Hiring a New Employee 162 / 9,426 / 15.14s 306 / 1,407 / 10.19s 162 / 4,899 / 18.68s Vectorless/MS
Deposit Methods 407 / 9,909 / 17.27s 491 / 1,708 / 10.76s 407 / 5,522 / 22.25s Vectorless/MS
Recordkeeping Req. 869 / 10,235 / 15.87s 903 / 2,436 / 12.65s 869 / 6,397 / 20.91s Vectorless/MS
Change of Address/Name 15,696 / 26,041 / 22.16s 119 / 881 / 6.80s 68 / 4,714 / 20.26s Multi-Step
Nonpayroll Withholding 427 / 9,691 / 15.55s 108 / 1,162 / 10.03s 427 / 5,163 / 17.20s Vectorless/MS
Electronic W-2 Filing 321 / 9,759 / 18.83s 970 / 1,902 / 8.90s 279 / 5,686 / 24.84s Multi-Step

The experiment demonstrates a clear trade-off between high-fidelity retrieval and execution efficiency across different query classes. While Hierarchy-Aware (Vectorless) search is superior for broad, procedural queries by guaranteeing context recall and avoiding topic drift, it initially suffered from a “Navigator Tax” of ~8,000 prompt tokens per query. By implementing a Multi-Step (Iterative) Navigation approach, we successfully mitigated this cost, reducing navigation prompt tokens by over 50% and improving path precision. However, this optimization introduces a new latency trade-off, adding 5 to 10 seconds of execution time due to the sequential LLM calls. Ultimately, while multi-step structural search provides a tunable path for cost reduction in complex synthesis, traditional Semantic search remains the fastest and most cost-effective option for simple point-fact lookups, operating at a fraction of the total token footprint.

Scaling to thousands of documents

Moving from a single document to a massive corpus containing thousands of documents shifts the challenges and behaviors of both retrieval approaches significantly. Here is how we could scale both systems and the trade-offs involved.

1. Vectorless search on checklists: Tackling the Minimap size

  • The LLM Navigator cannot hold a “Minimap” of thousands of documents in its context window to find the correct container.
  • The solutions: To scale this approach and keep the prompt size manageable, we can use two strategies:
    • A. Multi-Step (Iterative) Navigation: Instead of showing the full hierarchy at once, we guide the LLM down the tree in steps (e.g., picking a high-level topic first, and then a specific sub-section). In our tests, this reduced prompt tokens by over 50%. At scale, this can be used to navigate document folders or chapters before specific sections.
    • B. Vector-Assisted Navigation (Hybrid): We generate embeddings only for the Minimap entries (the paths and titles) rather than the full text. We use these embeddings to prefilter and only present these Top N candidates to the LLM Navigator. The LLM makes the final, precise routing decision, preserving the guarantee of 100% context recall for the chosen container.

2. Semantic search on point-facts

  • The scaling challenge: In a massive corpus, the risk of false positives increases dramatically. A query for a specific rate or fact might retrieve a highly similar-sounding paragraph from a completely irrelevant document or a different year’s filing, leading to precise but incorrect answers.
  • The solution : To maintain accuracy and efficiency at scale, standard vector search must be augmented with a complex pipeline of optimization techniques:
    • Hybrid search: Combining dense vector search (for semantic meaning) with sparse keyword search (like BM25) to ensure precise facts are not missed.
    • Reranking: Using a fast initial retrieval for a larger candidate set, followed by a precise cross- encoder model to rerank the results.
    • Query expansion: Generating synonyms or using techniques like HyDE to align the user’s query with stored content.
    • Metadata pre-filtering: Restricting the search space by document type, date, or chapter before running the vector search.

Which approach should you use

In today’s stack, standard vector search—even with the “complex pipeline of optimizations required to work at scale” is essentially a solved infrastructure problem. The modern AI stack has paved this road completely.

  • Managed infrastructure: Fully managed services like Vertex AI Vector Search handle massive-scale vector storage and low-latency retrieval natively. For a fully automated pipeline, Vertex AI Search provides hybrid search out-of-the-box—automatically combining dense vector retrieval with sparse keyword matching without requiring you to manually tune the underlying algorithms.
  • Plug-and-play reranking: Implementing a cross-encoder for reranking sounds computationally daunting, but in practice, it is a single API call to the Vertex AI Ranking API, which re-scores your initial candidate set using Google’s pre-trained semantic models.
  • Streamlined data architecture: The data preparation phase is highly forgiving. You can extract and chunk text from complex files using Document AI, generate representations via the Vertex AI Text Embeddings API, and sink them directly into your vector store. You do not need to logically map the relationship between Chunk A and Chunk B for the system to function correctly.

Scaling a hierarchical, multi-step navigation system across thousands of documents introduces massive engineering bottlenecks. Systems that rely on tree-based retrieval (like RAPTOR) are powerful, but they require building custom infrastructure.

  • The parsing trap: To build a functional “Minimap,” you first have to extract a perfectly clean hierarchy (Chapters → Sections → Subsections) from your corpus. If your data includes visually complex PDFs, legacy Word docs, or raw text dumps, building a universal parser that reliably extracts this structure is a monumental data engineering task.
  • Massive indexing costs: Tree-based systems often require recursive LLM processing. You have to cluster documents, use an LLM to summarize those clusters, and then cluster the summaries to build the tree structure. This burns massive amounts of compute and tokens before a user ever asks a single question.
  • Multi-step latency: Guiding an LLM down a tree iteratively (routing to a folder, then a document, then a section) means the user is waiting for multiple sequential LLM generations to finish before the system even begins to draft the final answer.

Conclusion

While traditional semantic RAG is efficient and easy to deploy for simple point-fact lookups, it struggles with complex, hierarchical documents where preserving context and ontology is critical. Vectorless (hierarchical) search offers superior precision and context recall for procedural or broad queries but requires more investment in infrastructure and parsing. For the enterprise developer, the choice isn’t binary; the most robust solutions often leverage a hybrid approach—combining vector search for speed and semantic matching with structural retrieval (like ltree) to navigate document hierarchies, ensuring both accuracy and compliance across complex datasets.

Takeaways for the Enterprise Developer

  • Preserve document ontology: If your source material has a strict hierarchy, preserve it. ltree is an effective tool for maintaining human reading patterns in database storage.
  • Evaluate empirically: RAG systems are non-deterministic. Build side-by-side evaluation suites to measure Token Efficiency and Context Recall across different query classes.
  • Hybrid search is often necessary: Relying solely on vectors for regulatory data introduces compliance risks. Combining the precision of vectors with the safety of structural search provides a more robust solution.
6 Likes

Just awesome and Amazing :star_struck: