I’m trying to set up Vertex AI Search (Not Chatbot) for our documentation site (as Sitesearch/PSE is being shutdown). Few questions
How do I set certain words NOT to be stemmed? For eg: yugabyted is automatically converted to YugabyteDB. I don’t want this to happen. (Specifying within quotes as “yugabyted” doesn’t work)
For Boost/Bury, how do I create a filter where the URL path matches a pattern
How to deeplink to specific headers within the page?. How do I identify whether it is a hit on a header or not ?
Here are some possible approaches that you might need to help you address Vertex Search configurations:
Preventing Stemming for Specific Words- While Vertex AI search does not directly support custom stemming rules, you might consider the following approaches to handle stemming specific word :
Create a list of words to exclude from stemming.
Preprocess your text data - before uploading your documentation data to Vertex AI search, use a custom tokenizer or text processing library to handle your case.
Boost/Bury Filters Based on URL Patterns
To create a filter where the URL path matches a pattern, you can use filter parameters in Vertex AI Search.
You can use the boostSpec or servingControls (Preview) to apply boosts or bury results based on your URL path filter
Deeplinking to Specific Headers:
Vertex AI Search itself doesn’t provide direct support for deep linking to headers or identifying them within search results. Below are implementation that might help you deeplinking to specific headers:
To deeplink to specific headers within a page, ensure your documents are structured with identifiable headers. You can use HTML tags or specific markers in your documents.
When you preprocess your documentation data, extract the header information (e.g., H1, H2, H3 tags) and store it alongside the content.
Store this header information in separate fields (e.g., h1_text, h2_text) within your documents when indexing them in Vertex AI Search.
To identify whether it is a hit on a header, you can analyze the search results and check if the hit corresponds to a header tag or marker within your document structure.
For more information about Vertex AI Search you can read through this documentation.
To prevent stemming in Vertex AI Search, try custom synonym rules, as there’s no direct way to stop it. For Boost/Bury, use filter expressions like “url_path LIKE ‘/docs/%’” to match URL patterns. To deeplink specific headers, ensure headers have unique IDs in your HTML (e.g.,
Header
), and Vertex AI Search will index them as searchable entities, enabling links to specific sections in search results.
Thanks @MJane . Preventing Stemming/Spell suggestion for Specific Words :
Preprocess your text data - How do I do this when using the Crawler?
I’m thinking of adding
spellCorrectionSpec":{"mode":"AUTO"}
for most queries and have a list of query exceptions and just for those cases set
spellCorrectionSpec":{"mode":"SUGGEST_ONLY"}
Boost/Bury Filters Based on URL Patterns :
I’ve the filter set as
and set the boost/bury score to -1 . but that does not seem to work correctly. For some queries the first result is from the same path. I don’t get this.
Deeplinking to Specific Headers:
Again, How can I preprocess data when using the Crawler? The data is just HTML (crawled by Vertex crawler ) & the headers are correctly defined with proper ID. Still, I’m unable to identify if it is a hit on the header. from the json search results response - What additional parameters do I have to pass in the request to get this info?
@shaikhsharmeen4 , The headers have unique ids and are correctly marked up and indexed .
<h2 id="section1">SomeText</h2>
, But for a search on “sometext” , I’m unable to identify if the hit was on the header, so that in the result listing I can modify the URL as url_path#section1 , so that the page will scroll to the header/anchor when clicked on the result. How do I do this ?