Custom stopwords in Postgres in Cloud SQL

I am trying to use a custom stopwords file in our Postgres DB hosted in Cloud SQL. We need access to the $SHAREDIR/tsearch_data/ directory, but this is not accessible with the hosted Cloud solution. Is there a workaround for this? Surely there are others who want a custom stopwords file in their database.

Accessing the $SHAREDIR/tsearch_data/ directory in Cloud SQL for PostgreSQL is not possible due to the managed nature of the service, which restricts file system access. However, a table-based approach, combined with custom text search configurations, offers a robust and flexible solution for implementing custom stopwords within the Cloud SQL environment. Here are some steps you can take:

  1. Create a Stopword Table

    Create a table to store your custom stopwords:

    CREATE TABLE custom_stopwords (
        word TEXT PRIMARY KEY
    );
    
    
  2. Populate the Stopword Table

    Add your custom stopwords:

    INSERT INTO custom_stopwords (word) VALUES
        ('example1'),
        ('example2'),
        ('example3');
    
    
  3. Create a Custom Text Search Dictionary

    Create a custom dictionary based on the pg_catalog.simple template:

    CREATE TEXT SEARCH DICTIONARY custom_stopword_dict (
        TEMPLATE = pg_catalog.simple,
        STOPWORDS = custom_stopwords
    );
    
    
  4. Create a Custom Text Search Configuration

    Modify a copy of the pg_catalog.english configuration to use the custom dictionary:

    CREATE TEXT SEARCH CONFIGURATION custom_config (COPY = pg_catalog.english);
    ALTER TEXT SEARCH CONFIGURATION custom_config
        ALTER MAPPING FOR asciiword, asciihword, hword_asciipart, word, hword, hword_part
        WITH custom_stopword_dict, english_stem;
    
    
  5. Use the Custom Configuration in Queries

    Apply the custom configuration in full-text search queries:

    SELECT *
    FROM your_table
    WHERE to_tsvector('custom_config', your_column) @@ to_tsquery('custom_config', 'search term');
    

The table-based approach, combined with PostgreSQL’s custom text search configurations, is an efficient, secure, and Cloud SQL-compatible solution for managing custom stopwords. It provides flexibility and optimal performance while fully leveraging PostgreSQL’s powerful text search features, making it an ideal choice for managed database environments.

1 Like

Thanks for the response, but step 3 does not work. I see this error:

ERROR: could not open stop-word file "/share/tsearch_data/custom_stopwords.stop": No such file or directory SQL state: F0000

Sorry for the confusion. Yes, in Postgres, the STOPWORDS parameter in a TEXT SEARCH DICTIONARY does requires a file in the $SHAREDIR/tsearch_data/ directory which as mentioned is not possible in Cloud SQL

To implement custom stopwords in Cloud SQL, the best solution is to preprocess the text to remove stopwords dynamically using a custom_stopwords table before creating a tsvector. This ensures compatibility with Cloud SQL and leverages PostgreSQL’s full-text search capabilities.

1. Create a Stopword Table

Start by creating a table to store your custom stopwords:

CREATE TABLE custom_stopwords (
    word TEXT PRIMARY KEY
);

2. Populate the Stopword Table

Add your custom stopwords:

INSERT INTO custom_stopwords (word) VALUES
    ('example1'),
    ('example2'),
    ('example3');

3. Preprocess Text and Filter Stopwords

Filter out stopwords dynamically when generating the tsvector. Use an UPDATE statement to populate a search_vector column:

ALTER TABLE your_table ADD COLUMN search_vector tsvector;

UPDATE your_table
SET search_vector = to_tsvector('pg_catalog.english', (
    SELECT string_agg(word, ' ')
    FROM (
        SELECT word
        FROM unnest(string_to_array(your_column, ' ')) word  -- Tokenize the text
        WHERE word NOT IN (SELECT word FROM custom_stopwords) -- Filter out stopwords
    ) AS filtered_words
));

4. Create a GIN Index

Index the search_vector column for efficient full-text search:

CREATE INDEX idx_search_vector ON your_table USING gin(search_vector);

5. Perform Full-Text Searches

Use the search_vector column in your queries:

SELECT *
FROM your_table
WHERE search_vector @@ to_tsquery('pg_catalog.english', 'search term');

This approach combines the flexibility of table-based stopword management with the performance and functionality of PostgreSQL’s full-text search engine. It ensures efficient, maintainable, and scalable text search within the constraints of a managed database environment.