Building a Fast Full Text Search in TimescaleDB for Trading Instruments

May 27, 2025

Building a Fast Full Text Search in TimescaleDB for Trading Instruments

When building our algorithmic trading platform, we faced an interesting challenge: how could our AI agent efficiently search and identify the correct financial instruments when researching trading opportunities? The solution needed to be both blazingly fast and highly accurate - two requirements that don't always go hand in hand.

The Problem: Finding Needles in the Haystack

The platform essentially stores market data in TimescaleDB (a PostgreSQL extension optimized for time-series data) for now, where we maintain tick-level data for thousands of instruments. While TimescaleDB excels at storing and retrieving time-series data efficiently, we needed a complementary search capability for our instruments metadata.

The AI agent needs to quickly find trading symbols with identifiers like NSE:RELIND:EQ, NSE:NIFTY50:OPT:26-Jun-2025:23500:CE, or NFO:IDFBAN:FUT:26-Jun-2025 based on partial information or even fuzzy search terms.

Here's a simplified version of our instruments table schema:

CREATE TABLE instruments (
    token VARCHAR(50),
    symbol VARCHAR(50),
    short_name VARCHAR(100),
    series VARCHAR(50),
    company_name VARCHAR(255),
    expiry_date VARCHAR(50),
    strike_price NUMERIC,
    option_type VARCHAR(10),
    lot_size INTEGER,
    tick_size NUMERIC,
    instrument_type VARCHAR(50),
    exchange VARCHAR(50),
    source_file VARCHAR(50),
    identifier VARCHAR(100),
    isec_token VARCHAR(50),
    isec_token_level1 VARCHAR(50),
    isec_token_level2 VARCHAR(50),
    PRIMARY KEY (token, symbol, series, exchange)
);

While PostgreSQL offers several ways to query text data (LIKE, ILIKE, regular expressions), these methods become increasingly inefficient as the dataset grows. Full Text Search (FTS) provides several advantages:

  1. Performance: FTS is optimized for searching through large volumes of text
  2. Relevance ranking: Results can be ranked by relevance
  3. Linguistic features: Support for stemming, stop words, and language-specific processing
  4. Partial matching: Finding words even when the search term is incomplete

Implementing Full Text Search for Instruments

Step 1: Create a Search Vector Column

First, we need to add a column to store the tsvector (text search vector) data:

ALTER TABLE instruments ADD COLUMN search_vector tsvector;

We want to make specific fields searchable. Each field might have different importance (weight) in the search:

UPDATE instruments
SET search_vector = 
    setweight(to_tsvector('english', coalesce(symbol, '')), 'A') ||
    setweight(to_tsvector('english', coalesce(company_name, '')), 'B') ||
    setweight(to_tsvector('english', coalesce(short_name, '')), 'B') ||
    setweight(to_tsvector('english', coalesce(instrument_type, '')), 'C') ||
    setweight(to_tsvector('english', coalesce(exchange, '')), 'C') ||
    setweight(to_tsvector('english', coalesce(identifier, '')), 'A');

This assigns different weights to different fields:

  • Weight A (highest): symbol and identifier
  • Weight B (medium): company_name and short_name
  • Weight C (lower): instrument_type and exchange

Step 3: Create a Trigger to Update the Search Vector

To keep the search vector updated whenever instruments data changes:

CREATE OR REPLACE FUNCTION instruments_search_vector_update() RETURNS trigger AS $$
BEGIN
    NEW.search_vector := 
        setweight(to_tsvector('english', coalesce(NEW.symbol, '')), 'A') ||
        setweight(to_tsvector('english', coalesce(NEW.company_name, '')), 'B') ||
        setweight(to_tsvector('english', coalesce(NEW.short_name, '')), 'B') ||
        setweight(to_tsvector('english', coalesce(NEW.instrument_type, '')), 'C') ||
        setweight(to_tsvector('english', coalesce(NEW.exchange, '')), 'C') ||
        setweight(to_tsvector('english', coalesce(NEW.identifier, '')), 'A');
    RETURN NEW;
END
$$ LANGUAGE plpgsql;

CREATE TRIGGER instruments_search_vector_trigger
BEFORE INSERT OR UPDATE ON instruments
FOR EACH ROW EXECUTE FUNCTION instruments_search_vector_update();

Step 4: Create a GIN Index for Fast Searching

CREATE INDEX instruments_search_idx ON instruments USING GIN (search_vector);

GIN (Generalized Inverted Index) is optimized for cases where the same values appear in many rows, making it perfect for full-text search.

Step 5: Implement the Search Query

Now we can create a function for the AI agent to search for instruments:

CREATE OR REPLACE FUNCTION search_instruments(search_term TEXT)
RETURNS TABLE (
    token VARCHAR,
    symbol VARCHAR,
    company_name VARCHAR,
    instrument_type VARCHAR,
    exchange VARCHAR,
    identifier VARCHAR,
    rank FLOAT
) AS $$
BEGIN
    RETURN QUERY
    SELECT 
        i.token,
        i.symbol,
        i.company_name,
        i.instrument_type,
        i.exchange,
        i.identifier,
        ts_rank(i.search_vector, to_tsquery('english', search_term)) AS rank
    FROM 
        instruments i
    WHERE 
        i.search_vector @@ to_tsquery('english', search_term)
    ORDER BY 
        rank DESC
    LIMIT 10;
END
$$ LANGUAGE plpgsql;

Optimizing the Search Experience

After implementing the basic search functionality, we made several optimizations:

1. Handle Partial Words with Custom Lexemes

The default PostgreSQL FTS configuration breaks text into lexemes (root words) which might not work well for symbols like "RELIND" (for Reliance Industries). We created a custom configuration:

-- Create a custom text search configuration
CREATE TEXT SEARCH CONFIGURATION trading_instruments (COPY = english);

-- Modify it to handle partial words and special cases
ALTER TEXT SEARCH CONFIGURATION trading_instruments
    ALTER MAPPING FOR asciiword, word, numword, asciihword, hword, numhword
    WITH simple;

2. Implement Fuzzy Matching for Typos

To handle typos and slight variations in search terms, we implemented trigram-based fuzzy matching:

-- Enable the pg_trgm extension
CREATE EXTENSION IF NOT EXISTS pg_trgm;

-- Add a trigram index on key fields
CREATE INDEX instruments_trigram_idx ON instruments 
USING GIN ((symbol || ' ' || company_name || ' ' || identifier) gin_trgm_ops);

-- Create a function for fuzzy search
CREATE OR REPLACE FUNCTION fuzzy_search_instruments(search_term TEXT)
RETURNS TABLE (
    token VARCHAR,
    symbol VARCHAR,
    company_name VARCHAR,
    instrument_type VARCHAR,
    exchange VARCHAR,
    identifier VARCHAR,
    similarity FLOAT
) AS $$
BEGIN
    RETURN QUERY
    SELECT 
        i.token,
        i.symbol,
        i.company_name,
        i.instrument_type,
        i.exchange,
        i.identifier,
        similarity(i.symbol || ' ' || i.company_name || ' ' || i.identifier, search_term) AS similarity
    FROM 
        instruments i
    WHERE 
        i.symbol || ' ' || i.company_name || ' ' || i.identifier % search_term
    ORDER BY 
        similarity DESC
    LIMIT 10;
END
$$ LANGUAGE plpgsql;

3. Combining FTS with Time-Series Data Queries

Since our AI agent often needs to search for an instrument and then immediately query its historical data, we created a specialized function that combines both operations:

CREATE OR REPLACE FUNCTION search_and_get_recent_ticks(
    search_term TEXT, 
    time_from TIMESTAMPTZ, 
    time_to TIMESTAMPTZ
)
RETURNS TABLE (
    token VARCHAR,
    symbol VARCHAR,
    price NUMERIC,
    volume BIGINT,
    timestamp TIMESTAMPTZ
) AS $$
DECLARE
    instrument_token VARCHAR;
BEGIN
    -- First find the instrument
    SELECT i.token INTO instrument_token
    FROM instruments i
    WHERE i.search_vector @@ to_tsquery('english', search_term)
    ORDER BY ts_rank(i.search_vector, to_tsquery('english', search_term)) DESC
    LIMIT 1;
    
    -- Then return its recent tick data
    RETURN QUERY
    SELECT 
        t.token,
        i.symbol,
        t.price,
        t.volume,
        t.timestamp
    FROM 
        ticks t
    JOIN 
        instruments i ON t.token = i.token
    WHERE 
        t.token = instrument_token
        AND t.timestamp BETWEEN time_from AND time_to
    ORDER BY 
        t.timestamp DESC;
END
$$ LANGUAGE plpgsql;

The AI agent interacts with the search functionality through a simple API. When researching market opportunities, it follows this workflow:

  1. Initial research: The agent identifies potential instruments based on news, trends, or strategic analysis
  2. Symbol search: It performs a full-text search to find the exact instrument identifier
  3. Data retrieval: Once it has the identifier, it can retrieve historical tick data for backtesting
  4. Strategy creation: Based on the analysis, it generates trading strategy code

For example, if the agent reads news about Reliance Industries and wants to create a strategy, it might:

  1. Search for "Reliance Industries" using search_instruments('reliance & industries')
  2. Get back NSE:RELIND:EQ as the top result
  3. Retrieve historical data using the token
  4. Generate and test a strategy specific to this instrument

Future Improvements

As we expand to international markets and commodities, we'll need to enhance our search capabilities:

1. Multi-Language Support

For international markets, we plan to implement multilingual search support:

-- Create configurations for different languages
CREATE TEXT SEARCH CONFIGURATION trading_jp (COPY = japanese);
CREATE TEXT SEARCH CONFIGURATION trading_de (COPY = german);

-- Add language-specific columns
ALTER TABLE instruments ADD COLUMN search_vector_en tsvector;
ALTER TABLE instruments ADD COLUMN search_vector_jp tsvector;
ALTER TABLE instruments ADD COLUMN search_vector_de tsvector;

2. Real-Time Instrument Updates

For markets with frequently changing instruments (like options with new expiries):

-- Create a hypertable for instrument updates log
SELECT create_hypertable('instrument_updates', 'update_time');

-- Create a materialized view for the latest instrument data
CREATE MATERIALIZED VIEW latest_instruments AS
SELECT DISTINCT ON (token, exchange) *
FROM instrument_updates
ORDER BY token, exchange, update_time DESC;

-- Refresh schedule
CREATE OR REPLACE FUNCTION refresh_instruments_mv()
RETURNS void AS $$
BEGIN
  REFRESH MATERIALIZED VIEW CONCURRENTLY latest_instruments;
END
$$ LANGUAGE plpgsql;

-- Run every hour
SELECT cron.schedule('0 * * * *', 'SELECT refresh_instruments_mv()');

3. Synonym Expansion

To handle different naming conventions across markets:

-- Create a synonyms table
CREATE TABLE instrument_synonyms (
    word TEXT PRIMARY KEY,
    synonyms TEXT[]
);

-- Example data
INSERT INTO instrument_synonyms VALUES 
('stock', ARRAY['equity', 'share', 'scrip']),
('index', ARRAY['indices', 'benchmark']),
('futures', ARRAY['fut', 'forward', 'forwards']);

-- Function to expand search terms with synonyms
CREATE OR REPLACE FUNCTION expand_search_terms(search_term TEXT)
RETURNS TEXT AS $$
DECLARE
    result TEXT := search_term;
    term_record RECORD;
BEGIN
    FOR term_record IN 
        SELECT word, synonyms FROM instrument_synonyms
        WHERE search_term ~* ('\m' || word || '\M')
    LOOP
        result := result || ' | ' || array_to_string(term_record.synonyms, ' | ');
    END LOOP;
    
    RETURN result;
END
$$ LANGUAGE plpgsql;

Conclusion

Implementing full-text search in PostgreSQL with TimescaleDB has significantly improved our AI agent's ability to identify trading instruments quickly and accurately. This capability is crucial for our automated strategy creation workflow, enabling the agent to move seamlessly from market research to strategy implementation.

While the current implementation meets our needs for the Indian markets, we've laid the groundwork for expanding to international markets and commodities with minimal changes to the underlying architecture. The combination of TimescaleDB's time-series strengths with PostgreSQL's powerful text search capabilities gives us a robust foundation for our algorithmic trading platform.

As we continue to refine our AI agent's capabilities, having fast and accurate instrument search remains a cornerstone of the system's architecture, enabling more sophisticated trading strategies and faster response to market opportunities.