Skip to content

[BUG] text search results wrong sorting  #5972

Open
@cancerberoSgx

Description

@cancerberoSgx

I'm not understanding the logic behind search text scorers when using filter expressions with OR like "word1 | word2". As a filter it works ok, but when sorting documents by number of matches it doesn't makes any sense so that's why this seems an issue.

TL;DR: I have similar strings, when searching for "word1 | word2", some match only 1 word and some containing both words. The number of occurrences per word is the same and the location of these words in the string is very similar. Nevertheless, documents with only 1 word have higher scores than the ones containing both words. And even for documents containing only 1 word, the document with only 1 occurrence scores higher than others with more occurrences. This happens with all build-in scorers

In my context I'm searching words entered by a user in an ecommerce products store. I expect that documents with most word matches to be returned first and ideally the ones with most unique word matches. But the actual behavior is nighter.

In the following example, I'm searching for two words "bicicleta | maldonado" (sorry for the spanish). I'm inserting example documents and the id indicates the number of occurrences of the two words, for example id="1_0" indicates 1 occurrence of "bicicleta" and 0 occurrences of "maldonado"

Observations:

  • order is not respecting the number of total word matching, even if matched words are in similar locations in the text <- this is why I think there's an issue
  • order is not respecting unique word matches vs total word matches <- nice to have
    • in other words, if a document contains both words (once) it should always be scored higher than one containing only one word (once)
  • For scorers like TFIDF, BM25, documents with lower score numbers are the ones with most words (2_2, 2_1) - maybe for those the scores are should be inverted ?
  • All above are observed with all builtin scorer implementations: TFIDF, BM25, TFIDF.DOCNORM, DISMAX

Questions:

  • is there an issue in fulltext scorers since just counting the total word matching occurrences is not respected ?
  • is there an issue with the order of the results? (should be sort by score ASC instead of DESC ?)
  • Or Am I missing something important like scoring the position of the match on which case none of this scorers are useful in my case ?
  • Is there a way of implementing a custom scorer using lua function instead of having to compile it as a C module?

results observed:

TFIDF scorer order: 0_1, 2_0, 0_2 ,1_0, 2_2, 2_1
DISMAX scorer order: 2_0, 0_2, 2_1, 2_2, 1_0, 0_1
BM25 scorer order: 2_0, 0_2, 1_0, 0_1, 2_2, 2_1
TFIDF.DOCNORM scorer order: 0_2, 2_0, 0_1, 1_0, 2_2, 2_1

Redis script

FT.DROPINDEX myIndex DD
FT.CREATE myIndex 
    ON HASH 
    PREFIX 1 myprefix: 
    SCHEMA 
        text TEXT 
EVAL "local keys = redis.call('KEYS', 'myprefix:*') for i,k in ipairs(keys) do redis.call('DEL', k) end return #keys" 0
HSET myprefix:1_0 text "bicicleta tornado roma mtb rodado price category actividades al aire libre description tornado roma mtb rodado excelente estado location montevideo"
HSET myprefix:0_1 text "bolsa tote tejida price category eventos description disponible tambin para retirar location departamento de maldonado"
HSET myprefix:2_0 text "rack porta bicicleta category accesorios bicicleta location montevideo"
HSET myprefix:2_1 text "soporte bicicleta auto price category accesorios vehculos description se pueden llevar hasta bicicleta location maldonado department"
HSET myprefix:2_0 text "bicicleta specialized price description bicicleta hbrida urbana y ruta specialized sirrus con parrilla luces traseras y delanteras con candado u includo location montevideo"
HSET myprefix:0_2 text "bolsa tote tejida price category eventos description disponible tambin para retirar por maldonado location departamento de maldonado"
HSET myprefix:2_1 text "bicicleta trinx m pro price category actividades al aire libre description bicicleta rodado estamos ubicados en punta del este horario de atencin a hs location maldonado"
HSET myprefix:2_2 text "bicicleta roma tornado price category actividades al aire libre description bicicleta rodado estamos ubicados en punta del este, maldonado horario de atencin a hs location maldonado"
FT.SEARCH myIndex "bicicleta | maldonado" WITHSCORES SCORER TFIDF DIALECT 2 RETURN 1 id

Environment:

  • using redis-stack docker image latest
  • redis_version:7.4.1
  • redis-search version: 21012

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions