cancel
Showing results for 
Search instead for 
Did you mean: 

How Fuzzy score and Score() function works in HANA?

Former Member
0 Kudos

Hi,

I read fuzzy developer guide of HANA, but i am not getting how HANA calculate score() and fuzzy score?

As per developer guide, Score() is calculate using TF/IDF, and I also try to calculate TF/IDF as per WIKI page, but it gives different values. and Score() value is changed as per x value of fuzzy(x) .

See example


select score() as sc, *

from COMPANIES2

where contains(Companyname,'IBM',fuzzy(0.7))

it returns


SC;                             ID; COMPANYNAME;  CONTACT

0.7599999904632568;   6;  IBM Corp;               M. Master

and for


select score() as sc, *

from COMPANIES2

where contains(Companyname,'IBM',fuzzy(0.2))

it return


SC;                               ID;  COMPANYNAME;       CONTACT

0.16945946216583252;   2;   SAP in Walldorf Corp;  Master Mister

0.8392000198364258;     6;   IBM Corp;                   M. Master

and table content of Companies2 is


ID; Companyname;           contact

1;  SAP Corp;                   Mister Master

2;  SAP in Walldorf Corp;   Master Mister

3;  ASAP;                         Nister Naster

4;  ASAP Corp;                 Mixter Maxter

5;  BSAP orp;                   Imster Marter

6;  IBM Corp;                    M. Master

Please provide any formula or algorithm for above.

Thanks,

Somnath A. Kadam

Accepted Solutions (1)

Accepted Solutions (1)

Former Member
0 Kudos

Hi Somnath,

It seems that the column "Companyname" has data type "SHORTTEXT" and here is the quote from SAP HANA Developer Guide Ch. 10.2.4.8 (p659)

"Text types support a more sophisticated kind of fuzzy search. Texts are tokenized (split into terms), and the fuzzy comparison is performed term by term.

When searching with 'SAP' for example, a record like 'SAP Deutschland AG & Co. KG' gets a high score, because the term 'SAP' exists in both texts. A record like 'SAPPHIRE NOW Orlando' gets a lower score, because 'SAP' is just a part of the longer term 'SAPPHIRE' (3 of 8 characters match)."

So for text columns the score calculation is much more complex than tf-idf.

As for the different fuzzy score, there is an explanation in the FAQ section ( Ch. 10.2.4.14, p736 "Is the score between request and result always stable for TEXT columns?")

Basically, for each token, its similarity score will be used to calculate the overall result only if it is higher than the threshold given in fuzzy(). Any token with a lesser similarity score will be excluded. Therefore, slight change in the threshold may influence the overall score greatly.

Here is an example.

I added id 7 "SAP ASAP" to the data you used.

Note that the similarity score between "ASAP" and "BSAP" is slightly over 0.74 and similarity score between "SAP" and "BSAP" is 0.75:

For

    select score() as sc, * from COMPANIES2  where contains(COMPANYNAME,'BSAP',fuzzy(0.74))

We get:

<...omitted...>

0.7474510073661804;    7;    SAP ASAP;        M. Master

Now change the  threshold to 0.75 and the result is:

<...omitted...>

0.5588234663009644;    7;    SAP ASAP;        M. Master

ID 7 now gets a lower score because "ASAP" is excluded and only "SAP" is used to calculate the overall result.

As for tf-idf, it is used in the so-called freestyle search across multiple columns.

An example from the same guide:

     select score() as sc, * from companies2 where contains((companyname,contact), 'IBM Master', FUZZY(0.7));

Result:

0.8103122115135193;    6;    IBM Corp;    M. Master

Regards

Roger Tao

Answers (0)