Solved: How Fuzzy score and Score() function works in HANA...

Former Member · ‎05-13-2014

Hi,

I read fuzzy developer guide of HANA, but i am not getting how HANA calculate score() and fuzzy score?

As per developer guide, Score() is calculate using TF/IDF, and I also try to calculate TF/IDF as per WIKI page, but it gives different values. and Score() value is changed as per x value of fuzzy(x) .

See example


select score() as sc, *
from COMPANIES2
where contains(Companyname,'IBM',fuzzy(0.7))

it returns


SC;                             ID; COMPANYNAME;  CONTACT
0.7599999904632568;   6;  IBM Corp;               M. Master

and for


select score() as sc, *
from COMPANIES2
where contains(Companyname,'IBM',fuzzy(0.2))

it return


SC;                               ID;  COMPANYNAME;       CONTACT
0.16945946216583252;   2;   SAP in Walldorf Corp;  Master Mister
0.8392000198364258;     6;   IBM Corp;                   M. Master

and table content of Companies2 is


ID; Companyname;           contact
1;  SAP Corp;                   Mister Master
2;  SAP in Walldorf Corp;   Master Mister
3;  ASAP;                         Nister Naster
4;  ASAP Corp;                 Mixter Maxter
5;  BSAP orp;                   Imster Marter
6;  IBM Corp;                    M. Master

Please provide any formula or algorithm for above.

Thanks,

Somnath A. Kadam

Former Member · ‎05-20-2014

Hi Somnath,

It seems that the column "Companyname" has data type "SHORTTEXT" and here is the quote from SAP HANA Developer Guide Ch. 10.2.4.8 (p659)

"Text types support a more sophisticated kind of fuzzy search. Texts are tokenized (split into terms), and the fuzzy comparison is performed term by term.

When searching with 'SAP' for example, a record like 'SAP Deutschland AG & Co. KG' gets a high score, because the term 'SAP' exists in both texts. A record like 'SAPPHIRE NOW Orlando' gets a lower score, because 'SAP' is just a part of the longer term 'SAPPHIRE' (3 of 8 characters match)."

So for text columns the score calculation is much more complex than tf-idf.

As for the different fuzzy score, there is an explanation in the FAQ section ( Ch. 10.2.4.14, p736 "Is the score between request and result always stable for TEXT columns?")

Basically, for each token, its similarity score will be used to calculate the overall result only if it is higher than the threshold given in fuzzy(). Any token with a lesser similarity score will be excluded. Therefore, slight change in the threshold may influence the overall score greatly.

Here is an example.

I added id 7 "SAP ASAP" to the data you used.

Note that the similarity score between "ASAP" and "BSAP" is slightly over 0.74 and similarity score between "SAP" and "BSAP" is 0.75:

For

select score() as sc, * from COMPANIES2 where contains(COMPANYNAME,'BSAP',fuzzy(0.74))

We get:

<...omitted...>

0.7474510073661804; 7; SAP ASAP; M. Master

Now change the threshold to 0.75 and the result is:

<...omitted...>

0.5588234663009644; 7; SAP ASAP; M. Master

ID 7 now gets a lower score because "ASAP" is excluded and only "SAP" is used to calculate the overall result.

As for tf-idf, it is used in the so-called freestyle search across multiple columns.

An example from the same guide:

select score() as sc, * from companies2 where contains((companyname,contact), 'IBM Master', FUZZY(0.7));

Result:

0.8103122115135193; 6; IBM Corp; M. Master

Regards

Roger Tao

How Fuzzy score and Score() function works in HANA?

Accepted Solutions (1)

Accepted Solutions (1)

Answers (0)

Re: Replication jobs running under my user ID. Sec...

Replication jobs running under my user ID. Securit...

Re: Couldn't create an instance of Cloud Identity ...

Re: Couldn't create an instance of Cloud Identity ...

Re: SAP CRYSTAL REPORT 2020 support ODBC DRIVER 18...