on 05-13-2014 12:41 PM
Hi,
I read fuzzy developer guide of HANA, but i am not getting how HANA calculate score() and fuzzy score?
As per developer guide, Score() is calculate using TF/IDF, and I also try to calculate TF/IDF as per WIKI page, but it gives different values. and Score() value is changed as per x value of fuzzy(x) .
See example
select score() as sc, *
from COMPANIES2
where contains(Companyname,'IBM',fuzzy(0.7))
it returns
SC; ID; COMPANYNAME; CONTACT
0.7599999904632568; 6; IBM Corp; M. Master
and for
select score() as sc, *
from COMPANIES2
where contains(Companyname,'IBM',fuzzy(0.2))
it return
SC; ID; COMPANYNAME; CONTACT
0.16945946216583252; 2; SAP in Walldorf Corp; Master Mister
0.8392000198364258; 6; IBM Corp; M. Master
and table content of Companies2 is
ID; Companyname; contact
1; SAP Corp; Mister Master
2; SAP in Walldorf Corp; Master Mister
3; ASAP; Nister Naster
4; ASAP Corp; Mixter Maxter
5; BSAP orp; Imster Marter
6; IBM Corp; M. Master
Please provide any formula or algorithm for above.
Thanks,
Somnath A. Kadam
Hi Somnath,
It seems that the column "Companyname" has data type "SHORTTEXT" and here is the quote from SAP HANA Developer Guide Ch. 10.2.4.8 (p659)
"Text types support a more sophisticated kind of fuzzy search. Texts are tokenized (split into terms), and the fuzzy comparison is performed term by term.
When searching with 'SAP' for example, a record like 'SAP Deutschland AG & Co. KG' gets a high score, because the term 'SAP' exists in both texts. A record like 'SAPPHIRE NOW Orlando' gets a lower score, because 'SAP' is just a part of the longer term 'SAPPHIRE' (3 of 8 characters match)."
So for text columns the score calculation is much more complex than tf-idf.
As for the different fuzzy score, there is an explanation in the FAQ section ( Ch. 10.2.4.14, p736 "Is the score between request and result always stable for TEXT columns?")
Basically, for each token, its similarity score will be used to calculate the overall result only if it is higher than the threshold given in fuzzy(). Any token with a lesser similarity score will be excluded. Therefore, slight change in the threshold may influence the overall score greatly.
Here is an example.
I added id 7 "SAP ASAP" to the data you used.
Note that the similarity score between "ASAP" and "BSAP" is slightly over 0.74 and similarity score between "SAP" and "BSAP" is 0.75:
For
select score() as sc, * from COMPANIES2 where contains(COMPANYNAME,'BSAP',fuzzy(0.74))
We get:
<...omitted...>
0.7474510073661804; 7; SAP ASAP; M. Master
Now change the threshold to 0.75 and the result is:
<...omitted...>
0.5588234663009644; 7; SAP ASAP; M. Master
ID 7 now gets a lower score because "ASAP" is excluded and only "SAP" is used to calculate the overall result.
As for tf-idf, it is used in the so-called freestyle search across multiple columns.
An example from the same guide:
select score() as sc, * from companies2 where contains((companyname,contact), 'IBM Master', FUZZY(0.7));
Result:
0.8103122115135193; 6; IBM Corp; M. Master
Regards
Roger Tao
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
User | Count |
---|---|
87 | |
10 | |
10 | |
10 | |
7 | |
6 | |
6 | |
5 | |
5 | |
4 |
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.