on 06-11-2012 6:41 PM
As SAP claimes HANA supports both structured and unstructured data. Can any of you - hoping from SAP - to explain how HANA supports unstructured data please? I would like to know how to
1) load unstructured data into HANA - say scanned invoices for example - JPEG or PDF files
2) search say a customer name within these files
3) show them when clicked into one search results
Cheers
Tansu
Looking for the same question. We are trying to load wave files from call centers to HANA (thousands) and need a voice translator to text, as well we a load mechanism to HANA.
Ideas?
Dr. Berg
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
Hi, below you will find a sample script that will allow you to upload any type of file to a BLOB column in a column table in HANA db. I was able to built this script with help from Juergen Schmerder. You can use any programming language that can establish a connection thru ODBC or JDBC, like .NET, Java, etc...
con = dbapi.connect(‘hanahost', 30015, 'SYSTEM', '********') #Open connection to SAP HANA
cur = con.cursor() #Open a cursor
file = open('doc.pdf', 'rb') #Open file in read-only and binary
content = file.read() #Save the content of the file in a variable
cur.execute("INSERT INTO BLOBTEST VALUES(?,?)", (2,content)) #Save the content to a table
file.close() #Close the file
cur.close() #Close the cursor
con.close() #Close the connection
Now, to be able to search within the content of the files you will need to use Fuzzy Search. Here's an example of a query that looks for the word "march" in the content of the files. The score that you will get back is a TF/IDF score (Term Frequency/Inverse Document Frequency), which means that the score will be calculated based on the number of times the word "march" is found in the content of the file, the file with the most number of matches will have the highest score.
SELECT TO_DECIMAL(SCORE(),3,2) AS score, *
FROM BLOBTEST
WHERE CONTAINS("File_Content", 'march',
FUZZY(0.5, 'textSearch=fulltext'))
ORDER BY "Year", "Month";
Hope it helps, Lucas.
Practice your SAP HANA™ development skills:
Info en Español sobre SAP HANA™:
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
Hello Lucas,
I could upload the files with various formats (PDF,DOC,HTML and Plain text).
I also created a column which specifies the MIME Type and created a full text index as follows
CREATE FULLTEXT INDEX <Index_name> ON MYTABLE(BLOBFIELD) MIME TYPE COLUMN MIME_COLUMN;
However in my system the values are searched only for html and plain text and it does not work for files of type PDF or DOC.
I checked also the table M_TEXT_ANALYSIS_MIME_TYPES and there are only 2 MIME TYPES text and HTML are visible.
Is there anything else needed to search PDF or DOC files?
I checked on REV34 system.
Thanks.
Hi Sagar, in order to be able to search within the content of Word or PDF files you need to use Fuzzy Search. The score that you will get back is a TF/IDF score (Term Frequency/Inverse Document Frequency), which means that the score will be calculated based on the number of times the word that you are looking for is found in the content of the file, the file with the most number of matches will have the highest score. Here's an example:
SELECT TO_DECIMAL(SCORE(),3,2) AS score, *
FROM <TABLE_NAME>
WHERE CONTAINS("<BLOB_COLUMN_NAME>", '<TERM_THAT_YOU_ARE_LOOKING_FOR>',
FUZZY(0.7, 'textSearch=fulltext'));
Hope this helps.
User | Count |
---|---|
95 | |
11 | |
10 | |
9 | |
9 | |
7 | |
6 | |
5 | |
5 | |
4 |
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.