cancel
Showing results for 
Search instead for 
Did you mean: 

How do you load UNSTRUCTURED data into HANA?

Former Member
0 Kudos

As SAP claimes HANA supports both structured and unstructured data. Can any of you - hoping from SAP - to explain how HANA supports unstructured data please? I would like to know how to

1) load unstructured data into HANA - say scanned invoices for example - JPEG or PDF files

2) search say a customer name within these files

3) show them when clicked into one search results

Cheers

Tansu

Accepted Solutions (0)

Answers (2)

Answers (2)

Former Member
0 Kudos

Looking for the same question. We are trying to load wave files from call centers to HANA (thousands) and need a voice translator to text, as well we a load mechanism to HANA.

Ideas?

Dr. Berg

Former Member
0 Kudos

Hi, below you will find a sample script that will allow you to upload any type of file to a BLOB column in a column table in HANA db. I was able to built this script with help from Juergen Schmerder. You can use any programming language that can establish a connection thru ODBC or JDBC, like .NET, Java, etc...


con = dbapi.connect(‘hanahost', 30015, 'SYSTEM', '********') #Open connection to SAP HANA
cur =
con.cursor() #Open a cursor

file = open('doc.pdf', 'rb') #Open file in read-only and binary
content =
file.read() #Save the content of the file in a variable

cur.execute("INSERT INTO BLOBTEST VALUES(?,?)", (2,content)) #Save the content to a table

file.close() #Close the file
cur.close() #Close the cursor
con.close() #Close the connection

Now, to be able to search within the content of the files you will need to use Fuzzy Search. Here's an example of a query that looks for the word "march" in the content of the files. The score that you will get back is a TF/IDF score (Term Frequency/Inverse Document Frequency), which means that the score will be calculated based on the number of times the word "march" is found in the content of the file, the file with the most number of matches will have the highest score.

SELECT TO_DECIMAL(SCORE(),3,2) AS score, *

FROM BLOBTEST

WHERE CONTAINS("File_Content", 'march',

FUZZY(0.5, 'textSearch=fulltext'))

ORDER BY "Year", "Month";

Hope it helps, Lucas.

Practice your SAP HANA™ development skills:

www.GetYourHandsOn.it

Info en Español sobre SAP HANA™:

sagarjoshi
Advisor
Advisor
0 Kudos

Hello Lucas,

I could upload the files with various formats (PDF,DOC,HTML and Plain text).

I also created a column which specifies the MIME Type and created a full text index as follows

CREATE FULLTEXT INDEX <Index_name> ON MYTABLE(BLOBFIELD) MIME TYPE COLUMN MIME_COLUMN;

However in my system the values are searched only for html and plain text and it does not work for files of type PDF or DOC.

I checked also the table M_TEXT_ANALYSIS_MIME_TYPES and there are only 2 MIME TYPES text and HTML are visible.

Is there anything else needed to search PDF or DOC files?

I checked on REV34 system.

Thanks.

Former Member
0 Kudos

Hi Sagar, in order to be able to search within the content of Word or PDF files you need to use Fuzzy Search. The score that you will get back is a TF/IDF score (Term Frequency/Inverse Document Frequency), which means that the score will be calculated based on the number of times the word that you are looking for is found in the content of the file, the file with the most number of matches will have the highest score. Here's an example:

SELECT TO_DECIMAL(SCORE(),3,2) AS score, *

FROM <TABLE_NAME>

WHERE CONTAINS("<BLOB_COLUMN_NAME>", '<TERM_THAT_YOU_ARE_LOOKING_FOR>',

FUZZY(0.7, 'textSearch=fulltext'));

Hope this helps.