cancel
Showing results for 
Search instead for 
Did you mean: 

Extracting specific text from pdf files (unstructured data) to a HANA table

Former Member
0 Kudos

Hi,

I have some pdf files which contain some data and images. In each of these pdf files, there is a reference number maintained like (Ref: 00.00.00001).

I need to extract this Ref No in a column in HANA table from various pdf files placed in the directory.

For this purpose, I have uploaded pdf files in HANA using a python script. All the content of pdf files goes into a single column of datatype BLOB of HANA table.

Now, I need to search within this BLOB column (which is pdf file content), extract the reference number and put it in another column.

I am not sure how to do this. Can you please guide me how this can be done ?

Is it possible to get this done in HANA via some text mining or text analysis technique or any other way ?    I am new to text mining and tech analysis in HANA.

Regards,

Amandeep Singh

Accepted Solutions (0)

Answers (2)

Answers (2)

pfefferf
Active Contributor
0 Kudos

You can use text analysis with CGUL rules.

A very simlar case is described in blog

Regards,

Florian

SergioG_TX
Active Contributor
0 Kudos

Amandeep,

i have not done this myself, but i took one of the openSAP course on text mining, etc.. here is the official documentation

SAP HANA Advanced Data Processing – SAP Help Portal Page

for another reference, please check out the opensap course ontext analytics

Text Analytics with SAP HANA Platform - Anthony Waite, Yolande Meessen, Bill Miller, and Michael Wie...