Extracting specific text from pdf files (unstructu...

Former Member · ‎08-22-2016

Hi,

I have some pdf files which contain some data and images. In each of these pdf files, there is a reference number maintained like (Ref: 00.00.00001).

I need to extract this Ref No in a column in HANA table from various pdf files placed in the directory.

For this purpose, I have uploaded pdf files in HANA using a python script. All the content of pdf files goes into a single column of datatype BLOB of HANA table.

Now, I need to search within this BLOB column (which is pdf file content), extract the reference number and put it in another column.

I am not sure how to do this. Can you please guide me how this can be done ?

Is it possible to get this done in HANA via some text mining or text analysis technique or any other way ? I am new to text mining and tech analysis in HANA.

Regards,

Amandeep Singh

pfefferf · ‎08-22-2016

You can use text analysis with CGUL rules.

A very simlar case is described in blog

Regards,

Florian

SergioG_TX · ‎08-22-2016

Amandeep,

i have not done this myself, but i took one of the openSAP course on text mining, etc.. here is the official documentation

SAP HANA Advanced Data Processing – SAP Help Portal Page

for another reference, please check out the opensap course ontext analytics

Text Analytics with SAP HANA Platform - Anthony Waite, Yolande Meessen, Bill Miller, and Michael Wie...

Extracting specific text from pdf files (unstructured data) to a HANA table

Accepted Solutions (0)

Answers (2)

Answers (2)

Re: SAP PI - Determine the receiver based on Sourc...

Análise Combinatória de dados de bancos diferentes

Re: generating qrcode using image url in crystal r...

SOAMANAGER - Field Cardinality - Min Occurence - W...

Re: SWPSTEPLOG table partitioning