cancel
Showing results for 
Search instead for 
Did you mean: 

Read Pdf file to XML via java mapping

vinaymittal
Contributor
0 Kudos

Hi

the scenario is File to Proxy, i have to read a pdf files content(all text) i have written the code

import java.io.IOException;

import java.io.FileReader;

import java.io.BufferedReader;

import java.io.*;

import org.apache.pdfbox.util.*;

import org.apache.pdfbox.pdmodel.*;

class ReadPdf

{

  public static void main(String args[])

  {

    PDDocument pd;

    BufferedWriter wr;

    try {

        File input = new File("original.pdf");  // The PDF file from where you would like to extract

          File output = new File("SampleText.txt"); // The text file where you are going to store the extracted data

          pd = PDDocument.load(input);

          System.out.println(pd.getNumberOfPages()); //prints number of pages

          System.out.println(pd.isEncrypted()); //false as not encrypted

          pd.save("CopyOfOriginal.pdf"); // Creates a copy called "CopyOforiginal.pdf"

          PDFTextStripper stripper = new PDFTextStripper();

          stripper.setStartPage(1); //Start extracting from page 1

          stripper.setEndPage(1); //Extract till page 1

          wr = new BufferedWriter(new OutputStreamWriter(new FileOutputStream(output)));

          stripper.writeText(pd, wr);

          if (pd != null) {

              pd.close();

          }

          // I use close() to flush the stream.

          wr.close();

  }

  catch (Exception e)

  {

        e.printStackTrace();

         }

  }

}

it works i have modified it to work in java mapping as

import java.io.InputStream;

import java.io.OutputStream;

import java.util.Map;

import java.util.HashMap;

import java.io.IOException;

import java.io.FileReader;

import java.io.BufferedReader;

import java.io.*;

import org.apache.pdfbox.util.*;

import org.apache.pdfbox.pdmodel.*;

import com.sap.aii.mapping.api.AbstractTransformation;

import com.sap.aii.mapping.api.StreamTransformationException;

import com.sap.aii.mapping.api.TransformationInput;

import com.sap.aii.mapping.api.TransformationOutput;

public class PdftoXml extends AbstractTransformation

{

  public void transform(TransformationInput in, TransformationOutput out) throws StreamTransformationException

  {

   

    PDDocument pd;

    BufferedWriter wr;

    try {

      

          pd = PDDocument.load(in.getInputPayload().getInputStream()); //convert Tranformationimput to inputstream than pass it to PDDocument constructor to read Pdf from Inputstream.

          //System.out.println(pd.getNumberOfPages()); //prints number of pages

        

        

          PDFTextStripper stripper = new PDFTextStripper();

          stripper.setStartPage(1); //Start extracting from page 1

          stripper.setEndPage(1); //Extract till page 1

  String str = stripper.getText(pd);

  String content[] = str.split("\n");

  String result ="<?xml version=\"1.0\" encoding=\"UTF-8\"?>";

  result = result.concat("<ns0:MTPdf xmlns:ns0=\"urn:mmm-com:pi:Vinay:10\">");

  result = result.concat("<field1>"+content[0]+"</field1>");

  result = result.concat("<field2>"+content[1]+"</field1>");

  result = result.concat("<field3>"+content[2]+"</field1>");

  result = result.concat("<field4>"+content[3]+"</field1>");

  result = result.concat("</ns0:MTPdf>");

  out.getOutputPayload().getOutputStream().write(result.getBytes("UTF-8")); //writing to output

  }

  catch (Exception e)

  {

        e.printStackTrace();

         }

  }

}

i am using apache third party API "PdfBox" where shall i import this API in ESR for my java mapping to work

Accepted Solutions (1)

Accepted Solutions (1)

former_member181985
Active Contributor
0 Kudos

Hi Vinay,

The external api jar files should be part of your java development archive under root folder.

You could also use my blog concept to directly test your java mapping code from interface/operation mapping

Best Regards,

Praveen Gujjeti

vinaymittal
Contributor
0 Kudos

Thanks sirji

Your blog was awesome i really could test my binary file directly in OM and got the result

I placed the apache API jar as it is with pdfbox name under imported archive and my second archive with the name pdf was able to access it.

the java mapping program (2nd one) above is a bit wrong

result = result.concat("<field1>"+content[0]+"</field1>");

  result = result.concat("<field2>"+content[1]+"</field1>");

  result = result.concat("<field3>"+content[2]+"</field1>");

  result = result.concat("<field4>"+content[3]+"</field1>");

it has to be

result = result.concat("<field1>"+content[0]+"</field1>");

  result = result.concat("<field2>"+content[1]+"</field2>");

  result = result.concat("<field3>"+content[2]+"</field3>");

  result = result.concat("<field4>"+content[3]+"</field4>");

It currently reads only 4 lines from the pdf.

Regards

Vinay

former_member181985
Active Contributor
0 Kudos

Glad to know Vinay

Answers (1)

Answers (1)

former_member184720
Active Contributor
0 Kudos

You just need to add those jars to project root folder(in eclipse/nwds)

Right click on the project folder(root)->import-> General(Archive File)->select your jar file

vinaymittal
Contributor
0 Kudos

Hi Hareesh

I am not using NWDS... we are still using ESR/ID, its in 7.31 single stack.

Regards

Vinay

engswee
Active Contributor
0 Kudos

Vinay

In what IDE environment are you developing the Java mapping if you are not using NWDS? Or are you using this technique

Rgds

Eng Swee