cancel
Showing results for 
Search instead for 
Did you mean: 

Duplicate file check

Former Member
0 Kudos

Hello

I have been asked to investigate methods of preventing a file being submitted to PI multiple times. The files are picked up by an NFS file adapter.

We are already making checks on filenames. However, this still allows a file to be submitted multiple times if the file name is changed.

The solution to this might be to take a hash of the file contents and compare against previous files. However, this would be a significant load on the PI server.

Has anyone got any suggestions on an efficient way to prevent duplicate files being submitted?

Kind regards

Steve

Accepted Solutions (0)

Answers (5)

Answers (5)

Former Member
0 Kudos

you have a option "wait for milisecond" on your file adapter which does job of taking the hash value to check the size of the file after the interval of those many miliseconds. Try this it should work.

Former Member
0 Kudos

HI

regarding the File Duplicate

check with the below code if it is Usefull ...

package com.sap.pi.deploy;

import java.io.BufferedReader;

import java.io.BufferedWriter;

import java.io.File;

import java.io.FileReader;

import java.io.FileWriter;

import java.io.IOException;

import java.io.InputStream;

import java.io.InputStreamReader;

import java.io.Reader;

import java.io.Writer;

import java.util.Vector;

import com.sap.aii.mapping.api.AbstractTransformation;

import com.sap.aii.mapping.api.StreamTransformationException;

import com.sap.aii.mapping.api.TransformationInput;

import com.sap.aii.mapping.api.TransformationOutput;

public class checkDuplicateFileData extends AbstractTransformation {

public void transform(TransformationInput arg0, TransformationOutput arg1)

throws StreamTransformationException {

String inputPayload = convertInputStreamToString(arg0.getInputPayload()

.getInputStream());

String outputPayload = "";

try {

String hashCodeDb = "//sapmnt//AX1/global//POC//dFileNameAB.txt";

File fileDB = new File(hashCodeDb);

String sourceFileData = Integer.toString(inputPayload.hashCode());

if (!(fileDB.exists() && fileDB

.canWrite() && fileDB.canRead())) {

fileDB.createNewFile();

}

Vector fileNameList = new Vector();

BufferedReader br = null;

br = new BufferedReader(new FileReader(hashCodeDb));

String name = new String();

// loop and read a line from the file as long as we dont get null

while ((name = br.readLine()) != null)

// add the read word to the wordList

fileNameList.add(name);

br.close();

boolean dataAlreadyProcessed = fileNameList

.contains(sourceFileData);

/*System.out.println("dataAlreadyProcessed "

+ dataAlreadyProcessed);*/

if (!dataAlreadyProcessed) {

Writer output = new BufferedWriter(new FileWriter(new File(

hashCodeDb), true));

output.write(sourceFileData + "\r\n");

output.flush();

output.close();

/*System.out

.println("i am hear writing the file to outputStream ...");*/

outputPayload = inputPayload;

arg1.getOutputPayload().getOutputStream().write(

outputPayload.getBytes("UTF-8"));

}

if(dataAlreadyProcessed){

outputPayload=new String(

"<?xml version=\"1.0\"?>"

+ "<Error>Mapping failed in module due to duplicate file</Error>");

}

}

catch (IOException e) {

// TODO Auto-generated catch block

e.printStackTrace();

}

outputPayload = inputPayload;

try {

/*

  • Output payload is returned using the TransformationOutput class

  • arg1.getOutputPayload().getOutputStream()

*/

System.out.println("i am hear writing the file to outputStream ...");

arg1.getOutputPayload().getOutputStream().write(

outputPayload.getBytes("UTF-8"));

} catch (Exception exception1) {

}

}

public String convertInputStreamToString(InputStream in) {

StringBuffer sb = new StringBuffer();

try {

InputStreamReader isr = new InputStreamReader(in);

Reader reader = new BufferedReader(isr);

int ch;

while ((ch = in.read()) > -1) {

sb.append((char) ch);

}

reader.close();

} catch (Exception exception) {

}

return sb.toString();

}

}

Former Member
0 Kudos

Thank you all for your suggestions. There are some very good ideas there.

I had been concerned about handling large volumes of data, so I like the idea of checking the first 'X' characters, combination of fields, or file size. Using specific fields would tie it to a specific message type, so I'm inclined to use the first chunk of a file. Saving a hash of this text would save space at the expense of more processor utilisation.

I especially like the suggestion that I ask the providers not to send duplicates. I'd be interested to know if anyone has ever had any success with that one.

I had seen the blog post about preventing duplicates. That was where I saw the warning about large amounts of IO and the effect this might have on PI.

I liked Anupam's idea about linking the filename and a unique field in the file. Unfortunately I have no control over the filenames for some of our interfaces.

Mickael has obviously been giving this some thought already, and he sums up the options very well. I agree that it is important to make the process reusable, and I think option 3 is the most reuseable.

Kind regards

Steve

Former Member
0 Kudos

Thank you Vishal. I don't think that would solve this problem because the file size would be the same if it was submitted more than once.

However, this would be very useful to solve another problem I have where files are picked up before the source system has finished writing to them.

Edited by: PI Stream Lead on Nov 29, 2011 11:40 AM

Former Member
0 Kudos

Thanks, Dsravan. That is a useful piece of code.

I hadn't realised that hashCode could be used in that way. It would save me using CRC or MD5.

I can't seem to allocate you any points. Perhaps I've been too generous already.

Kind regards

Steve

anupam_ghosh2
Active Contributor
0 Kudos

Hi Steve,

Say you file "GTM.txt" has content like this in first line or last line


Header,20110808, "xyz",PN00223,10000

Now you need to choose a value in a row preferably the first or the last row, whch is unique for each file.

say the value "PN00223". Sender should also send this value as part of filename also, thus the file name now becomes "GTM.PN00223.txt". Whenever any file is received in message mapping UDF or java mapping validation code you need to check if the first line/last line of the file carries the value which is also the part of filename. In case there is a match you can process the file further else reject the file for further processing (you can send rejection emails to business).

If you follow this process, then simply by renaming the file, PI server won't process the file further. This method won't prevent posting of duplicate files. Thus you need to follow the link provided by Baskar in addition to this method.

If no unique field value is present in the file you can add a field called sequence number in the first line and increment it with each file. The sequence number must be alphanumeric so that it does not grow too much in length over time.

regards

Anupam

Former Member
0 Kudos

Hi,

Few days ago, I did such a search, and it seems on SDN, most of controls are based only on filenames, and not on file content.

Do not forget with ASMA, you can also stored/get the source file size, in addition of the source filename.

to a control on file content, there are mainly two technic:

1. store in a ztable the objects key (like payment number, bank account), but be sure to have all the relevant keys which distinguish two sendings.

2. to store the whole file content itself, inside a Ztable.

3. to store the a part of file content itself, inside a Ztable. Indeed, depending on your volumetry, and of file size, maybe a solution is to store only the 100 first lines, or the first 64 kB (for example).

4. Another solution, that I suggested (with some reserves...), it's to limit the access to the FTP or NFS server, to only admin team, and not to business employee. So to change to business process!

presonally, for the moment, we have not take the decision on which is the best way, for our need (mainly one), but also maybe for the future (reusable method/process).

Mickael

baskar_gopalakrishnan2
Active Contributor
0 Kudos

try this wiki blog for the solution

http://wiki.sdn.sap.com/wiki/display/XI/DifferentwaystokeepyourInterfacefromprocessingduplicate+files

Former Member
0 Kudos

Hi Steve,

if you think of a file as a transaction, each file should have a transaction ID. It can be any combination of fields in the file or a dedicated ID which is unique for the content in the file. Then you can log the ID, implement a check and reject the file if the transaction has been processed already.

Regards, Martin