cancel
Showing results for 
Search instead for 
Did you mean: 

bods script to delete duplicate input files

former_member186160
Contributor
0 Kudos

hi all,

i have a list of csv files being sent via FTP daily to a folder.

i need to check if duplicate files available and then delete one of the copy and read the rest of the files.

i am not able to find the script to identify the duplicate file.

can anyone please suggest?

Accepted Solutions (1)

Accepted Solutions (1)

severin_thelen
Contributor
0 Kudos

Hello Swetha,

I am a little bit confused, maybe I do not understand your problem correct.

So you have a folder and load files to this folder every day.

  1. Is it possible to have 2 files with the same name in the one folder? In my opinion thats not possible.
  2. If you have 2 files with the same name, which file do you want to delete? Would both files have the same content?

Regards

Severin

former_member186160
Contributor
0 Kudos

hi,

yes it is possible to have 2 files with same properties (name,size, last modified date etc) in the same folder if there is a slight case differences in the file names.

eg: A.csv and a.csv

any one of the file copy can be deleted.

this folder is the source folder for my BODS job to read data and it is loaded to table in SQL database.

severin_thelen
Contributor
0 Kudos

Then you have to use the EXEC command, because it is the only way to communicate with the OS.

In the EXEC command you could write OS command. An example for windows: exec('cmd','del C:\myfile.txt')

Because you will need some commands to detect duplicated files and delete this files, I would create a batch script an execute this script from the DS with the EXEC command. I know, that this is possible, but I am not a shell expert, so I cannot tell you how.

Another way is to avoid duplicate files. I do not know, how the files will be created, but if you have a system, that exports its data, you could delete all files in the folder, before loading.

former_member211387
Contributor
0 Kudos

Hi,

Which operating system are you using this on? I dont think it is feasible on Windows to have the same file name with different cases like A.csv and a.csv as you mentioned in the example.

kind regards

Raghu

former_member186160
Contributor
0 Kudos

hi, our job server is based on linux OS, and i just tried copying two files A.csv and a.csv and both got copied successfully.

severin_thelen
Contributor
0 Kudos

But then, that is not the same filename. Linux is key-sensitive, Windows is not key-sensitive.

Therefor you could not delete a random file, because DS need every time the same file (so the file with the "a" or the "A"). Because of this, you never could have 2 files with the same name and will have not problems.

In addition you should make sure to create always the source files with the exact same name, to avoid addition problems.

former_member211387
Contributor
0 Kudos

Hi Swetha,

In case of linux, you can delete the specific file name as long as you can establish which file you want to delete. To do this

Lets consider for example there are six files, a.csv, A.csv, ab.csv, Ab.csv, aB.csv, AB.csv. You will only need a.csv and ab.csv.

1. Read in all the file names into BODS.

2. Keep one column with the actual file name (call this col1)  and another column where the file names are set to either lower case or upper case (call this col2) .

3. Sort the file names by Col2 and prepare a row_num by group Col3 using the gen_row_num_by_group function. and load this data into a table. The table sould have a structure like below

Once you have this data in a table, you can use a script to call the EXEC function to run the command line statement to delete the file names (OriginalFileName) where the NumberByGroup is >1

You will then end up with just a.csv and ab.csv out of the six files that are originally in the fileshare. It can even be A.csv and AB.csv. It doesnt really matter if you have the same data for a given file name regardless of which case conbination it is given in.

The assumption is that AB.csv, ab.csv, Ab.csv and aB.csv all contain the same data but stored under different file names and the same for a.csv and A.csv. If however you have different datasets, then I would suggest you import all the files into the database with the DI_FILENAME column enabled so that you can isolate the data based on the file name at a later stage of the job.

kind regards

Raghu

Former Member
0 Kudos

I must admit that if the data held in the files are different then deleting them will result in data loss. However the approach should work if the contents of these files with same file name but different (a.csv or A.csv) are the same.

I think it is best to look for something like checksum if supplied in the file to be sure that the data is the same and hence confirm to delete the file.

former_member186160
Contributor
0 Kudos

thanks . this is the best approach if we dont need to consider checksum etc but need to delete only  based on the file names.

thank you.

Answers (2)

Answers (2)

former_member186160
Contributor
0 Kudos

thanks all for your suggestions.

i will try out all the approaches and post the results shortly.

vnovozhilov
Employee
Employee
0 Kudos

This can be easily achieved with OS, eg: bash - How to find duplicate files with same name but in different case that exist in same directory...

Do you specifically need to implement this in SAP Data Services?

Thank you,

Viacheslav.