bods script to delete duplicate input files
i have a list of csv files being sent via FTP daily to a folder.
i need to check if duplicate files available and then delete one of the copy and read the rest of the files.
i am not able to find the script to identify the duplicate file.
can anyone please suggest?
Raghunathan Balasubramanian replied
In case of linux, you can delete the specific file name as long as you can establish which file you want to delete. To do this
Lets consider for example there are six files, a.csv, A.csv, ab.csv, Ab.csv, aB.csv, AB.csv. You will only need a.csv and ab.csv.
1. Read in all the file names into BODS.
2. Keep one column with the actual file name (call this col1) and another column where the file names are set to either lower case or upper case (call this col2) .
3. Sort the file names by Col2 and prepare a row_num by group Col3 using the gen_row_num_by_group function. and load this data into a table. The table sould have a structure like below
Once you have this data in a table, you can use a script to call the EXEC function to run the command line statement to delete the file names (OriginalFileName) where the NumberByGroup is >1
You will then end up with just a.csv and ab.csv out of the six files that are originally in the fileshare. It can even be A.csv and AB.csv. It doesnt really matter if you have the same data for a given file name regardless of which case conbination it is given in.
The assumption is that AB.csv, ab.csv, Ab.csv and aB.csv all contain the same data but stored under different file names and the same for a.csv and A.csv. If however you have different datasets, then I would suggest you import all the files into the database with the DI_FILENAME column enabled so that you can isolate the data based on the file name at a later stage of the job.