on 01-13-2015 7:09 AM
hi all,
i have a list of csv files being sent via FTP daily to a folder.
i need to check if duplicate files available and then delete one of the copy and read the rest of the files.
i am not able to find the script to identify the duplicate file.
can anyone please suggest?
Hello Swetha,
I am a little bit confused, maybe I do not understand your problem correct.
So you have a folder and load files to this folder every day.
Regards
Severin
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
hi,
yes it is possible to have 2 files with same properties (name,size, last modified date etc) in the same folder if there is a slight case differences in the file names.
eg: A.csv and a.csv
any one of the file copy can be deleted.
this folder is the source folder for my BODS job to read data and it is loaded to table in SQL database.
Then you have to use the EXEC command, because it is the only way to communicate with the OS.
In the EXEC command you could write OS command. An example for windows: exec('cmd','del C:\myfile.txt')
Because you will need some commands to detect duplicated files and delete this files, I would create a batch script an execute this script from the DS with the EXEC command. I know, that this is possible, but I am not a shell expert, so I cannot tell you how.
Another way is to avoid duplicate files. I do not know, how the files will be created, but if you have a system, that exports its data, you could delete all files in the folder, before loading.
But then, that is not the same filename. Linux is key-sensitive, Windows is not key-sensitive.
Therefor you could not delete a random file, because DS need every time the same file (so the file with the "a" or the "A"). Because of this, you never could have 2 files with the same name and will have not problems.
In addition you should make sure to create always the source files with the exact same name, to avoid addition problems.
Hi Swetha,
In case of linux, you can delete the specific file name as long as you can establish which file you want to delete. To do this
Lets consider for example there are six files, a.csv, A.csv, ab.csv, Ab.csv, aB.csv, AB.csv. You will only need a.csv and ab.csv.
1. Read in all the file names into BODS.
2. Keep one column with the actual file name (call this col1) and another column where the file names are set to either lower case or upper case (call this col2) .
3. Sort the file names by Col2 and prepare a row_num by group Col3 using the gen_row_num_by_group function. and load this data into a table. The table sould have a structure like below
Once you have this data in a table, you can use a script to call the EXEC function to run the command line statement to delete the file names (OriginalFileName) where the NumberByGroup is >1
You will then end up with just a.csv and ab.csv out of the six files that are originally in the fileshare. It can even be A.csv and AB.csv. It doesnt really matter if you have the same data for a given file name regardless of which case conbination it is given in.
The assumption is that AB.csv, ab.csv, Ab.csv and aB.csv all contain the same data but stored under different file names and the same for a.csv and A.csv. If however you have different datasets, then I would suggest you import all the files into the database with the DI_FILENAME column enabled so that you can isolate the data based on the file name at a later stage of the job.
kind regards
Raghu
I must admit that if the data held in the files are different then deleting them will result in data loss. However the approach should work if the contents of these files with same file name but different (a.csv or A.csv) are the same.
I think it is best to look for something like checksum if supplied in the file to be sure that the data is the same and hence confirm to delete the file.
thanks all for your suggestions.
i will try out all the approaches and post the results shortly.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
This can be easily achieved with OS, eg: bash - How to find duplicate files with same name but in different case that exist in same directory...
Do you specifically need to implement this in SAP Data Services?
Thank you,
Viacheslav.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
User | Count |
---|---|
89 | |
10 | |
9 | |
9 | |
9 | |
6 | |
6 | |
5 | |
5 | |
4 |
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.