cancel
Showing results for 
Search instead for 
Did you mean: 

Data cleansing and de-duplication ideas

Former Member
0 Kudos

Hello:

I've been working with MDM and clients usually find useful Import Manager features for de-duplication using an specific field. However, a big problem is how to find the duplicated registers that are not that evident, such as addresses, names, etc.

I usually give ideas such as sorting the data, keyword free-search etc, however I was wondering if anyone have other ideas for data de-duplication on, let's say, 10,000,000 registers scenarios, where sorting and searching do not seem that appealing.

Thanks

Alejandro

Accepted Solutions (0)

Answers (4)

Answers (4)

lawrencegray
Discoverer
0 Kudos

Hi

How do the capabilities of MDM5.5 SP4 matching compare to something like a dedicated 3rd party dedup tool such as Trillium?

Do the features such as fuzzy matching or phonetic matching in some of the 3rd party tools also exist in the native MDM matching?

When would you need to consider one over the other?

Thanks

Lawrence

Former Member
0 Kudos

At the risk of delivering a shameless plug, there are third-party certified connectors for data quality and deduplication. My company makes one of them. Trillium Software.

The tools can be used either as an interim step when migrating data, to cleanse data in place in SAP applications, or in real-time as users are entering names and address data.

http://www.trilliumsoftware.com/site/content/products/sap-data-migrations.asp

Former Member
0 Kudos

Hi Steve,

I am working on CRM 5.0. Does Trillium integrate with CRM 5.0? If not, what are the plans for integration?

I think Trillium can also be used for Adress validation in addition to checking data for duplication, right?

thanks,

LSP

Mark63
Product and Topic Expert
Product and Topic Expert
0 Kudos

Support Package 04 for SAP NetWeaver MDM (scheduled for August 2006) will offer enhanced de-duplication, matching and merging capabilities in the standard delivery scope.

Markus

Former Member
0 Kudos

Hi,

Also check Athanor, very good tool for data profiling and matching/merging.

Regards,

Dirk

Former Member
0 Kudos

I should check back more often. Yes, Trillium does offer CRM 5.0 integration.

Most companies use it (metaphorically) to clean the pond first, then the river. So, a batch process can be used to clean source systems during instance consolidation or legacy migration to CRM 5. Then, you'd keep the rivers of data clean with the real-time integration.

Trillium has address validation/standardization AND fuzzy matching. Both happen in a sub-second. In a real-time environment, we've developed it to be highly scaleable, so if you have a big call center with many concurrent transactions, you can add servers to keep it fast.

Hope that helps.

Former Member
0 Kudos

Hey Guys

We exported from R/3 into 15 to 20 xmls files. Can we still consolidate on import? See as we will import files separately and the consolidation need to be done on across data all 10 files?

C

Former Member
0 Kudos

Hello,

We developed a matching strategy using multiple iterations of the import manager, we have successfully found duplicates and have even created a small program that does automatic merging of data that has a match rate over a certain threshhold.

The system works by normailzing the data while importing into MDM and then also tokenizing fields if required.

We run this as a batch process and create groups of similar items.

Contact me for more information if required.

Stanley Levin.

Former Member
0 Kudos

Hello I'd like to hear more about your normalization and implementation approach, did you actually reach a high level of automation/reliability on your data cleansing?

Thanks

Alejandro

Former Member
0 Kudos

Hi Stanley,

we tried a similar approach. We use about 10 different import maps(steps) which run one after the other. But in the end we did not get a proper solution.

So can you give more detailed information here or do you want me to contact you right away?

Regards

Nico

Former Member
0 Kudos

Hi Nicolas and Alejandro,

We are currently completing and documenting the process, I would like to delay for about two weeks and then do a complete session where I can show you the process and also some documentation so that you can run it on your data.

I will update you as soon as we are ready.

Regards,

Stanley.

Former Member
0 Kudos

That would be great. Thanks Stanley. Please post the progress in this thread. I'll be watching it.

Former Member
0 Kudos

Thanks a lot Stanley! We'll be looking forward to it!

Alejandro

former_member192347
Participant
0 Kudos

Hi Stanley,

I am new to MDM. We are also looking for a process to identify the duplicates and determine the data quality.

Would you share the process that you have developed and your experiences regarding data cleansing. How can I contact you?

My email is abhay_mhatre@colpal.com

Thanks and Regards,

Abhay

Former Member
0 Kudos

Hello Abhay:

Even though Stanley hasn't been able to reply, it this forum is posted a SAP methodology to do that, by giving each field a weight and then sorting the registers, in order to create duplicated candidates and consolidated data. Please check out this thread:

Greetings

Alejandro

former_member192347
Participant
0 Kudos

Hi Alejandro,

Thanks for your response. When I click on the link you sent, it opens new browser page with "Forum: SDN Suggestions". May be I am not navigating right. Could you please send the correct link or the topic in SDN Suggestions forum.

Regards,

Abhay

Former Member
0 Kudos

I'm sorry, yes it's my fault. The real link is:

The Thread is called:MDM 5.5 SP03 - Updated documentation by Markus Ganser.

Regards

Alejandro

Former Member
0 Kudos

Hi Alejandro,

I guess the best way to find duplicates is to work with a strategy that calculates scores regarding the similarity of records.

You could then define a higher treshold, so that the system automatically merges records that have a score higher than that treshold. If the score is lower than a defined lower treshold, the compared records are no duplicates. For scores that are in between the higher and the lower treshold, user interaction is necessary to decide. That of course could be a lot of work.

The matching strategies that were delivered for MDM 3.00 give a good example on how such a strategy could work.

There is some excellent documentation available for these matching strategies in the Service Marketplace:

https://websmp204.sap-ag.de/instguides -> SAP Netweaver ->

Release 04 -> Operations -> Component SAP MDM -> MDM 3.00 - Operations Guides

Br

Lars

Former Member
0 Kudos

Hi Lars,

we implemented exactly this scenario with MDM3.0

Now the customer wants the same functionality in MDM5.5

Since there is no Content Integrator to define any duplication logic and since the de-duplication part promised for SP03 was not delivered we have to find a workaround to mirror the old functionality.

Do you have any advices how to define a threshold with the MDM5.5 standard?

Regards

Nico