Solved: CU&UC vocabulary maintenance

Former Member · ‎11-04-2013

Hello Experts,

We are upgrading our system from 4.6C to ECC EHP5. Its MDMP system with 6 European languages & 4 Asian languages.

we still have around 30000 Asian words to which language has to be assigned. In vocabulary Assigned Language tab there is "Unkown" option. By which it assign language as "?".

1 - What is the purpose and the impact of the Unknown option?

2 - I would like to let blank those 30 000 asian chars to deal with them later (only for the quality system operations). Could you confirm me I will be able to assign those 30 000 chars during manual repair after all operations?

Thank you

Mayuresh

nils_buerckel · ‎11-05-2013

Hi Mayuresh,

1. the unknown option is defined as follows:

Reprocess scan considers "?" as if no language/code page has been assigned

Customers can e.g. explicitely set some short words (one-byte and/or two-byte words) with this option if they know that these words are ambiguous (used in at least two different code pages in one table) and hence cannot be assigned consistently. Then the "standard select: language = <EMPTY>" for assigning the rest of the vocabulary does not show those entries.

2. In theory you should be able to assign those entries. But there is a very high risk that you end up with a lot of entries (many more than 30000) in SUMG and hence do not have the time to repair them in downtime. SUMG is not designed to handle mass data. Therefore the number of entries in SUMG should be minimized as much as possible. A strategy to leave the vocabulary / reprocess log maintenance empty (or maintain only a small part) and repair the data after the conversion will be at least problematic or might not work at all (e.g. too many XML files).

Hence before relying on SUMG, all capabilities in vocabulary (first choice) and reprocessing assignment (second choice) should be preferred.

Best regards,

Nils Buerckel

nils_buerckel · ‎11-05-2013

Hi Mayuresh,

1. the unknown option is defined as follows:

Reprocess scan considers "?" as if no language/code page has been assigned

Customers can e.g. explicitely set some short words (one-byte and/or two-byte words) with this option if they know that these words are ambiguous (used in at least two different code pages in one table) and hence cannot be assigned consistently. Then the "standard select: language = <EMPTY>" for assigning the rest of the vocabulary does not show those entries.

2. In theory you should be able to assign those entries. But there is a very high risk that you end up with a lot of entries (many more than 30000) in SUMG and hence do not have the time to repair them in downtime. SUMG is not designed to handle mass data. Therefore the number of entries in SUMG should be minimized as much as possible. A strategy to leave the vocabulary / reprocess log maintenance empty (or maintain only a small part) and repair the data after the conversion will be at least problematic or might not work at all (e.g. too many XML files).

Hence before relying on SUMG, all capabilities in vocabulary (first choice) and reprocessing assignment (second choice) should be preferred.

Best regards,

Nils Buerckel

Former Member · ‎11-05-2013

Hello Nils,

Thank you again for your valuable reply.

You said in your point 1 as "if they know that these words are ambiguous (used in at least two different code pages in one table) and hence cannot be assigned consistently." It usually happen for vocabulary collision words. So should we keep the words as blank language for which "Vocabulary collision" has occured?

Secondly We are not able to assign language = <Empty> once we assigned "Unkonwn"(?) to that particular word.

Any solution on this?

Thank you

Mayuresh

nils_buerckel · ‎11-05-2013

Hi Mayuresh,

from a functional view in SPUMG there is basically no difference between leaving the language empty or assign a "?" to it. The difference is for your native speakers who are maintaining the vocabulary. If you assign a "?", everyone knows that this is an ambiguous word and no one will change it. If you leave it empty and there are still native speakers working on the vocab, there is the risk that they assign it to their language (because they probably do not know about the ambiguity).

Hence in your case (if you do not want to assign the 30000 words, which I do NOT recommend) it does not make much difference ...

Secondly: What I meant here was the fact that native speakers usually work on the vocabulary with the selection mode "language = <EMPTY>" - and in this case in fact they do not see the words with "?" (see above) - which is the desired behavior.

Best regards,

Nils Buerckel

Former Member · ‎11-05-2013

Hello Nils,

I understood your point.

We are in the process of assigning all 30000 words.

But my question is, we have around 578 words with "vocabulay collision"

Sould we mark it as "?" or should we assign one of the active languages?

What is the best practice to deal with " vocabulary collision" as well as "conversion collision"?

Thank you

Mayuresh

nils_buerckel · ‎11-06-2013

Hi Mayuresh,

1) 1) Vocabulary Collisions

Vocabulary collisions are described in the Unicode conversion guide (UCG). For the specific word in the vocabulary, which has the collision flag, the correct language can be assigned manually. This however does not resolve the problem regarding incorrect language keys in the language dependent data. This can be fixed only via manual effort either before the conversion (e.g. via transcription of the data – please note that this might lead to inconsistencies, if dependent data is not changed) or after it via SUMG (this would be the recommended way).

Hence for the vocabulary collision you need to assign the language in the vocabulary (as for most of the other words).

2) Conversion collision

This is another name for the issue described before:

Those words are ambiguous (used in at least two different code pages in one table) and hence cannot be assigned in the vocabulary consistently. In that case I would recommend to use the "?" and assign the correct language in the reprocessing log (still before the conversion).

Best regards,

Nils Buerckel

Former Member · ‎11-06-2013

Hello Nils,

Ok. It clear for me now. We will not assign any language to vocabulay collision words, we will keep it as <Empty> language for these 578 words. So as per my understanding we can deal it later in SUMG manual repair.

Also for conversion collision we will try to assign languages in Reprocess logs.

We still have 19000 words which have not been recognized by any of the native speakers.

And these words seem to be junk characters in the system. We simply assigned EN to these 19000 words. Is that fine? Or should we leave these words also as <Empty> language.

Thanks once again for your help

Mayuresh

nils_buerckel · ‎11-07-2013

Hi Mayuresh,

my proposal actually was to manually assign the words, which do have a vocabulary collision ...

Regarding the 19000 words: I would leave them empty and go on with the reprocessing scan. After that. you should be able to analyze the tables which contribute mostly to this problem. Looking directly at the table content (as you see the key fields in the logs of this scan) , it is usually easier to find out whether they are junk characters. In some cases it might make sense to maintain the vocabulary and reset (and rerun) some of the tables in the reprocessing scan.

If you are sure that vocabulary entries are junk, it does not make much difference whether you assign EN or leave them empty. In the latter case, those entries will always re-appear in the reprocessing scan and in SUMG. In case you assign EN, there is the chance that the entries will not show up in those phases anymore ...

Best regards,

Nils Buerckel

Former Member · ‎11-07-2013

Hello Nils,

Thank you so much again for valuable answer.

It’s always been pleasure to hear from you on this section of Unicode.

I am more clear from my side now.

Thank you,

Mayuresh