cancel
Showing results for 
Search instead for 
Did you mean: 

UNICODE - UTF-8 on AIX pSeries vs UTF-16 on iSeries question

Former Member
0 Kudos

Hi,

I ran across an SAP power point that indicated that the iSeries DB2 databases are UTF-16 while the pSeries AIX DB2 databases are UTF-8. Since there appears to be about a 100% higher storage requirement, does anyone know why? I know it can't be the processor/memory hardware since both are identical now so I was wondering if there was a performance benefit or whether it was just easier since the ASCII database was UTF-16 compliant already.

It would seem that the pSeries would have the upper hand for backup/restore time, disk investment but might be at a disadvantage for CPU to "unpack" from single to double byte.

In case you are wondering where my question is coming from, It's end of lease time again so I'm looking at disk sizing.

Thanks,

Craig

Accepted Solutions (0)

Answers (2)

Answers (2)

dorothea_stein
Participant
0 Kudos

Hi Craig,

It's really the design of the database (plus the history) which makes the difference, not the hardware.

At the time, SAP decided to do Unicode, iSeries has been offering UCS-2 only making the original decision easy. Nowadays, there is also UTF-8.

UTF-8 and UCS-2 bring along different characteristics. Neither one is "perfect", both types have good and bad sides:

UTF-8 allocates 1 to 4 bytes, depending. Asian languages are usually more space consuming than languages who take their characters mainly out of the 7-bit ASCII range like English does, for example. (Volker pointed that out already.) Most characters except A-Z already allocate 2 or more bytes (like for example German umlauts).

(Ignoring surrogates as a very exotic case) UCS-2 operates with a fixed amount of space per character (2 bytes), making it easy to integrate it into a table's base row for fast data access, and to be able to position/search performantly within the data.

Last but not least, UCS-2 naturally fulfills the sort order requirement of the SAP application server, while UTF-8 has to apply a sort sequence before passing data to the SAP AppServer. - In order to get a UTF-8 like type with UTF-16 sort order, you'd really have to use the CESU-8 type, which is not available on most databases.)

So, to put it in (very much) simplified words, UCS-2 is more CPU optimized, while UTF-8 is more space optimized.

Since DB2 for i has a long tradition to work well with fixed length types it seemed natural that IBM offer UCS-2 as their first choice.

Also: converting an existing database to UTF-8 would basically be "another codepage conversion"...

However, since both types are now available, the SAP on i team has recently started a research project to get a more profound understanding what it would mean to run SAP on i using UTF-8 in comparison to our current solution.

Kind regards,

Dorothea

PS. You are mentioning a "100% higher storage requirement". To what are you comparing exactly? - Comparing the IBM i EBCDIC solution with the IBM i Unicode solution we estimate a disk storage increase of about 60% depending on how much character data there is. Usually less for BW, because there is more numeric data.

Former Member
0 Kudos

Hi Craig,

the powerpoint is correct.

MS-SQL-Server & DB2 for iSeries have chosen UTF-16 and Oracle, DB2 for the other platforms and MaxDB have chosen UTF-8.

If you store english in UTF-8, this consumes 1 byte per character where it needs 2 bytes in UTF-16. If you store chinese in UTF-8, it typically uses 3 bytes in UTF-8 and still 2 bytes in UTF-16.

So, for english you are correct, not for chinese.

The decision is made by each DB-Vendor itself and cannot be changed by you or SAP. There is NO penalty to convert that into UTF-16 within the SAP server.

As you might think on english or other latin-1 databases, a space difference of 30-50% is realistic.

Regards

Volker Gueldenpfennig, consolut international ag

http://www.consolut.de - http://www.4soi.de - http://www.easymarketplace.de