When migrating from Db2 mainframe (z/OS) to Amazon Relational Database Service (Amazon RDS) for Db2 on Linux, deciding on the suitable code web page and collation sequence is important for guaranteeing information compatibility. The suitable selection will assist forestall information truncation, character enlargement points, and keep constant show of overseas characters (similar to accented Latin letters) and sorting conduct.
Whereas Amazon RDS for Db2 defaults to Unicode, you may obtain mainframe compatibility by means of cautious configuration. This publish guides you thru deciding on the suitable code web page for Amazon RDS for Db2 to assist guarantee seamless information migration.
Code web page, collation, and territory parameters in Amazon RDS for Db2 can’t be modified after database creation. Cautious choice upfront avoids the necessity for database recreation and remigration.
Understanding mainframe CCSID and code web page
In Db2 for z/OS®, every character is mapped to a quantity relying upon the CCSID (coded character set identifier) and code web page related to that character. A code web page is a numeric identifier for a selected mapping of bytes and is simply the byte-to-glyph mapping. An instance is code web page 037, often known as Prolonged Binary Coded Decimal Interchange Code (EBCDIC) for the US and Canada. The CCSID is a full identifier describing the character set, encoding, and conversion guidelines. The CCSID contains code web page and locale with conversion metadata. An instance is CCSID 0037 (EBCDIC for the US and Canada together with conversion guidelines). Each CCSID implies a code web page, however a code web page doesn’t indicate a whole CCSID. Db2 for z/OS performs character conversion internally and desires CCSID to know precisely methods to convert information.
Regional code pages
North American and Western European code pages
- IBM-37Â (CCSID 37, CP037)
- IBM-500Â (CCSID 500, CP500)
- IBM-1047Â (CCSID 1047, CP1047)
German and Austrian code pages
- IBM-273Â (CCSID 273, CP273): Conventional German and Austrian EBCDIC
- IBM-1141Â (CCSID 1141, CP1141): German and Austrian EBCDIC with Euro image help
Japanese code pages
- IBM-930 (CCSID 930): Japanese EBCDIC blended SBCS and DBCS code web page utilizing Katakana SBCS and JIS X 0208 DBCS
- IBM-939 (CCSID 939): Japanese EBCDIC blended SBCS and DBCS code web page utilizing Latin (Roman) SBCS and JIS X 0208 DBCS
- IBM-1390 (CCSID 1390): Japanese EBCDIC blended SBCS/DBCS character CCSID with Euro (€) image help
- IBM-1399 (CCSID 1399): Japanese EBCDIC DBCS-only graphic CCSID
- IBM-5026 (CCSID 5026): Japanese EBCDIC Katakana SBCS code web page with uppercase-only Latin letters and Katakana-oriented collation
- IBM-5035 (CCSID 5035): Japanese EBCDIC DBCS-only graphic CCSID
These EBCDIC code pages encode character units optimized for particular areas and languages. Whereas many share comparable Latin-1 character repertoires, they differ within the project of particular code factors to reinforce compatibility with native necessities and open techniques. On IBM platforms, character CCSIDs (CHAR and VARCHAR) and graphic CCSIDs (GRAPHIC and VARGRAPHIC) are distinct. DBCS-only CCSIDs similar to 1399 and 5035 are supposed for graphic information and can’t signify SBCS characters.
Regional code web page particulars
North American and Western European:
- CCSID 37 (CP037): Conventional EBCDIC for the U.S., Canada, the Netherlands, and Portugal. Optimized for legacy mainframe functions with normal image placements. Originated within the Sixties for System/360 mainframes.
- CCSID 500 (CP500): Worldwide EBCDIC variant for Belgium, French-Canada, and Switzerland. Options changes for native symbols (for instance, forex symbols like CHF). Shares most code factors with CCSID 37 however contains swaps for multilingual help.
- CCSID 1047 (CP1047): Open Methods EBCDIC launched within the Nineties for OS/390 USS (Unix System Companies). A variant of CCSID 37 with six particular character swaps to align with POSIX/C requirements.
- CCSID 1140 (CP1140): Enhanced U.S. and Canada (CP037) that features the Euro image (€).
German and Austrian:
- CCSID 273 (CP273): Conventional German and Austrian EBCDIC encoding with German-specific characters similar to ä, ö, ü, and ß. Broadly utilized in German-speaking international locations for legacy mainframe techniques.
- CCSID 1141 (CP1141): Enhanced German and Austrian EBCDIC that features the Euro image (€). Launched to help European financial union necessities whereas sustaining German character help.
Japanese:
- CCSID 930 (CP930): Japanese EBCDIC blended SBCS and DBCS code web page utilizing Latin (Roman) SBCS and JIS X 0208 DBCS.
- CCSID 939 (CP939): Japanese EBCDIC blended SBCS and DBCS code web page utilizing Katakana SBCS and JIS X 0208 DBCS. It helps each SBCS (Katakana, Latin, and digits) and DBCS Kanji. It’s quite common on z/OS and older mainframe environments.
- CCSID 1390 (CP1390): Japanese EBCDIC blended SBCS and DBCS code with Euro (€) image help.
- CCSID 1399 (CP1399): Japanese EBCDIC DBCS graphic CCSOID (no SBCS).
- CCSID 5026 (CP5026): Japanese EBCDIC Katakana SBCS code web page with uppercase-only Latin letters and a Katakana-oriented collation sequence.
- CCSID 5035 (CP5035): Japanese EBCDIC DBCS graphic CCSID (DBCS solely).
Migration compatibility and ISO requirements
For Latin-based code pages (37, 500, 1047, 273): ISO-8859-1 (Latin-1 or CCSID 819) serves because the direct ASCII equal, enabling lossless conversion for many characters.
Euro image consideration: ISO-8859-1 doesn’t embody the Euro image (€) as a result of it was standardized earlier than the Euro’s introduction. For mainframe techniques utilizing Euro-enabled code pages (1140, 1141, 1390), take into account:
- ISO-8859-15 (Latin-9): Contains the Euro image and replaces some not often used characters from ISO-8859-1
- UTF-8: Complete Unicode help together with Euro and all worldwide characters
For Japanese code pages (930, 939, 1390, 1399): UTF-8 is the advisable goal encoding as a result of ISO-8859-1 can’t signify Japanese characters. Word that both 930 or 939 may be transformed to UTF-8 as a result of conversion is one path. For those who’re attempting to change again from UTF-8 to both 930 or 939, nearly half of the characters in UTF-8 received’t work in 930 and 939. Due to this, you received’t be capable to convert forwards and backwards with out points.
Understanding Db2 LUW code pages
Db2 LUW (Linux, Unix, and Home windows) doesn’t use CCSIDs the identical approach Db2 on z/OS makes use of.
- Db2 LUW is constructed on a code web page as a substitute of CCSIDs.
- A code web page is immutable and outlined on the database creation time.
- DB2 LUW depends on shopper/server code web page conversion.
- Db2 LUW doesn’t have a CCSID catalog column—the platform makes use of an Worldwide Parts for Unicode (ICU) and US locales somewhat than IBM CCSID definitions.
Code web page help: Db2 LUW helps varied code pages together with UTF-8 (Unicode) for complete character help.
Character compatibility: ISO-8859-1 serves because the direct ASCII counterpart to IBM 37, IBM 500, and IBM-1047’s character repertoire, enabling lossless conversion for Latin-1 characters.
Collation conduct:
- Normal collations are locale-based (for instance, en_US)
- Customized 256-byte sequences can simulate EBCDIC sorting for single-byte databases
- Unicode databases use CLDR (Widespread Locale Knowledge Repository) or UCA (Unicode Collation Algorithm) collations, making EBCDIC simulation extra complicated
Character show consistency: International characters (non-ASCII) show accurately when information is correctly transformed throughout switch. With out conversion, EBCDIC bytes seem as invalid characters in LUW (for instance, 0xC1 represents A in EBCDIC however is invalid in ASCII). Db2’s DRDA protocol and instruments like db2move deal with computerized conversion.
Code web page choice technique for Amazon RDS for Db2
On this part, we overview mainframe in comparison with Amazon RDS code web page flexibility.
Mainframe Db2Â permits code web page definition at a number of ranges:
- Subsystem stage: Defining CCSIDs (SBCS_CCSID, MIXED_CCSID, and DBCS_CCSID) establishes defaults for objects and functions that don’t explicitly specify a CCSID
- Tablespace and Desk stage: A CCSID clause throughout creation units the default encoding for all character columns until overridden on the column stage
- Column stage: Particular CCSID for particular person columns (for instance,Â
VARCHAR(50) CCSID 1208Â for UTF-8)
Db2Â defines the code web page solely at database creation time and can’t be modified afterward.
Key issues
When creating an Amazon RDS for Db2 occasion, keep away from specifying the default database identify throughout occasion creation, as a result of it defaults to UTF-8 with US territory.
Code web page equivalency:
- IBM code web page 819 (ISO-8859-1) is the precise equal of mainframe code pages 37, 500, 1047, and 273 (excluding the Euro image).
- For Euro image help, ISO-8859-15 (Latin-9) gives compatibility with code pages 1141 and different Euro-enabled variants.
- Japanese code pages (930, 939, 1390, and 1399) require UTF-8 encoding for correct character illustration. Nevertheless the counterpart of Japanese code pages 930 and 939 in Db2 LUW is IBM-943. It seems that most Japanese clients select IBM-943 for Japanese code pages 930 and 939.
Choice framework
Select the suitable code web page based mostly in your particular migration necessities.
Possibility 1a: ISO-8859-1 Code set
Use ISO-8859-1 when:
- Utilizing just one CCSID (37, 500, or 1047) for all tables and columns in mainframe Db2.
- Zero information truncation is required.
- Actual mainframe sorting order should be preserved. The sorting order in Db2 is enabled on the time of the database creation by means of the collation parameter.
- Multi-language help isn’t wanted.
- The Euro image isn’t required (ISO-8859-1 predates Euro introduction).
Implementation:
Substitute , , and  along with your precise values.
Possibility 1b: ISO-8859-15 code set (Euro help)
Use ISO-8859-15 when:
- Migrating from Euro-enabled mainframe code pages (1141, and 1390)
- Euro image (€) help is required
- Sustaining single-byte character encoding—SBCS (Single Byte Character Set)
- Restricted multi-language necessities
Implementation:
Collation sequence choices:
The fifth parameter specifies the collation sequence. Select based mostly in your mainframe CCSID:
|
Collation worth |
Goal code web page |
Mainframe CCSID |
Description |
|
EBCDIC_819_037 |
ISO-8859-1 |
37 |
EBCDIC US English |
|
EBCDIC_819_500 |
ISO-8859-1 | 500, 1047 | EBCDIC Worldwide |
| EBCDIC_850_037 | ASCII Latin | 37 | EBCDIC US English |
| EBCDIC_850_500 | ASCII Latin | 500 | EBCDIC Worldwide |
| EBCDIC_932_5026 | Japanese | 932 (collation 5026) | EBCDIC US English |
| EBCDIC_932_5035 | Japanese | 932 (collation 5035) | EBCDIC Worldwide |
| EBCDIC_1252_037 | Home windows Latin | 37 | EBCDIC US English |
| EBCDIC_1252_500 | Home windows Latin | 500 | EBCDIC Worldwide |
Choice information:
- Mainframe CCSID 37: Use EBCDIC_819_037
- Mainframe CCSID 500 or 1047: Use EBCDIC_819_500
- Mainframe CCSID 273: Use EBCDIC_819_500
- Mainframe CCSID 930: Use EBCDIC_932_5026 (Katakana collation)
- Mainframe CCSID 939: Use EBCDIC_932_5035 (Latin collation)
- Euro-enabled CCSID (1141, 1390, 1399): Take into account UTF-8 or ISO-8859-15
The CREATE_DATABASE command creates a Db2 database with:
- Beneficial 32 Ok web page dimension. The default web page dimension is 8 Ok.
- ISO-8859-1 code set (equal to IBM code web page 819)
- US territory
- EBCDIC collation for the required mainframe code web page
Supported territory codes for ISO-8859-1 and ISO-8859-15: AL, AU, AT, BE, BR, CA, CH, CN, DE, DK, ES, ET, FI, GB, ID, IE, IN, IS, IT, JP, KE, KR, Lat, MY, NL, NZ, NO, PH, PT, TW, TZ, US, and ZA
Seek the advice of the IBM documentation for legitimate territory and code web page combos.
Possibility 2: UTF-8 code web page
Use UTF-8 when:
- Multi-language information help is required
- Migrating from Japanese mainframe code pages (930, 939, or 1390)
- Euro image (€) and complete worldwide character help wanted
- Future proofing for world character dealing with is necessary
- Mainframe sorting order preservation isn’t important
Vital issues:
- Character enlargement: UTF-8 may require extra cupboard space
- Accented vowels (à , é, î), forex symbols (¢, £, ¥), fractions (¼, ½, ¾), and particular characters (ß, ¬, µ) require 1 byte in mainframe code pages however 2 bytes in UTF-8
- DDL modifications required:
CHARandVARCHARlengths should be adjusted, which isn’t trivial
Implementation:
UTF-8 is supported in all territories. Seek the advice of the IBM documentation for out there territory choices.
Understanding CODEUNITS32 in UTF-8 Db2 databases
When utilizing UTF-8 in Amazon RDS for Db2, mainframe migration inevitably encounters information truncation points as a result of character enlargement—some characters develop from 1 byte on mainframe to 2 bytes or extra in UTF-8.
To keep away from DDL modifications, Db2 gives a mechanism to vary the default string measurement from OCTETS to CODEUNITS32 by modifying the STRING_UNITS database configuration parameter.
This transformation requires an occasion restart as a result of STRING_UNITS isn’t a dynamic parameter. The DDLs should be created after updating the parameter for this parameter to take impact.
Sensible instance: ISO-8859-1 database
Alternatively, you may specify CODEUNITS32 on the column stage when creating objects with out altering the STRING_UNITS parameter.
Create an ISO-8859-1 database:
Check with pattern information:
Outcome:
Discover that every column makes use of precisely 2 bytes within the ISO-8859-1 database. Do the identical take a look at with a UTF-8 database.
Create a UTF-8 database:
Check with CODEUNITS32:
Outcome:
Key statement: Every column makes use of 3 bytes within the UTF-8 database, regardless of being outlined as CHAR(2 CODEUNITS32). The CODEUNITS32 specification permits for 2 characters with as much as 4 bytes every (8 bytes complete allocation).
CODEUNITS32 trade-offs and proposals
Impression of database stage CODEUNITS32:
- Default
CHARandVARCHARallocation modifications from 1 byte to 4 bytes per character - Reduces most lengths:
CHAR: from 255 to 63 charactersVARCHAR: from 32,704 to eight,174 bytes (32 KB web page dimension)
- For databases with 90% ASCII characters, dimension can broaden to three.8 occasions the unique
It’s advisable to keep away from database stage CODEUNITS32. One choice is to make use of selective column-level CODEUNITS32, however accomplish that with warning, as a result of it may waste vital house.
For instance, CREATE TABLE t1 (c1 CHAR(2 CODEUNITS32)) allocates 8 bytes.
A greater strategy is to make use of OCTETS with adjusted size. OCTETS is a illustration of a byte. It’s a one-to-one mapping between a personality and a quantity related in non-Unicode code pages. CREATE TABLE t1 (c1 CHAR(4 OCTETS)) allocates precisely 4 bytes.
Finest observe: Use CODEUNITS32 solely when sure that you just’ll retailer 3–4 byte characters (for instance, East Asian languages). For many worldwide characters, use and alter OCTETS size accordingly.
Collation variations: EBCDIC in comparison with SYSTEM
Kind order comparability.
EBCDIC collation order: Particular characters, lowercase letters, uppercase letters, then numerals
SYSTEM collation order: Numerals, uppercase letters, lowercase letters, then particular characters
Sensible examples
EBCDIC collation take a look at:
Outcome:
The consequence reveals the anticipated type order of particular characters, lowercase, uppercase, and numerals.
SYSTEM collation take a look at:
Outcome:
The consequence reveals the anticipated type order of numerals, uppercase, lowercase, and particular characters.
Dealing with unsupported characters in ISO-8859-1
When functions try to insert characters not out there in ISO-8859-1 (CCSID 819), the database performs silent character substitution.
Instance take a look at:
Outcome:
What occurred:
- No error or warning was generated
- The Japanese character was changed by SUB (substitute) character (0x1A)
- The remaining bytes are full of areas (0x20)
Key takeaway: ISO-8859-1 databases can’t retailer unsupported characters. Unknown characters are silently changed with 0x1A adopted by areas (0x20) for remaining byte positions.
Guaranteeing character consistency throughout platforms
Knowledge migration finest practices
Conversion instruments: Use Db2 federation (catalog z/OS database in LUW) or instruments like db2move/EXPORT with ASC/DEL codecs for computerized character conversion.
Test the mainframe CCSID configuration
Desk-level CCSID:
Column-level CCSID:
Instance consequence:
Validation and testing
Character testing: Insert take a look at information with worldwide characters (for instance, café, niño) on mainframe, export, and import to LUW and confirm show consistency utilizing GUI purchasers similar to:
- DBeaver
- DataGrip
- IBM Knowledge Studio
- IBM Database Administration Console
Conversion dangers: Spherical-trip conversions (EBCDIC → ASCII → EBCDIC) may lose character variants with out actual mappings. Confirm the outcomes utilizing IBM’s official CCSID tables.
Conclusion
Code web page, collation, and territory parameters in Db2 are immutable after database creation—in contrast to different database techniques. This makes the preliminary choice important for migration success.Key issues:
- ISO-8859-1: Select for actual mainframe compatibility and 0 information loss. Use an Amazon RDS for Db2 outlined collation sequence to protect mainframe sorting conduct
- ISO-8859-15: Take into account for Euro image help when migrating from Euro-enabled mainframe code pages (1141, 1390) and confirm Amazon RDS help
- UTF-8: Choose for multi-language help, however pay attention to character enlargement and potential DDL modifications
- Early planning: Deal with code web page and collation necessities early within the migration cycle to forestall information high quality points
Correct code web page choice prevents information truncation, manages character enlargement successfully, and helps guarantee constant software conduct throughout platforms.
Acknowledgements
Due to Chiranjeev Mukherjee, Hajime Minagawa, and Akira Shimosako for rigorously reviewing this weblog publish from the mainframe migration perspective and particularly for Japanese character units.
Concerning the authors
