Thursday, 14 March 2013

Multibyte Offline Data Move

Moving data from a non Oracle database to Oracle can be a bit tricky when different character sets are at play.
For this example Ill move data from a Russian (Cyrillic_General_CI_AS) SQL Server  database on windows  to UTF8 (AL32UTF8) Oracle  database on linux.

SQL Server 2008 Cyrillic_General_CI_AS database
CREATE TABLE multibyte1(col1 VARCHAR(10),col2 CHAR(10));
CREATE TABLE multibyte2(col2 NVARCHAR(10),col2 NCHAR(100));
INSERT INTO multibyte1 VALUES('фис','фис');
INSERT INTO multibyte2 VALUES('фис','фис');
SELECT * FROM multibyte1;
SELECT * FROM multibyte2;

SQL Developer can migrate the two tables to Oracle

CREATE TABLE multibyte1(col1 VARCHAR2(10),col2 CHAR(10));
CREATE TABLE multibyte2(col2 NVARCHAR2(10),col2 NCHAR(100));

SQL Developer will also generate offline data move scripts which can perform the data move using SQL Servers BCP and Oracle SQL*Loader.

There are two issues with the scripts generated by SQL Developer (Ill fix them from the next release).

  • The shell files are not in linux format
  • The NLS_LANG setting is specified incorrectly as NLS_LANGUAGE
The solution is to 
  • run dos2unix on the .sh files
  • change NLS_LANGUAGE to NLS_LANG in the "bottom" file
Once that's done, the data move is performed in two steps. 

Dump the Data From SQL Server using BCP
The data from SQL Server can be dumped out to dat files using the  or unload_script.bat.
The data inside the dat files can be verified as correct. 
The dat files will be encoded in  Cyrillic_General_CI_AS (Cyrillic Window 1251).

You can now move all the files to your linux machine to load into Oracle.

Load the Data To Oracle using SQL*Loader
SQL*Loader needs to be able to recognize the characters in the dat files.
In the "bottom",  export NLS_LANG should be set to RUSSIAN_CIS.CL8MSWIN1251 .
SQL*Loader will recognize the characters and insert them correctly.
The oracle database is setup to save them in UTF8.

Verify the data in Oracle
In linux, if you open SQLPlus and query the tables, you may get ? , question marks instead of the Russian characters. This is because Oracle has saved the characters in UTF-8.

In linux, set the NLS_LANG to UTF8 and query the tables again.

export NLS_LANG=_.UTF8