Friday, September 02, 2016

Dealing with encoding issue in clinical trial data: WLATIN1 and UTF-8

Nowadays, the clinical trials go to global and are usually multinational. The data collection also goes to the electronic data capture (EDC) and the clinical trial data are entered directly by the investigational sites no matter whether the sites are in English-speaking countries or the non-English speaking countries. One issue we often run into is the data encoding issue.

Encoding is the process of transforming a set of Unicode characters into a sequence of bytes. In contrast, decoding is the process of transforming a sequence of encoded bytes into a set of Unicode characters.

To accommodate the multinational trials and the necessity of handling the non-English language characters, the EDC vendors may choose to use the encoding = UTF-8 for their data sets. However, when we use SAS for Windows system, the compatible encoding system is usually WLATIN1.

In the Windows environment, if we try to read a data encoded with UTF-8 format, we will get an error message such as below:

NOTE: Data file xxxxx is in a format that is native to another host, or the file encoding does not match the session encoding. Cross Environment Data Access will be used, which might require additional CPU resources and might reduce performance.
ERROR: Some character data was lost during transcoding in the dataset xxxxx Either the data contains characters that are not representable in the new encoding or truncation occurred during transcoding.

Here is also a discussion about this issue on SAS website.

To ensure that the data is transcoded correctly from one encoding to another, there are several ways. The following three papers provided very good explanations:

According to paper by Song, there are three ways to change the encoding:
1.    Force the transcoding by specifying that it needs to become WLATIN1, using the dataset option ENCODING=.
data x(encoding='WLATIN1');
set x;
run;
2.    USE PROC DATASETS
The second approach is to use PROC DATASETS as below:
proc datasets lib=libname;
modify x/correctencoding='WLATIN1';
run;
However, this way is NOT recommended: it only changes the encoder indicator but not actually translate the data itself!

3.    USE PROC MIGRATE
When you would like to convert multiple SAS datasets from wlatin1 into UTF-8, you can use PROC MIGRATE.
proc migrate in=inlib out=outlib;
run;
This migrates all SAS datasets in libname inlib to libname outlib. It retains SAS datasets labels as well. Note that inlib and outlib should be two different locations.

Also, we can use the following approaches:
1   1. inencoding option in libname statement.
libname in 'directory\' inencoding=asciiany;
data x;
   set in.x;
run;
    2. Directly use encoding option after the data set
 proc sort data=RAWDM.AE(encoding='wlatin1') out=OUTSTATS.AE ;
by subject;run;
Here are some approaches / examples for resolving the data encoding issues from NLS reference guide:

Example 1: Creating a SAS Data Set with Mixed Encodings and with Transcoding Suppressed
By specifying the data set option ENCODING=ANY, you can create a SAS data set that contains mixed encodings, and suppress transcoding for either input or output processing.
In this example, the new data set MYFILES.MIXED contains some data that uses the Latin1 encoding, and some data that uses the Latin2 encoding. When the data set is processed, no transcoding occurs. For example, the correct Latin1 characters in a Latin1 session encoding and correct Latin2 characters in a Latin2 session encoding are displayed.
libname myfiles 'SAS data-library';
data myfiles.mixed (encoding=any);
set work.latin1;
set work.latin2;
run;

Example 2: Creating a SAS Data Set with a Particular Encoding
For output processing, you can override the current session encoding. This action might be necessary, for example, if the normal access to the file uses a different session encoding.
For example, if the current session encoding is Wlatin1, you can specify ENCODING=WLATIN2 in order to create the data set that uses the encoding Wlatin2. The following statements tell SAS to write the data to the new data set using the Wlatin2 encoding instead of the session encoding. The encoding is also specified in the descriptor portion of the file.
libname myfiles 'SAS data-library';
data myfiles.difencoding (encoding=wlatin2);
run;

Example 3: Using the FILE Statement to Specify an Encoding for Writing to an External File
This example creates an external file from a SAS data set. The current session encoding is Wlatin1, but the external file's encoding needs to be UTF-8. By default, SAS writes the external file using the current session encoding.
To specify what encoding to use for writing data to the external file, specify the ENCODING= option:
libname myfiles 'SAS data-library';
filename outfile 'external-file';
data _null_;
set myfiles.cars;
file outfile encoding="utf-8";
put Make Model Year;
run;
When you tell SAS that the external file is to be in UTF-8 encoding, SAS then transcodes the data from Wlatin1 to the specified UTF-8 encoding.

Example 4: Using the FILENAME Statement to Specify an Encoding for Reading an External File
This example creates a SAS data set from an external file. The external file is in UTF-8 character-set encoding, and the current SAS session is in the Wlatin1 encoding. By default, SAS assumes that an external file is in the same encoding as the session encoding, which causes the character data to be written to the new SAS data set incorrectly.
To specify which encoding to use when reading the external file, specify the ENCODING= option: 
libname myfiles 'SAS data-library';
filename extfile 'external-file' encoding="utf-8";
data myfiles.unicode;
infile extfile;
input Make $ Model $ Year;
run;
When you specify that the external file is in UTF-8, SAS then transcodes the external file from UTF-8 to the current session encoding when writing to the new SAS data set. Therefore, the data is written to the new data set correctly in Wlatin1.

Example 5: Using the FILENAME Statement to Specify an Encoding for Writing to an External File
This example creates an external file from a SAS data set. By default, SAS writes the external file using the current session encoding. The current session encoding is Wlatin1, but the external file's encoding needs to be UTF-8.
To specify which encoding to use when writing data to the external file, specify the ENCODING= option:
libname myfiles 'SAS data-library';
filename outfile 'external-file' encoding="utf-8";
data _null_;
set myfiles.cars;
file outfile;
put Make Model Year;
run;
When you specify that the external file is to be in UTF-8 encoding, SAS then transcodes the data from Wlatin1 to the specified UTF-8 encoding when writing to the external file.
Example 6: Using the INFILE= Statement to Specify an Encoding for Reading from an External File
This example creates a SAS data set from an external file. The external file's encoding is in UTF-8, and the current SAS session encoding is Wlatin1. By default, SAS assumes that the external file is in the same encoding as the session encoding, which causes the character data to be written to the new SAS data set incorrectly.
To specify which encoding to use when reading the external file, specify the ENCODING= option: 
libname myfiles 'SAS data-library';
filename extfile 'external-file';
data myfiles.unicode;
infile extfile encoding="utf-8";
input Make $ Model $ Year;
run;
When you specify that the external file is in UTF-8, SAS then transcodes the external file from UTF-8 to the current session encoding when writing to the new SAS data set. Therefore, the data is written to the new data set correctly in Wlatin1.

Incorrect encoding can be stamped on a SAS 7 or SAS 8 data set if it is copied or replaced in a SAS 9 session with a different session encoding from the data. The incorrect encoding stamp can be corrected with the CORRECTENCODING= option in the MODIFY statement in PROC DATASETS. If a character variable contains binary data, transcoding might corrupt the data.

No comments: