To develop
globalized software applications, one of the main design criteria would
be to provide service irrespective of the language or the region in which
it is used. By having a system which provides facility to support all
the possible languages across the world, it will reduce the time and effort
required for customization of the application to suit to the local regional
settings. The technology of Encoding and Decoding is used to cater to
this need for providing multi-lingual support for applications.
Encoding
and Decoding Data
Encoding is usually needed in scenarios like operating with legacy systems,
creating HTML pages, manually generating e-mail messages, etc. In all
these cases, the string resources may have to be encoded so as to support
for multiple languages. Though ASCII (American Standard Code for Information
Interchange) proved to be one of the best encoding type, it has its own
drawbacks. ASCII assigned characters to 7-bit bytes using the numbers,
0 to 127. These were mapped to the English alphabets along with some special
characters (like -, , etc.). This number range was insufficient
to cater to the needs of characters necessary for non-English alphabets.
Also, there was a possibility of different languages mapping different
characters to a same value leading to conflicts. To resolve this issue,
ASCII introduced the usage of code pages which are defined to support
groups of languages that share common writing system. Windows code pages
contain 256 code points (values) and are zero based.
The usage of encoding can be illustrated with an example of the way how
a web page is generated. Since the content of web pages need to be created
based on the region where they are rendered, each of these web page are
tagged with an encoding type which represents the encoding format that
needs to be used for displaying data. This would be defined as meta tag
of the HTML as below:
In the above sample, the Unicode Transformation Format, UTF-8 is being
used for encoding.
Unicode is a big code page having thousands of characters that support
most languages and scripts in the world. The conversion of Unicode characters
to a sequence of bytes is called Encoding while the conversion to Unicode
character from a sequence of bytes is called Decoding. For every character
in a Unicode supported script, a code point (basically a unique number)
is assigned. The way to encode to the code point is termed as Unicode
Transformation Format (UTF). Some of the popular UTFs are as below:
UTF-8: Each code point represented as a sequence of one to four bytes
UTF-16: Each code point considered as a sequence of one to two 16-bit
bytes
UTF-32: Each code point represented as 32 bit integer
UTF-7: Each code point represented as 7 bit integers. It is rarely used
for cases like mail, newsgroup, etc since it is not robust
The .Net framework itself internally uses UTF-16 format to store and retrieve
text data.
.Net classes for Encoding and Decoding
The .Net framework has implemented the code for encoding and decoding
characters in the class, Encoding. For this, the System.Text namespace
has to be included in the code. Following are the different Unicode encodings
supported by .Net framework:
ASCII encoding: encodes Unicode character to 7-bit value and its code
page being 20127. Hence, it can support character values from U+0000 to
U+007F. The .Net class, ASCIIEncoding can be used to convert characters
to and from ASCII encoding.
UTF-8 encoding: supports all Unicode character values and its code page
being 65001. The .Net class, UTF8Encoding can be used to convert characters
to and from UTF-8 encoding.
UTF-7 encoding: supports all Unicode character values and its code page
being 65000. The .Net class, UTF7Encoding can be used to convert characters
to and from UTF-7 encoding.
UTF-16 encoding: supports all Unicode character values and its code pages
being 1200 and 1201. The .Net class, UnicodeEncoding can be used to convert
characters to and from UTF-16 encoding.
UTF-32 encoding: supports all Unicode character values and its code pages
being 65005 and 65006. The .Net class, UTF32Encoding can be used to convert
characters to and from UTF-32 encoding.
Selection of encoding class is based on the encodings used in legacy applications
with which the newly generated code is expected to work with. In case
any option given to choose the encoding type, it is recommended to use
UTF8Encoding or UnicodeEncoding class. In case of ASCII contents, UTF8Encoding
is preferred over ASCIIEncoding since the latter provides error detection
and hence better security.
Using the Encoding class
The two important methods of the .Net class, System.Text.Encoding which
helps in encoding and decoding data are as below:
GetEncoding - returns an Encoding object for a specified encoding format.
GetBytes - converts a Unicode string to its byte representation in a specified
encoding
GetEncodings - used to fetch the details of the code pages supported by
the .Net framework. Details such as number, official name and friendly
name of the code base are stored in EncodingInfo object which gets returned
on calling this method.
While reading a file, the .Net framework automatically decodes the most
common encoding types and hence, there is no need to specify the encoding
type. If it is necessary to do so, the Encoding object can be passed as
parameter to the overloaded constructor of the StreamReader class which
is used for reading the file. Similarly, to specify an encoding type while
writing to a file, it is necessary to pass the Encoding object as parameter
to its overloaded constructor of the StreamWriter class, used for writing
to the file. By default (without passing the encoding type object), the
.Net framework uses UTF-16.
Note:
While reading files with UTF-7 encoding type, the encoding type
has to be specified for reading the file correctly
Notepad application should not be used for reading UTF-7 and UTF-32
files.
Thus, while the .Net framework eases the way to support developing world-ready
applications by providing rich classes, it is necessary to understand
thoroughly about the concept of Unicode standards and carefully implement
it in the application so as to avoid issues that can arise later.