clipped from: www.yoda.arachsys.com   

Unicode and .NET


If all of this sounds rather confusing, don't worry. It's worth being aware of the distinctions above, but they don't often actually come to the fore. Most of the time you just want to convert some bytes into some characters, and vice versa. This is where the System.Text.Encoding class comes in, along with the System.Char structure (aka char in C#) and the System.String class (aka string in C#).


The char is the most basic character type. Each char is a single Unicode character. It takes 2 bytes in memory, and can take a value of 0-65535. Note that not all values are thus actually valid Unicode characters.


A string is just a sequence of chars, fundamentally. It's immutable, which means that once you've created a string instance (however you've done it) you can't change it - the various methods in the string class which suggest that they're changing the string in fact just return a new string which is the original character sequence with the changes applied.


The System.Text.Encoding class provides facilities for converting arrays of bytes to arrays of characters, or strings, and vice versa. The class itself is abstract; various implementations are provided by .NET and can easily be instantiated, and users can write their own derived classes if they wish. (This is quite a rare requirement, however - most of the time you'll be fine with the built-in implementations.) An encoding can also provide separate encoders and decoders, which maintain state between calls. This is necessary for multi-byte character encoding schemes, where you may not be able to decode all the bytes you have so far received from a stream. For instance, if a UTF-8 decoder receives 0x41 0xc2, it can return the first character (a capital A) but must wait for the third byte to determine what the second character is.