当前位置：网站首页>Character encoding problem

Character encoding problem

Unicode yes Character set
It assigns each character a unique ID, That is, all characters in the world are assigned a unique ID
in the light of Unicode There are many coding rules implemented by the character set of , Such as utf-8,utf-16,utf-32
such as Unicode The character set only specifies you The code number of is 12345, But the specific encoding takes up a few bytes , It depends on the coding rule you choose .

ANSI It's also Character set
ASCII The code assigns unique characters to English letters ID, front 128 English characters 、 Numbers 、 Common symbols .
ANSI It's right ASCII An extension of . front 128 Characters also represent English characters 、 Numbers 、 Universal character . The following code means A country All characters of .
such as , For China ANSI Encoded as GB2312. For Japan ANSI Encoded as Shift_JIS, That is, each country has its own standards .
The biggest drawback is the difference between different languages ANSI Codes cannot be converted to each other , It will cause garbled code in multi language mixed text .

UTF-8 yes Encoding rules , It's right Unicode Implementation of character set coding .
It's a set of 8 Bit by bit Coding unit Variable length coding of . A code point will be encoded as 1 To 4 Bytes .
English characters 、 Numbers 、 Common characters account for 1 Bytes
Most Chinese characters account for 3 Bytes , A few rarely used Chinese characters account for 4 Bytes
Characters of single subsection , The first bit of the byte is set to 0, For English texts ,UTF-8 The code takes only one byte , and ASCII It's the same size ;
n Characters in bytes (1<n<=4), Before the first byte n Set as 1, The first n+1 Set as 0, The first two bits of the next byte are set to 10, this n The rest of the bytes fill in the character unicode code , High level 0 Make up

0xxxxxxx
110xxxxx 10xxxxxx
1110xxxx 10xxxxxx 10xxxxxx
11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

utf-16 A code point will be encoded as 2 Bytes or 4 Bytes
utf-32 A code point will be uniformly encoded as 4 Subsection
You will find that the utilization of these two coding methods is not as good as utf-8, But the coding method is simpler .

版权声明
本文为[Every day without dancing is a betrayal of life]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/203/202207220356575063.html