Python 3


"The first Python course that simply amazed me. Very well explained and easy to understand." (Alexandru Cosmin)

"The best Python course in Romania." (Iulian Geană)

About the UTF-8 Standard and Python Language

What is UTF-8?

UTF-8, or Unicode Transformation Format 8-bit, is an ingenious method of representing characters from all the world's languages using numbers and bits. Before UTF-8, programmers faced the challenge of managing different character sets for each language. But now, we can bring all these diverse characters together in a single piece of code.

Why is it Important? Imagine developing an application to be used worldwide. People from various countries will want to use your application and input text in their native language. UTF-8 allows you to do this without worrying that the input text will be misinterpreted or that some characters will not display correctly.

How do we work with UTF-8 in Python?

In Python, things are surprisingly simple. When working with strings, Python allows you to use UTF-8 characters directly. For example, you can define a string with letters from various languages, and Python will handle them without problems:

text = "Привет, Hello, 你好!"

This code will correctly display all characters, regardless of the language they come from. Even this HTML file respects this standard, so the characters could be displayed immediately on the page.

You can also ensure that your Python files are in UTF-8 format by adding a comment at the beginning of the file:

# -*- coding: utf-8 -*-

If we want to introduce the copyright symbol (©) in Python using its Unicode code, we will type the following:

s = '\u00A9'

Above, we created a string using the Unicode code "\u00A9". Since Python's string utilizes UTF-8 encoding by default, displaying the value of s automatically converts it to the corresponding Unicode symbol. Note that the "\u" sequence at the beginning of a code point is necessary. Without it, Python will not be able to convert the code point.

List of UTF-8 Characters

More Details

UTF-8 is a character encoding standard used in computing to represent text in human languages. This standard associates each character (letter, digit, punctuation mark, symbol, etc.) with a unique sequence of bits (0 and 1), allowing computers to understand and display various languages and symbols from around the world.

UTF-8 uses a variable encoding scheme, meaning characters are represented using a varying number of bits depending on the character. Common characters (such as letters from the English alphabet) are represented using fewer bits, while rarer or more complex characters (such as letters from non-Latin writing systems) are represented using more bits. This makes UTF-8 space-efficient and capable of representing a wide range of characters.

For example, let's take the letter "A". In Unicode it has a numeric code point, in this case: "U+0041". In UTF-8 representation, this letter is encoded with the bit sequence 01000001. This representation system allows for the unification of how computers interpret and display letters, even if they come from different languages.

Another important point is that UTF-8 is not limited to letters and alphabetic characters. It can represent mathematical symbols, emojis, punctuation marks, logos, and much more. For example, let's take the emoji Milky Way ("🌌"), which in Unicode code is "U+1F30C". In UTF-8, it is encoded with the bit sequence 11110000 10011111 10001100 10001100.

Over the years, the widespread adoption of UTF-8 in programming and on the internet has made global communication and collaboration possible without the need for complex conversions and adaptations between different character sets. Therefore, in today's digital world, knowledge of UTF-8 is vital for creating intercultural and interoperable applications and websites.

Encoding and Decoding Strings

The Python programming language provides built-in functions for encoding and decoding strings. For example, the encode() function converts (encodes) a string into a bytes object:

emoji = '😅'

The console output is as follows:

Let's break it down a bit. The displayed expression represents a sequence of bytes in UTF-8 encoding format, which is a common way to represent Unicode characters in binary form:

\xf0 (in binary, "11110000")

This byte is part of a multi-byte sequence that represents a Unicode character. It is a start byte that indicates a longer sequence for the character representation.

\x9f ("10011111")

The second byte in the sequence, which is part of the Unicode code of the character.

\x98 ("10011000")

The third byte in the sequence, also part of the Unicode code of the character.

\x85 ("10000101")

The last byte in the sequence, which is also part of the Unicode code of the character.

Thus, the emoji can be written in bits (4 bytes) as: "11110000 10011111 10011000 10000101".

The prefix "b" in the expression indicates that this sequence of characters is a byte string (8 bits).

See more characters expressed in this way [here].

Similarly, decoding is done using the decode() function:

binary = b'\xf0\x9f\x8c\x8d'

In the console, the corresponding character (the "Earth Globe Europe-Africa" emoji) was displayed.
Join our Club,
Python 3 is super cool!
 arrow_back   home   perm_identity   list   arrow_upward