In Python 2
>>> plain_string = "Hi!"
>>> unicode_string = u"Hi!"
>>> type(plain_string), type(unicode_string)
(<type 'str'>, <type 'unicode'>)
^ This is the difference between a byte string (plain_string) and a unicode string.
>>> s = "Hello!"
>>> u = unicode(s, "utf-8")
^ Converting to unicode and specifying the encoding.
In Python 3
All strings are unicode. The unicode function does not exist anymore. See answer from @Noumenon
In Python 2
>>> plain_string = "Hi!"
>>> unicode_string = u"Hi!"
>>> type(plain_string), type(unicode_string)
(<type 'str'>, <type 'unicode'>)
^ This is the difference between a byte string (plain_string) and a unicode string.
>>> s = "Hello!"
>>> u = unicode(s, "utf-8")
^ Converting to unicode and specifying the encoding.
In Python 3
All strings are unicode. The unicode function does not exist anymore. See answer from @Noumenon
If the methods above don't work, you can also tell Python to ignore portions of a string that it can't convert to utf-8:
stringnamehere.decode('utf-8', 'ignore')
Videos
Literal strings are unicode by default in Python3.
Assuming that text is a bytes object, just use text.decode('utf-8')
unicode of Python2 is equivalent to str in Python3, so you can also write:
str(text, 'utf-8')
if you prefer.
What's new in Python 3.0 says:
All text is Unicode; however encoded Unicode is represented as binary data
If you want to ensure you are outputting utf-8, here's an example from this page on unicode in 3.0:
b'\x80abc'.decode("utf-8", "strict")
Here's a class that overrides the representation of the string it wraps:
>>> class OctUTF8:
... def __init__(self,s):
... self.s = s.encode()
... def __repr__(self):
... return "b'" + ''.join(f'\\{n:03o}' for n in self.s) + "'"
...
>>> s='õ'
>>> OctUTF8(s)
b'\303\265'
This representation can be evaluated as a byte string and decoded back to the original:
>>> eval(repr(OctUTF8(s))).decode()
'õ'
First, you can use ord() to convert a character in a string
to it's Unicode form, then, you can use oct():
print(oct(ord("õ")))
Output:
0o365
The u'' syntax only works for string literals, e.g. defining values in source code. Using the syntax results in a unicode object being created, but that's not the only way to create such an object.
You cannot make a unicode value from a byte string by adding u in front of it. But if you called str.decode() with the right encoding, you get a unicode value. Vice-versa, you can encode unicode objects to byte strings with unicode.encode().
Note that when displaying a unicode object, Python represents it by using the Unicode string literal syntax again (so u'...'), to ease debugging. You can paste the representation back in to a Python interpreter and get an object with the same value.
Your a value is defined using a byte string literal, so you only need to decode:
a = 'Entre\xc3\xa9'
b = a.decode('utf8')
Your first example created a Mojibake, a Unicode string containing Latin-1 codepoints that actually represent UTF-8 bytes. This is why you had to encode to Latin-1 first (to undo the Mojibake), then decode from UTF-8.
You may want to read up on Python and Unicode in the Unicode HOWTO. Other articles of interest are:
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky
Pragmatic Unicode by Ned Batchelder
>>> chr(0x24E1)
'ⓡ'
>>> chr(0x24E9)
'ⓩ'
>>> chr(0x24E7)
'ⓧ'
doc: https://docs.python.org/3/howto/unicode.html