decode the bytes to produce a str:
b = b'1234'
print(b.decode('utf-8')) # '1234'
Answer from hiro protagonist on Stack Overflowdecode the bytes to produce a str:
b = b'1234'
print(b.decode('utf-8')) # '1234'
The object you are printing is not a string, but rather a bytes object as a byte literal.
Consider creating a byte object by typing a byte literal (literally defining a byte object without actually using a byte object e.g. by typing b'') and converting it into a string object encoded in utf-8. (Note that converting here means decoding)
byte_object= b"test" # byte object by literally typing characters
print(byte_object) # Prints b'test'
print(byte_object.decode('utf8')) # Prints "test" without quotations
We simply applied the .decode(utf8) function.
String literals are described by the following lexical definitions:
https://docs.python.org/3.3/reference/lexical_analysis.html#string-and-bytes-literals
stringliteral ::= stringprefix
stringprefix ::= "r" | "u" | "R" | "U"
shortstring ::= "'" shortstringitem* "'" | '"' shortstringitem* '"'
longstring ::= "'''" longstringitem* "'''" | '"""' longstringitem* '"""'
shortstringitem ::= shortstringchar | stringescapeseq
longstringitem ::= longstringchar | stringescapeseq
shortstringchar ::= <any source character except "\" or newline or the quote>
longstringchar ::= <any source character except "\">
stringescapeseq ::= "\" <any source character>
bytesliteral ::= bytesprefix(shortbytes | longbytes)
bytesprefix ::= "b" | "B" | "br" | "Br" | "bR" | "BR" | "rb" | "rB" | "Rb" | "RB"
shortbytes ::= "'" shortbytesitem* "'" | '"' shortbytesitem* '"'
longbytes ::= "'''" longbytesitem* "'''" | '"""' longbytesitem* '"""'
shortbytesitem ::= shortbyteschar | bytesescapeseq
longbytesitem ::= longbyteschar | bytesescapeseq
shortbyteschar ::= <any ASCII character except "\" or newline or the quote>
longbyteschar ::= <any ASCII character except "\">
bytesescapeseq ::= "\" <any ASCII character>
input:
s = 'ਅ'
a = s.encode('ascii', 'backslashreplace')
print(a)
output:
b'\\u0a05'
how do I get rid of the b'\ ? i just want it to say \u0a05
python - Is there any way to remove the b'' from decoded base64 output? - Stack Overflow
Removing certain strings from base64 output python - Stack Overflow
python - Why do I need 'b' to encode a string with Base64? - Stack Overflow
Hot Linked Questions - Stack Overflow
base64 encoding takes 8-bit binary byte data and encodes it uses only the characters A-Z, a-z, 0-9, +, /* so it can be transmitted over channels that do not preserve all 8-bits of data, such as email.
Hence, it wants a string of 8-bit bytes. You create those in Python 3 with the b'' syntax.
If you remove the b, it becomes a string. A string is a sequence of Unicode characters. base64 has no idea what to do with Unicode data, it's not 8-bit. It's not really any bits, in fact. :-)
In your second example:
>>> encoded = base64.b64encode('data to be encoded')
All the characters fit neatly into the ASCII character set, and base64 encoding is therefore actually a bit pointless. You can convert it to ascii instead, with
>>> encoded = 'data to be encoded'.encode('ascii')
Or simpler:
>>> encoded = b'data to be encoded'
Which would be the same thing in this case.
* Most base64 flavours may also include a = at the end as padding. In addition, some base64 variants may use characters other than + and /. See the Variants summary table at Wikipedia for an overview.
Short Answer
You need to push a bytes-like object (bytes, bytearray, etc) to the base64.b64encode() method. Here are two ways:
>>> import base64
>>> data = base64.b64encode(b'data to be encoded')
>>> print(data)
b'ZGF0YSB0byBiZSBlbmNvZGVk'
Or with a variable:
>>> import base64
>>> string = 'data to be encoded'
>>> data = base64.b64encode(string.encode())
>>> print(data)
b'ZGF0YSB0byBiZSBlbmNvZGVk'
Why?
In Python 3, str objects are not C-style character arrays (so they are not byte arrays), but rather, they are data structures that do not have any inherent encoding. You can encode that string (or interpret it) in a variety of ways. The most common (and default in Python 3) is utf-8, especially since it is backwards compatible with ASCII (although, as are most widely-used encodings). That is what is happening when you take a string and call the .encode() method on it: Python is interpreting the string in utf-8 (the default encoding) and providing you the array of bytes that it corresponds to.
Base-64 Encoding in Python 3
Originally the question title asked about Base-64 encoding. Read on for Base-64 stuff.
base64 encoding takes 6-bit binary chunks and encodes them using the characters A-Z, a-z, 0-9, '+', '/', and '=' (some encodings use different characters in place of '+' and '/'). This is a character encoding that is based off of the mathematical construct of radix-64 or base-64 number system, but they are very different. Base-64 in math is a number system like binary or decimal, and you do this change of radix on the entire number, or (if the radix you're converting from is a power of 2 less than 64) in chunks from right to left.
In base64 encoding, the translation is done from left to right; those first 64 characters are why it is called base64 encoding. The 65th '=' symbol is used for padding, since the encoding pulls 6-bit chunks but the data it is usually meant to encode are 8-bit bytes, so sometimes there are only two or 4 bits in the last chunk.
Example:
>>> data = b'test'
>>> for byte in data:
... print(format(byte, '08b'), end=" ")
...
01110100 01100101 01110011 01110100
>>>
If you interpret that binary data as a single integer, then this is how you would convert it to base-10 and base-64 (table for base-64):
base-2: 01 110100 011001 010111 001101 110100 (base-64 grouping shown)
base-10: 1952805748
base-64: B 0 Z X N 0
base64 encoding, however, will re-group this data thusly:
base-2: 011101 000110 010101 110011 011101 00(0000) <- pad w/zeros to make a clean 6-bit chunk
base-10: 29 6 21 51 29 0
base-64: d G V z d A
So, 'B0ZXN0' is the base-64 version of our binary, mathematically speaking. However, base64 encoding has to do the encoding in the opposite direction (so the raw data is converted to 'dGVzdA') and also has a rule to tell other applications how much space is left off at the end. This is done by padding the end with '=' symbols. So, the base64 encoding of this data is 'dGVzdA==', with two '=' symbols to signify two pairs of bits will need to be removed from the end when this data gets decoded to make it match the original data.
Let's test this to see if I am being dishonest:
>>> encoded = base64.b64encode(data)
>>> print(encoded)
b'dGVzdA=='
Why use base64 encoding?
Let's say I have to send some data to someone via email, like this data:
>>> data = b'\x04\x6d\x73\x67\x08\x08\x08\x20\x20\x20'
>>> print(data.decode())
>>> print(data)
b'\x04msg\x08\x08\x08 '
>>>
There are two problems I planted:
- If I tried to send that email in Unix, the email would send as soon as the
\x04character was read, because that is ASCII forEND-OF-TRANSMISSION(Ctrl-D), so the remaining data would be left out of the transmission. - Also, while Python is smart enough to escape all of my evil control characters when I print the data directly, when that string is decoded as ASCII, you can see that the 'msg' is not there. That is because I used three
BACKSPACEcharacters and threeSPACEcharacters to erase the 'msg'. Thus, even if I didn't have theEOFcharacter there the end user wouldn't be able to translate from the text on screen to the real, raw data.
This is just a demo to show you how hard it can be to simply send raw data. Encoding the data into base64 format gives you the exact same data but in a format that ensures it is safe for sending over electronic media such as email.
Instead of encodestring consider using b64encode. Later does not add \n characters. e.g.
In [11]: auth = b'[email protected]:passWORD'
In [12]: base64.encodestring(auth)
Out[12]: b'dXNlcm5hbWVAZG9tYWluLmNvbTpwYXNzV09SRA==\n'
In [13]: base64.b64encode(auth)
Out[13]: b'dXNlcm5hbWVAZG9tYWluLmNvbTpwYXNzV09SRA=='
It produces identical encoded string except the \n
Following code would work
auth_base64 = auth_base64.decode('utf-8').replace('\n', '')