It's probably a consequence of the parsing algorithm used. A simple mental model is that the tokenizer attempts to match all the token patterns there are, and recognizes the longest match it finds. On a lower-level, the tokenizer works character-by-character, and makes a decision based only on the current state and input character – there shouldn't be any backtracking or re-reading of input.
After joining patterns with common prefixes – in this case, the pattern for int literals and the integral part of the pattern of float literals – what happens in the tokenizer is that it:
- Reads the
1, and enters the state that indicates "reading either afloator anintliteral" - Reads the
., and enters the state "reading afloatliteral" - Reads the
_, which can not be part of afloatliteral. The parser emits1.as afloatliteral token. - Carries on parsing starting with the
_, and eventually emits__class__as an identifier token.
Aside: This tokenizing approach is also the reason why common languages have the syntax restrictions they have. E.g. identifiers contain letters, digits, and underscores, but cannot start with a digit. If that was allowed,
123abccould be intended as either an identifier, or the integer123followed by the identifierabc.A lex-like tokenizer would recognize this as the former since it leads to the longest single token, but nobody likes having to keep details like this in their head when trying to read code. Or when trying to write and debug the tokenizer for that matter.
The parser then tries to process the token stream:
<FloatLiteral: '1.'> <Identifier: '__class__'>
In Python, a literal directly followed by an identifier – without an operator between the tokens – makes no sense, so the parser bails. This also means that the reason why Python would complain about 123abc being invalid syntax isn't the tokenizer error "the character a isn't valid in an integer literal", but the parser error "the identifier abc cannot directly follow the integer literal 123"
The reason why the tokenizer can't recognize the 1 as an int literal is that the character that makes it leave the float-or-int state determines what it just read. If it's ., it was the start of a float literal, which might continue afterwards. If it's something else, it was a complete int literal token.
It's not possible for the tokenizer to "go back" and re-read the previous input as something else. In fact, the tokenizer is at too low a level to care about what an "attribute access" is and handle such ambiguities.
Now, your second example is valid because the tokenizer knows a float literal can only have one . in it. More precisely: the first . makes it transition from the float-or-int state to the float state. In this state, it only expects digits (or an E for scientific/engineering notation, a j for complex numbers…) to continue the the float literal. The first character that's not a digit etc. (i.e. the .) is definitely no longer part of the float literal and the tokenizer can emit the finished token. The token stream for your second example will thus be:
<FloatLiteral: '1.'> <Operator: '.'> <Identifier: '__class__'>
Which, of course, the parser then recognizes as valid Python. Now we also know enough why the suggested workarounds help. In Python, separating tokens with whitespace is optional – unlike, say, in Lisp. Conversely, whitespace does separate tokens. (That is, no tokens except string literals may contain whitespace, it's merely skipped between tokens.) So the code:
1 .__class__
is always tokenized as
<IntLiteral: '1'> <Operator: '.'> <Identifier: '__class__'>
And since a closing parenthesis cannot appear in an int literal, this:
(1).__class__
gets read as this:
<Operator: '('> <IntLiteral: '1'> <Operator: ')'> <Operator: '.'> <Identifier: '__class__'>
The above implies that, amusingly, the following is also valid:
1..__class__ # => <type 'float'>
The decimal part of a float literal is optional, and the second . read will make the preceding input be recognized as one.
It's probably a consequence of the parsing algorithm used. A simple mental model is that the tokenizer attempts to match all the token patterns there are, and recognizes the longest match it finds. On a lower-level, the tokenizer works character-by-character, and makes a decision based only on the current state and input character – there shouldn't be any backtracking or re-reading of input.
After joining patterns with common prefixes – in this case, the pattern for int literals and the integral part of the pattern of float literals – what happens in the tokenizer is that it:
- Reads the
1, and enters the state that indicates "reading either afloator anintliteral" - Reads the
., and enters the state "reading afloatliteral" - Reads the
_, which can not be part of afloatliteral. The parser emits1.as afloatliteral token. - Carries on parsing starting with the
_, and eventually emits__class__as an identifier token.
Aside: This tokenizing approach is also the reason why common languages have the syntax restrictions they have. E.g. identifiers contain letters, digits, and underscores, but cannot start with a digit. If that was allowed,
123abccould be intended as either an identifier, or the integer123followed by the identifierabc.A lex-like tokenizer would recognize this as the former since it leads to the longest single token, but nobody likes having to keep details like this in their head when trying to read code. Or when trying to write and debug the tokenizer for that matter.
The parser then tries to process the token stream:
<FloatLiteral: '1.'> <Identifier: '__class__'>
In Python, a literal directly followed by an identifier – without an operator between the tokens – makes no sense, so the parser bails. This also means that the reason why Python would complain about 123abc being invalid syntax isn't the tokenizer error "the character a isn't valid in an integer literal", but the parser error "the identifier abc cannot directly follow the integer literal 123"
The reason why the tokenizer can't recognize the 1 as an int literal is that the character that makes it leave the float-or-int state determines what it just read. If it's ., it was the start of a float literal, which might continue afterwards. If it's something else, it was a complete int literal token.
It's not possible for the tokenizer to "go back" and re-read the previous input as something else. In fact, the tokenizer is at too low a level to care about what an "attribute access" is and handle such ambiguities.
Now, your second example is valid because the tokenizer knows a float literal can only have one . in it. More precisely: the first . makes it transition from the float-or-int state to the float state. In this state, it only expects digits (or an E for scientific/engineering notation, a j for complex numbers…) to continue the the float literal. The first character that's not a digit etc. (i.e. the .) is definitely no longer part of the float literal and the tokenizer can emit the finished token. The token stream for your second example will thus be:
<FloatLiteral: '1.'> <Operator: '.'> <Identifier: '__class__'>
Which, of course, the parser then recognizes as valid Python. Now we also know enough why the suggested workarounds help. In Python, separating tokens with whitespace is optional – unlike, say, in Lisp. Conversely, whitespace does separate tokens. (That is, no tokens except string literals may contain whitespace, it's merely skipped between tokens.) So the code:
1 .__class__
is always tokenized as
<IntLiteral: '1'> <Operator: '.'> <Identifier: '__class__'>
And since a closing parenthesis cannot appear in an int literal, this:
(1).__class__
gets read as this:
<Operator: '('> <IntLiteral: '1'> <Operator: ')'> <Operator: '.'> <Identifier: '__class__'>
The above implies that, amusingly, the following is also valid:
1..__class__ # => <type 'float'>
The decimal part of a float literal is optional, and the second . read will make the preceding input be recognized as one.
It is a tokenization issue... the . is parsed as the beginning of the fractional part of a floating point number.
You can use
(1).__class__
to avoid the problem
Videos
Try this instead,
print(
"{:.3f}% {} ({} sentences)".format(pcent, gender, nsents)
)
Refer the latest docs for more examples and check the Py version!
You could also use {:.3%} instead of {:.3f}%.
It will transform the value into percentages automatically.
That means "{:.3%}".format(0.3) will print "30%" while you have to write "{:.3f}%".format(0.3 * 100) to get "30%" as well.