rust slice string python

stackoverflow.com › questions › 51982999 › slice-a-string-containing-unicode-chars

Possible solutions to codepoint slicing

I know I can use the chars() iterator and manually walk through the desired substring, but is there a more concise way?

If you know the exact byte indices, you can slice a string:

let text = "Hello привет";
println!("{}", &text[2..10]);

This prints "llo пр". So the problem is to find out the exact byte position. You can do that fairly easily with the char_indices() iterator (alternatively you could use chars() with char::len_utf8()):

let text = "Hello привет";
let end = text.char_indices().map(|(i, _)| i).nth(8).unwrap();
println!("{}", &text[2..end]);

As another alternative, you can first collect the string into Vec<char>. Then, indexing is simple, but to print it as a string, you have to collect it again or write your own function to do it.

let text = "Hello привет";
let text_vec = text.chars().collect::<Vec<_>>();
println!("{}", text_vec[2..8].iter().cloned().collect::<String>());

Why is this not easier?

As you can see, neither of these solutions is all that great. This is intentional, for two reasons:

As str is a simply UTF8 buffer, indexing by unicode codepoints is an O(n) operation. Usually, people expect the [] operator to be a O(1) operation. Rust makes this runtime complexity explicit and doesn't try to hide it. In both solutions above you can clearly see that it's not O(1).

But the more important reason:

Unicode codepoints are generally not a useful unit

What Python does (and what you think you want) is not all that useful. It all comes down to the complexity of language and thus the complexity of unicode. Python slices Unicode codepoints. This is what a Rust char represents. It's 32 bit big (a few fewer bits would suffice, but we round up to a power of 2).

But what you actually want to do is slice user perceived characters. But this is an explicitly loosely defined term. Different cultures and languages regard different things as "one character". The closest approximation is a "grapheme cluster". Such a cluster can consist of one or more unicode codepoints. Consider this Python 3 code:

>>> s = "Jürgen"
>>> s[0:2]
'Ju'

Surprising, right? This is because the string above is:

0x004A LATIN CAPITAL LETTER J
0x0075 LATIN SMALL LETTER U
0x0308 COMBINING DIAERESIS
...

This is an example of a combining character that is rendered as part of the previous character. Python slicing does the "wrong" thing here.

Another example:

>>> s = "ﬁre"
>>> s[0:2]
'ﬁr'

Also not what you'd expect. This time, fi is actually the ligature ﬁ, which is one codepoint.

There are far more examples where Unicode behaves in a surprising way. See the links at the bottom for more information and examples.

So if you want to work with international strings that should be able to work everywhere, don't do codepoint slicing! If you really need to semantically view the string as a series of characters, use grapheme clusters. To do that, the crate unicode-segmentation is very useful.

Further resources on this topic:

Blogpost "Let's stop ascribing meaning to unicode codepoints"
Blogpost "Breaking our Latin-1 assumptions
http://utf8everywhere.org/

Answer from Lukas Kalbertodt on Stack Overflow

The Rust Programming Language

doc.rust-lang.org › book › ch04-03-slices.html

The Slice Type - The Rust Programming Language

Internally, the slice data structure stores the starting position and the length of the slice, which corresponds to ending_index minus starting_index. So, in the case of let world = &s[6..11];, world would be a slice that contains a pointer to the byte at index 6 of s with a length value of 5. Figure 4-7 shows this in a diagram. Figure 4-7: A string slice referring to part of a String

reddit.com › r/rust › newbie question: string slices

r/rust on Reddit: Newbie Question: String Slices

December 13, 2020 -

I am coming from a world of Python, so pointers and references are kind of a new thing for me. So, I was going through The book. I was on the slices chapter, specifically the String Slices as Parameter part. I just don't get this fn first_word(s: &String) -> &str { is worse than fn first_word(s: &str) -> &str { this. Is it because it is a pointer to a pointer, if I understood things correctly?

Top answer

1 of 5

Everything can be represented as a &str so with a reference to a slice as a parameter the function can be used with both strings, string literals and slices. A &str is just a fat pointer to you string so it costs nothing to have around. I like this blog post about slicing vs string vs referencing: https://blog.thoughtram.io/string-vs-str-in-rust/ . It explains the subject perfectly. It may help you a bit with this. It helped me.

2 of 5

It's because you can make a &str that points to to a string literal or to part of a String without having to make a copy of the data. &String can only point to the entirety of a String and, to feed a string literal into it, you first have to create a String from it, which makes a copy because string literals are baked into the compiled code itself.

Videos