Possible solutions to codepoint slicing

I know I can use the chars() iterator and manually walk through the desired substring, but is there a more concise way?

If you know the exact byte indices, you can slice a string:

let text = "Hello привет";
println!("{}", &text[2..10]);

This prints "llo пр". So the problem is to find out the exact byte position. You can do that fairly easily with the char_indices() iterator (alternatively you could use chars() with char::len_utf8()):

let text = "Hello привет";
let end = text.char_indices().map(|(i, _)| i).nth(8).unwrap();
println!("{}", &text[2..end]);

As another alternative, you can first collect the string into Vec<char>. Then, indexing is simple, but to print it as a string, you have to collect it again or write your own function to do it.

let text = "Hello привет";
let text_vec = text.chars().collect::<Vec<_>>();
println!("{}", text_vec[2..8].iter().cloned().collect::<String>());

Why is this not easier?

As you can see, neither of these solutions is all that great. This is intentional, for two reasons:

As str is a simply UTF8 buffer, indexing by unicode codepoints is an O(n) operation. Usually, people expect the [] operator to be a O(1) operation. Rust makes this runtime complexity explicit and doesn't try to hide it. In both solutions above you can clearly see that it's not O(1).

But the more important reason:

Unicode codepoints are generally not a useful unit

What Python does (and what you think you want) is not all that useful. It all comes down to the complexity of language and thus the complexity of unicode. Python slices Unicode codepoints. This is what a Rust char represents. It's 32 bit big (a few fewer bits would suffice, but we round up to a power of 2).

But what you actually want to do is slice user perceived characters. But this is an explicitly loosely defined term. Different cultures and languages regard different things as "one character". The closest approximation is a "grapheme cluster". Such a cluster can consist of one or more unicode codepoints. Consider this Python 3 code:

>>> s = "Jürgen"
>>> s[0:2]
'Ju'

Surprising, right? This is because the string above is:

  • 0x004A LATIN CAPITAL LETTER J
  • 0x0075 LATIN SMALL LETTER U
  • 0x0308 COMBINING DIAERESIS
  • ...

This is an example of a combining character that is rendered as part of the previous character. Python slicing does the "wrong" thing here.

Another example:

>>> s = "fire"
>>> s[0:2]
'fir'

Also not what you'd expect. This time, fi is actually the ligature , which is one codepoint.

There are far more examples where Unicode behaves in a surprising way. See the links at the bottom for more information and examples.

So if you want to work with international strings that should be able to work everywhere, don't do codepoint slicing! If you really need to semantically view the string as a series of characters, use grapheme clusters. To do that, the crate unicode-segmentation is very useful.


Further resources on this topic:

  • Blogpost "Let's stop ascribing meaning to unicode codepoints"
  • Blogpost "Breaking our Latin-1 assumptions
  • http://utf8everywhere.org/
Answer from Lukas Kalbertodt on Stack Overflow
🌐
The Rust Programming Language
doc.rust-lang.org › book › ch04-03-slices.html
The Slice Type - The Rust Programming Language
Internally, the slice data structure stores the starting position and the length of the slice, which corresponds to ending_index minus starting_index. So, in the case of let world = &s[6..11];, world would be a slice that contains a pointer to the byte at index 6 of s with a length value of 5. Figure 4-7 shows this in a diagram. Figure 4-7: A string slice referring to part of a String
🌐
Programiz
programiz.com › rust › slice
Rust Slice (With Examples)
Become a certified Python programmer. Try Programiz PRO! ... A Rust slice is a data type used to access portions of data stored in collections like arrays, vectors and strings.
🌐
DEV Community
dev.to › itachiuchiha › a-slice-of-rust-working-with-the-slice-type-4p8i
A Slice of Rust: Working with the Slice Type - DEV Community
August 3, 2020 - For example, we're using slices in Python like that; py_string = 'Python' # contains indices 0, 1 and 2 print(py_string[0:3]) # Pyt · We're calling this as indexing syntax. Let's assume we have a function like below. In this section, we'll use Rust's example.
Top answer
1 of 4
54

Possible solutions to codepoint slicing

I know I can use the chars() iterator and manually walk through the desired substring, but is there a more concise way?

If you know the exact byte indices, you can slice a string:

let text = "Hello привет";
println!("{}", &text[2..10]);

This prints "llo пр". So the problem is to find out the exact byte position. You can do that fairly easily with the char_indices() iterator (alternatively you could use chars() with char::len_utf8()):

let text = "Hello привет";
let end = text.char_indices().map(|(i, _)| i).nth(8).unwrap();
println!("{}", &text[2..end]);

As another alternative, you can first collect the string into Vec<char>. Then, indexing is simple, but to print it as a string, you have to collect it again or write your own function to do it.

let text = "Hello привет";
let text_vec = text.chars().collect::<Vec<_>>();
println!("{}", text_vec[2..8].iter().cloned().collect::<String>());

Why is this not easier?

As you can see, neither of these solutions is all that great. This is intentional, for two reasons:

As str is a simply UTF8 buffer, indexing by unicode codepoints is an O(n) operation. Usually, people expect the [] operator to be a O(1) operation. Rust makes this runtime complexity explicit and doesn't try to hide it. In both solutions above you can clearly see that it's not O(1).

But the more important reason:

Unicode codepoints are generally not a useful unit

What Python does (and what you think you want) is not all that useful. It all comes down to the complexity of language and thus the complexity of unicode. Python slices Unicode codepoints. This is what a Rust char represents. It's 32 bit big (a few fewer bits would suffice, but we round up to a power of 2).

But what you actually want to do is slice user perceived characters. But this is an explicitly loosely defined term. Different cultures and languages regard different things as "one character". The closest approximation is a "grapheme cluster". Such a cluster can consist of one or more unicode codepoints. Consider this Python 3 code:

>>> s = "Jürgen"
>>> s[0:2]
'Ju'

Surprising, right? This is because the string above is:

  • 0x004A LATIN CAPITAL LETTER J
  • 0x0075 LATIN SMALL LETTER U
  • 0x0308 COMBINING DIAERESIS
  • ...

This is an example of a combining character that is rendered as part of the previous character. Python slicing does the "wrong" thing here.

Another example:

>>> s = "fire"
>>> s[0:2]
'fir'

Also not what you'd expect. This time, fi is actually the ligature , which is one codepoint.

There are far more examples where Unicode behaves in a surprising way. See the links at the bottom for more information and examples.

So if you want to work with international strings that should be able to work everywhere, don't do codepoint slicing! If you really need to semantically view the string as a series of characters, use grapheme clusters. To do that, the crate unicode-segmentation is very useful.


Further resources on this topic:

  • Blogpost "Let's stop ascribing meaning to unicode codepoints"
  • Blogpost "Breaking our Latin-1 assumptions
  • http://utf8everywhere.org/
2 of 4
12

A UTF-8 encoded string may contain characters which consists of multiple bytes. In your case, п starts at index 6 (inclusive) and ends at position 8 (exclusive) so indexing 7 is not the start of the character. This is why your error occurred.

You may use str::char_indices() for solving this (remember, that getting to a position in a UTF-8 string is O(n)):

fn get_utf8_slice(string: &str, start: usize, end: usize) -> Option<&str> {
    assert!(end >= start);
    string.char_indices().nth(start).and_then(|(start_pos, _)| {
        string[start_pos..]
            .char_indices()
            .nth(end - start - 1)
            .map(|(end_pos, _)| &string[start_pos..end_pos])
    })
}

playground

You may use str::chars() if you are fine with getting a String:

let string: String = text.chars().take(end).skip(start).collect();
Top answer
1 of 3
6

You can't return a reference to a locally allocated String because the string is dropped when the function returns. There's no way to finagle your way around that. A &str is simply a bad match for the type of data you want to return.

The most straightforward fix is to return an owned String.

Copyfn my_func(input: &str) -> String {
    match input {
        "a" => "Alpha".to_string(),
        _ => format!("'{}'", "Quoted" ), 
    }
}

Another is to return a Cow<'_, str>, which can hold either a borrowed or owned string depending on which you have. It's a bit fussy, but it does avoids unnecessary allocations. I only recommend this if efficiency is of utmost important; otherwise, just return String.

Copyfn my_func(input: &str) -> Cow<'_, str> {
    match input {
        "a" => "Alpha".into(),
        _ => format!("'{}'", "Quoted" ).into(), 
    }
}

I'll also mention a third option -- for educational purposes, not for actual use, since it leaks memory. You can get a 'static reference to an owned object if you leak it. Leaked memory is valid for the remainder of the program since it's never freed, and thus you can in fact get a reference to it.

Copy// Warning: Do not use! Leaks memory.
fn my_func(input: &str) -> &'static str {
    match input {
        "a" => "Alpha",
        _ => Box::leak(format!("'{}'", "Quoted").into_boxed_str()), 
    }
}
2 of 3
1

The problem is that the arm with format!().as_str() produces an owned String, as soon as your function returns, the String is dropped and the &str reference would become invalid.

You can use std::borrow::Cow to allow a function to return both owned or borrowed strings.

🌐
GeeksforGeeks
geeksforgeeks.org › rust › rust-slices
Rust - Slices - GeeksforGeeks
March 15, 2021 - Slices are also present in Python which is similar to slice here in Rust. Slice is used when you do not want the complete collection, or you want some part of it. ... In slicing first element is at 0 index and the last is index-1. //gfg is String or Array &gfg[0..2] //from 0 to 1 index &gfg[0..3] //from 0 to 2 index
🌐
Medium
medium.com › @python-javascript-php-html-css › understanding-string-slices-in-rust-why-str-needs-an-ampersand-7b99deff3b71
Understanding String Slices in Rust: Why &str Needs an Ampersand
November 18, 2024 - Rust enforces strict borrowing rules, ensuring that references to a string remain valid while the original string exists. When you slice a string slice like `&str`, you are creating a borrowed reference to a subset of the original string.
Find elsewhere
Top answer
1 of 2
2

Your code example is not very complete. The part that actually causes the error can't be seen in your example.

I guess that your code looks something like this:

pub fn function1(s: String) -> i32 {
    let index: &i32 = &1;
    let substring = (&s[index..]).to_string();
    let counter = function1(substring);
    10
}
error[E0277]: the type `String` cannot be indexed by `RangeFrom<&i32>`
 --> src/main.rs:3:23
  |
3 |     let substring = (&s[index..]).to_string();
  |                       ^^^^^^^^^^ `String` cannot be indexed by `RangeFrom<&i32>`
  |
  = help: the trait `Index<RangeFrom<&i32>>` is not implemented for `String`

Problems

  • index must be a usize, but it is an &i32. This is the main error that you see.
  • You cannot slice a string directly, you need to convert from char-based indices to byte-based indices first. This can be done by iterating through char_indices().

Here is a rough sketch of how this might look like:

pub fn function1(s: String) -> i32 {
    println!("s: {}", s);

    let index: &i32 = &1;

    // Try to convert the index to a byte position
    let substring = match s.char_indices().nth(*index as usize) {
        // If a position with the given index was found in the string, create a substring
        Some((pos, _)) => (&s[pos..]).to_string(),
        // Else, create an empty string
        None => "".to_string(),
    };

    // Break if the substring is empty, otherwise we would have an infinite recursion
    if substring.is_empty() {
        return 0;
    }

    let counter = function1(substring);
    counter + 1
}

fn main() {
    let input_str = "".to_string();
    let result = function1(input_str);
    println!("Result: {}", result);
}
s: 
s: 
s: 
s: 
Result: 3

Slicing vs copying

With every iteration of your function, you are creating a new copy of the string. This is quite slow, and I don't see a reason why this would be necessary in your case.

What you really want is a slice of the input string. This doesn't copy any data, it simply references a part of the original string.

To achieve that, you would have to change your parameter type from String to &str. There is no reason your function would need to take ownership. Even if you want to take ownership, then to_string() would do so, as it creates a copy of the data. So there really is no reason to use String as the parameter type.

pub fn function1(s: &str) -> i32 {
    println!("s: {}", s);

    let index: &i32 = &1;

    // Try to convert the index to a byte position
    let substring = match s.char_indices().nth(*index as usize) {
        // If a position with the given index was found in the string, create a substring slice
        Some((pos, _)) => &s[pos..],
        // Else, use an empty string
        None => "",
    };

    // Break if the substring is empty, otherwise we would have an infinite recursion
    if substring.is_empty() {
        return 0;
    }

    let counter = function1(substring);
    counter + 1
}

fn main() {
    let input_str = "".to_string();
    let result = function1(&input_str);
    println!("Result: {}", result);
}
s: 
s: 
s: 
s: 
Result: 3
2 of 2
1

You couldn't indexing a string in rust, because strings are encoded in UTF-8. You could use the method chars and/or char_indices

As from your given code, I can't figure out what method you should use. Have a look at the rust doc.

For further information:

https://doc.rust-lang.org/std/string/struct.String.html

https://doc.rust-lang.org/std/string/struct.String.html#method.chars

https://doc.rust-lang.org/std/string/struct.String.html#method.char_indices

https://doc.rust-lang.org/std/string/struct.String.html#method.split_whitespace

🌐
Rust
docs.rs › slicestring
slicestring - Rust
slicestring is a crate for slicing Strings. It provides the slice() method for String and &str. It takes the index-range as an argument, whereby also a negative value can be passed for the second index.
🌐
Wduquette
wduquette.github.io › parsing-strings-into-slices
Parsing Rust Strings into Slices
The Chars iterator can return a &str slice containing the remainder of the source string. It’s easy to compute a slice from two slices one of which completely contains the other. For the record, I found the solution at users.rust-lang.org.
🌐
Reddit
reddit.com › r/rust › take a string slice and returns a reference to its first character
r/rust on Reddit: Take a string slice and returns a reference to its first character
May 10, 2024 -

Hey all,

I am learning Rust, and was not sure how to approach the following:

Write a function `first_char` that takes a string slice and returns a reference to its first character.`

My attempt:

fn first_char(s: &str) -> Option<&char> {
    s.chars().next().as_ref()
}

However, Rust complains that it "cannot return value referencing temporary value
returns a value referencing data owned by the current function"

Is there any way to solve the above? Any pointers would be appreciated

🌐
Rust Programming Language
users.rust-lang.org › help
How to get a substring of a String - help - The Rust Programming Language Forum
May 14, 2015 - Hi, what is the best way to get a substring of a String? I couldn't find a substr method or similar. Let's assume I have a String like "Golden Eagle" and I want to get the first 6 characters, that is "Golden". How ca…
🌐
Rust
rust-lang.github.io › rfcs › 0198-slice-notation.html
0198-slice-notation - The Rust RFC Book
Some other languages (like Python and Go – and Fortran) use : rather than .. in slice notation. The choice of .. here is influenced by its use elsewhere in Rust, for example for fixed-length array types [T, ..n]. The ..
🌐
TutorialsPoint
tutorialspoint.com › rust › rust_slices.htm
Rust - Slices
fn main() { let n1 = "Tutorials".to_string(); println!("length of string is {}",n1.len()); let c1 = &n1[4..9]; // fetches characters at 4,5,6,7, and 8 indexes println!("{}",c1); } ... The main() function declares an array with 5 elements. It invokes the use_slice() function and passes to it ...
🌐
Codecademy
codecademy.com › docs › rust › slices
Rust | Slices | Codecademy
May 15, 2024 - Note: Slices are often used in Rust for tasks like substring extraction, working with subarrays, and allowing multiple parts of a data structure to be manipulated separately without copying the entire data. They offer a flexible and memory-efficient way to handle data subsets. Preview: @THE-Spellchecker 154 total contributions ... Looking for an introduction to the theory behind programming? Master Python while learning data structures, algorithms, and more!