python dtype string list

What is the default dtype for str like input in numpy?

stackoverflow.com › questions › 46051977 › what-is-the-default-dtype-for-str-like-input-in-numpy

b'...' means it's a byte-string and the default dtype for arrays of strings depends on the kind of strings. Unicodes (python 3 strings are unicode) are U and Python 2 str or Python 3 bytes have the dtype S. You can find the explanation of dtypes in the NumPy documentation here

Array-protocol type strings

The first character specifies the kind of data and the remaining characters specify the number of bytes per item, except for Unicode, where it is interpreted as the number of characters. The item size must correspond to an existing type, or an error will be raised. The supported kinds are:

'?' boolean

'b' (signed) byte

'B' unsigned byte

'i' (signed) integer

'u' unsigned integer

'f' floating-point

'c' complex-floating point

'm' timedelta

'M' datetime

'O' (Python) objects

'S', 'a' zero-terminated bytes (not recommended)

'U' Unicode string

'V' raw data (void)

However in your first case you actually forced NumPy to convert it to bytes because you specified dtype='S'.

Answer from MSeifert on Stack Overflow

NumPy

numpy.org › doc › stable › reference › arrays.dtypes.html

Data type objects (dtype) — NumPy v2.4 Manual

This style has two required and three optional keys. The names and formats keys are required. Their respective values are equal-length lists with the field names and the field formats. The field names must be strings and the field formats can be any object accepted by dtype constructor.

W3Schools

w3schools.com › python › numpy › numpy_data_types.asp

NumPy Data Types

Create an array with data type string: import numpy as np arr = np.array([1, 2, 3, 4], dtype='S') print(arr) print(arr.dtype) Try it Yourself » · For i, u, f, S and U we can define size as well. Create an array with data type 4 bytes integer: import numpy as np arr = np.array([1, 2, 3, 4], dtype='i4') print(arr) print(arr.dtype) Try it Yourself » · If a type is given in which elements can't be casted then NumPy will raise a ValueError. ValueError: In Python ValueError is raised when the type of passed argument to a function is unexpected/incorrect.

Pandas

pandas.pydata.org › docs › user_guide › text.html

Working with text data — pandas 3.0.2 documentation

At runtime, these can be checked via the StringDtype.storage and StringDtype.na_value attributes. ... This is the same as dtype='str' when PyArrow is not installed. The implementation uses a NumPy object array, which directly stores the Python string objects, hence why the storage here is called 'python'.

Stack Overflow

stackoverflow.com › questions › 46051977 › what-is-the-default-dtype-for-str-like-input-in-numpy

python - What is the default dtype for str like input in numpy? - Stack Overflow

Top answer

1 of 4

208

The dtype object comes from NumPy, it describes the type of element in a ndarray. Every element in an ndarray must have the same size in bytes. For int64 and float64, they are 8 bytes. But for strings, the length of the string is not fixed. So instead of saving the bytes of strings in the ndarray directly, Pandas uses an object ndarray, which saves pointers to objects; because of this the dtype of this kind ndarray is object.

Here is an example:

the int64 array contains 4 int64 value.
the object array contains 4 pointers to 3 string objects.

2 of 4

@HYRY's answer is great. I just want to provide a little more context..

Arrays store data as contiguous, fixed-size memory blocks. The combination of these properties together is what makes arrays lightning fast for data access. For example, consider how your computer might store an array of 32-bit integers, [3,0,1].

If you ask your computer to fetch the 3rd element in the array, it'll start at the beginning and then jump across 64 bits to get to the 3rd element. Knowing exactly how many bits to jump across is what makes arrays fast.

Now consider the sequence of strings ['hello', 'i', 'am', 'a', 'banana']. Strings are objects that vary in size, so if you tried to store them in contiguous memory blocks, it'd end up looking like this.

Now your computer doesn't have a fast way to access a randomly requested element. The key to overcoming this is to use pointers. Basically, store each string in some random memory location, and fill the array with the memory address of each string. (Memory addresses are just integers.) So now, things look like this

Now, if you ask your computer to fetch the 3rd element, just as before, it can jump across 64 bits (assuming the memory addresses are 32-bit integers) and then make one extra step to go fetch the string.

The challenge for NumPy is that there's no guarantee the pointers are actually pointing to strings. That's why it reports the dtype as 'object'.

Shamelessly gonna plug my own course on NumPy where I originally discussed this.

Practical Business Python

pbpython.com › pandas_dtypes.html

Overview of Pandas Data Types - Practical Business Python

Customer Number int64 Customer Name object 2016 float64 2017 float64 Percent Growth object Jan Units object Month int64 Day int64 Year int64 Active object dtype: object · For another example of using lambda vs. a function, we can look at the process for fixing the Percent Growth column. ... def convert_percent(val): """ Convert the percentage string to an actual floating point percent - Remove % - Divide by 100 to make decimal """ new_val = val.replace('%', '') return float(new_val) / 100 df['Percent Growth'].apply(convert_percent)

Stack Overflow

stackoverflow.com › questions › 3410147 › define-dtypes-in-numpy-using-a-list

python - Define dtypes in NumPy using a list? - Stack Overflow

Top answer

1 of 1

The following code might help:

import numpy as np

dt = np.dtype([('name1', '|S10'), ('name2', '<f8')])
tuplelist=[
    ('n1', 1.2),
    ('n2', 3.4),    
     ]
arr = np.array(tuplelist, dtype=dt)

print(arr['name1'])
# ['n1' 'n2']
print(arr['name2'])
# [ 1.2  3.4]

Your immediate problem was that np.dtype expects the format specifiers to be numpy types, such as '|S10' or '<f8' and not Python types, such as str or float. If you type help(np.dtype) you'll see many examples of how np.dtypes can be specified. (I've only mentioned a few.)

Note that np.array expects a list of tuples. It's rather particular about that.

A list of lists raises TypeError: expected a readable buffer object.

A (tuple of tuples) or a (tuple of lists) raises ValueError: setting an array element with a sequence.

Park

park.is › notebooks › comparing-pandas-string-dtypes

An In-depth Comparison of Pandas String dtypes

May 27, 2023 - Checking the data types displays string for both columns. ... Print columns as Python lists to check whether each value has single qutoes around it. ... <class 'pandas.core.frame.DataFrame'> RangeIndex: 4 entries, 0 to 3 Data columns (total 2 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 A 1 non-null string 1 B 3 non-null string dtypes: string(2) memory usage: 181.0 bytes

Find elsewhere

Google Bing Mojeek

Python for Data Science

python4data.science › en › latest › workspace › numpy › dtype.html

dtype - Python for Data Science

array([[ 0.44477868, 1.7366465 , -2.0396285 ], [ 0.65273875, -0.11706501, -2.3253074 ], [-1.3416812 , -1.1469622 , 0.04803479], [-0.08298384, -0.02865864, 1.0284923 ], [ 0.59293705, 0.5345401 , -1.717722 ], [-1.1971567 , -0.4091349 , -0.03829814], [ 1.030255 , 0.9890015 , -0.4749484 ]], dtype=float32)

GitHub

github.com › dask › dask › issues › 11117

Column with object dtype get converted to string when selecting the column · Issue #11117 · dask/dask

May 13, 2024 - import dask.dataframe as dd import pandas as pd import numpy as np x = np.zeros((20, 10)) df = pd.DataFrame({"X": x.tolist()}) ddf = dd.from_pandas(df) print("df['X'].dtype: ") print(df["X"].dtype) # returns "X" has dtype object print() print("ddf['X'].dtype") print(ddf["X"].dtype) # returns dtype string / the lists get converted to a string print() print("ddf['X'].compute().dtype") print(ddf["X"].compute().dtype) # returns dtype string / the lists get converted to a string print()

Published May 13, 2024

Author felix0097

Towards Data Science

towardsdatascience.com › home › latest › why we need to use pandas new string dtype instead of object for textual data

Why We Need to Use Pandas New String Dtype Instead of Object for Textual Data | Towards Data Science

January 19, 2025 - Select_dtypes(include="object") will return any column with object data type. On the other hand, if we use "string" data type for textual data, select_dtypes(include="string") will give just what we need.

Quansight

quansight.com › home › post › my numpy year: creating a dtype for the next generation of scientific computing

My NumPy Year: Creating a DType for the Next Generation of Scientific Computing | Quansight Consulting

October 30, 2024 - Python 2 also had this Unicode type, where you could create an array with the contents 'hello', 'world', but as Unicode strings, and that creates an array with the DType 'U5'. This works, and it’s exactly what Python 2 did with Unicode strings.

Python Course

python-course.eu › numerical-programming › numpy-data-objects-dtype.php

3. Numpy Data Objects, dtype | Numerical Programming

Some may have noticed that the strings in our previous array have been prefixed with a lower case "b". This means that we have created binary strings with the definition "('country', 'S20')". To get unicode strings we exchange this with the definition "('country', 'U20')". We will redefine our population table now: dt = np.dtype([('country', 'U20'), ('density', 'i4'), ('area', 'i4'), ('population', 'i4')]) population_table_2025 = np.array([ ('Netherlands', 544, 33720, 18_346_819), ('Belgium', 383, 30510, 11_700_000), ('United Kingdom', 287, 243610, 69_800_000), ('Germany', 241, 348560, 84_075_

NumPy

numpy.org › doc › 2.1 › reference › arrays.dtypes.html

Data type objects (dtype) — NumPy v2.1 Manual

This style does not accept align in the dtype constructor as it is assumed that all of the memory is accounted for by the array interface description. ... This style has two required and three optional keys. The names and formats keys are required. Their respective values are equal-length lists with the field names and the field formats. The field names must be strings and the field formats can be any object accepted by dtype constructor.

University of Texas at Austin

het.as.utexas.edu › HET › Software › Numpy › reference › generated › numpy.dtype.html

numpy.dtype — NumPy v1.9 Manual

>>> np.dtype([('hello',(np.int,3)),('world',np.void,10)]) dtype([('hello', '<i4', 3), ('world', '|V10')])

Python⇒Speed

pythonspeed.com › articles › pandas-string-dtype-memory

Saving memory with Pandas 1.3’s new string dtype

January 6, 2023 - By default, Pandas will store strings using the object dtype, meaning it store strings as NumPy array of pointers to normal Python object.

GeeksforGeeks

geeksforgeeks.org › numpy › python-dtype-object-length-of-numpy-array-of-strings

Python | dtype object length of Numpy array of strings - GeeksforGeeks

March 14, 2019 - # Print the dtype print(arr.dtype) Output : As we can see in the output, the dtype of the given array object is '<U9' where 9 is the length of the longest string in the given array object.

Kaggle

kaggle.com › general › 188478

What is the difference between Pandas Object & String dtype

Checking your browser before accessing www.kaggle.com · Click here if you are not automatically redirected after 5 seconds

Pandas

pandas.pydata.org › docs › reference › api › pandas.DataFrame.dtypes.html

pandas.DataFrame.dtypes — pandas 3.0.2 documentation

Return the dtype object of the underlying data. ... >>> df = pd.DataFrame( ... { ... "float": [1.0], ... "int": [1], ... "datetime": [pd.Timestamp("20180310")], ... "string": ["foo"], ...