First off, the code you're learning from is flawed. It almost certainly doesn't do what the original author thought it did based on the comments in the code.
What the author probably meant was this:
def to_1d(array):
"""prepares an array into a 1d real vector"""
return array.astype(np.float64).ravel()
However, if array is always going to be an array of complex numbers, then the original code makes some sense.
The only cases where viewing the array (a.dtype = 'float64' is equivalent to doing a = a.view('float64')) would double its size is if it's a complex array (numpy.complex128) or a 128-bit floating point array. For any other dtype, it doesn't make much sense.
For the specific case of a complex array, the original code would convert something like np.array([0.5+1j, 9.0+1.33j]) into np.array([0.5, 1.0, 9.0, 1.33]).
A cleaner way to write that would be:
def complex_to_iterleaved_real(array):
"""prepares a complex array into an "interleaved" 1d real vector"""
return array.copy().view('float64').ravel()
(I'm ignoring the part about returning the original dtype and shape, for the moment.)
Background on numpy arrays
To explain what's going on here, you need to understand a bit about what numpy arrays are.
A numpy array consists of a "raw" memory buffer that is interpreted as an array through "views". You can think of all numpy arrays as views.
Views, in the numpy sense, are just a different way of slicing and dicing the same memory buffer without making a copy.
A view has a shape, a data type (dtype), an offset, and strides. Where possible, indexing/reshaping operations on a numpy array will just return a view of the original memory buffer.
This means that things like y = x.T or y = x[::2] don't use any extra memory, and don't make copies of x.
So, if we have an array similar to this:
import numpy as np
x = np.array([1,2,3,4,5,6,7,8,9,10])
We could reshape it by doing either:
x = x.reshape((2, 5))
or
x.shape = (2, 5)
For readability, the first option is better. They're (almost) exactly equivalent, though. Neither one will make a copy that will use up more memory (the first will result in a new python object, but that's beside the point, at the moment.).
Dtypes and views
The same thing applies to the dtype. We can view an array as a different dtype by either setting x.dtype or by calling x.view(...).
So we can do things like this:
import numpy as np
x = np.array([1,2,3], dtype=np.int)
print 'The original array'
print x
print '\n...Viewed as unsigned 8-bit integers (notice the length change!)'
y = x.view(np.uint8)
print y
print '\n...Doing the same thing by setting the dtype'
x.dtype = np.uint8
print x
print '\n...And we can set the dtype again and go back to the original.'
x.dtype = np.int
print x
Which yields:
The original array
[1 2 3]
...Viewed as unsigned 8-bit integers (notice the length change!)
[1 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 3 0 0 0 0 0 0 0]
...Doing the same thing by setting the dtype
[1 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 3 0 0 0 0 0 0 0]
...And we can set the dtype again and go back to the original.
[1 2 3]
Keep in mind, though, that this is giving you low-level control over the way that the memory buffer is interpreted.
For example:
import numpy as np
x = np.arange(10, dtype=np.int)
print 'An integer array:', x
print 'But if we view it as a float:', x.view(np.float)
print "...It's probably not what we expected..."
This yields:
An integer array: [0 1 2 3 4 5 6 7 8 9]
But if we view it as a float: [ 0.00000000e+000 4.94065646e-324
9.88131292e-324 1.48219694e-323 1.97626258e-323
2.47032823e-323 2.96439388e-323 3.45845952e-323
3.95252517e-323 4.44659081e-323]
...It's probably not what we expected...
So, we're interpreting the underlying bits of the original memory buffer as floats, in this case.
If we wanted to make a new copy with the ints recasted as floats, we'd use x.astype(np.float).
Complex Numbers
Complex numbers are stored (in both C, python, and numpy) as two floats. The first is the real part and the second is the imaginary part.
So, if we do:
import numpy as np
x = np.array([0.5+1j, 1.0+2j, 3.0+0j])
We can see the real (x.real) and imaginary (x.imag) parts. If we convert this to a float, we'll get a warning about discarding the imaginary part, and we'll get an array with just the real part.
print x.real
print x.astype(float)
astype makes a copy and converts the values to the new type.
However, if we view this array as a float, we'll get a sequence of item1.real, item1.imag, item2.real, item2.imag, ....
print x
print x.view(float)
yields:
[ 0.5+1.j 1.0+2.j 3.0+0.j]
[ 0.5 1. 1. 2. 3. 0. ]
Each complex number is essentially two floats, so if we change how numpy interprets the underlying memory buffer, we get an array of twice the length.
Hopefully that helps clear things up a bit...
Answer from Joe Kington on Stack OverflowVideos
First off, the code you're learning from is flawed. It almost certainly doesn't do what the original author thought it did based on the comments in the code.
What the author probably meant was this:
def to_1d(array):
"""prepares an array into a 1d real vector"""
return array.astype(np.float64).ravel()
However, if array is always going to be an array of complex numbers, then the original code makes some sense.
The only cases where viewing the array (a.dtype = 'float64' is equivalent to doing a = a.view('float64')) would double its size is if it's a complex array (numpy.complex128) or a 128-bit floating point array. For any other dtype, it doesn't make much sense.
For the specific case of a complex array, the original code would convert something like np.array([0.5+1j, 9.0+1.33j]) into np.array([0.5, 1.0, 9.0, 1.33]).
A cleaner way to write that would be:
def complex_to_iterleaved_real(array):
"""prepares a complex array into an "interleaved" 1d real vector"""
return array.copy().view('float64').ravel()
(I'm ignoring the part about returning the original dtype and shape, for the moment.)
Background on numpy arrays
To explain what's going on here, you need to understand a bit about what numpy arrays are.
A numpy array consists of a "raw" memory buffer that is interpreted as an array through "views". You can think of all numpy arrays as views.
Views, in the numpy sense, are just a different way of slicing and dicing the same memory buffer without making a copy.
A view has a shape, a data type (dtype), an offset, and strides. Where possible, indexing/reshaping operations on a numpy array will just return a view of the original memory buffer.
This means that things like y = x.T or y = x[::2] don't use any extra memory, and don't make copies of x.
So, if we have an array similar to this:
import numpy as np
x = np.array([1,2,3,4,5,6,7,8,9,10])
We could reshape it by doing either:
x = x.reshape((2, 5))
or
x.shape = (2, 5)
For readability, the first option is better. They're (almost) exactly equivalent, though. Neither one will make a copy that will use up more memory (the first will result in a new python object, but that's beside the point, at the moment.).
Dtypes and views
The same thing applies to the dtype. We can view an array as a different dtype by either setting x.dtype or by calling x.view(...).
So we can do things like this:
import numpy as np
x = np.array([1,2,3], dtype=np.int)
print 'The original array'
print x
print '\n...Viewed as unsigned 8-bit integers (notice the length change!)'
y = x.view(np.uint8)
print y
print '\n...Doing the same thing by setting the dtype'
x.dtype = np.uint8
print x
print '\n...And we can set the dtype again and go back to the original.'
x.dtype = np.int
print x
Which yields:
The original array
[1 2 3]
...Viewed as unsigned 8-bit integers (notice the length change!)
[1 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 3 0 0 0 0 0 0 0]
...Doing the same thing by setting the dtype
[1 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 3 0 0 0 0 0 0 0]
...And we can set the dtype again and go back to the original.
[1 2 3]
Keep in mind, though, that this is giving you low-level control over the way that the memory buffer is interpreted.
For example:
import numpy as np
x = np.arange(10, dtype=np.int)
print 'An integer array:', x
print 'But if we view it as a float:', x.view(np.float)
print "...It's probably not what we expected..."
This yields:
An integer array: [0 1 2 3 4 5 6 7 8 9]
But if we view it as a float: [ 0.00000000e+000 4.94065646e-324
9.88131292e-324 1.48219694e-323 1.97626258e-323
2.47032823e-323 2.96439388e-323 3.45845952e-323
3.95252517e-323 4.44659081e-323]
...It's probably not what we expected...
So, we're interpreting the underlying bits of the original memory buffer as floats, in this case.
If we wanted to make a new copy with the ints recasted as floats, we'd use x.astype(np.float).
Complex Numbers
Complex numbers are stored (in both C, python, and numpy) as two floats. The first is the real part and the second is the imaginary part.
So, if we do:
import numpy as np
x = np.array([0.5+1j, 1.0+2j, 3.0+0j])
We can see the real (x.real) and imaginary (x.imag) parts. If we convert this to a float, we'll get a warning about discarding the imaginary part, and we'll get an array with just the real part.
print x.real
print x.astype(float)
astype makes a copy and converts the values to the new type.
However, if we view this array as a float, we'll get a sequence of item1.real, item1.imag, item2.real, item2.imag, ....
print x
print x.view(float)
yields:
[ 0.5+1.j 1.0+2.j 3.0+0.j]
[ 0.5 1. 1. 2. 3. 0. ]
Each complex number is essentially two floats, so if we change how numpy interprets the underlying memory buffer, we get an array of twice the length.
Hopefully that helps clear things up a bit...
By changing the dtype in this way, you are changing the way a fixed block of memory is being interpreted.
Example:
>>> import numpy as np
>>> a=np.array([1,0,0,0,0,0,0,0],dtype='int8')
>>> a
array([1, 0, 0, 0, 0, 0, 0, 0], dtype=int8)
>>> a.dtype='int64'
>>> a
array([1])
Note how the change from int8 to int64 changed an 8 element, 8 bit integer array, into a 1 element, 64 bit array. It is the same 8 byte block however. On my i7 machine with native endianess, the byte pattern is the same as 1 in an int64 format.
Change the position of the 1:
>>> a=np.array([0,0,0,1,0,0,0,0],dtype='int8')
>>> a.dtype='int64'
>>> a
array([16777216])
Another example:
>>> a=np.array([0,0,0,0,0,0,1,0],dtype='int32')
>>> a.dtype='int64'
>>> a
array([0, 0, 0, 1])
Change the position of the 1 in the 32 byte, 32 bit array:
>>> a=np.array([0,0,0,1,0,0,0,0],dtype='int32')
>>> a.dtype='int64'
>>> a
array([ 0, 4294967296, 0, 0])
It is the same block of bits reinterpreted.
NumPy arrays are stored as contiguous blocks of memory. They usually have a single datatype (e.g. integers, floats or fixed-length strings) and then the bits in memory are interpreted as values with that datatype.
Creating an array with dtype=object is different. The memory taken by the array now is filled with pointers to Python objects which are being stored elsewhere in memory (much like a Python list is really just a list of pointers to objects, not the objects themselves).
Arithmetic operators such as * don't work with arrays such as ar1 which have a string_ datatype (there are special functions instead - see below). NumPy is just treating the bits in memory as characters and the * operator doesn't make sense here. However, the line
np.array(['avinash','jay'], dtype=object) * 2
works because now the array is an array of (pointers to) Python strings. The * operator is well defined for these Python string objects. New Python strings are created in memory and a new object array with references to the new strings is returned.
If you have an array with string_ or unicode_ dtype and want to repeat each string, you can use np.char.multiply:
In [52]: np.char.multiply(ar1, 2)
Out[52]: array(['avinashavinash', 'jayjay'],
dtype='<U14')
NumPy has many other vectorised string methods too.
There are 3 main dtypes to store strings in numpy:
object: Stores pointers to Python objectsstr: Stores fixed-width stringsnumpy.types.StringDType(): New in numpy 2.0 and stores variable-width strings
str consumes more memory than object; StringDType is better
Depending on the length of the fixed-length string and the size of the array, the ratio differs but as long as the longest string in the array is longer than 2 characters, str consumes more memory (they are equal when the longest string in the array is 2 characters long). For example, in the following example, str consumes almost 8 times more memory.
On the other hand, the new (in numpy>=2.0) numpy.dtypes.StringDType stores variable width strings, so consumes much less memory.
from pympler.asizeof import asizeof
ar1 = np.array(['this is a string', 'string']*1000, dtype=object)
ar2 = np.array(['this is a string', 'string']*1000, dtype=str)
ar3 = np.array(['this is a string', 'string']*1000, dtype=np.dtypes.StringDType())
asizeof(ar2) / asizeof(ar1) # 7.944444444444445
asizeof(ar3) / asizeof(ar1) # 1.992063492063492
For numpy 1.x, str is slower than object
For numpy>=2.0.0, str is faster than object
Numpy 2.0 has introduced a new numpy.strings API that has much more performant ufuncs for string operations. A simple test (on numpy 2.2.0) below shows that vectorized string operations on an array of str or StringDType dtype is much faster than the same operations on an object dtype array.
import timeit
t1 = min(timeit.repeat(lambda: ar1*2, number=1000))
t2a = min(timeit.repeat(lambda: np.strings.multiply(ar2, 2), number=1000))
t2b = min(timeit.repeat(lambda: np.strings.multiply(ar3, 2), number=1000))
print(t2a / t1) # 0.8786601958427778
print(t2b / t1) # 0.7311586933668037
t3 = min(timeit.repeat(lambda: np.array([s.count('i') for s in ar1]), number=1000))
t4a = min(timeit.repeat(lambda: np.strings.count(ar2, 'i'), number=1000))
t4b = min(timeit.repeat(lambda: np.strings.count(ar3, 'i'), number=1000))
print(t4a / t3) # 0.13328748153237377
print(t4b / t3) # 0.3365874412749679
For numpy<2.0.0 (tested on numpy 1.26.0)
Numpy 1.x's vectorized string methods are not optimized, so operating on the object array is often faster. For example, in the example in the OP where each character is repeated, a simple * (aka multiply()) is not only more concise but also over 10 times faster than char.multiply().
import timeit
setup = "import numpy as np; from __main__ import ar1, ar2"
t1 = min(timeit.repeat("ar1*2", setup, number=1000))
t2 = min(timeit.repeat("np.char.multiply(ar2, 2)", setup, number=1000))
t2 / t1 # 10.650433758517027
Even for functions that cannot be readily be applied on the array, instead of the vectorized char method of str arrays, it is faster to loop over the object array and work on the Python strings.
For example, iterating over the object array and calling str.count() on each Python string is over 3 times faster than the vectorized char.count() on the str array.
f1 = lambda: np.array([s.count('i') for s in ar1])
f2 = lambda: np.char.count(ar2, 'i')
setup = "import numpy as np; from __main__ import ar1, ar2, f1, f2, f3"
t3 = min(timeit.repeat("f1()", setup, number=1000))
t4 = min(timeit.repeat("f2()", setup, number=1000))
t4 / t3 # 3.251369161574832
On a side note, if it comes to explicit loop, iterating over a list is faster than iterating over a numpy array. So in the previous example, a further performance gain can be made by iterating over the list
f3 = lambda: np.array([s.count('i') for s in ar1.tolist()])
# ^^^^^^^^^ <--- convert to list here
t5 = min(timeit.repeat("f3()", setup, number=1000))
t3 / t5 # 1.2623498005294627