Reducing memory usage in Python is difficult, because Python does not actually release memory back to the operating system. If you delete objects, then the memory is available to new Python objects, but not free()'d back to the system (see this question).
If you stick to numeric numpy arrays, those are freed, but boxed objects are not.
>>> import os, psutil, numpy as np # psutil may need to be installed
>>> def usage():
... process = psutil.Process(os.getpid())
... return process.memory_info()[0] / float(2 ** 20)
...
>>> usage() # initial memory usage
27.5
>>> arr = np.arange(10 ** 8) # create a large array without boxing
>>> usage()
790.46875
>>> del arr
>>> usage()
27.52734375 # numpy just free()'d the array
>>> arr = np.arange(10 ** 8, dtype='O') # create lots of objects
>>> usage()
3135.109375
>>> del arr
>>> usage()
2372.16796875 # numpy frees the array, but python keeps the heap big
Reducing the Number of Dataframes
Python keep our memory at high watermark, but we can reduce the total number of dataframes we create. When modifying your dataframe, prefer inplace=True, so you don't create copies.
Another common gotcha is holding on to copies of previously created dataframes in ipython:
In [1]: import pandas as pd
In [2]: df = pd.DataFrame({'foo': [1,2,3,4]})
In [3]: df + 1
Out[3]:
foo
0 2
1 3
2 4
3 5
In [4]: df + 2
Out[4]:
foo
0 3
1 4
2 5
3 6
In [5]: Out # Still has all our temporary DataFrame objects!
Out[5]:
{3: foo
0 2
1 3
2 4
3 5, 4: foo
0 3
1 4
2 5
3 6}
You can fix this by typing %reset Out to clear your history. Alternatively, you can adjust how much history ipython keeps with ipython --cache-size=5 (default is 1000).
Reducing Dataframe Size
Wherever possible, avoid using object dtypes.
>>> df.dtypes
foo float64 # 8 bytes per value
bar int64 # 8 bytes per value
baz object # at least 48 bytes per value, often more
Values with an object dtype are boxed, which means the numpy array just contains a pointer and you have a full Python object on the heap for every value in your dataframe. This includes strings.
Whilst numpy supports fixed-size strings in arrays, pandas does not (it's caused user confusion). This can make a significant difference:
>>> import numpy as np
>>> arr = np.array(['foo', 'bar', 'baz'])
>>> arr.dtype
dtype('S3')
>>> arr.nbytes
9
>>> import sys; import pandas as pd
>>> s = pd.Series(['foo', 'bar', 'baz'])
dtype('O')
>>> sum(sys.getsizeof(x) for x in s)
120
You may want to avoid using string columns, or find a way of representing string data as numbers.
If you have a dataframe that contains many repeated values (NaN is very common), then you can use a sparse data structure to reduce memory usage:
>>> df1.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 39681584 entries, 0 to 39681583
Data columns (total 1 columns):
foo float64
dtypes: float64(1)
memory usage: 605.5 MB
>>> df1.shape
(39681584, 1)
>>> df1.foo.isnull().sum() * 100. / len(df1)
20.628483479893344 # so 20% of values are NaN
>>> df1.to_sparse().info()
<class 'pandas.sparse.frame.SparseDataFrame'>
Int64Index: 39681584 entries, 0 to 39681583
Data columns (total 1 columns):
foo float64
dtypes: float64(1)
memory usage: 543.0 MB
Viewing Memory Usage
You can view the memory usage (docs):
>>> df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 39681584 entries, 0 to 39681583
Data columns (total 14 columns):
...
dtypes: datetime64ns, float64(8), int64(1), object(4)
memory usage: 4.4+ GB
As of pandas 0.17.1, you can also do df.info(memory_usage='deep') to see memory usage including objects.
Reducing memory usage in Python is difficult, because Python does not actually release memory back to the operating system. If you delete objects, then the memory is available to new Python objects, but not free()'d back to the system (see this question).
If you stick to numeric numpy arrays, those are freed, but boxed objects are not.
>>> import os, psutil, numpy as np # psutil may need to be installed
>>> def usage():
... process = psutil.Process(os.getpid())
... return process.memory_info()[0] / float(2 ** 20)
...
>>> usage() # initial memory usage
27.5
>>> arr = np.arange(10 ** 8) # create a large array without boxing
>>> usage()
790.46875
>>> del arr
>>> usage()
27.52734375 # numpy just free()'d the array
>>> arr = np.arange(10 ** 8, dtype='O') # create lots of objects
>>> usage()
3135.109375
>>> del arr
>>> usage()
2372.16796875 # numpy frees the array, but python keeps the heap big
Reducing the Number of Dataframes
Python keep our memory at high watermark, but we can reduce the total number of dataframes we create. When modifying your dataframe, prefer inplace=True, so you don't create copies.
Another common gotcha is holding on to copies of previously created dataframes in ipython:
In [1]: import pandas as pd
In [2]: df = pd.DataFrame({'foo': [1,2,3,4]})
In [3]: df + 1
Out[3]:
foo
0 2
1 3
2 4
3 5
In [4]: df + 2
Out[4]:
foo
0 3
1 4
2 5
3 6
In [5]: Out # Still has all our temporary DataFrame objects!
Out[5]:
{3: foo
0 2
1 3
2 4
3 5, 4: foo
0 3
1 4
2 5
3 6}
You can fix this by typing %reset Out to clear your history. Alternatively, you can adjust how much history ipython keeps with ipython --cache-size=5 (default is 1000).
Reducing Dataframe Size
Wherever possible, avoid using object dtypes.
>>> df.dtypes
foo float64 # 8 bytes per value
bar int64 # 8 bytes per value
baz object # at least 48 bytes per value, often more
Values with an object dtype are boxed, which means the numpy array just contains a pointer and you have a full Python object on the heap for every value in your dataframe. This includes strings.
Whilst numpy supports fixed-size strings in arrays, pandas does not (it's caused user confusion). This can make a significant difference:
>>> import numpy as np
>>> arr = np.array(['foo', 'bar', 'baz'])
>>> arr.dtype
dtype('S3')
>>> arr.nbytes
9
>>> import sys; import pandas as pd
>>> s = pd.Series(['foo', 'bar', 'baz'])
dtype('O')
>>> sum(sys.getsizeof(x) for x in s)
120
You may want to avoid using string columns, or find a way of representing string data as numbers.
If you have a dataframe that contains many repeated values (NaN is very common), then you can use a sparse data structure to reduce memory usage:
>>> df1.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 39681584 entries, 0 to 39681583
Data columns (total 1 columns):
foo float64
dtypes: float64(1)
memory usage: 605.5 MB
>>> df1.shape
(39681584, 1)
>>> df1.foo.isnull().sum() * 100. / len(df1)
20.628483479893344 # so 20% of values are NaN
>>> df1.to_sparse().info()
<class 'pandas.sparse.frame.SparseDataFrame'>
Int64Index: 39681584 entries, 0 to 39681583
Data columns (total 1 columns):
foo float64
dtypes: float64(1)
memory usage: 543.0 MB
Viewing Memory Usage
You can view the memory usage (docs):
>>> df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 39681584 entries, 0 to 39681583
Data columns (total 14 columns):
...
dtypes: datetime64ns, float64(8), int64(1), object(4)
memory usage: 4.4+ GB
As of pandas 0.17.1, you can also do df.info(memory_usage='deep') to see memory usage including objects.
As noted in the comments, there are some things to try: gc.collect (@EdChum) may clear stuff, for example. At least from my experience, these things sometimes work and often don't.
There is one thing that always works, however, because it is done at the OS, not language, level.
Suppose you have a function that creates an intermediate huge DataFrame, and returns a smaller result (which might also be a DataFrame):
def huge_intermediate_calc(something):
...
huge_df = pd.DataFrame(...)
...
return some_aggregate
Then if you do something like
import multiprocessing
result = multiprocessing.Pool(1).map(huge_intermediate_calc, [something_])[0]
Then the function is executed at a different process. When that process completes, the OS retakes all the resources it used. There's really nothing Python, pandas, the garbage collector, could do to stop that.
Pandas's memory management
How to delete multiple pandas (python) dataframes from memory to save RAM? - Stack Overflow
Variable deletion consumes a lot of memory
python - Delete and release memory of a single pandas dataframe - Stack Overflow
I have several big csv file. I want to extract the column "item id" in each on them.
And combine all of them and return a unique one.
My code is as follow:
for csv_file in folder:
df = pd.read_csv(csv_file)
list_df.append(df['item id'])
df_all_itemNo = pd.concat(list_df, ignore_index=True)
df_all_itemNo = df_all_itemNo.drop_duplicates()It is working when there is only a few csv file. The problem is when several big csv is read, all of my computer memory is used up.
From the memory usage graph, I see that the memory was keep on increasing. It never release back when every time
df = pd.read_csv(csv_file) is executed. The old df was stuck in memory.
Is there any solutions ?
del statement does not delete an instance, it merely deletes a name.
When you do del i, you are deleting just the name i - but the instance is still bound to some other name, so it won't be Garbage-Collected.
If you want to release memory, your dataframes has to be Garbage-Collected, i.e. delete all references to them.
If you created your dateframes dynamically to list, then removing that list will trigger Garbage Collection.
>>> lst = [pd.DataFrame(), pd.DataFrame(), pd.DataFrame()]
>>> del lst # memory is released
If you created some variables, you have to delete them all.
>>> a, b, c = pd.DataFrame(), pd.DataFrame(), pd.DataFrame()
>>> lst = [a, b, c]
>>> del a, b, c # dfs still in list
>>> del lst # memory release now
In python automatic garbage collection deallocates the variable (pandas DataFrame are also just another object in terms of python). There are different garbage collection strategies that can be tweaked (requires significant learning).
You can manually trigger the garbage collection using
import gc
gc.collect()
But frequent calls to garbage collection is discouraged as it is a costly operation and may affect performance.
Reference