df.memory_usage() will return how many bytes each column occupies:
>>> df.memory_usage()
Row_ID 20906600
Household_ID 20906600
Vehicle 20906600
Calendar_Year 20906600
Model_Year 20906600
...
The values are in units of bytes.
To include indexes, pass index=True.
So to get overall memory consumption:
>>> df.memory_usage(index=True).sum()
731731000
As before, the value is in units of bytes.
Also, passing deep=True will enable a more accurate memory usage report, that accounts for the full usage of the contained objects.
This is because memory usage does not include memory consumed by elements that are not components of the array if deep=False (default case).
df.memory_usage() will return how many bytes each column occupies:
>>> df.memory_usage()
Row_ID 20906600
Household_ID 20906600
Vehicle 20906600
Calendar_Year 20906600
Model_Year 20906600
...
The values are in units of bytes.
To include indexes, pass index=True.
So to get overall memory consumption:
>>> df.memory_usage(index=True).sum()
731731000
As before, the value is in units of bytes.
Also, passing deep=True will enable a more accurate memory usage report, that accounts for the full usage of the contained objects.
This is because memory usage does not include memory consumed by elements that are not components of the array if deep=False (default case).
Here's a comparison of the different methods - sys.getsizeof(df) is simplest.
For this example, df is a dataframe with 814 rows, 11 columns (2 ints, 9 objects) - read from a 427kb shapefile
sys.getsizeof(df)
>>> import sys >>> sys.getsizeof(df) (gives results in bytes) 462456
df.memory_usage()
>>> df.memory_usage() ... (lists each column at 8 bytes/row) >>> df.memory_usage().sum() 71712 (roughly rows * cols * 8 bytes) >>> df.memory_usage(deep=True) (lists each column's full memory usage) >>> df.memory_usage(deep=True).sum() (gives results in bytes) 462432
df.info()
Prints dataframe info to stdout. Technically these are kibibytes (KiB), not kilobytes - as the docstring says, "Memory usage is shown in human-readable units (base-2 representation)." So to get bytes would multiply by 1024, e.g. 451.6 KiB = 462,438 bytes.
>>> df.info() ... memory usage: 70.0+ KB >>> df.info(memory_usage='deep') ... memory usage: 451.6 KB
python - Accurate memory usage estimate of a pandas dataframe - Stack Overflow
How to efficiently manage memory of pandas dataframe?
Optionally use sys.getsizeof in DataFrame.memory_usage
Pandas - data size limit
Videos
I ran this with Pandas 1.0.3 and Python 3.7.4 on CentOS 7. I get the same results. Seems df.memory_usage(index=True,deep=True) and getsizeof are both buggy. If I check process.memory_info()[0] (RSS Resident Set Size) before and after the dataframe creation, the difference is 191 MB.
I think this post answers this issue well: https://pythonspeed.com/articles/pandas-dataframe-series-memory-usage/
For short, there are memory optimisation in python implementation that neither pandas nor sys consider in their calculations. Thus, usually, the memory usage reported by these methods is typically higher than actuals.
I have a huge dataset of one state in the US..
There were 50 csv files...each containing no. of rows = 232717
i used concat at the end after using map...using list comprehension, like this:
len of csv_files = 50...
df = pd.concat(map(pd.read_csv, csv_files), axis =1)
and the shape of the concatenated file is (232717, 2027) this is the shape of the file that we got after concatenation of 50 csv files.
when i saw its memory usage..it is around 3.6+ GB
now this was just for one state..I've to do it for all the states in the US..
so how can i effectively reduce memory...
i read about changing the datatype of each column...I'm planning on doing that..BUT there are mixed datatypes in some columns
but what else?...let me know if y'all can provide any inputs..
I'm doing this for the first time...so..thankss
Edit: I took one of the 50 csv files.. df.shape is (232717, 6) There was one row across all columns that was text..it is not required. So, merged that in the header...
-
memory usage of this df = 10.7 MB
-
converted 4 columns into numeric and downcasted the dtype from int64 to int8
-
now the memory usage is 4.4 MB
Question :
-
If I do this for all the 50 CSV files...will it reduce a substantial amount of memory?
-
what else should I be doing?