Use splitlines for lists, then indexig for remove first 5 values and last 2 and split by space with DataFrame constructor:
import io
buffer = io.StringIO()
df.info(buf=buffer)
lines = buffer.getvalue().splitlines()
df = (pd.DataFrame([x.split() for x in lines[5:-2]], columns=lines[3].split())
.drop('Count',axis=1)
.rename(columns={'Non-Null':'Non-Null Count'}))
print (df)
Answer from jezrael on Stack Overflowpython - How to save Pandas info() function output to variable or data frame - Stack Overflow
python 3.x - how to convert df.info() into data frame. df.info() - Stack Overflow
Df.info() is not displaying any result in st.write()
Call to DataFrame.info() causes huge memory consumption. Why?
Videos
Building off of the previous answer. The solution below will place the string collected from the buffer directly into a pandas DataFrame without having to save a temp file to disk
import io
buf = io.StringIO()
df.info(buf=buf)
s = buf.getvalue()
lines = [line.split() for line in s.splitlines()[3:-2]]
pd.DataFrame(lines)
Explaination:
s.splitlines()- creates a list from the string where a new line character is found- indexing
[3:-2]- removes the first three lines and last two so that it will fit nicely into columns for the data frame
While the info() method directly prints the information to the output, and even when you use the buffer to extract the info, it extracts it in the form of text lines, which are hardly useful for further processing.
The above mentioned solution does work, but it creates a problem when your column names have spaces or are inconsistently named for you to be able to use the line.split() with some other separator character.
I couldn't search of any way to do this using the default info() method. So, I made my own function to do this. And it is not that complicated.
def infoOut(data,details=False):
dfInfo = data.columns.to_frame(name='Column')
dfInfo['Non-Null Count'] = data.notna().sum()
dfInfo['Dtype'] = data.dtypes
dfInfo.reset_index(drop=True,inplace=True)
if details:
rangeIndex = (dfInfo['Non-Null Count'].min(),dfInfo['Non-Null Count'].min())
totalColumns = dfInfo['Column'].count()
dtypesCount = dfInfo['Dtype'].value_counts()
totalMemory = dfInfo.memory_usage().sum()
return dfInfo, rangeIndex, totalColumns, dtypesCount, totalMemory
else:
return dfInfo
Usage:
variable = infoOut(yourDataFrameObject)
#or
var1, var2, var3, var4, var5 = infoOut(yourDataFrameObject,details=True)
This function will return the exact table structure as returned by the info method for your supplied dataframe, and that too in a dataframe format. Further, if you supply with an argument details=True, then it will also give our other information that info() gives out like memory, summary counts, etc.
Modify the function as you like.
Good day.
Hello experts,
I was happily hacking around with a csv file containing logs which I want to analyse in pandas, when I noticed something funny about the memory consumption: while the RAM consumption of the loaded dataframe roughly matches the size of the csv file (โ3.1GB), a call to DataFrame.info() shoots up the system's memory usage by another few GB. Looking at htop, the call to DataFrame.info() quickly consumes another 2.5 GB of free RAM.
Here's a self-contained script (minus the data, which I can provide should there be more mystery to this than expected) and it's stdout.
import pandas as pd
import psutil
df = pd.read_csv("./logs-unique.tsv", sep="\t")
# Convert the timestamps to their proper datetime equivalent. We truncate the
# input dates at the 1s-resolution.
def convert_time(series):
datetime_series = series.str.extract(r"(^[0-9]{4}-[0-9]{2}-[0-9]{2}T[0-9]{2}\.[0-9]{2}\.[0-9]{2})")[0]
return pd.to_datetime(datetime_series, format="%Y-%m-%dT%H.%M.%S")
df["Time"] = convert_time(df["Time"])
# Convert the following categories to a proper category type. Note that by
# default, the pandas category type is unordered and its categories are inferred
# from the data. Hence the simple .astype("category") will do just fine.
df["Category"] = df["Category"].astype("category")
df["User"] = df["User"].astype("category")
df["Tenant"] = df["Tenant"].astype("category")
df["Account"] = df["Account"].astype("category")
df["Application"] = df["Application"].astype("category")
df["Message"] = df["Message"].astype("category")
# Drop the 'InstanceId' column, as its completely empty as well as the
# 'FormatVersion' column, whose values are all '2.2'.
df.drop(columns=["InstanceId", "FormatVersion"], inplace=True)
print(f"Free RAM before call to 'info()': {psutil.virtual_memory().available / 1e9} GB", end="\n\n")
print(df.info(memory_usage="deep"), end="\n\n")
print(f"Free RAM after call to 'info()' : {psutil.virtual_memory().available / 1e9} GB")Free RAM before call to 'info()': 5.57312 GB <class 'pandas.core.frame.DataFrame'> RangeIndex: 2884211 entries, 0 to 2884210 Data columns (total 9 columns): # Column Dtype --- ------ ----- 0 Uuid object 1 Category category 2 User category 3 Tenant category 4 Account category 5 Application category 6 Time datetime64[ns] 7 Message category 8 Duplicity int64 dtypes: category(6), datetime64[ns](1), int64(1), object(1) memory usage: 3.0 GB None Free RAM after call to 'info()' : 3.046608896 GB
Notice how 2.5 GB of free RAM seem to vanish. This happens everytime, inside and outside of my jupyter notebook. Also, the rather similar command DataFrame.memory_usage() does not swallow any memory.
May I kindly ask you to help me find out what's going on here?
Thanks