Building off of the previous answer. The solution below will place the string collected from the buffer directly into a pandas DataFrame without having to save a temp file to disk
import io
buf = io.StringIO()
df.info(buf=buf)
s = buf.getvalue()
lines = [line.split() for line in s.splitlines()[3:-2]]
pd.DataFrame(lines)
Explaination:
s.splitlines()- creates a list from the string where a new line character is found- indexing
[3:-2]- removes the first three lines and last two so that it will fit nicely into columns for the data frame
Videos
Building off of the previous answer. The solution below will place the string collected from the buffer directly into a pandas DataFrame without having to save a temp file to disk
import io
buf = io.StringIO()
df.info(buf=buf)
s = buf.getvalue()
lines = [line.split() for line in s.splitlines()[3:-2]]
pd.DataFrame(lines)
Explaination:
s.splitlines()- creates a list from the string where a new line character is found- indexing
[3:-2]- removes the first three lines and last two so that it will fit nicely into columns for the data frame
While the info() method directly prints the information to the output, and even when you use the buffer to extract the info, it extracts it in the form of text lines, which are hardly useful for further processing.
The above mentioned solution does work, but it creates a problem when your column names have spaces or are inconsistently named for you to be able to use the line.split() with some other separator character.
I couldn't search of any way to do this using the default info() method. So, I made my own function to do this. And it is not that complicated.
def infoOut(data,details=False):
dfInfo = data.columns.to_frame(name='Column')
dfInfo['Non-Null Count'] = data.notna().sum()
dfInfo['Dtype'] = data.dtypes
dfInfo.reset_index(drop=True,inplace=True)
if details:
rangeIndex = (dfInfo['Non-Null Count'].min(),dfInfo['Non-Null Count'].min())
totalColumns = dfInfo['Column'].count()
dtypesCount = dfInfo['Dtype'].value_counts()
totalMemory = dfInfo.memory_usage().sum()
return dfInfo, rangeIndex, totalColumns, dtypesCount, totalMemory
else:
return dfInfo
Usage:
variable = infoOut(yourDataFrameObject)
#or
var1, var2, var3, var4, var5 = infoOut(yourDataFrameObject,details=True)
This function will return the exact table structure as returned by the info method for your supplied dataframe, and that too in a dataframe format. Further, if you supply with an argument details=True, then it will also give our other information that info() gives out like memory, summary counts, etc.
Modify the function as you like.
Good day.
You need to open a file then pass the file handle to df.info:
with open('info_output.txt','w') as file_out:
df.info(buf=file_out)
You could try avoiding pandas.dataframe.info() and instead create the information that you need as a pandas.DataFrame:
import pandas as pd
def get_info(df: pd.DataFrame):
info = df.dtypes.to_frame('dtypes')
info['non_null'] = df.count()
info['unique_values'] = df.apply(lambda srs: len(srs.unique()))
info['first_row'] = df.iloc[0]
info['last_row'] = df.iloc[-1]
return info
And write it to csv with df.to_csv('info_output.csv').
The memory usage information may also be useful, so you could do:
df.memory_usage().sum()