Pandas operates more like data.frame in this regard. You can check this using the memory_profiler package; here's an example of its use in the Jupyter notebook:
First define a program that will test this:
%%file df_memprofile.py
import numpy as np
import pandas as pd
def foo():
x = np.random.rand(1000000, 5)
y = pd.DataFrame(x, columns=list('abcde'))
y.rename(columns = {'e': 'f'}, inplace=True)
return y
Then load the memory profiler and run + profile the function
%load_ext memory_profiler
from df_memprofile import foo
%mprun -f foo foo()
I get the following output:
Filename: /Users/jakevdp/df_memprofile.py
Line # Mem usage Increment Line Contents
================================================
4 66.1 MiB 66.1 MiB def foo():
5 104.2 MiB 38.2 MiB x = np.random.rand(1000000, 5)
6 104.4 MiB 0.2 MiB y = pd.DataFrame(x, columns=list('abcde'))
7 142.6 MiB 38.2 MiB y.rename(columns = {'e': 'f'}, inplace=True)
8 142.6 MiB 0.0 MiB return y
You can see a couple things:
when
yis created, it is just a light wrapper around the original array: i.e. no data is copied.When the column in
yis renamed, it results in duplication of the entire data array in memory (it's the same 38MB increment as whenxis created in the first place).
So, unless I'm missing something, it appears that Pandas operates more like R's dataframes than R's data tables.
Edit: Note that rename() has an argument copy that controls this behavior, and defaults to True. For example, using this:
y.rename(columns = {'e': 'f'}, inplace=True, copy=False)
... results in an inplace operation without copying data.
Alternatively, you can modify the columns attribute directly:
y.columns = ['a', 'b', 'c', 'd', 'f']
Answer from jakevdp on Stack OverflowAre Pandas' dataframes (Python) closer to R's dataframes or datatables? - Stack Overflow
What actually defines a DataFrame?
How to save DataFrame to a file and read back
Appending Data to DataFrames
Videos
I fear this is more a philosophical question then a technical one but I am a bit confused. I’ve been thinking a lot about what makes something a DataFrame, not just in terms of syntax or library, but from a conceptual standpoint.
My current definition is as such:
A DataFrame is a language native, programmable interface for querying and transforming tabular data. Its designed to be embedded directly in general purpose programming workflows.
I like this because it focuses on what a DataFrame is for, rather than what specific tools or libraries implement it.
I think however that this definition is too general and can lead to anything tabular with an API being described as a DF.
Properties that are not exclusive across DataFrames which I previously thought defined them:
-
mutability
-
pandas: mutable, you can add/remove/overwrite columns directly.
-
Spark DataFrames: immutable, transformations return new logical plans.
-
Polars (lazy mode): immutable, transformations build a new plan.
-
-
execution model
-
pandas: eager, executes immediately.
-
Spark / Polars (lazy): lazy, builds DAGs and executes on trigger.
-
-
in memory
-
pandas / polars: usually in-memory.
-
Spark: can spill to disk or operate on distributed data.
-
Ibist: abstract, backend might not be memory-bound at all.
-
Curious how others would describe and define DataFrames.