I think that pandas offers better alternatives to what you're suggesting (rationale below).

For one, there's the pandas.Panel data structure, which was meant for things like you're doing here.

However, as Wes McKinney (the Pandas author) noted in his book Python for Data Analysis: Data Wrangling with Pandas, NumPy, and IPython, multi-dimensional indices, to a large extent, offer a better alternative.

Consider the following alternative to your code:

dfs = []
for year in range(1967,2014):
    ....some codes that allow me to generate df1, df2 and df3 
    df1['year'] = year
    df1['origin'] = 'df1'
    df2['year'] = year
    df2['origin'] = 'df2'
    df3['year'] = year
    df3['origin'] = 'df3'
    dfs.extend([df1, df2, df3])
df = pd.concat(dfs)

This gives you a DataFrame with 4 columns: 'firm', 'price', 'year', and 'origin'.

This gives you the flexibility to:

  • Organize hierarchically by, say, 'year' and 'origin': df.set_index(['year', 'origin']), by, say, 'origin' and 'price': df.set_index(['origin', 'price'])

  • Do groupbys according to different levels

  • In general, slice and dice the data along many different ways.

What you're suggesting in the question makes one dimension (origin) arbitrarily different, and it's hard to think of an advantage to this. If a split along some dimension is necessary due, to, e.g., performance, you can combine DataFrames better with standard Python data structures:

  • A dictionary mapping each year to a Dataframe with the other three dimensions.

  • Three DataFrames, one for each origin, each having three dimensions.

Answer from Ami Tavory on Stack Overflow
🌐
Pandas
pandas.pydata.org › docs › user_guide › dsintro.html
Intro to data structures — pandas 3.0.2 documentation
DataFrame is a 2-dimensional labeled data structure with columns of potentially different types. You can think of it like a spreadsheet or SQL table, or a dict of Series objects. It is generally the most commonly used pandas object.
Discussions

design - Designing interpretable/maintainable python code that use pandas DataFrames? - Software Engineering Stack Exchange
I am working with/writing a good amount of code in python using pandas dataframes. One thing I'm really struggling with is how to enforce a "schema" of sorts or make it apparent what data fields are More on softwareengineering.stackexchange.com
🌐 softwareengineering.stackexchange.com
Are Pandas' dataframes (Python) closer to R's dataframes or datatables? - Stack Overflow
To understand my question, I should first point out that R datatables aren't just R dataframes with syntaxic sugar, there are important behavioral differences : column assignation/modification by More on stackoverflow.com
🌐 stackoverflow.com
What actually defines a DataFrame?
Dataframe is an engineering term, not some strongly defined theoretical term. If it looks like a dataframe, walks like a dataframe, swims like a dataframe, it’s probably a dataframe. More on reddit.com
🌐 r/dataengineering
30
47
March 24, 2025
How to save DataFrame to a file and read back
I used to use pickle format too, but based on this benchmark , I now favour the use of the feather format in most situations. More on reddit.com
🌐 r/pythontips
11
16
September 17, 2022
Top answer
1 of 1
11

I think that pandas offers better alternatives to what you're suggesting (rationale below).

For one, there's the pandas.Panel data structure, which was meant for things like you're doing here.

However, as Wes McKinney (the Pandas author) noted in his book Python for Data Analysis: Data Wrangling with Pandas, NumPy, and IPython, multi-dimensional indices, to a large extent, offer a better alternative.

Consider the following alternative to your code:

dfs = []
for year in range(1967,2014):
    ....some codes that allow me to generate df1, df2 and df3 
    df1['year'] = year
    df1['origin'] = 'df1'
    df2['year'] = year
    df2['origin'] = 'df2'
    df3['year'] = year
    df3['origin'] = 'df3'
    dfs.extend([df1, df2, df3])
df = pd.concat(dfs)

This gives you a DataFrame with 4 columns: 'firm', 'price', 'year', and 'origin'.

This gives you the flexibility to:

  • Organize hierarchically by, say, 'year' and 'origin': df.set_index(['year', 'origin']), by, say, 'origin' and 'price': df.set_index(['origin', 'price'])

  • Do groupbys according to different levels

  • In general, slice and dice the data along many different ways.

What you're suggesting in the question makes one dimension (origin) arbitrarily different, and it's hard to think of an advantage to this. If a split along some dimension is necessary due, to, e.g., performance, you can combine DataFrames better with standard Python data structures:

  • A dictionary mapping each year to a Dataframe with the other three dimensions.

  • Three DataFrames, one for each origin, each having three dimensions.

🌐
RDocumentation
rdocumentation.org › packages › base › versions › 3.6.2 › topics › data.frame
data.frame function - RDocumentation
The function data.frame() creates data frames, tightly coupled collections of variables which share many of the properties of matrices and of lists, used as the fundamental data structure by most of R's modeling software.
🌐
W3Schools
w3schools.com › r › r_data_frames.asp
W3Schools.com
W3Schools offers free online tutorials, references and exercises in all the major languages of the web. Covering popular subjects like HTML, CSS, JavaScript, Python, SQL, Java, and many, many more.
🌐
W3Schools
w3schools.com › python › pandas › pandas_dataframes.asp
Pandas DataFrames
A Pandas DataFrame is a 2 dimensional data structure, like a 2 dimensional array, or a table with rows and columns.
Find elsewhere
🌐
Databricks
databricks.com › blog › what-are-dataframes
What are Dataframes? | Databricks
A DataFrame is a data structure that organizes data into a 2-dimensional table of rows and columns, much like a spreadsheet.
Top answer
1 of 4
3

I had the same problem and resorted to dataclasses. In a shell something like:

@dataclass
class OrderDataFields:
    """
    data field names aka columns
    """
    order_id        : str = 'order_id'
    order_amount    : str = 'order_amount'
    order_date      : str = 'order_date'


@dataclass
class OrderData:
    data: pd.DataFrame
    columns: OrdrDataFields = OrderDataFields()

Now when you are using your dataframe put it in the class

order_data = OrderData(data=order_df)

If you want to perform column checks every time you instantiate Data objects you can use BaseData class and inherit it in your for example OrderData object:

@dataclass
class BaseData:
    """
    Base class for all Data dataclasses.

    Expected child attributes:
        `data` : pd.DataFrame -- data
        `columns`: object -- dataclass with column names

    Raises
    ------
    ValueError
        If columns in `columns` and `data` dataframe don't match
    """
    data: pd.DataFrame
    columns: object

    def __post_init__(self):
        self._check_columns()
    
    def _check_columns(self):
        """
        Check if columns in dataframe match the columns in `columns` attribute

        Raises
        ------
        ValueError
            If columns don't match
        """
        data_columns = set(self.data.columns)
        columns_columns = {v for k,v in asdict(self.columns).items()}
        if data_columns != columns_columns:
            raise ValueError


@dataclass
class OrderData(BaseData):
    data: pd.DataFrame
    columns: OrdrDataFields = OrderDataFields()

And now when you do some wrangling you can use dataframe and columns:

df = order_data.data
c = order_data.columns

df[c.order_amount] ....
....

Along those lines, adjust for your case

There is also library pandera : https://pandera.readthedocs.io/en/stable/

2 of 4
1

My first idea would be to include type hints and descriptive docstrings to functions responsible for loading a pandas DataFrame, e.g.:

import pandas as pd


def load_client_data(input_path: str) -> pd.DataFrame:
    """Loads client DataFrame from a csv file, performs data preprocessing and returns it.

    The client DataFrame is formatted as follows:

    costumer_id (string): represents a customer.
    order_id (string): represents an order.
    order_amount (int): represents the number of items bought.
    order_date (string): the date in which the order was made (YYYY-MM-DD).
    order_time (string): the time in which the order was made (HH:mm:ss)."""

    client_data = pd.read_csv(input_path)
    preprocessed_client_data = do_preprocessing(client_data)
    return preprocessed_client_data

Ideally, all functions responsible for loading the datasets would be bundled together in a module, so that at the very least you know where to look whenever you're in doubt. Good/consistent variable names for your datasets will also help you keep track of what dataset you're working with in a downstream function.

Of course, this all adds a bit of coupling: if you decide to change the columns of a dataset, you need to remember to update the docstring, too. At the end of the day, however, it's a choice between flexibility and reliability: once your program grows in size and becomes more stable, I think it's a fair compromise.

You'll also want to perform any operations to the dataset itself (adding new columns, parsing the date into day/month/year columns, etc) as soon as possible, so that the docstring reflects these in-memory changes as well. If your datasets are being transformed all the way down in another function, ask yourself if you could do this earlier. If that's not possible, at least initialize the dataframe with empty columns that expect future data, and reflect this information on the docstring.

If you want to take this a step further, you can wrap all functions related to loading datasets into a DatasetManager class, which unify the information of the datasets' signatures. You could even add a helper function to quickly view a docstring for a specific dataset: writing dataset_manager.get_info('client_data') could print out the docstring for the load_client_data function, for example.

Lastly, there are a couple third-party modules that help you enforce data types in pandas DataFrames, if you're okay with that. An example is dataenforce, but as a disclaimer I've never used it personally.

🌐
W3Schools
w3schools.com › python › pandas › pandas_analyzing.asp
Pandas - Analyzing DataFrames
<class 'pandas.core.frame.DataFrame'> RangeIndex: 169 entries, 0 to 168 Data columns (total 4 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Duration 169 non-null int64 1 Pulse 169 non-null int64 2 Maxpulse 169 non-null int64 3 Calories 164 non-null float64 dtypes: float64(1), int64(3) memory usage: 5.4 KB None Try it Yourself »
🌐
Pandas
pandas.pydata.org › docs › reference › frame.html
DataFrame — pandas 3.0.2 documentation
Flags refer to attributes of the pandas object. Properties of the dataset (like the date is was recorded, the URL it was accessed from, etc.) should be stored in DataFrame.attrs.
🌐
Apache
spark.apache.org › docs › latest › api › python › user_guide › dataframes.html
Chapter 1: DataFrames - A view into your structured data — PySpark 4.1.1 documentation
A DataFrame is a two-dimensional labeled data structure with columns of potentially different types. You can think of a DataFrame like a spreadsheet, a SQL table, or a dictionary of series objects.
🌐
Juliadata
juliadata.github.io › DataFrames.jl › stable
Introduction · DataFrames.jl
This resource aims to teach you everything you need to know to get up and running with tabular data manipulation using the DataFrames.jl package.
Top answer
1 of 1
12

Pandas operates more like data.frame in this regard. You can check this using the memory_profiler package; here's an example of its use in the Jupyter notebook:

First define a program that will test this:

%%file df_memprofile.py
import numpy as np
import pandas as pd

def foo():
    x = np.random.rand(1000000, 5)
    y = pd.DataFrame(x, columns=list('abcde'))
    y.rename(columns = {'e': 'f'}, inplace=True)
    return y

Then load the memory profiler and run + profile the function

%load_ext memory_profiler
from df_memprofile import foo
%mprun -f foo foo()

I get the following output:

Filename: /Users/jakevdp/df_memprofile.py

Line #    Mem usage    Increment   Line Contents
================================================
     4     66.1 MiB     66.1 MiB   def foo():
     5    104.2 MiB     38.2 MiB       x = np.random.rand(1000000, 5)
     6    104.4 MiB      0.2 MiB       y = pd.DataFrame(x, columns=list('abcde'))
     7    142.6 MiB     38.2 MiB       y.rename(columns = {'e': 'f'}, inplace=True)
     8    142.6 MiB      0.0 MiB       return y

You can see a couple things:

  1. when y is created, it is just a light wrapper around the original array: i.e. no data is copied.

  2. When the column in y is renamed, it results in duplication of the entire data array in memory (it's the same 38MB increment as when x is created in the first place).

So, unless I'm missing something, it appears that Pandas operates more like R's dataframes than R's data tables.


Edit: Note that rename() has an argument copy that controls this behavior, and defaults to True. For example, using this:

y.rename(columns = {'e': 'f'}, inplace=True, copy=False)

... results in an inplace operation without copying data.

Alternatively, you can modify the columns attribute directly:

y.columns = ['a', 'b', 'c', 'd', 'f']
🌐
Reddit
reddit.com › r/dataengineering › what actually defines a dataframe?
r/dataengineering on Reddit: What actually defines a DataFrame?
March 24, 2025 -

I fear this is more a philosophical question then a technical one but I am a bit confused. I’ve been thinking a lot about what makes something a DataFrame, not just in terms of syntax or library, but from a conceptual standpoint.

My current definition is as such:

A DataFrame is a language native, programmable interface for querying and transforming tabular data. Its designed to be embedded directly in general purpose programming workflows.

I like this because it focuses on what a DataFrame is for, rather than what specific tools or libraries implement it.

I think however that this definition is too general and can lead to anything tabular with an API being described as a DF.

Properties that are not exclusive across DataFrames which I previously thought defined them:

  • mutability

    • pandas: mutable, you can add/remove/overwrite columns directly.

    • Spark DataFrames: immutable, transformations return new logical plans.

    • Polars (lazy mode): immutable, transformations build a new plan.

  • execution model

    • pandas: eager, executes immediately.

    • Spark / Polars (lazy): lazy, builds DAGs and executes on trigger.

  • in memory

    • pandas / polars: usually in-memory.

    • Spark: can spill to disk or operate on distributed data.

    • Ibist: abstract, backend might not be memory-bound at all.

Curious how others would describe and define DataFrames.

🌐
Nexacu
nexacu.com.au › home › insights-blog › data frames and dataframes
Data Frames and DataFrames | Nexacu
May 18, 2021 - The data frame (or DataFrame) is an essential tool for R or Python to analyse data, allowing related data to be stored together & efficiently used.
🌐
Uchicago
ds1.datascience.uchicago.edu › 06 › DataFrames.html
5. DataFrames — Introduction to Data Science
DataFrames are two-dimensional collections of data. You can think of them as tables, similar to Excel spreadsheets. The dimensions of a DataFrame refer to its shape, i.e., the number of rows and columns. Each row represents a data point (observation/record), and each column represents the value ...
🌐
Apache Spark
spark.apache.org › docs › latest › sql-programming-guide.html
Spark SQL and DataFrames - Spark 4.1.1 Documentation
A DataFrame is a Dataset organized into named columns. It is conceptually equivalent to a table in a relational database or a data frame in R/Python, but with richer optimizations under the hood. DataFrames can be constructed from a wide array of sources such as: structured data files, tables ...
🌐
Kaggle
kaggle.com › datasets
Find Open Datasets and Machine Learning Projects | Kaggle
Download Open Datasets on 1000s of Projects + Share Projects on One Platform. Explore Popular Topics Like Government, Sports, Medicine, Fintech, Food, More. Flexible Data Ingestion.
🌐
Wikipedia
en.wikipedia.org › wiki › Dataframe
Dataframe - Wikipedia
December 31, 2025 - Dataframe may refer to: · A tabular data structure common to many data processing libraries: · pandas (software) § DataFrame · The Dataframe API in Apache Spark · DFLib for Java · Data frames in the R programming language · Frame (networking) · Category: · Disambiguation pages · Search
🌐
PyPI
pypi.org › project › pandas
pandas · PyPI
Powerful data structures for data analysis, time series, and statistics
      » pip install pandas
    
Published   Mar 31, 2026
Version   3.0.2