I think that pandas offers better alternatives to what you're suggesting (rationale below).
For one, there's the pandas.Panel data structure, which was meant for things like you're doing here.
However, as Wes McKinney (the Pandas author) noted in his book Python for Data Analysis: Data Wrangling with Pandas, NumPy, and IPython, multi-dimensional indices, to a large extent, offer a better alternative.
Consider the following alternative to your code:
dfs = []
for year in range(1967,2014):
....some codes that allow me to generate df1, df2 and df3
df1['year'] = year
df1['origin'] = 'df1'
df2['year'] = year
df2['origin'] = 'df2'
df3['year'] = year
df3['origin'] = 'df3'
dfs.extend([df1, df2, df3])
df = pd.concat(dfs)
This gives you a DataFrame with 4 columns: 'firm', 'price', 'year', and 'origin'.
This gives you the flexibility to:
Organize hierarchically by, say,
'year'and'origin':df.set_index(['year', 'origin']), by, say,'origin'and'price':df.set_index(['origin', 'price'])Do
groupbys according to different levelsIn general, slice and dice the data along many different ways.
What you're suggesting in the question makes one dimension (origin) arbitrarily different, and it's hard to think of an advantage to this. If a split along some dimension is necessary due, to, e.g., performance, you can combine DataFrames better with standard Python data structures:
A dictionary mapping each year to a Dataframe with the other three dimensions.
Three DataFrames, one for each origin, each having three dimensions.
design - Designing interpretable/maintainable python code that use pandas DataFrames? - Software Engineering Stack Exchange
Are Pandas' dataframes (Python) closer to R's dataframes or datatables? - Stack Overflow
What actually defines a DataFrame?
How to save DataFrame to a file and read back
Videos
I had the same problem and resorted to dataclasses. In a shell something like:
@dataclass
class OrderDataFields:
"""
data field names aka columns
"""
order_id : str = 'order_id'
order_amount : str = 'order_amount'
order_date : str = 'order_date'
@dataclass
class OrderData:
data: pd.DataFrame
columns: OrdrDataFields = OrderDataFields()
Now when you are using your dataframe put it in the class
order_data = OrderData(data=order_df)
If you want to perform column checks every time you instantiate Data objects you can use BaseData class and inherit it in your for example OrderData object:
@dataclass
class BaseData:
"""
Base class for all Data dataclasses.
Expected child attributes:
`data` : pd.DataFrame -- data
`columns`: object -- dataclass with column names
Raises
------
ValueError
If columns in `columns` and `data` dataframe don't match
"""
data: pd.DataFrame
columns: object
def __post_init__(self):
self._check_columns()
def _check_columns(self):
"""
Check if columns in dataframe match the columns in `columns` attribute
Raises
------
ValueError
If columns don't match
"""
data_columns = set(self.data.columns)
columns_columns = {v for k,v in asdict(self.columns).items()}
if data_columns != columns_columns:
raise ValueError
@dataclass
class OrderData(BaseData):
data: pd.DataFrame
columns: OrdrDataFields = OrderDataFields()
And now when you do some wrangling you can use dataframe and columns:
df = order_data.data
c = order_data.columns
df[c.order_amount] ....
....
Along those lines, adjust for your case
There is also library pandera : https://pandera.readthedocs.io/en/stable/
My first idea would be to include type hints and descriptive docstrings to functions responsible for loading a pandas DataFrame, e.g.:
import pandas as pd
def load_client_data(input_path: str) -> pd.DataFrame:
"""Loads client DataFrame from a csv file, performs data preprocessing and returns it.
The client DataFrame is formatted as follows:
costumer_id (string): represents a customer.
order_id (string): represents an order.
order_amount (int): represents the number of items bought.
order_date (string): the date in which the order was made (YYYY-MM-DD).
order_time (string): the time in which the order was made (HH:mm:ss)."""
client_data = pd.read_csv(input_path)
preprocessed_client_data = do_preprocessing(client_data)
return preprocessed_client_data
Ideally, all functions responsible for loading the datasets would be bundled together in a module, so that at the very least you know where to look whenever you're in doubt. Good/consistent variable names for your datasets will also help you keep track of what dataset you're working with in a downstream function.
Of course, this all adds a bit of coupling: if you decide to change the columns of a dataset, you need to remember to update the docstring, too. At the end of the day, however, it's a choice between flexibility and reliability: once your program grows in size and becomes more stable, I think it's a fair compromise.
You'll also want to perform any operations to the dataset itself (adding new columns, parsing the date into day/month/year columns, etc) as soon as possible, so that the docstring reflects these in-memory changes as well. If your datasets are being transformed all the way down in another function, ask yourself if you could do this earlier. If that's not possible, at least initialize the dataframe with empty columns that expect future data, and reflect this information on the docstring.
If you want to take this a step further, you can wrap all functions related to loading datasets into a DatasetManager class, which unify the information of the datasets' signatures. You could even add a helper function to quickly view a docstring for a specific dataset: writing dataset_manager.get_info('client_data') could print out the docstring for the load_client_data function, for example.
Lastly, there are a couple third-party modules that help you enforce data types in pandas DataFrames, if you're okay with that. An example is dataenforce, but as a disclaimer I've never used it personally.
I fear this is more a philosophical question then a technical one but I am a bit confused. I’ve been thinking a lot about what makes something a DataFrame, not just in terms of syntax or library, but from a conceptual standpoint.
My current definition is as such:
A DataFrame is a language native, programmable interface for querying and transforming tabular data. Its designed to be embedded directly in general purpose programming workflows.
I like this because it focuses on what a DataFrame is for, rather than what specific tools or libraries implement it.
I think however that this definition is too general and can lead to anything tabular with an API being described as a DF.
Properties that are not exclusive across DataFrames which I previously thought defined them:
-
mutability
-
pandas: mutable, you can add/remove/overwrite columns directly.
-
Spark DataFrames: immutable, transformations return new logical plans.
-
Polars (lazy mode): immutable, transformations build a new plan.
-
-
execution model
-
pandas: eager, executes immediately.
-
Spark / Polars (lazy): lazy, builds DAGs and executes on trigger.
-
-
in memory
-
pandas / polars: usually in-memory.
-
Spark: can spill to disk or operate on distributed data.
-
Ibist: abstract, backend might not be memory-bound at all.
-
Curious how others would describe and define DataFrames.
» pip install pandas