train.Id is a pandas Series and is one dimensional. train is a pandas DataFrame and is two dimensional. shape is an attribute that both DataFrames and Series have. It is always a tuple. For a Series the tuple has only only value (x,). For a DataFrame shape is a tuple with two values (x, y). So train.Id.shape[0] would also return 1467. However, train.Id.shape[1] would produce an error while train.shape[1] would give you the number of columns in train.
Furthermore, pandas Panel objects are three dimensional and shape for it returns a tuple (x, y, z)
train = pd.DataFrame(dict(Id=np.arange(1437), A=np.arange(1437)))
print(train.shape)
print(train.Id.shape)
(1437, 2)
(1437,)
Answer from piRSquared on Stack OverflowDifference between .shape[0] and .shape[1]
What does .isnull() and shape[0] do in Pandas?
python - Understanding dataframe.shape df.shape - Stack Overflow
Python pandas - df.shape for a resultant dataframe that has no columns but index - Stack Overflow
Hey
I am pretty new to Pandas. Do you know what the first and second line do?
Specifically, I try to understand what .isnull(),:] is doing
and df.shape[0] and the 2 at the end ? :S
Thanks for your help.
salary_nan_df = survey_df.loc[survey_df['ConvertedSalary'].isnull(), :]
percentage = round((salary_nan_df.shape[0] / survey_df.shape[0]) * 100, 2)
print(str(percentage)+"% ("+str(salary_nan_df.shape[0])+") of responders have not filled in their salary.")
51.75% (51153) of responders have not filled in their salary.
Get into an interactive Python session with numpy and pandas, and experiment
Make a dataframe:
In [394]: df=pd.DataFrame(np.eye(3))
In [395]: df
Out[395]:
0 1 2
0 1.0 0.0 0.0
1 0.0 1.0 0.0
2 0.0 0.0 1.0
Check its shape. That's a tuple (basic Python object):
In [396]: df.shape
Out[396]: (3, 3)
In [397]: df.shape[0] # first element of the tuple
Out[397]: 3
Repeat with the shape parameter is just like using the number 3:
In [398]: np.repeat('red', df.shape[0])
Out[398]: array(['red', 'red', 'red'], dtype='<U3')
Pandas and numpy are running in Python. So the regular evaluation order of Python applies.
This part (red_df.shape[0]) just to return an integer with the total number of rows in the red_df to create the new add column 'Color' with the same number of raws of its related red_df so, when we append it later with the white_df, it doesn't shift down the other white_df and creatw empty rows on the other columns.
You can simply delete this section and write it like this:
color_red = np.repeat('red', red_df.shape[0])
color_red = np.repeat('red', 1599)
Full program will be
import pandas as pd
import numpy as np
df_red = pd.read_csv('winequality-red.csv',sep=';')
df_white = pd.read_csv('winequality-white.csv',sep=';')
print(df_red.info())
print(df_red.shape[0])
# shape[0} refer to the number of columns which is 1599 shape[1] refer to the number of rows which is 12
# create color array for red dataframe
color_red = np.repeat('red', 1599)
# create color array for white dataframe
color_white = np.repeat('white', df_white.shape[0])
df_red['color'] = color_red
df_white['color'] = color_white
#combine data frame into one data frame called wine_df
wine_df = df_red.append(df_white)
print(wine_df.head())
wine_df.to_csv('winequality_edited.csv', index=False)
Hi all
I'm doing a course on Udemy so this code probably looks familiar to some of you. I'm trying to get a list of prices from the original dataset (shape is (506,), and convert it to log prices.
Converting it to log prices results in the same (506,0) shape, but when I try to convert this into a DataFrame, the shape ends up being (0,1) so I have issues later down the line trying to merge this with another data frame that's shaped (506,11).
Someone in course comments mentioned that using np.log1p instead of np.log will fix the issue, but unfortunately I'm having the same issue.
Any help would be greatly appreciated. I've tried asking in the course comments, and even tried google and troubleshooting with ChatGPT but no dice 🥲
Code:
# Gather Data
data = pd.DataFrame(data=boston_dataset.data, columns=boston_dataset.feature_names)
data.head()
features = data.drop(['INDUS', 'AGE'], axis=1)
# Check the shape of the target values in the dataset
print("Shape of boston_dataset.target:", boston_dataset.target.shape)
# Convert to log prices
#log_prices = np.log(boston_dataset.target)
log_prices = np.log1p(boston_dataset.target)
# Ensure log_prices has the correct shape
print("Shape of log_prices:", log_prices.shape)
print("First 5 elements of log_prices:", log_prices[:5])
# Create a new DataFrame for the target prices
target = pd.DataFrame(log_prices, columns=['PRICE'])
# Check the shape and contents of the target DataFrame
print("Target DataFrame shape:", target.shape)
print("First 5 rows of target DataFrame:\n", target.head())
# Check if the DataFrames are empty
if features.empty:
print("Error: Features DataFrame is empty")
if target.empty:
print("Error: Target DataFrame is empty")
# If 'CHAS' is categorical, convert it to numeric
if features['CHAS'].dtype == 'category':
features['CHAS'] = features['CHAS'].astype('int')
# Convert RAD to integer
if features['RAD'].dtype == 'category':
features['RAD'] = features['RAD'].astype('int')The outputs for the above print statements are:
Shape of boston_dataset.target: (506,) Shape of log_prices: (506,) First 5 elements of log_prices: 0 3.218876 1 3.117950 2 3.575151 3 3.538057 4 3.616309 Name: MEDV, dtype: float64 Target DataFrame shape: (0, 1) First 5 rows of target DataFrame: Empty DataFrame Columns: [PRICE] Index: [] Error: Target DataFrame is empty