pyspark pivot multiple aggregations

Pyspark Pivot with multiple aggregations

stackoverflow.com › questions › 61126917 › pyspark-pivot-with-multiple-aggregations

Here is one way (basically melt the DF, then pivot)

# First combine price and units into a map column
mydf = mydf.withColumn("price_units", F.create_map(F.lit("price"), "price", F.lit("units"), "units"))

# Now explode to get a melted dataframe
mydf = mydf.select("id", "day", F.explode("price_units").alias("name", "value"))

+---+---+-----+-----+
| id|day| name|value|
+---+---+-----+-----+
|100|  1|price|   23|
|100|  1|units|   10|
|100|  2|price|   45|
|100|  2|units|   11|
|100|  3|price|   67|
etc

# Then pivot
mydf.groupby("id", "name").pivot("day").agg(F.mean("value")).show()

+---+-----+----+----+----+----+
| id| name|   1|   2|   3|   4|
+---+-----+----+----+----+----+
|100|price|23.0|45.0|67.0|78.0|
|101|price|23.0|45.0|67.0|78.0|
|102|units|10.0|11.0|16.0|18.0|
|100|units|10.0|11.0|12.0|13.0|
|101|units|10.0|13.0|14.0|15.0|
|102|price|23.0|45.0|67.0|78.0|
+---+-----+----+----+----+----+

Answer from ags29 on Stack Overflow

Stack Overflow

stackoverflow.com › questions › 61126917 › pyspark-pivot-with-multiple-aggregations

python - Pyspark Pivot with multiple aggregations - Stack Overflow

Videos

youtube.com

How to Solve Pyspark Pivot Function Issues with Grouping ...

08:56

YouTube

PySpark GroupBy Tutorial for Beginners | Aggregations with agg() ...

February 1, 2026

08:53

YouTube

How to Use Pivot Function in PySpark | Transform and Summarize ...

April 14, 2025

View all

Palantir

palantir.com › docs › foundry › transforms-python-spark › pyspark-aggregation

Python (Spark) • PySpark reference • Aggregation and pivot tables • Palantir

By default aggregations produce columns of the form aggregation_name(target_column). However, column names in Foundry cannot contain parentheses or other non-alphanumeric characters. Alias each aggregation to a specific name instead. Pivot tables in PySpark work very similarly to ordinary grouped ...

GeeksforGeeks

geeksforgeeks.org › python › python-pyspark-pivot-function

Python PySpark pivot() Function - GeeksforGeeks

July 26, 2025 - And for Product A, the sales from two entries in the East region have been averaged ((100 + 50)/2 = 75.0). For Product B, the sales in the West region have also been aggregated ((250 + 100)/2 = 175.0). The pivot() function in PySpark is a powerful tool for transforming data.

Spark Playground

sparkplayground.com › tutorials › pyspark › pivoting-data

How to Pivot Data in PySpark - Spark Playground

You can also apply more complex aggregation(s) via .agg() after .pivot(). For example: from pyspark.sql.functions import sum, avg agg_df = df.groupBy("Product") \ .pivot("Region") \ .agg( sum("Sales").alias("total_sales"), avg("Sales").alias("avg_sales") ) agg_df.show()

Databricks

api-docs.databricks.com › python › pyspark › latest › pyspark.sql › api › pyspark.sql.GroupedData.pivot.html

pyspark.sql.GroupedData.pivot — PySpark master documentation

Pivots a column of the current DataFrame and perform the specified aggregation. There are two versions of pivot function: one that requires the caller to specify the list of distinct values to pivot on, and one that does not. The latter is more concise but less efficient, because Spark needs ...

Medium

medium.com › @dhanashrisaner.30 › advanced-aggregations-and-grouping-in-pyspark-89ee7c9dcd6d

Advanced Aggregations and Grouping in PySpark | by Dhanashri Saner | Medium

November 14, 2024 - Efficient aggregation and grouping in PySpark allow data engineers to quickly analyze and summarize large datasets. Techniques like groupBy, pivot, and rollup empower you to produce detailed, multi-level reports that provide valuable insights, especially when working with large-scale data in ...

Apache

downloads.apache.org › spark › docs › 3.4.0 › api › python › reference › pyspark.sql › api › pyspark.sql.GroupedData.pivot.html

pyspark.sql.GroupedData.pivot — PySpark 3.4.0 documentation

>>> df1.groupBy("year").pivot("course").sum("earnings").show() +----+-----+------+ |year| Java|dotNET| +----+-----+------+ |2012|20000| 15000| |2013|30000| 48000| +----+-----+------+ >>> df2.groupBy("sales.year").pivot("sales.course").sum("sales.earnings").show() ... +----+-----+------+ |year| Java|dotNET| +----+-----+------+ |2012|20000| 15000| |2013|30000| 48000| +----+-----+------+ pyspark.sql.GroupedData.min pyspark.sql.GroupedData.sum

Find elsewhere

Google Bing Mojeek

Stack Overflow

stackoverflow.com › questions › 65941036 › pyspark-pivoting-data-with-multiple-aggregation-is-much-longer-freezes-compared

apache spark - pyspark pivoting data with multiple aggregation is much longer/freezes compared to running separately - Stack Overflow

Top answer

1 of 1

Well, I managed to make it work, but ... I don't understand why it fixes the issue. I encourage Spark Dev team to have a look at why this works better than the naive approach.

I had a group of several aggregation lists of lists, rather than sending all of them to pivot I computed them one by one and then joined the result. This ran much faster, about 100 seconds instead of 500: Each item in agg_list is a list of functions on a family type (e.g. num/str etc) with aliases e.g. f.sum(f.col(col_name)).alias(col_name)

df_groupby = df.groupby(groupby_col)
df_pivot = df_groupby.pivot(pivot_col, pivot_distinct_values)
pivot_list = [df_pivot.agg(*a) for a in agg_list] 
pivot_join = join_dataframes(pivot_list, on=self.user_id_col)

where join_dataframes is a small utility function:

def join_dataframes(df_list: List[pyspark.sql.DataFrame], *args, **kwargs) -> pyspark.sql.DataFrame:
    """joins multiple :class:`DataFrame` with the same `on` and `how`

    pass on, how etc like in join(other, on=None, how=None) method of pyspark.sql.dataframe.DataFrame instance

    :param df_list: a list of dataframes
    :type df_list: List[pyspark.sql.DataFrame]
    :return: the joined pyspark dataframe
    :rtype: pyspark.sql.DataFrame
    """

    df_joined = df_list[0]
    for df in df_list[1:]:
        df_joined.join(df, *args, **kwargs)
    return df_joined

Databricks

databricks.com › blog › 2016 › 02 › 09 › reshaping-data-with-pivot-in-apache-spark.html

Reshaping Data with Pivot in Apache Spark | Databricks Blog

February 9, 2016 - Pivot, just like normal aggregations, supports multiple aggregate expressions, just pass multiple arguments to the agg method.

Best-practice-and-impact

best-practice-and-impact.github.io › ons-spark › spark-functions › pivot-tables.html

Pivot tables in Spark — Spark at the ONS

sdf_pivot() is quite awkward with multiple aggregations on the same column. fun.aggregate can take a named list, but only one aggregation can be applied to each column. As we want to get the sum and max of total_cost, we can create another column, total_cost_copy, and aggregate on this.

Apache

spark.apache.org › docs › latest › api › python › reference › pyspark.pandas › api › pyspark.pandas.DataFrame.pivot_table.html

pyspark.pandas.DataFrame.pivot_table — PySpark 4.1.1 documentation

>>> table = df.pivot_table(values='D', index=['A', 'B'], ... columns='C', aggfunc='sum', fill_value=0) >>> table.sort_index() C large small A B bar one 4 5 two 7 6 foo one 4 1 two 0 6 · We can also calculate multiple types of aggregations for any given value column.

spark-workshop

jaceklaskowski.github.io › spark-workshop › exercises › spark-sql-exercise-Pivoting-on-Multiple-Columns.html

Exercise: Pivoting on Multiple Columns | spark-workshop

Write a structured query that pivots a dataset on multiple columns. Since pivot aggregation allows for a single column only, find a solution to pivot on two or more columns.

Medium

medium.com › @shubham.shardul2019 › chapter-4-pyspark-advanced-aggregations-pivoting-conditional-logic-and-joins-924ef5d7b82a

Chapter 4: PySpark — Advanced Aggregations, Pivoting, Conditional Logic, and Joins | by Shubham Shardul | Medium

March 24, 2025 - You learned how to: Use collect_list to aggregate related values into a single column. Pivot rows into columns for multi-dimensional analysis. Apply conditional logic with when/otherwise to create new classification flags.

Spark By {Examples}

sparkbyexamples.com › home › pyspark › pyspark pivot and unpivot dataframe

PySpark Pivot and Unpivot DataFrame - Spark By {Examples}

October 10, 2025 - PySpark pivot() function is used to rotate/transpose the data from one column into multiple Dataframe columns and back using unpivot(). Pivot() It is an

Machine Learning Plus

machinelearningplus.com › blog › pyspark pivot – a detailed guide harnessing the power of pyspark pivot

PySpark Pivot - A Detailed Guide Harnessing the Power of PySpark Pivot - machinelearningplus

April 19, 2023 - As shown above, we have successfully pivoted the data by region, displaying the revenue for each quarter in separate columns for the US and EU regions. ... To unpivot a DataFrame (i.e., convert it from wide format to long format), you can use the stack function with a combination of select and expr ... # Method:1 Unpivot PySpark DataFrame from pyspark.sql.functions import expr unpivotExpr = "stack(2, 'EU',EU, 'US', US) as (region,revenue)" unPivotDF = pivot_df.select("year","quarter", expr(unpivotExpr)).where("revenue is not null") unPivotDF.show()

Apache Spark

spark.apache.org › docs › latest › sql-ref-syntax-qry-select-pivot.html

PIVOT Clause - Spark 4.1.1 Documentation

Specifies new columns, which are used to match values in column_list as the aggregating condition. We can also add aliases for them. CREATE TABLE person (id INT, name STRING, age INT, class INT, address STRING); INSERT INTO person VALUES (100, 'John', 30, 1, 'Street 1'), (200, 'Mary', NULL, 1, 'Street 2'), (300, 'Mike', 80, 3, 'Street 3'), (400, 'Dan', 50, 4, 'Street 4'); SELECT * FROM person PIVOT ( SUM(age) AS a, AVG(class) AS c FOR name IN ('John' AS john, 'Mike' AS mike) ); +------+-----------+---------+---------+---------+---------+ | id | address | john_a | john_c | mike_a | mike_c | +--

Databricks Community

community.databricks.com › t5 › data-engineering › pyspark-alias-is-not-applied-in-pivot-if-only-one-aggregation › td-p › 82886

Pyspark - alias is not applied in pivot if only one aggregation

November 17, 2025 - When you specify only one aggregation ... alias you provide. However, if there are multiple aggregations, Databricks combines the column alias and the aggregation alias (e.g., value_sum), improving consistency....

Stack Overflow

stackoverflow.com › questions › 73506389 › pivot-multiple-columns-pyspark

Pivot Multiple columns pyspark - Stack Overflow

Copy trouble_df = mydf.withColumn('combcol',F.concat(F.lit('trouble_code_'),mydf['trouble_code'])).groupby('Job #').pivot('combcol').agg(F.first('trouble_status')) Below is the output from the code which isnt exactly what i'm looking. Fairly new to pyspark so still learning

FavTutor

favtutor.com › blogs › pyspark-pivot-dataframe

PySpark pivot() DataFrame Function (Working & Example)

September 26, 2023 - ... pivot_col: This is the column ... newly created columns. agg_func: The aggregation function you want to apply when there are multiple values for the same combination of aggregation and pivot columns....