Here's a non-UDF way involving a single pivot (hence, just a single column scan to identify all the unique dates).
dff = mydf.groupBy('id').pivot('day').agg(F.first('price').alias('price'),F.first('units').alias('unit'))
Here's the result (apologies for the non-matching ordering and naming):
+---+-------+------+-------+------+-------+------+-------+------+
| id|1_price|1_unit|2_price|2_unit|3_price|3_unit|4_price|4_unit|
+---+-------+------+-------+------+-------+------+-------+------+
|100| 23| 10| 45| 11| 67| 12| 78| 13|
|101| 23| 10| 45| 13| 67| 14| 78| 15|
|102| 23| 10| 45| 11| 67| 16| 78| 18|
+---+-------+------+-------+------+-------+------+-------+------+
We just aggregate both on the price and the unit column after pivoting on the day.
If naming required as in question,
dff.select([F.col(c).name('_'.join(x for x in c.split('_')[::-1])) for c in dff.columns]).show()
+---+-------+------+-------+------+-------+------+-------+------+
| id|price_1|unit_1|price_2|unit_2|price_3|unit_3|price_4|unit_4|
+---+-------+------+-------+------+-------+------+-------+------+
|100| 23| 10| 45| 11| 67| 12| 78| 13|
|101| 23| 10| 45| 13| 67| 14| 78| 15|
|102| 23| 10| 45| 11| 67| 16| 78| 18|
+---+-------+------+-------+------+-------+------+-------+------+
Answer from Jedi on Stack OverflowHere's a non-UDF way involving a single pivot (hence, just a single column scan to identify all the unique dates).
dff = mydf.groupBy('id').pivot('day').agg(F.first('price').alias('price'),F.first('units').alias('unit'))
Here's the result (apologies for the non-matching ordering and naming):
+---+-------+------+-------+------+-------+------+-------+------+
| id|1_price|1_unit|2_price|2_unit|3_price|3_unit|4_price|4_unit|
+---+-------+------+-------+------+-------+------+-------+------+
|100| 23| 10| 45| 11| 67| 12| 78| 13|
|101| 23| 10| 45| 13| 67| 14| 78| 15|
|102| 23| 10| 45| 11| 67| 16| 78| 18|
+---+-------+------+-------+------+-------+------+-------+------+
We just aggregate both on the price and the unit column after pivoting on the day.
If naming required as in question,
dff.select([F.col(c).name('_'.join(x for x in c.split('_')[::-1])) for c in dff.columns]).show()
+---+-------+------+-------+------+-------+------+-------+------+
| id|price_1|unit_1|price_2|unit_2|price_3|unit_3|price_4|unit_4|
+---+-------+------+-------+------+-------+------+-------+------+
|100| 23| 10| 45| 11| 67| 12| 78| 13|
|101| 23| 10| 45| 13| 67| 14| 78| 15|
|102| 23| 10| 45| 11| 67| 16| 78| 18|
+---+-------+------+-------+------+-------+------+-------+------+
The solution in the question is the best I could get. The only improvement would be to cache the input dataset to avoid double scan, i.e.
mydf.cache
pivot_udf(mydf,'price','units').show()
Videos
Pivot is an expensive shuffle operation and should be avoided if possible. Try using this logic with arrays_zip and explode to dynamically collapse columns and groupby-aggregate.
from pyspark.sql import functions as F
df.withColumn("cols", F.explode(F.arrays_zip(F.array([F.array(F.col(x),F.lit(x))\
for x in df.columns if x!='id']))))\
.withColumn("name", F.col("cols.0")[1]).withColumn("val", F.col("cols.0")[0]).drop("cols")\
.groupBy("name").agg(F.count(F.when(F.col("val")=='diff',1)).alias("diff"),\
F.count(F.when(F.col("val")=='same',1)).alias("same")).orderBy("name").show()
#+----+----+----+
#|name|diff|same|
#+----+----+----+
#| c1| 2| 2|
#| c2| 0| 4|
#| c3| 1| 3|
#+----+----+----+
You can also do this by exploding a map_type by creating a map dynamically.
from pyspark.sql import functions as F
from itertools import chain
df.withColumn("cols", F.create_map(*(chain(*[(F.lit(name), F.col(name))\
for name in df.columns if name!='id']))))\
.select(F.explode("cols").alias("name","val"))\
.groupBy("name").agg(F.count(F.when(F.col("val")=='diff',1)).alias("diff"),\
F.count(F.when(F.col("val")=='same',1)).alias("same")).orderBy("name").show()
#+----+----+----+
#|name|diff|same|
#+----+----+----+
#| c1| 2| 2|
#| c2| 0| 4|
#| c3| 1| 3|
#+----+----+----+
from pyspark.sql.functions import *
df = spark.createDataFrame([(1,'diff','same','diff'),(2,'same','same','same'),(3,'diff','same','same'),(4,'same','same','same')],['idcol','C1','C2','C3'])
df.createOrReplaceTempView("MyTable")
#spark.sql("select * from MyTable").collect()
x1=spark.sql("select idcol, 'C1' AS col, C1 from MyTable union all select idcol, 'C2' AS col, C2 from MyTable union all select idcol, 'C3' AS col, C3 from MyTable")
#display(x1)
x2=x1.groupBy('col').pivot('C1').agg(count('C1')).orderBy('col')
display(x2)