You can try getItem(0):
df \
.withColumn("CurrencyCode", df["CurrencyCode"].getItem(0).cast("string")) \
.withColumn("TicketAmount", df["TicketAmount"].getItem(0).cast("string"))
The final cast to string is optional.
Answer from Daniel de Paula on Stack OverflowWhile you can use a UserDefinedFunction it is very inefficient. Instead it is better to use concat_ws function:
from pyspark.sql.functions import concat_ws
df.withColumn("test_123", concat_ws(",", "test_123")).show()
+----+----------------+
|uuid| test_123|
+----+----------------+
| 1|test,test2,test3|
| 2|test4,test,test6|
| 3|test6,test9,t55o|
+----+----------------+
You can create a udf that joins array/list and then apply it to the test column:
from pyspark.sql.functions import udf, col
join_udf = udf(lambda x: ",".join(x))
df.withColumn("test_123", join_udf(col("test_123"))).show()
+----+----------------+
|uuid| test_123|
+----+----------------+
| 1|test,test2,test3|
| 2|test4,test,test6|
| 3|test6,test9,t55o|
+----+----------------+
The initial data frame is created from:
from pyspark.sql.types import StructType, StructField
schema = StructType([StructField("uuid",IntegerType(),True),StructField("test_123",ArrayType(StringType(),True),True)])
rdd = sc.parallelize([[1, ["test","test2","test3"]], [2, ["test4","test","test6"]],[3,["test6","test9","t55o"]]])
df = spark.createDataFrame(rdd, schema)
df.show()
+----+--------------------+
|uuid| test_123|
+----+--------------------+
| 1|[test, test2, test3]|
| 2|[test4, test, test6]|
| 3|[test6, test9, t55o]|
+----+--------------------+
Just:
Copyfrom pyspark.sql.functions import col
table = spark.sql("table")
table.select([col(c).cast("string") for c in table.columns])
Here's a one line solution in Scala :
Copydf.select(df.columns.map(c => col(c).cast(StringType)) : _*)
Let's see an example here :
Copyimport org.apache.spark.sql._
import org.apache.spark.sql.types._
import org.apache.spark.sql.functions._
val data = Seq(
Row(1, "a"),
Row(5, "z")
)
val schema = StructType(
List(
StructField("num", IntegerType, true),
StructField("letter", StringType, true)
)
)
val df = spark.createDataFrame(
spark.sparkContext.parallelize(data),
schema
)
df.printSchema
//root
//|-- num: integer (nullable = true)
//|-- letter: string (nullable = true)
val newDf = df.select(df.columns.map(c => col(c).cast(StringType)) : _*)
newDf.printSchema
//root
//|-- num: string (nullable = true)
//|-- letter: string (nullable = true)
I hope it helps
PySpark Row is just a tuple and can be used as such. All you need here is a simple map (or flatMap if you want to flatten the rows as well) with list:
data.map(list)
or if you expect different types:
data.map(lambda row: [str(c) for c in row])
The accepted answer is old. With Spark 2.0, you must now explicitly state that you're converting to an rdd by adding .rdd to the statement. Therefore, the equivalent of this statement in Spark 1.0:
data.map(list)
Should now be:
data.rdd.map(list)
in Spark 2.0. Related to the accepted answer in this post.