DynamicFrame is safer when handling memory intensive jobs. "The executor memory with AWS Glue dynamic frames never exceeds the safe threshold," while on the other hand, Spark DataFrame could hit "Out of memory" issue on executors. (https://docs.aws.amazon.com/glue/latest/dg/monitor-profile-debug-oom-abnormalities.html)
DynamicFrames are designed to provide maximum flexibility when dealing with messy data that may lack a declared schema. Records are represented in a flexible self-describing way that preserves information about schema inconsistencies in the data.
For example, with changing requirements, an address column stored as a string in some records might be stored as a struct in later rows. Rather than failing or falling back to a string, DynamicFrames will track both types and gives users a number of options in how to resolve these inconsistencies, providing fine grain resolution options via the ResolveChoice transforms.
DynamicFrames also provide a number of powerful high-level ETL operations that are not found in DataFrames. For example, the Relationalize transform can be used to flatten and pivot complex nested data into tables suitable for transfer to a relational database. In additon, the ApplyMapping transform supports complex renames and casting in a declarative fashion.
DynamicFrames are also integrated with the AWS Glue Data Catalog, so creating frames from tables is a simple operation. Writing to databases can be done through connections without specifying the password. Moreover, DynamicFrames are integrated with job bookmarks, so running these scripts in the job system can allow the script to implictly keep track of what was read and written.(https://github.com/aws-samples/aws-glue-samples/blob/master/FAQ_and_How_to.md)
Answer from Fang Zhang on Stack OverflowDynamicFrame is safer when handling memory intensive jobs. "The executor memory with AWS Glue dynamic frames never exceeds the safe threshold," while on the other hand, Spark DataFrame could hit "Out of memory" issue on executors. (https://docs.aws.amazon.com/glue/latest/dg/monitor-profile-debug-oom-abnormalities.html)
DynamicFrames are designed to provide maximum flexibility when dealing with messy data that may lack a declared schema. Records are represented in a flexible self-describing way that preserves information about schema inconsistencies in the data.
For example, with changing requirements, an address column stored as a string in some records might be stored as a struct in later rows. Rather than failing or falling back to a string, DynamicFrames will track both types and gives users a number of options in how to resolve these inconsistencies, providing fine grain resolution options via the ResolveChoice transforms.
DynamicFrames also provide a number of powerful high-level ETL operations that are not found in DataFrames. For example, the Relationalize transform can be used to flatten and pivot complex nested data into tables suitable for transfer to a relational database. In additon, the ApplyMapping transform supports complex renames and casting in a declarative fashion.
DynamicFrames are also integrated with the AWS Glue Data Catalog, so creating frames from tables is a simple operation. Writing to databases can be done through connections without specifying the password. Moreover, DynamicFrames are integrated with job bookmarks, so running these scripts in the job system can allow the script to implictly keep track of what was read and written.(https://github.com/aws-samples/aws-glue-samples/blob/master/FAQ_and_How_to.md)
You can refer to the documentation here: DynamicFrame Class. It says,
A DynamicFrame is similar to a DataFrame, except that each record is self-describing, so no schema is required initially.
You want to use DynamicFrame when,
- Data that does not conform to a fixed schema.
Note: You can also convert the DynamicFrame to DataFrame using toDF()
- Refer here: def toDF
Glue dynamicframes vs spark dynamicframe
amazon web services - write a spark dataframe or write a glue dynamic frame, which option is better in AWS Glue? - Stack Overflow
Performance of PySpark DataFrames vs Glue DynamicFrames - Stack Overflow
AWS Glue, is using toDF costly?
Yes converting to / from DF is costly in terms of time, you also lose some of the glue specific features like better reliability for high memory pressure scenarios.
That being said, if you use the glue studio to do “add a column” you’ll see that even AWS uses toDf(), do a thing, fromDF(), everywhere.
I know lots of people just skip over glue’s dynamic frame entirely or they use it to get a dyf from the glue data catalog then immediately convert to a DF and just use that.
Dyf’s do also have the advantage of being forgiving when your data set has mixed data types for a single column. While this isn’t typically an issue once you’re passed the data cleaning stage of an ETL pipeline it can be invaluable for the data cleaning stage
More on reddit.comVideos
DE intern here!
Hey so I’m trying to figure out why aws even created dynamic frames like what advantage does it give over spark data frame?
In AWS Glue notebook, after creating a Dynamic Dataframe, I wonder is it costly to convert it to normal Spark DF using .toDF()?
I mean Glue Dynamic Dataframe is stupid. For example, I want a simple .withColumn("new_col", lit("abc")) in Pyspark, but Dynamic Dataframe doesn't have withColumn method so I have to do it in a very complex way.
fromDF is a class function. Here is how you can convert Dataframe to DynamicFrame
from awsglue.dynamicframe import DynamicFrame
DynamicFrame.fromDF(test_df, glueContext, "test_nest")
AWS Docs
Just to consolidate the answers for Scala users too, here's how to transform a Spark Dataframe to a DynamicFrame (the method fromDF doesn't exist in the scala API of the DynamicFrame) :
import com.amazonaws.services.glue.DynamicFrame
val dynamicFrame = DynamicFrame(df, glueContext)
I hope it helps !