glue dynamic frame vs dataframe performance

stackoverflow.com › questions › 52822526 › dynamicframe-vs-dataframe

amazon web services - DynamicFrame vs DataFrame - Stack Overflow

dev.to › tmyoda › spark-on-aws-glue-performance-tuning-2-glue-dynamicframe-vs-spark-dataframe-12gb

1 of 4

44

DynamicFrame is safer when handling memory intensive jobs. "The executor memory with AWS Glue dynamic frames never exceeds the safe threshold," while on the other hand, Spark DataFrame could hit "Out of memory" issue on executors. (https://docs.aws.amazon.com/glue/latest/dg/monitor-profile-debug-oom-abnormalities.html)

DynamicFrames are designed to provide maximum flexibility when dealing with messy data that may lack a declared schema. Records are represented in a flexible self-describing way that preserves information about schema inconsistencies in the data.

For example, with changing requirements, an address column stored as a string in some records might be stored as a struct in later rows. Rather than failing or falling back to a string, DynamicFrames will track both types and gives users a number of options in how to resolve these inconsistencies, providing fine grain resolution options via the ResolveChoice transforms.

DynamicFrames also provide a number of powerful high-level ETL operations that are not found in DataFrames. For example, the Relationalize transform can be used to flatten and pivot complex nested data into tables suitable for transfer to a relational database. In additon, the ApplyMapping transform supports complex renames and casting in a declarative fashion.

DynamicFrames are also integrated with the AWS Glue Data Catalog, so creating frames from tables is a simple operation. Writing to databases can be done through connections without specifying the password. Moreover, DynamicFrames are integrated with job bookmarks, so running these scripts in the job system can allow the script to implictly keep track of what was read and written.(https://github.com/aws-samples/aws-glue-samples/blob/master/FAQ_and_How_to.md)

2 of 4

11

You can refer to the documentation here: DynamicFrame Class. It says,

A DynamicFrame is similar to a DataFrame, except that each record is self-describing, so no schema is required initially.

You want to use DynamicFrame when,

Data that does not conform to a fixed schema.

Note: You can also convert the DynamicFrame to DataFrame using toDF()

Refer here: def toDF

DEV Community

Spark on AWS Glue: Performance Tuning 2 (Glue DynamicFrame vs Spark DataFrame) - DEV Community

December 8, 2023 - Spark on AWS Glue: Performance Tuning 2 (Glue DynamicFrame vs Spark DataFrame) Spark on AWS Glue: Performance Tuning 3 (Impact of Partition Quantity) Spark on AWS Glue: Performance Tuning 4 (Spark Join) Spark on AWS Glue: Performance Tuning 5 (Using Cache) Let's compare them using the Parquet file which I created in the part 1. We will read a single large Parquet file and a highly partitioned Parquet file. with timer('df'): dyf = glueContext.create_dynamic_frame.from_options( "s3", { "paths": [ "s3://.../parquet-chunk-high/" ] }, "parquet", ) print(dyf.count()) with timer('df partition'): dyf = glueContext.create_dynamic_frame.from_options( "s3", { "paths": [ "s3:/.../parquet-partition-high/" ] }, "parquet", ) print(dyf.count()) 324917265 [df] done in 125.9965 s 324917265 [df partition] done in 55.9798 s ·

Discussions

Glue dynamicframes vs spark dynamicframe

Two big differences I know of are that DynamicFrames have tie ins with the glue catalog, and the rationalize() method is very nice for handling nested json data without having to write a bunch of explodes and such. Each have their own use cases though, neither is strictly always better than the other. More on reddit.com

r/dataengineering

3

4

February 2, 2025

amazon web services - write a spark dataframe or write a glue dynamic frame, which option is better in AWS Glue? - Stack Overflow

In AWS Glue, I read the data from data catalog in a glue dynamic frame. Then convert the dynamic frame to spark dataframe to apply schema transformations. To write the data back to s3 I have seen More on stackoverflow.com

stackoverflow.com

Performance of PySpark DataFrames vs Glue DynamicFrames - Stack Overflow

So I recently started using Glue and PySpark for the first time. The task was to create a Glue job that does the following: Load data from parquet files residing in an S3 bucket Apply a filter to ... More on stackoverflow.com

stackoverflow.com

April 20, 2022

AWS Glue, is using toDF costly?

Yes converting to / from DF is costly in terms of time, you also lose some of the glue specific features like better reliability for high memory pressure scenarios.

That being said, if you use the glue studio to do “add a column” you’ll see that even AWS uses toDf(), do a thing, fromDF(), everywhere.

I know lots of people just skip over glue’s dynamic frame entirely or they use it to get a dyf from the glue data catalog then immediately convert to a DF and just use that.

Dyf’s do also have the advantage of being forgiving when your data set has mixed data types for a single column. While this isn’t typically an issue once you’re passed the data cleaning stage of an ETL pipeline it can be invaluable for the data cleaning stage

Videos

youtube.com

AWS Glue PySpark: Change Column Data Types - YouTube

November 21, 2022

youtube.com

AWS Glue PySpark: Filter Data in a DynamicFrame - YouTube

October 17, 2022

01:36:49

YouTube

PySpark For AWS Glue Tutorial [FULL COURSE in 100min] - YouTube

July 28, 2022

86.7K

View all

linkedin.com › pulse › aws-glue-dynamicframes-advantages-over-spark-dataframes-hugo-tota

AWS Glue DynamicFrames: Advantages over Spark DataFrames

July 21, 2023 - AWS Glue DynamicFrames are optimized for performance within the AWS Glue ecosystem. They leverage Glue's query optimization capabilities to enhance data processing speed and resource utilization.

AWS

docs.aws.amazon.com › aws glue › user guide › aws glue programming guide › programming spark scripts › program aws glue etl scripts in pyspark › aws glue pyspark extensions reference › dynamicframe class

DynamicFrame class - AWS Glue

A DynamicFrame is similar to a DataFrame, except that each record is self-describing, so no schema is required initially. Instead, AWS Glue computes a schema on-the-fly when required, and explicitly encodes schema inconsistencies using a choice (or union) type.

medium.com › @mahakg290399 › aws-glue-dynamicframe-vs-spark-dataframe-when-to-use-which-50f7776d8ceb

AWS Glue DynamicFrame vs Spark DataFrame: When to Use Which? | by Mahak Goyal | Medium

August 17, 2025 - DataFrames are statically typed with an enforced schema, which improves performance due to Spark’s Catalyst optimizer. A DynamicFrame is an AWS Glue-specific abstraction built on top of DataFrames, designed for semi-structured data and schema-evolving datasets commonly found in data lakes.

GitHub

github.com › aws-samples › aws-glue-samples › blob › master › FAQ_and_How_to.md

aws-glue-samples/FAQ_and_How_to.md at master · aws-samples/aws-glue-samples

Since DataFrames do not have the type flexibility that DynamicFrames do, you have to resolve the choice type in your DynamicFrame before conversion. Glue provides a transformation called ResolveChoice with the following signature: ResolveChoice.apply(self, frame, specs = None, choice = "", ...

Author aws-samples

reddit.com › r/dataengineering › glue dynamicframes vs spark dynamicframe

r/dataengineering on Reddit: Glue dynamicframes vs spark dynamicframe

February 2, 2025 -

DE intern here!

Hey so I’m trying to figure out why aws even created dynamic frames like what advantage does it give over spark data frame?

Two big differences I know of are that DynamicFrames have tie ins with the glue catalog, and the rationalize() method is very nice for handling nested json data without having to write a bunch of explodes and such. Each have their own use cases though, neither is strictly always better than the other.

1 of 2

1

2 of 2

1

Don't use dynamicFrame, I beg you not. You will face error like File already exists, like this one https://stackoverflow.com/questions/62886214/getting-file-ready-exists-error-in-aws-glue-when-writing-a-dynamic-frame-to-re And the logs are terrible. Spark DF for your good night

Find elsewhere

Google Bing Mojeek

medium.com › @mailtoharshachunduru › glue-dynamic-frame-vs-apache-spark-scala-data-frames-a38229bb9af3

Glue Dynamic frame vs Apache spark scala Data frames | by Harsha Chunduru | Medium

March 22, 2023 - Both frameworks can handle big data processing, but Glue Dynamic Frame is better suited for larger datasets as it is built on top of AWS Glue, which is a serverless ETL service that can scale automatically based on the workload.

Jayendra's Cloud Certification Blog

jayendrapatil.com › tag › dynamic-frames

Dynamic Frames - AWS Glue

Conversion can be done between Dynamic frames and Spark dataframes, to take advantage of both AWS Glue and Spark transformations to do the kinds of analysis needed. AWS Glue enables performing ETL operations on streaming data using continuously-running jobs.

stackoverflow.com › questions › 62357613 › write-a-spark-dataframe-or-write-a-glue-dynamic-frame-which-option-is-better-in › 62358251

amazon web services - write a spark dataframe or write a glue dynamic frame, which option is better in AWS Glue? - Stack Overflow

1 of 1

8

You will find that there is functionality that is available only to dynamic frame writer class that cannot be accessed when using data frames:

Writing to a catalog table based on an s3 source as well when you want to utilize connection to JDBC sources. i.e using from_jdbc_conf
Writing to parquet using format glueparquet as a format.
Tracking processed files in the target location using bookmarks

These are some of the use-cases I can think of, but if you have a use case that requires using save modes, for example, mode('overwrite') you could use data frames. A similar approach however exists at dynamic frame but is implemented slightly different. You can take a look at [purge_s3_path][3] then write.

stackoverflow.com › questions › 71940964 › performance-of-pyspark-dataframes-vs-glue-dynamicframes

Performance of PySpark DataFrames vs Glue DynamicFrames - Stack Overflow

April 20, 2022 - def AddColumn(r): if r["option_type"] == 'S': r["option_code_derived"]= 'S'+ r["option_code_4"] elif r["option_type"] == 'P': r["option_code_derived"]= 'F'+ r["option_code_4"][1:] elif r["option_type"] == 'L': r["option_code_derived"]= 'P'+ r["option_code_4"] else: r["option_code_derived"]= None return r glueContext = GlueContext(create_spark_context(role_arn=args['role_arn'])) spark = glueContext.spark_session job = Job(glueContext) job.init(args['JOB_NAME'], args) inputGDF = glueContext.create_dynamic_frame_from_options(connection_type = "s3", connection_options = {"paths": [source_path], "r

medium.com › @irshadalamtech › difference-between-pyspark-dataframe-and-dynamicframe-f5c863201afd

Difference between pyspark dataframe and aws glue dynamicframe | by Irshad Alam | Medium

November 26, 2024 - PySpark DataFrame: 1. Requires a schema to be defined upfront 2. Can handle nested data, but it can be cumbersome to work with 3. Can handle schema evolution, but it can be complex and error-prone 4. Generally faster than DynamicFrame · AWS Glue ...

BMC Software

bmc.com › blogs › aws-glue-etl-transformations

AWS Glue ETL Transformations – BMC Software | Blogs

August 21, 2020 - As we can turn DynamicFrames into Spark Dataframes, we can go the other way around.

reddit.com › r/aws › aws glue, is using todf costly?

r/aws on Reddit: AWS Glue, is using toDF costly?

June 30, 2024 -

In AWS Glue notebook, after creating a Dynamic Dataframe, I wonder is it costly to convert it to normal Spark DF using .toDF()?

I mean Glue Dynamic Dataframe is stupid. For example, I want a simple .withColumn("new_col", lit("abc")) in Pyspark, but Dynamic Dataframe doesn't have withColumn method so I have to do it in a very complex way.

github.com › aws-samples › aws-glue-samples › issues › 49

1 of 1

3

Yes converting to / from DF is costly in terms of time, you also lose some of the glue specific features like better reliability for high memory pressure scenarios.

That being said, if you use the glue studio to do “add a column” you’ll see that even AWS uses toDf(), do a thing, fromDF(), everywhere.

I know lots of people just skip over glue’s dynamic frame entirely or they use it to get a dyf from the glue data catalog then immediately convert to a DF and just use that.

Dyf’s do also have the advantage of being forgiving when your data set has mixed data types for a single column. While this isn’t typically an issue once you’re passed the data cleaning stage of an ETL pipeline it can be invaluable for the data cleaning stage

GitHub

AWS Glue error converting data frame to dynamic frame · Issue #49 aws-samples/aws-glue-samples · GitHub

June 11, 2019 - Here's my code where I am trying to create a new data frame out of the result set of my left join on other 2 data frames and then trying to convert it to a dynamic frame. dfs = sqlContext.read.format(SNOWFLAKE_SOURCE_NAME).options(**sfOptions).option("query", "SELECT hashkey as hash From randomtable").load() #Source datasource0 = glueContext.create_dynamic_frame.from_catalog(database = "test", table_name = "randomtable", transformation_ctx = "datasource0") #add hash value df = datasource0.toDF() df.cache() df = df.withColumn("hashkey", sha2(concat_ws("||", *df.columns), 256)) #drop dupes df1 = df.dropDuplicates(subset=['hashkey']) #read incremental data inc = df1.join(dfs, df1["hashkey"] == dfs["hash"], how='left').filter(col('hash').isNull()) #convert it back to glue context datasource1 = DynamicFrame.fromDF(inc, glueContext, "datasource1")

Author shanmukhakota

stackoverflow.com › questions › 59014650 › convert-spark-dataframe-to-aws-glue-dynamic-frame

pyspark - convert spark dataframe to aws glue dynamic frame - Stack Overflow

forums.aws.amazon.com › thread.jspa

1 of 3

38

fromDF is a class function. Here is how you can convert Dataframe to DynamicFrame

from awsglue.dynamicframe import DynamicFrame

DynamicFrame.fromDF(test_df, glueContext, "test_nest")

AWS Docs

2 of 3

14

Just to consolidate the answers for Scala users too, here's how to transform a Spark Dataframe to a DynamicFrame (the method fromDF doesn't exist in the scala API of the DynamicFrame) :

import com.amazonaws.services.glue.DynamicFrame  
val dynamicFrame = DynamicFrame(df, glueContext)

I hope it helps !

AWS re:Post

Forums | AWS re:Post

November 30, 2018 - New user sign up using AWS Builder ID · New user sign up using AWS Builder ID is currently unavailable on re:Post. To sign up, please use the AWS Management Console instead

AWS re:Post

repost.aws › questions › QU51ax-cVsROSuk9-tWgOhQQ › using-pandas-in-glue-etl-job-how-to-convert-dynamic-dataframe-or-pyspark-dataframe-to-pandas-dataframe

Using Pandas in Glue ETL Job ( How to convert Dynamic DataFrame or PySpark Dataframe to Pandas Dataframe) | AWS re:Post

April 29, 2022 - Extract Transform & Load DataAWS Glue · Language · English · bfeeny · asked 4 years ago13K views · 1 Answer · Newest · Most votes · Most comments · Are these answers helpful? Upvote the correct answer to help the community benefit from your knowledge. 0 · Accepted Answer · Would say convert Dynamic frame to Spark data frame using .ToDF() method and from spark dataframe to pandas dataframe using link https://sparkbyexamples.com/pyspark/convert-pyspark-dataframe-to-pandas/#:~:text=Convert PySpark Dataframe to Pandas DataFrame,small subset of the data.

Kindatechnical

kindatechnical.com › aws-glue › lesson-37-understanding-dynamicframe.html

Kinda Technical | A Guide to AWS Glue - Understanding DynamicFrame

# Relationalizing a DynamicFrame ... important to consider performance optimizations: Parallel Processing: DynamicFrames are designed to work in parallel, which can significantly speed up processing....