DynamicFrame is safer when handling memory intensive jobs. "The executor memory with AWS Glue dynamic frames never exceeds the safe threshold," while on the other hand, Spark DataFrame could hit "Out of memory" issue on executors. (https://docs.aws.amazon.com/glue/latest/dg/monitor-profile-debug-oom-abnormalities.html)

DynamicFrames are designed to provide maximum flexibility when dealing with messy data that may lack a declared schema. Records are represented in a flexible self-describing way that preserves information about schema inconsistencies in the data.

For example, with changing requirements, an address column stored as a string in some records might be stored as a struct in later rows. Rather than failing or falling back to a string, DynamicFrames will track both types and gives users a number of options in how to resolve these inconsistencies, providing fine grain resolution options via the ResolveChoice transforms.

DynamicFrames also provide a number of powerful high-level ETL operations that are not found in DataFrames. For example, the Relationalize transform can be used to flatten and pivot complex nested data into tables suitable for transfer to a relational database. In additon, the ApplyMapping transform supports complex renames and casting in a declarative fashion.

DynamicFrames are also integrated with the AWS Glue Data Catalog, so creating frames from tables is a simple operation. Writing to databases can be done through connections without specifying the password. Moreover, DynamicFrames are integrated with job bookmarks, so running these scripts in the job system can allow the script to implictly keep track of what was read and written.(https://github.com/aws-samples/aws-glue-samples/blob/master/FAQ_and_How_to.md)

Answer from Fang Zhang on Stack Overflow
🌐
GitHub
github.com › aws-samples › aws-glue-samples › blob › master › FAQ_and_How_to.md
aws-glue-samples/FAQ_and_How_to.md at master · aws-samples/aws-glue-samples
Once all the choice types in your DynamicFrame are resolved, you can convert it to a data frame using the 'toDF()' method. b. How do I write to targets that do not handle ChoiceTypes?
Author   aws-samples
🌐
BMC Software
bmc.com › blogs › aws-glue-etl-transformations
AWS Glue ETL Transformations – BMC Software | Blogs
August 21, 2020 - from pyspark.context import ... a new DynamicFrame by taking the fields in the paths list. We use toDF().show() to turn it into Spark Dataframe and print the results....
Discussions

amazon web services - DynamicFrame vs DataFrame - Stack Overflow
Note: You can also convert the DynamicFrame to DataFrame using toDF() ... A dataframe will have a set schema (schema on read). Your data can be nested, but it must be schema on read. In the case where you can't do schema on read a dataframe will not work. I have yet to see a case where someone is using dynamic frames ... More on stackoverflow.com
🌐 stackoverflow.com
pyspark - convert spark dataframe to aws glue dynamic frame - Stack Overflow
I tried converting my spark dataframes to dynamic to output as glueparquet files but I'm getting the error 'DataFrame' object has no attribute 'fromDF'" My code uses heavily spark dataframes. Is... More on stackoverflow.com
🌐 stackoverflow.com
pyspark - How to convert Dataframe to dynamic frame - Stack Overflow
I am new to AWS glue and I am trying to run some transformation process using pyspark. I successfully ran my ETL but I am looking for another way of converting dataframe to dynamic frame. import sy... More on stackoverflow.com
🌐 stackoverflow.com
Glue DynamicFrame show method yields nothing
I would like to inform DynamicFrame is similar to a DataFrame, except that each record is self-describing, so no schema is required initially. Instead, AWS Glue computes a schema on-the-fly when required. Basically Glue DynamicFrame is based on RDD due to which show() method does not work directly and you need to convert dynamic frame ... More on repost.aws
🌐 repost.aws
1
0
June 10, 2022
🌐
AWS re:Post
repost.aws › questions › QU51ax-cVsROSuk9-tWgOhQQ › using-pandas-in-glue-etl-job-how-to-convert-dynamic-dataframe-or-pyspark-dataframe-to-pandas-dataframe
Using Pandas in Glue ETL Job ( How to convert Dynamic DataFrame or PySpark Dataframe to Pandas Dataframe) | AWS re:Post
April 29, 2022 - ... Would say convert Dynamic frame to Spark data frame using .ToDF() method and from spark dataframe to pandas dataframe using link https://sparkbyexamples.com/pyspark/convert-pyspark-dataframe-to-pandas/#:~:text=Convert PySpark Dataframe to ...
Top answer
1 of 4
44

DynamicFrame is safer when handling memory intensive jobs. "The executor memory with AWS Glue dynamic frames never exceeds the safe threshold," while on the other hand, Spark DataFrame could hit "Out of memory" issue on executors. (https://docs.aws.amazon.com/glue/latest/dg/monitor-profile-debug-oom-abnormalities.html)

DynamicFrames are designed to provide maximum flexibility when dealing with messy data that may lack a declared schema. Records are represented in a flexible self-describing way that preserves information about schema inconsistencies in the data.

For example, with changing requirements, an address column stored as a string in some records might be stored as a struct in later rows. Rather than failing or falling back to a string, DynamicFrames will track both types and gives users a number of options in how to resolve these inconsistencies, providing fine grain resolution options via the ResolveChoice transforms.

DynamicFrames also provide a number of powerful high-level ETL operations that are not found in DataFrames. For example, the Relationalize transform can be used to flatten and pivot complex nested data into tables suitable for transfer to a relational database. In additon, the ApplyMapping transform supports complex renames and casting in a declarative fashion.

DynamicFrames are also integrated with the AWS Glue Data Catalog, so creating frames from tables is a simple operation. Writing to databases can be done through connections without specifying the password. Moreover, DynamicFrames are integrated with job bookmarks, so running these scripts in the job system can allow the script to implictly keep track of what was read and written.(https://github.com/aws-samples/aws-glue-samples/blob/master/FAQ_and_How_to.md)

2 of 4
11

You can refer to the documentation here: DynamicFrame Class. It says,

A DynamicFrame is similar to a DataFrame, except that each record is self-describing, so no schema is required initially.

You want to use DynamicFrame when,

  • Data that does not conform to a fixed schema.

Note: You can also convert the DynamicFrame to DataFrame using toDF()

  • Refer here: def toDF
🌐
Jayendra's Cloud Certification Blog
jayendrapatil.com › tag › dynamic-frames
Dynamic Frames Archives - Jayendra's Cloud Certification Blog
Conversion can be done between Dynamic frames and Spark dataframes, to take advantage of both AWS Glue and Spark transformations to do the kinds of analysis needed.
🌐
LinkedIn
linkedin.com › pulse › programmatically-adding-column-dynamic-dataframe-aws-glue-tom-reid
Programmatically adding a column to a Dynamic DataFrame in AWS Glue
February 23, 2021 - from awsglue.transforms import * from awsglue.utils import getResolvedOptions from pyspark.context import SparkContext from awsglue.context import GlueContext from awsglue.job import Job from awsglue.dynamicframe import DynamicFrame # Extra import to get the current_date function # from pyspark.sql.functions import current_date datasource0 = glueContext.create_dynamic_frame.from_catalog(database = "test_data", table_name = "test_file", transformation_ctx = "datasource0") # # The 2 extra lines below convert between the # original Glue Dataframe to a Spark Dataframe, add the new column # then co
Find elsewhere
🌐
Edlitera
edlitera.com › en › blog › posts › pyspark-cheat-sheet
PySpark Cheat Sheet | Edlitera
January 17, 2023 - dfg.toDF().write.parquet("s3://glue-sample-target/outputdir/dfg", partitionBy=["example_column"]) dfg = DynamicFrame.fromDF(df, glueContext, "dfg") glueContext.write_dynamic_frame.from_options( frame=dfg, connection_type="s3", connection_options={"path": "s3://glue-sample-target/outputdir/dfg"}, format="parquet") Display DataFrame content: df.show() Display DataFrame schema: df.schema() Display DataFrame as a Pandas DataFrame: df.toPandas() Return DataFrame columns: df.columns ·
🌐
Stack Overflow
stackoverflow.com › questions › 69915152 › how-to-convert-dataframe-to-dynamic-frame
pyspark - How to convert Dataframe to dynamic frame - Stack Overflow
Copyimport sys from awsglue.transforms import * from awsglue.utils import getResolvedOptions from pyspark.context import SparkContext from awsglue.context import GlueContext from awsglue.job import Job glueContext = GlueContext(SparkContext.getOrCreate()) # load data from crawler students = glueContext.create_dynamic_frame.from_catalog(database="example_db", table_name="samp_csv") # move data into a new variable for transformation students_trans = students # convert dynamicframe(students_trans) to dataframe students_= students_trans.toDF() # run transformation change column names/ drop columns
🌐
Medium
medium.com › today-i-learnt › til-aws-glue-dynamic-dataframe-tips-todf-use-resolvechoice-for-mixed-data-types-in-a-column-374775d0c092
TIL: AWS Glue Dynamic Dataframe Tips toDf() — Use ResolveChoice for Mixed Data types in a column | by Satyaprakash Bommaraju | Today I Learnt | Medium
February 18, 2023 - Recently, I had to create a Dataframe that contained mixed data types. ... raw_data_dydf = glueContext.create_dynamic_frame.from_options( format_options={"multiline": False}, connection_type="s3", format="json", connection_options={ "paths": [input_path], "recurse": False, }, transformation_ctx="raw_data", )
🌐
AWS re:Post
repost.aws › questions › QUXTWPEGvJR9S6HHJWCpFlKg › glue-dynamicframe-show-method-yields-nothing
Glue DynamicFrame show method yields nothing | AWS re:Post
June 10, 2022 - Converting the DynamicFrame into a Spark DataFrame actually yields a result (df.toDF().show()). Here the dummy code that I'm using · glueContext = GlueContext(spark.sparkContext) df = glueContext.create_dynamic_frame_from_options( connection_type="s3", connection_options = { "paths": [f"s3://bucketname/filename"]}, format="json", format_options={"multiline": True} ) df.printSchema() df.show()
🌐
GitHub
github.com › aws-samples › aws-glue-samples › issues › 49
AWS Glue error converting data frame to dynamic frame · Issue #49 aws-samples/aws-glue-samples · GitHub
June 11, 2019 - dfs = sqlContext.read.format(SNOWFLAKE_SOURCE_NAME).options(**sfOptions).option("query", "SELECT hashkey as hash From randomtable").load() #Source datasource0 = glueContext.create_dynamic_frame.from_catalog(database = "test", table_name = "randomtable", transformation_ctx = "datasource0") #add hash value df = datasource0.toDF() df.cache() df = df.withColumn("hashkey", sha2(concat_ws("||", *df.columns), 256)) #drop dupes df1 = df.dropDuplicates(subset=['hashkey']) #read incremental data inc = df1.join(dfs, df1["hashkey"] == dfs["hash"], how='left').filter(col('hash').isNull()) #convert it back to glue context datasource1 = DynamicFrame.fromDF(inc, glueContext, "datasource1")
Author   shanmukhakota
🌐
Medium
medium.com › @jsaluja › aws-glue-dynamicframe-to-pandas-dataframe-458fc41fac26
AWS Glue DynamicFrame to Pandas DataFrame - Jaspal Singh Saluja - Medium
November 5, 2023 - dynamic_frame.toDF().toPandas(). “AWS Glue DynamicFrame to Pandas DataFrame” is published by Jaspal Singh Saluja.
🌐
Reddit
reddit.com › r/aws › aws glue tutorial: not sure how to get the name of the dynamic frame that is being used to write out the data
r/aws on Reddit: AWS Glue Tutorial: Not sure how to get the name of the dynamic frame that is being used to write out the data
September 27, 2017 -

Hi everyone! I'm trying to follow this tutorial https://aws.amazon.com/blogs/big-data/harmonize-query-and-visualize-data-from-various-providers-using-aws-glue-amazon-athena-and-amazon-quicksight/ to understand AWS Glue a bit better, but I'm having a hard time with one of the steps

In the job generation, they have this step

Let’s now convert that to a DataFrame. Please replace the <DYNAMIC_FRAME_NAME> with the name generated in the script.

And this snippet

 ##----------------------------------
 #convert to a Spark DataFrame...
 customDF = <DYNAMIC_FRAME_NAME>.toDF()

But I can't seem to find where the <DYNAMIC_FRAME_NAME> can be found. I thought it was customDF = resolvechoice2.toDF() , but it didn't run correctly.

Here's my entire code (with the edited names of the buckets, of course)

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
from pyspark.sql.functions import lit
from awsglue.dynamicframe import DynamicFrame

## @params: [JOB_NAME]
args = getResolvedOptions(sys.argv, ['JOB_NAME'])

sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)
## @type: DataSource
## @args: [database = "nycitytaxianalysis", table_name = "blog_yellow",    transformation_ctx = "datasource0"]
## @return: datasource0
## @inputs: []
datasource0 = glueContext.create_dynamic_frame.from_catalog(database = "nycitytaxianalysis", table_name = "blog_yellow", transformation_ctx = "datasource0")
## @type: ApplyMapping
## @args: [mapping = [("vendorid", "long", "vendorid", "long"), ("tpep_pickup_datetime", "string", "pickup_datetime", "timestamp"), ("tpep_dropoff_datetime", "string", "dropoff_datetime", "timestamp"), ("passenger_count", "long", "passenger_count", "long"), ("trip_distance", "double", "trip_distance", "double"), ("pickup_longitude", "double", "pickup_longitude", "double"), ("pickup_latitude", "double", "pickup_latitude", "double"), ("ratecodeid", "long", "ratecodeid", "long"), ("store_and_fwd_flag", "string", "store_and_fwd_flag", "string"), ("dropoff_longitude", "double", "dropoff_longitude", "double"), ("dropoff_latitude", "double", "dropoff_latitude", "double"), ("payment_type", "long", "payment_type", "long"), ("fare_amount", "double", "fare_amount", "double"), ("extra", "double", "extra", "double"), ("mta_tax", "double", "mta_tax", "double"), ("tip_amount", "double", "tip_amount", "double"), ("tolls_amount", "double", "tolls_amount", "double"), ("improvement_surcharge", "double", "improvement_surcharge", "double"), ("total_amount", "double", "total_amount", "double")], transformation_ctx = "applymapping1"]
## @return: applymapping1
## @inputs: [frame = datasource0]
applymapping1 = ApplyMapping.apply(frame = datasource0, mappings = [("vendorid", "long", "vendorid", "long"), ("tpep_pickup_datetime", "string", "pickup_datetime", "timestamp"), ("tpep_dropoff_datetime", "string", "dropoff_datetime", "timestamp"), ("passenger_count", "long", "passenger_count", "long"), ("trip_distance", "double", "trip_distance", "double"), ("pickup_longitude", "double", "pickup_longitude", "double"), ("pickup_latitude", "double", "pickup_latitude", "double"), ("ratecodeid", "long", "ratecodeid", "long"), ("store_and_fwd_flag", "string", "store_and_fwd_flag", "string"), ("dropoff_longitude", "double", "dropoff_longitude", "double"), ("dropoff_latitude", "double", "dropoff_latitude", "double"), ("payment_type", "long", "payment_type", "long"), ("fare_amount", "double", "fare_amount", "double"), ("extra", "double", "extra", "double"), ("mta_tax", "double", "mta_tax", "double"), ("tip_amount", "double", "tip_amount", "double"), ("tolls_amount", "double", "tolls_amount", "double"), ("improvement_surcharge", "double", "improvement_surcharge", "double"), ("total_amount", "double", "total_amount", "double")], transformation_ctx = "applymapping1")
## @type: ResolveChoice
## @args: [choice = "make_struct", transformation_ctx = "resolvechoice2"]
## @return: resolvechoice2
## @inputs: [frame = applymapping1]
resolvechoice2 = ResolveChoice.apply(frame = applymapping1, choice = "make_struct", transformation_ctx = "resolvechoice2")
## @type: DropNullFields
## @args: [transformation_ctx = "dropnullfields3"]
## @return: dropnullfields3
## @inputs: [frame = resolvechoice2]
dropnullfields3 = DropNullFields.apply(frame = resolvechoice2, transformation_ctx = "dropnullfields3")
##----------------------------------
#convert to a Spark DataFrame...
customDF = resolvechoice2.toDF() <<---- HERE'S MY CODE 

#add a new column for "type"
customDF = customDF.withColumn("type", lit('yellow'))

# Convert back to a DynamicFrame for further processing.
customDynamicFrame = DynamicFrame.fromDF(customDF, glueContext, "customDF_df")
##----------------------------------
## @type: DataSink
## @args: [connection_type = "s3", connection_options = {"path": "s3://<<s3-bucket>>/glue-blog/"}, format = "parquet", transformation_ctx = "datasink4"]
## @return: datasink4
## @inputs: [frame = customDynamicFrame]
datasink4 = glueContext.write_dynamic_frame.from_options(frame = customDynamicFrame, connection_type = "s3", connection_options = {"path": "s3://<<s3-bucket>>/glue-blog/"}, format = "parquet", transformation_ctx = "datasink4")
job.commit()

Where can I find the <DYNAMIC_FRAME_NAME> ?

Thanks!