convert dynamic frame to dataframe

stackoverflow.com › questions › 52822526 › dynamicframe-vs-dataframe

DynamicFrame is safer when handling memory intensive jobs. "The executor memory with AWS Glue dynamic frames never exceeds the safe threshold," while on the other hand, Spark DataFrame could hit "Out of memory" issue on executors. (https://docs.aws.amazon.com/glue/latest/dg/monitor-profile-debug-oom-abnormalities.html)

DynamicFrames are designed to provide maximum flexibility when dealing with messy data that may lack a declared schema. Records are represented in a flexible self-describing way that preserves information about schema inconsistencies in the data.

For example, with changing requirements, an address column stored as a string in some records might be stored as a struct in later rows. Rather than failing or falling back to a string, DynamicFrames will track both types and gives users a number of options in how to resolve these inconsistencies, providing fine grain resolution options via the ResolveChoice transforms.

DynamicFrames also provide a number of powerful high-level ETL operations that are not found in DataFrames. For example, the Relationalize transform can be used to flatten and pivot complex nested data into tables suitable for transfer to a relational database. In additon, the ApplyMapping transform supports complex renames and casting in a declarative fashion.

DynamicFrames are also integrated with the AWS Glue Data Catalog, so creating frames from tables is a simple operation. Writing to databases can be done through connections without specifying the password. Moreover, DynamicFrames are integrated with job bookmarks, so running these scripts in the job system can allow the script to implictly keep track of what was read and written.(https://github.com/aws-samples/aws-glue-samples/blob/master/FAQ_and_How_to.md)

Answer from Fang Zhang on Stack Overflow

GitHub

github.com › aws-samples › aws-glue-samples › blob › master › FAQ_and_How_to.md

aws-glue-samples/FAQ_and_How_to.md at master · aws-samples/aws-glue-samples

Once all the choice types in your DynamicFrame are resolved, you can convert it to a data frame using the 'toDF()' method. b. How do I write to targets that do not handle ChoiceTypes?

Author aws-samples

BMC Software

bmc.com › blogs › aws-glue-etl-transformations

AWS Glue ETL Transformations – BMC Software | Blogs

August 21, 2020 - from pyspark.context import ... a new DynamicFrame by taking the fields in the paths list. We use toDF().show() to turn it into Spark Dataframe and print the results....

Discussions

amazon web services - DynamicFrame vs DataFrame - Stack Overflow

Note: You can also convert the DynamicFrame to DataFrame using toDF() ... A dataframe will have a set schema (schema on read). Your data can be nested, but it must be schema on read. In the case where you can't do schema on read a dataframe will not work. I have yet to see a case where someone is using dynamic frames ... More on stackoverflow.com

stackoverflow.com

pyspark - convert spark dataframe to aws glue dynamic frame - Stack Overflow

I tried converting my spark dataframes to dynamic to output as glueparquet files but I'm getting the error 'DataFrame' object has no attribute 'fromDF'" My code uses heavily spark dataframes. Is... More on stackoverflow.com

stackoverflow.com

pyspark - How to convert Dataframe to dynamic frame - Stack Overflow

I am new to AWS glue and I am trying to run some transformation process using pyspark. I successfully ran my ETL but I am looking for another way of converting dataframe to dynamic frame. import sy... More on stackoverflow.com

stackoverflow.com

Glue DynamicFrame show method yields nothing

I would like to inform DynamicFrame is similar to a DataFrame, except that each record is self-describing, so no schema is required initially. Instead, AWS Glue computes a schema on-the-fly when required. Basically Glue DynamicFrame is based on RDD due to which show() method does not work directly and you need to convert dynamic frame ... More on repost.aws

repost.aws

1

0

June 10, 2022

AWS re:Post

repost.aws › questions › QU51ax-cVsROSuk9-tWgOhQQ › using-pandas-in-glue-etl-job-how-to-convert-dynamic-dataframe-or-pyspark-dataframe-to-pandas-dataframe

Using Pandas in Glue ETL Job ( How to convert Dynamic DataFrame or PySpark Dataframe to Pandas Dataframe) | AWS re:Post

April 29, 2022 - ... Would say convert Dynamic frame to Spark data frame using .ToDF() method and from spark dataframe to pandas dataframe using link https://sparkbyexamples.com/pyspark/convert-pyspark-dataframe-to-pandas/#:~:text=Convert PySpark Dataframe to ...

Stack Overflow

stackoverflow.com › questions › 52822526 › dynamicframe-vs-dataframe

amazon web services - DynamicFrame vs DataFrame - Stack Overflow

Top answer

1 of 4

44

DynamicFrame is safer when handling memory intensive jobs. "The executor memory with AWS Glue dynamic frames never exceeds the safe threshold," while on the other hand, Spark DataFrame could hit "Out of memory" issue on executors. (https://docs.aws.amazon.com/glue/latest/dg/monitor-profile-debug-oom-abnormalities.html)

DynamicFrames are designed to provide maximum flexibility when dealing with messy data that may lack a declared schema. Records are represented in a flexible self-describing way that preserves information about schema inconsistencies in the data.

For example, with changing requirements, an address column stored as a string in some records might be stored as a struct in later rows. Rather than failing or falling back to a string, DynamicFrames will track both types and gives users a number of options in how to resolve these inconsistencies, providing fine grain resolution options via the ResolveChoice transforms.

DynamicFrames also provide a number of powerful high-level ETL operations that are not found in DataFrames. For example, the Relationalize transform can be used to flatten and pivot complex nested data into tables suitable for transfer to a relational database. In additon, the ApplyMapping transform supports complex renames and casting in a declarative fashion.

DynamicFrames are also integrated with the AWS Glue Data Catalog, so creating frames from tables is a simple operation. Writing to databases can be done through connections without specifying the password. Moreover, DynamicFrames are integrated with job bookmarks, so running these scripts in the job system can allow the script to implictly keep track of what was read and written.(https://github.com/aws-samples/aws-glue-samples/blob/master/FAQ_and_How_to.md)

2 of 4

11

You can refer to the documentation here: DynamicFrame Class. It says,

A DynamicFrame is similar to a DataFrame, except that each record is self-describing, so no schema is required initially.

You want to use DynamicFrame when,

Data that does not conform to a fixed schema.

Note: You can also convert the DynamicFrame to DataFrame using toDF()

Refer here: def toDF

AWS

docs.aws.amazon.com › aws glue › user guide › aws glue programming guide › programming spark scripts › program aws glue etl scripts in pyspark › aws glue pyspark extensions reference › dynamicframe class

DynamicFrame class - AWS Glue

Converts a DynamicFrame to an Apache Spark DataFrame by converting DynamicRecords into DataFrame fields.

Stack Overflow

stackoverflow.com › questions › 59014650 › convert-spark-dataframe-to-aws-glue-dynamic-frame

pyspark - convert spark dataframe to aws glue dynamic frame - Stack Overflow

Top answer

1 of 3

38

fromDF is a class function. Here is how you can convert Dataframe to DynamicFrame

from awsglue.dynamicframe import DynamicFrame

DynamicFrame.fromDF(test_df, glueContext, "test_nest")

AWS Docs

2 of 3

14

Just to consolidate the answers for Scala users too, here's how to transform a Spark Dataframe to a DynamicFrame (the method fromDF doesn't exist in the scala API of the DynamicFrame) :

import com.amazonaws.services.glue.DynamicFrame  
val dynamicFrame = DynamicFrame(df, glueContext)

I hope it helps !

Jayendra's Cloud Certification Blog

jayendrapatil.com › tag › dynamic-frames

Dynamic Frames Archives - Jayendra's Cloud Certification Blog

Conversion can be done between Dynamic frames and Spark dataframes, to take advantage of both AWS Glue and Spark transformations to do the kinds of analysis needed.

linkedin.com › pulse › programmatically-adding-column-dynamic-dataframe-aws-glue-tom-reid

Programmatically adding a column to a Dynamic DataFrame in AWS Glue

February 23, 2021 - from awsglue.transforms import * from awsglue.utils import getResolvedOptions from pyspark.context import SparkContext from awsglue.context import GlueContext from awsglue.job import Job from awsglue.dynamicframe import DynamicFrame # Extra import to get the current_date function # from pyspark.sql.functions import current_date datasource0 = glueContext.create_dynamic_frame.from_catalog(database = "test_data", table_name = "test_file", transformation_ctx = "datasource0") # # The 2 extra lines below convert between the # original Glue Dataframe to a Spark Dataframe, add the new column # then co

Find elsewhere

Google Bing Mojeek

Edlitera

edlitera.com › en › blog › posts › pyspark-cheat-sheet

PySpark Cheat Sheet | Edlitera

January 17, 2023 - dfg.toDF().write.parquet("s3://glue-sample-target/outputdir/dfg", partitionBy=["example_column"]) dfg = DynamicFrame.fromDF(df, glueContext, "dfg") glueContext.write_dynamic_frame.from_options( frame=dfg, connection_type="s3", connection_options={"path": "s3://glue-sample-target/outputdir/dfg"}, format="parquet") Display DataFrame content: df.show() Display DataFrame schema: df.schema() Display DataFrame as a Pandas DataFrame: df.toPandas() Return DataFrame columns: df.columns ·

Stack Overflow

stackoverflow.com › questions › 69915152 › how-to-convert-dataframe-to-dynamic-frame

pyspark - How to convert Dataframe to dynamic frame - Stack Overflow

Copyimport sys from awsglue.transforms import * from awsglue.utils import getResolvedOptions from pyspark.context import SparkContext from awsglue.context import GlueContext from awsglue.job import Job glueContext = GlueContext(SparkContext.getOrCreate()) # load data from crawler students = glueContext.create_dynamic_frame.from_catalog(database="example_db", table_name="samp_csv") # move data into a new variable for transformation students_trans = students # convert dynamicframe(students_trans) to dataframe students_= students_trans.toDF() # run transformation change column names/ drop columns

Medium

medium.com › today-i-learnt › til-aws-glue-dynamic-dataframe-tips-todf-use-resolvechoice-for-mixed-data-types-in-a-column-374775d0c092

TIL: AWS Glue Dynamic Dataframe Tips toDf() — Use ResolveChoice for Mixed Data types in a column | by Satyaprakash Bommaraju | Today I Learnt | Medium

February 18, 2023 - Recently, I had to create a Dataframe that contained mixed data types. ... raw_data_dydf = glueContext.create_dynamic_frame.from_options( format_options={"multiline": False}, connection_type="s3", format="json", connection_options={ "paths": [input_path], "recurse": False, }, transformation_ctx="raw_data", )

AWS re:Post

repost.aws › questions › QUXTWPEGvJR9S6HHJWCpFlKg › glue-dynamicframe-show-method-yields-nothing

Glue DynamicFrame show method yields nothing | AWS re:Post

June 10, 2022 - Converting the DynamicFrame into a Spark DataFrame actually yields a result (df.toDF().show()). Here the dummy code that I'm using · glueContext = GlueContext(spark.sparkContext) df = glueContext.create_dynamic_frame_from_options( connection_type="s3", connection_options = { "paths": [f"s3://bucketname/filename"]}, format="json", format_options={"multiline": True} ) df.printSchema() df.show()

AWS re:Post

repost.aws › questions › QU1tsffeEGRbaWsTx0NlNKNA › is-it-possible-to-converta-spark-dataframe-to-dynamic-frame-and-then-using-bookmark-feature-on-the-s3-folder-used-to-read-data-in-spark-frame

is it possible to converta spark dataframe to dynamic frame and then using bookmark feature on the s3 folder used to read data in spark frame | AWS re:Post

January 12, 2024 - ``` df = spark.read.parquet("s3://folder/") df = df.withColumn('filename', input_file_name()) AmazonS3_node1697616892615 = DynamicFrame.fromDF(df, glueContext, "s3sparkread") ``` if this is the c...

GitHub

github.com › aws-samples › aws-glue-samples › issues › 49

AWS Glue error converting data frame to dynamic frame · Issue #49 aws-samples/aws-glue-samples · GitHub

June 11, 2019 - dfs = sqlContext.read.format(SNOWFLAKE_SOURCE_NAME).options(**sfOptions).option("query", "SELECT hashkey as hash From randomtable").load() #Source datasource0 = glueContext.create_dynamic_frame.from_catalog(database = "test", table_name = "randomtable", transformation_ctx = "datasource0") #add hash value df = datasource0.toDF() df.cache() df = df.withColumn("hashkey", sha2(concat_ws("||", *df.columns), 256)) #drop dupes df1 = df.dropDuplicates(subset=['hashkey']) #read incremental data inc = df1.join(dfs, df1["hashkey"] == dfs["hash"], how='left').filter(col('hash').isNull()) #convert it back to glue context datasource1 = DynamicFrame.fromDF(inc, glueContext, "datasource1")

Author shanmukhakota

Medium

medium.com › @jsaluja › aws-glue-dynamicframe-to-pandas-dataframe-458fc41fac26

AWS Glue DynamicFrame to Pandas DataFrame - Jaspal Singh Saluja - Medium

November 5, 2023 - dynamic_frame.toDF().toPandas(). “AWS Glue DynamicFrame to Pandas DataFrame” is published by Jaspal Singh Saluja.

Stack Overflow

stackoverflow.com › questions › 47786013 › aws-glue-transform-a-struct-into-dynamicframe

python - AWS Glue transform a struct into dynamicframe - Stack Overflow

Top answer

1 of 2

12

I don't think AWSGlue provide any mapping method for it. After some struggling, I found the transformation was relatively easy in the pyspark. Here is the pseudo code:

Retrieve datasource from database

datasource0 = glueContext.create_dynamic_frame.from_catalog(database = ...)

Convert it into DF and transform it in spark

mapped_df = datasource0.toDF().select(explode(col("Datapoints")).alias("collection")).select("collection.*")

Convert back to DynamicFrame and continue the rest of ETL process

mapped_datasource0 = DynamicFrame.fromDF(mapped_df, glueContext, "mapped_datasource0");

Thanks to this reference

2 of 2

0

Check the function split_row in the following link:

https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-crawler-pyspark-extensions-dynamic-frame.html#aws-glue-api-crawler-pyspark-extensions-dynamic-frame-split_rows

Here is also an example of how to transform directly the DynamicFrame in case you also need to do it first:

datasource = glueContext.create_dynamic_frame.from_catalog(database = ...)

# Function to modify a single record
def process_record(record):
    # Changes in the fields or adding fields
    record["timestamp"] = record["Datapoints"] + '_sufix' # Any change you need
    ...
    return record

processed_datasource = datasource.map(process_record)

reddit.com › r/aws › aws glue tutorial: not sure how to get the name of the dynamic frame that is being used to write out the data

r/aws on Reddit: AWS Glue Tutorial: Not sure how to get the name of the dynamic frame that is being used to write out the data

September 27, 2017 -

Hi everyone! I'm trying to follow this tutorial https://aws.amazon.com/blogs/big-data/harmonize-query-and-visualize-data-from-various-providers-using-aws-glue-amazon-athena-and-amazon-quicksight/ to understand AWS Glue a bit better, but I'm having a hard time with one of the steps

In the job generation, they have this step

Let’s now convert that to a DataFrame. Please replace the <DYNAMIC_FRAME_NAME> with the name generated in the script.

And this snippet

 ##----------------------------------
 #convert to a Spark DataFrame...
 customDF = <DYNAMIC_FRAME_NAME>.toDF()

But I can't seem to find where the <DYNAMIC_FRAME_NAME> can be found. I thought it was customDF = resolvechoice2.toDF() , but it didn't run correctly.

Here's my entire code (with the edited names of the buckets, of course)

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
from pyspark.sql.functions import lit
from awsglue.dynamicframe import DynamicFrame

## @params: [JOB_NAME]
args = getResolvedOptions(sys.argv, ['JOB_NAME'])

sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)
## @type: DataSource
## @args: [database = "nycitytaxianalysis", table_name = "blog_yellow",    transformation_ctx = "datasource0"]
## @return: datasource0
## @inputs: []
datasource0 = glueContext.create_dynamic_frame.from_catalog(database = "nycitytaxianalysis", table_name = "blog_yellow", transformation_ctx = "datasource0")
## @type: ApplyMapping
## @args: [mapping = [("vendorid", "long", "vendorid", "long"), ("tpep_pickup_datetime", "string", "pickup_datetime", "timestamp"), ("tpep_dropoff_datetime", "string", "dropoff_datetime", "timestamp"), ("passenger_count", "long", "passenger_count", "long"), ("trip_distance", "double", "trip_distance", "double"), ("pickup_longitude", "double", "pickup_longitude", "double"), ("pickup_latitude", "double", "pickup_latitude", "double"), ("ratecodeid", "long", "ratecodeid", "long"), ("store_and_fwd_flag", "string", "store_and_fwd_flag", "string"), ("dropoff_longitude", "double", "dropoff_longitude", "double"), ("dropoff_latitude", "double", "dropoff_latitude", "double"), ("payment_type", "long", "payment_type", "long"), ("fare_amount", "double", "fare_amount", "double"), ("extra", "double", "extra", "double"), ("mta_tax", "double", "mta_tax", "double"), ("tip_amount", "double", "tip_amount", "double"), ("tolls_amount", "double", "tolls_amount", "double"), ("improvement_surcharge", "double", "improvement_surcharge", "double"), ("total_amount", "double", "total_amount", "double")], transformation_ctx = "applymapping1"]
## @return: applymapping1
## @inputs: [frame = datasource0]
applymapping1 = ApplyMapping.apply(frame = datasource0, mappings = [("vendorid", "long", "vendorid", "long"), ("tpep_pickup_datetime", "string", "pickup_datetime", "timestamp"), ("tpep_dropoff_datetime", "string", "dropoff_datetime", "timestamp"), ("passenger_count", "long", "passenger_count", "long"), ("trip_distance", "double", "trip_distance", "double"), ("pickup_longitude", "double", "pickup_longitude", "double"), ("pickup_latitude", "double", "pickup_latitude", "double"), ("ratecodeid", "long", "ratecodeid", "long"), ("store_and_fwd_flag", "string", "store_and_fwd_flag", "string"), ("dropoff_longitude", "double", "dropoff_longitude", "double"), ("dropoff_latitude", "double", "dropoff_latitude", "double"), ("payment_type", "long", "payment_type", "long"), ("fare_amount", "double", "fare_amount", "double"), ("extra", "double", "extra", "double"), ("mta_tax", "double", "mta_tax", "double"), ("tip_amount", "double", "tip_amount", "double"), ("tolls_amount", "double", "tolls_amount", "double"), ("improvement_surcharge", "double", "improvement_surcharge", "double"), ("total_amount", "double", "total_amount", "double")], transformation_ctx = "applymapping1")
## @type: ResolveChoice
## @args: [choice = "make_struct", transformation_ctx = "resolvechoice2"]
## @return: resolvechoice2
## @inputs: [frame = applymapping1]
resolvechoice2 = ResolveChoice.apply(frame = applymapping1, choice = "make_struct", transformation_ctx = "resolvechoice2")
## @type: DropNullFields
## @args: [transformation_ctx = "dropnullfields3"]
## @return: dropnullfields3
## @inputs: [frame = resolvechoice2]
dropnullfields3 = DropNullFields.apply(frame = resolvechoice2, transformation_ctx = "dropnullfields3")
##----------------------------------
#convert to a Spark DataFrame...
customDF = resolvechoice2.toDF() <<---- HERE'S MY CODE 

#add a new column for "type"
customDF = customDF.withColumn("type", lit('yellow'))

# Convert back to a DynamicFrame for further processing.
customDynamicFrame = DynamicFrame.fromDF(customDF, glueContext, "customDF_df")
##----------------------------------
## @type: DataSink
## @args: [connection_type = "s3", connection_options = {"path": "s3://<<s3-bucket>>/glue-blog/"}, format = "parquet", transformation_ctx = "datasink4"]
## @return: datasink4
## @inputs: [frame = customDynamicFrame]
datasink4 = glueContext.write_dynamic_frame.from_options(frame = customDynamicFrame, connection_type = "s3", connection_options = {"path": "s3://<<s3-bucket>>/glue-blog/"}, format = "parquet", transformation_ctx = "datasink4")
job.commit()

Where can I find the <DYNAMIC_FRAME_NAME> ?

Thanks!

Top answer

1 of 2

1

How about dropnullfields3?

2 of 2

1

I know this is old but the dynamic_frame_name is datasource0 datasource0 = glueContext.create_dynamic_frame.from_catalog

Stack Overflow

stackoverflow.com › questions › 53566942 › how-do-i-convert-from-dataframe-to-dynamicframe-locally-and-without-using-glue-d

pandas - How do I convert from dataframe to DynamicFrame locally and WITHOUT using glue dev endoints? - Stack Overflow

Top answer

1 of 2

1

Why do you want to convert from dataframe to DynamicFrame as you can't do unit testing using Glue APIs - No mocks for Glue APIs?

I prefer following approach:

Write two files per glue job - job_glue.py and job_pyspark.py
Write Glue API specific code in job_glue.py
Write non-glue api specific code job_pyspark.py
Write pytest test-cases to test job_pyspark.py

2 of 2

0

I think present there is no other alternate option for us other than using glue. For reference:Can I test AWS Glue code locally?

Stack Overflow

stackoverflow.com › questions › 58073795 › convert-pyspark-dataframe-to-dynamic-dataframe

aws glue - Convert pyspark dataframe to dynamic dataframe - Stack Overflow

Top answer

1 of 2

4

at least you need pyspark.context, awsglue.context and awsglue.dynamicframe There is example :

from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.dynamicframe import DynamicFrame

sc = SparkContext()
glueContext = GlueContext(sc)

NewDynamicFrame = DynamicFrame.fromDF(persons, glueContext, "nested")

"persons" is your DataFrame

Please check following links :

https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-python-samples-medicaid.html
https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-crawler-pyspark-extensions-dynamic-frame.html#aws-glue-api-crawler-pyspark-extensions-dynamic-frame-fromDF

2 of 2

3

You can create a dynamic frame from dataframe using the fromDF function.

Basic Syntax

dyf = fromDF(dataframe, glue_ctx, name)

where,

dataframe – The Apache Spark SQL DataFrame to convert (required).
glue_ctx – The GlueContext Class object that specifies the context for this transform (required).
name – The name of the resulting DynamicFrame (required).

Reference : Dynamic frame from dataframe