gluecontext.write_dynamic_frame.from_options overwrite

Overwrite parquet files from dynamic frame in AWS Glue

stackoverflow.com › questions › 52001781 › overwrite-parquet-files-from-dynamic-frame-in-aws-glue

Currently AWS Glue doesn't support 'overwrite' mode but they are working on this feature.

As a workaround you can convert DynamicFrame object to spark's DataFrame and write it using spark instead of Glue:

table.toDF()
  .write
  .mode("overwrite")
  .format("parquet")
  .partitionBy("var_1", "var_2")
  .save(output_dir)

Answer from Yuriy Bondaruk on Stack Overflow

Stack Overflow

stackoverflow.com › questions › 52001781 › overwrite-parquet-files-from-dynamic-frame-in-aws-glue

amazon web services - Overwrite parquet files from dynamic frame in AWS Glue - Stack Overflow

Top answer

1 of 3

Currently AWS Glue doesn't support 'overwrite' mode but they are working on this feature.

As a workaround you can convert DynamicFrame object to spark's DataFrame and write it using spark instead of Glue:

table.toDF()
  .write
  .mode("overwrite")
  .format("parquet")
  .partitionBy("var_1", "var_2")
  .save(output_dir)

2 of 3

As mentioned earlier, AWS Glue doesn't support mode="overwrite" mode. But converting Glue Dynamic Frame back to PySpark data frame can cause lot of issues with big data.

You just need to add single command i.e. purge_s3_path() before writing dynamic_dataFrame to S3.

glueContext.purge_s3_path(s3_path, {"retentionPeriod": 0})
glueContext.write_dynamic_frame.from_options(frame=table,
                                     connection_type="s3",
                                     connection_options={"path": s3_path,
                                                           "partitionKeys": ["var1","var2"]},
                                     format="parquet")

Please refer: AWS Documentation

AWS

docs.aws.amazon.com › aws glue › user guide › aws glue programming guide › programming spark scripts › program aws glue etl scripts in pyspark › aws glue pyspark extensions reference › dynamicframewriter class

DynamicFrameWriter class - AWS Glue

glueContext.write_dynamic_frame.from_options(\ frame = dyf_splitFields,\ connection_options = {'path': '/home/glue/GlueLocalOutput/'},\ connection_type = 's3',\ format = 'json')

Discussions

GlueContext.write_dynamic_frame.from_options

There was an error while loading. Please reload this page · Is it possible to specify when writing the dynamicframe out to S3 that we can pick the storage class to throw it in in S3 More on github.com

github.com

December 9, 2021

amazon web services - AWS Glue export to parquet issue using glueContext.write_dynamic_frame.from_options - Stack Overflow

I have the following problem. The code below is auto-generated by AWS Glue. It's mission is to data from Athena (backed up by .csv @ S3) and transform data into Parquet. The code is working for... More on stackoverflow.com

stackoverflow.com

April 18, 2018

Duplicate Records in Parquet (Processed) Table after AWS Glue Job execution.

We have an AWS Glue pipeline where: A crawler populates a raw database table from partitioned JSON files in S3. S3 structure: ``` raw/ ├── org=21/ │ └── 221.json └── org=23/ └── 654... More on repost.aws

repost.aws

April 15, 2025

Overwrite or truncate existing data on jdbc connection Mysql with glue job

I want to overwrite or truncate a table in Mysql using aws glue job python, I tried using preactions like redshift but It doesn't work. here is my code : ``` datasink4 = glueContext.write_dynamic_... More on repost.aws

repost.aws

May 25, 2023

AWS

DynamicFrame class - AWS Glue

If there is no matching record in the staging frame, all records (including duplicates) are retained from the source. If the staging frame has matching records, the records from the staging frame overwrite the records in the source in AWS Glue. stage_dynamic_frame – The staging DynamicFrame ...

AWS

GlueContext class - AWS Glue

__init__ — creating —getSourcecreate_dynamic_frame_from_rddcreate_dynamic_frame_from_catalogcreate_dynamic_frame_from_optionscreate_sample_dynamic_frame_from_catalogcreate_sample_dynamic_frame_from_optionsadd_ingestion_time_columnscreate_data_frame_from_catalogcreate_data_frame_from_optionsforEachBatch — Amazon S3 datasets —purge_tablepurge_s3_pathtransition_tabletransition_s3_path — extracting —extract_jdbc_conf— transactions —start_transactioncommit_transactioncancel_transaction — writing —getSinkwrite_dynamic_frame_from_optionswrite_from_optionswrite_dynamic_frame_from_catalogwrite_data_frame_from_catalogwrite_dynamic_frame_from_jdbc_confwrite_from_jdbc_conf

GitHub

github.com › awslabs › aws-glue-libs › issues › 108

GlueContext.write_dynamic_frame.from_options · Issue #108 · awslabs/aws-glue-libs

December 9, 2021 - glueContext.write_dynamic_frame.from_options(frame=dynamicFrame, connection_type="s3", connection_options={ "path": s3PathLatest, "StorageClass": "STANDARD_IA"}, format="csv", format_options={"separator": ",", "writeHeader": True, "optimizePerformance": True}, transformation_ctx=f"{table['Name']}_dataSink") There doesn't seem to be an actual documentation on what the connection_options dict actually supports and looking over the code library it doesn't actually really care what you throw in there.

Author theonlyway

Stack Overflow

stackoverflow.com › questions › 49893397 › aws-glue-export-to-parquet-issue-using-gluecontext-write-dynamic-frame-from-opti

amazon web services - AWS Glue export to parquet issue using glueContext.write_dynamic_frame.from_options - Stack Overflow

Top answer

1 of 2

Part 1: identifying the problem

The solution how to find what is causing the problem was to switch output from .parquet to .csv and drop ResolveChoice or DropNullFields (as it is automatically suggested by Glue for .parquet):

datasink2 = glueContext.write_dynamic_frame.from_options(frame = applymapping1, connection_type = "s3", connection_options = {"path": "s3://xxxx"}, format = "csv", transformation_ctx = "datasink2")
job.commit()

It has produced the more detailed error message:

An error occurred while calling o120.pyWriteDynamicFrame. Job aborted due to stage failure: Task 5 in stage 0.0 failed 4 times, most recent failure: Lost task 5.3 in stage 0.0 (TID 182, ip-172-31-78-99.ec2.internal, executor 15): com.amazonaws.services.glue.util.FatalException: Unable to parse file: xxxx1.csv.gz

The file xxxx1.csv.gz mentioned in the error message appears to be too big for Glue (~100Mb .gzip and ~350Mb as uncompressed .csv).

Part 2: true source of the problem and fix

As mentioned in the 1st part thanks to export to .csv it was possible to identify the wrong file.

Further investigation by loading .csv into R has revealed that one of the columns contains a single string record, while all other values of this column were long or NULL.

After dropping this value in R and re-uploading data to S3 the problem vanished.

Note #1: the column was declared string in Athena so I consider this behaviour as bug

Note #2: the nature of the problem was not the size of the data. I have successfuly processed files up to 200Mb .csv.gz which correspond to roughtly 600 Mb .csv.

2 of 2

Please use the updated table schema from the data catalog.

I have gone through this same error. In my case, the crawler had created another table of the same file in the database. I was referencing the old one. This can happen if crawler was crawling again and again the same path and made different schema table in data catalog. So glue job wasn't finding the table name and schema. Thereby giving this error.

Moreover, you can change DeleteBehavior: "LOG" to DeleteBehavior: "DELETE_IN_DATABASE"

AWS re:Post

repost.aws › questions › QUwojlEHztSCyyQ3cM2Ggz7Q › duplicate-records-in-parquet-processed-table-after-aws-glue-job-execution

Duplicate Records in Parquet (Processed) Table after AWS Glue Job execution. | AWS re:Post

April 15, 2025 - ✅ Use partitionKeys and ... overwrite mode is enabled }, format="parquet" ) 🔸 But Glue doesn't support dynamic partition overwrite by default....

GitHub

github.com › aws-samples › aws-glue-samples › blob › master › examples › resolve_choice.py

aws-glue-samples/examples/resolve_choice.py at master · aws-samples/aws-glue-samples

glueContext.write_dynamic_frame.from_options(frame = medicare_res_cast, connection_type = "s3", connection_options = {"path": medicare_cast}, format = "json") glueContext.write_dynamic_frame.from_options(frame = medicare_res_project, connection_type = "s3", connection_options = {"path": medicare_project}, format = "json") glueContext.write_dynamic_frame.from_options(frame = medicare_res_make_cols, connection_type = "s3", connection_options = {"path": medicare_cols}, format = "json") glueContext.write_dynamic_frame.from_options(frame = medicare_res_make_struct, connection_type = "s3", connection_options = {"path": medicare_struct}, format = "json") glueContext.write_dynamic_frame.from_options(frame = medicare_sql_dyf, connection_type = "s3", connection_options = {"path": medicare_sql}, format = "json")

Author aws-samples

Find elsewhere

Google Bing Mojeek

AWS re:Post

repost.aws › questions › QUX3tcxxq3QQ23499FIszI4w › overwrite-or-truncate-existing-data-on-jdbc-connection-mysql-with-glue-job

Overwrite or truncate existing data on jdbc connection Mysql with glue job | AWS re:Post

Top answer

1 of 1

Hi, I faced a similar issue to satisfy a requirement for loading data from Redshift to a MySQL database. It seems that there is no overwrite function for data frames when dealing with MySQL. What was implemented was before the data load I truncate the required table by connecting to the RDS. Example Code ```python secret_value = client_secret.get_secret_value(SecretId=secret_name) secret_string = secret_value['SecretString'] secret_json = json.loads(secret_string) region = session.region_name user_name = secret_json["username"] password = secret_json["password"] host = secret_json["host"] port = secret_json["port"] dbname = "" table_name = "" conn = pymysql.connect(host=host, user=user_name, passwd=password, port=port, database=dbname) cur = conn.cursor() query = "TRUNCATE TABLE {0}.{1}".format(dbname,table_name) cur.execute(query) conn.commit() ``` I hope this helps

linkedin.com › pulse › using-jdbc-aws-glue-job-tom-reid

Using JDBC in an AWS Glue job

March 23, 2021 - # truncate the table # using something like the pg8000 external library # NB you need to provide the connection_db function which # gets your database connection deta # conn=connection_db() cursor = conn.cursor() cursor.execute("truncate table myschema.mytable") conn.commit() cursor.close() conn.close() # # Stage 4 # # read the data we just stored in S3 back into a new dataframe # # see above Stage 2 comments as to why this is required newdf = glueContext.create_dynamic_frame.from_options ( connection_type='s3', connection_options = {"paths": [" s3://somewhere/you-have/read-write access/to "]}

GitHub

github.com › aws-samples › aws-glue-samples › blob › master › examples › join_and_relationalize.py

aws-glue-samples/examples/join_and_relationalize.py at master · aws-samples/aws-glue-samples

# Write out the dynamic frame into parquet in "legislator_history" directory · print("Writing to /legislator_history ...") glueContext.write_dynamic_frame.from_options(frame = l_history, connection_type = "s3", connection_options = {"path": output_history_dir}, format = "parquet") ·

Author aws-samples

Medium

swapnil-bhoite.medium.com › aws-glue-dynamicframe-transformations-with-example-code-and-output-26e14d13145f

AWS Glue DynamicFrame transformations with example code and output | by Swapnil Bhoite | Medium

April 28, 2022 - If there is no matching record in the staging frame, all records (including duplicates) are retained from the source. If the staging frame has matching records, the records from the staging frame overwrite the records in the source in AWS Glue.

Krupchinskiy

krupchinskiy.com › wp-content › plugins › ieinjhx › gluecontext-write_dynamic_frame-from_options-file-name-88f4bb

gluecontext write_dynamic_frame from_options file name

Dynamic Frame mode overwrite, Is there a way to use mode 'rewrite' instead of default 'append' when I write a glueContext.write_dynamic_frame.from_options(frame = prices, connection_options = {"path": "s3://aws-glue-target/temp"} For JDBC connections, several properties must be defined.

Stack Overflow

stackoverflow.com › questions › 64566100 › how-do-i-overwrite-stale-partitioned-data-with-aws-glue-job

How do I overwrite stale partitioned data with AWS Glue job? - Stack Overflow

Top answer

1 of 3

You have a few options:

DynamicFrameWriter doesn't support overwriting data in S3 yet. Instead you can use Spark native write(). However, for really large datasets it can be a bit inefficient as a single worker will be used to overwrite existing data in S3. An example is below:

    sc = SparkContext()
    glueContext = GlueContext(sc)
    spark = glueContext.spark_session
    spark.conf.set("spark.sql.sources.partitionOverwriteMode","dynamic")
    
    job = Job(glueContext)
    job.init(args['JOB_NAME'], args)
    
    DataSource0 = glueContext.create_dynamic_frame.from_catalog(database = "mydatabase", table_name = "mydata", transformation_ctx = "DataSource0")
    ds_df = DataSource0.toDF()
    ds_df1 = ds_df.select("year","month",upper(col('colA')),upper(col('colB')),upper(col('colC')),upper(col('colD')))
    ds_df1 \
        .write.mode('overwrite') \
        .format('parquet') \
        .partitionBy('year', 'month') \
        .save('s3://<bucket>/mydata-transformed/')
    
    job.commit()

In the lambda function, you could use delete data in S3 under a certain prefix. An example using Python and boto3 is:

    import boto3
    
    s3_res = boto3.resource('s3')
    bucket = 'my-bucket-name'
    # Add any logic to derive required prefix based on year/month/day
    prefix = 'mydata/year=2020/month=10/'
    s3_res.Bucket(bucket).objects.filter(Prefix=key).delete()

You can use Glue's purge_s3_path to delete data from a certain prefix. Link is here.

2 of 3

now exist a function in glue for delete S3 path or delete glue catalog table.

AWS GLue doc

reddit.com › r/aws › aws glue tutorial: not sure how to get the name of the dynamic frame that is being used to write out the data

r/aws on Reddit: AWS Glue Tutorial: Not sure how to get the name of the dynamic frame that is being used to write out the data

September 27, 2017 -

Hi everyone! I'm trying to follow this tutorial https://aws.amazon.com/blogs/big-data/harmonize-query-and-visualize-data-from-various-providers-using-aws-glue-amazon-athena-and-amazon-quicksight/ to understand AWS Glue a bit better, but I'm having a hard time with one of the steps

In the job generation, they have this step

Let’s now convert that to a DataFrame. Please replace the <DYNAMIC_FRAME_NAME> with the name generated in the script.

And this snippet

 ##----------------------------------
 #convert to a Spark DataFrame...
 customDF = <DYNAMIC_FRAME_NAME>.toDF()

But I can't seem to find where the <DYNAMIC_FRAME_NAME> can be found. I thought it was customDF = resolvechoice2.toDF() , but it didn't run correctly.

Here's my entire code (with the edited names of the buckets, of course)

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
from pyspark.sql.functions import lit
from awsglue.dynamicframe import DynamicFrame

## @params: [JOB_NAME]
args = getResolvedOptions(sys.argv, ['JOB_NAME'])

sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)
## @type: DataSource
## @args: [database = "nycitytaxianalysis", table_name = "blog_yellow",    transformation_ctx = "datasource0"]
## @return: datasource0
## @inputs: []
datasource0 = glueContext.create_dynamic_frame.from_catalog(database = "nycitytaxianalysis", table_name = "blog_yellow", transformation_ctx = "datasource0")
## @type: ApplyMapping
## @args: [mapping = [("vendorid", "long", "vendorid", "long"), ("tpep_pickup_datetime", "string", "pickup_datetime", "timestamp"), ("tpep_dropoff_datetime", "string", "dropoff_datetime", "timestamp"), ("passenger_count", "long", "passenger_count", "long"), ("trip_distance", "double", "trip_distance", "double"), ("pickup_longitude", "double", "pickup_longitude", "double"), ("pickup_latitude", "double", "pickup_latitude", "double"), ("ratecodeid", "long", "ratecodeid", "long"), ("store_and_fwd_flag", "string", "store_and_fwd_flag", "string"), ("dropoff_longitude", "double", "dropoff_longitude", "double"), ("dropoff_latitude", "double", "dropoff_latitude", "double"), ("payment_type", "long", "payment_type", "long"), ("fare_amount", "double", "fare_amount", "double"), ("extra", "double", "extra", "double"), ("mta_tax", "double", "mta_tax", "double"), ("tip_amount", "double", "tip_amount", "double"), ("tolls_amount", "double", "tolls_amount", "double"), ("improvement_surcharge", "double", "improvement_surcharge", "double"), ("total_amount", "double", "total_amount", "double")], transformation_ctx = "applymapping1"]
## @return: applymapping1
## @inputs: [frame = datasource0]
applymapping1 = ApplyMapping.apply(frame = datasource0, mappings = [("vendorid", "long", "vendorid", "long"), ("tpep_pickup_datetime", "string", "pickup_datetime", "timestamp"), ("tpep_dropoff_datetime", "string", "dropoff_datetime", "timestamp"), ("passenger_count", "long", "passenger_count", "long"), ("trip_distance", "double", "trip_distance", "double"), ("pickup_longitude", "double", "pickup_longitude", "double"), ("pickup_latitude", "double", "pickup_latitude", "double"), ("ratecodeid", "long", "ratecodeid", "long"), ("store_and_fwd_flag", "string", "store_and_fwd_flag", "string"), ("dropoff_longitude", "double", "dropoff_longitude", "double"), ("dropoff_latitude", "double", "dropoff_latitude", "double"), ("payment_type", "long", "payment_type", "long"), ("fare_amount", "double", "fare_amount", "double"), ("extra", "double", "extra", "double"), ("mta_tax", "double", "mta_tax", "double"), ("tip_amount", "double", "tip_amount", "double"), ("tolls_amount", "double", "tolls_amount", "double"), ("improvement_surcharge", "double", "improvement_surcharge", "double"), ("total_amount", "double", "total_amount", "double")], transformation_ctx = "applymapping1")
## @type: ResolveChoice
## @args: [choice = "make_struct", transformation_ctx = "resolvechoice2"]
## @return: resolvechoice2
## @inputs: [frame = applymapping1]
resolvechoice2 = ResolveChoice.apply(frame = applymapping1, choice = "make_struct", transformation_ctx = "resolvechoice2")
## @type: DropNullFields
## @args: [transformation_ctx = "dropnullfields3"]
## @return: dropnullfields3
## @inputs: [frame = resolvechoice2]
dropnullfields3 = DropNullFields.apply(frame = resolvechoice2, transformation_ctx = "dropnullfields3")
##----------------------------------
#convert to a Spark DataFrame...
customDF = resolvechoice2.toDF() <<---- HERE'S MY CODE 

#add a new column for "type"
customDF = customDF.withColumn("type", lit('yellow'))

# Convert back to a DynamicFrame for further processing.
customDynamicFrame = DynamicFrame.fromDF(customDF, glueContext, "customDF_df")
##----------------------------------
## @type: DataSink
## @args: [connection_type = "s3", connection_options = {"path": "s3://<<s3-bucket>>/glue-blog/"}, format = "parquet", transformation_ctx = "datasink4"]
## @return: datasink4
## @inputs: [frame = customDynamicFrame]
datasink4 = glueContext.write_dynamic_frame.from_options(frame = customDynamicFrame, connection_type = "s3", connection_options = {"path": "s3://<<s3-bucket>>/glue-blog/"}, format = "parquet", transformation_ctx = "datasink4")
job.commit()

Where can I find the <DYNAMIC_FRAME_NAME> ?

Thanks!

Top answer

1 of 2

How about dropnullfields3?

2 of 2

I know this is old but the dynamic_frame_name is datasource0 datasource0 = glueContext.create_dynamic_frame.from_catalog

CopyProgramming

copyprogramming.com › howto › overwrite-parquet-files-from-dynamic-frame-in-aws-glue

Updating AWS Glue's dynamic frame with new data and replacing existing parquet files - Amazon web services

May 27, 2023 - The sentence that I use is this: glueContext.write_dynamic_frame. Currently AWS Glue doesn't support 'overwrite' mode but they are working on this feature.

Amazon Web Services

docs.amazonaws.cn › 亚马逊云科技 › amazon glue › user guide › data discovery and cataloging in amazon glue › managing the data catalog › updating the schema, and adding new partitions in the data catalog using amazon glue etl jobs

Updating the schema, and adding new partitions in the Data Catalog using Amazon Glue ETL jobs - Amazon Glue

The code uses enableUpdateCatalog set to true, and also updateBehavior set to UPDATE_IN_DATABASE, which indicates to overwrite the schema and add new partitions in the Data Catalog during the job run. ... additionalOptions = { "enableUpdateCatalog": True, "updateBehavior": "UPDATE_IN_DATABASE"} ...

AWS

docs.aws.amazon.com › aws glue › user guide › data discovery and cataloging in aws glue › managing the data catalog › updating the schema, and adding new partitions in the data catalog using aws glue etl jobs

Updating the schema, and adding new partitions in the Data Catalog using AWS Glue ETL jobs - AWS Glue

AWS re:Post

repost.aws › questions › QU3rukJUaHRpiMNjydfqLZgw › aws-glue-create-dynamic-frame-from-data-in-postgresql-with-custom-bookmark-key

aws glue create_dynamic_frame from data in PostgreSQL with custom bookmark key | AWS re:Post

March 18, 2023 - cb_ceres = glueContext.create_dynamic_frame.from_options( connection_type="postgresql", connection_options={ "url": f"jdbc:postgresql://{ENDPOINT}:5432/{DBNAME}", "dbtable": "xxxxx_raw_ceres", "user": username, "password": password, }, additional_options={"jobBookmarkKeys": ["ceres_mono_index"], "jobBookmarkKeysSortOrder": "asc"}, transformation_ctx="cb_ceres_bookmark", )

Javaer101

javaer101.com › en › article › 19882830.html

Overwrite parquet files from dynamic frame in AWS Glue - Javaer101

glueContext.write_dynamic_frame.from_options(frame = table, connection_type = "s3", connection_options = {"path": output_dir, "partitionKeys": ["var1","var2"]}, format = "parquet") Is there anything like "mode":"overwrite" that replace my parquet files?