Currently AWS Glue doesn't support 'overwrite' mode but they are working on this feature.

As a workaround you can convert DynamicFrame object to spark's DataFrame and write it using spark instead of Glue:

table.toDF()
  .write
  .mode("overwrite")
  .format("parquet")
  .partitionBy("var_1", "var_2")
  .save(output_dir)
Answer from Yuriy Bondaruk on Stack Overflow
🌐
AWS
docs.aws.amazon.com › aws glue › user guide › aws glue programming guide › programming spark scripts › program aws glue etl scripts in pyspark › aws glue pyspark extensions reference › dynamicframewriter class
DynamicFrameWriter class - AWS Glue
glueContext.write_dynamic_frame.from_options(\ frame = dyf_splitFields,\ connection_options = {'path': '/home/glue/GlueLocalOutput/'},\ connection_type = 's3',\ format = 'json')
Discussions

GlueContext.write_dynamic_frame.from_options
There was an error while loading. Please reload this page · Is it possible to specify when writing the dynamicframe out to S3 that we can pick the storage class to throw it in in S3 More on github.com
🌐 github.com
1
December 9, 2021
amazon web services - AWS Glue export to parquet issue using glueContext.write_dynamic_frame.from_options - Stack Overflow
I have the following problem. The code below is auto-generated by AWS Glue. It's mission is to data from Athena (backed up by .csv @ S3) and transform data into Parquet. The code is working for... More on stackoverflow.com
🌐 stackoverflow.com
April 18, 2018
Duplicate Records in Parquet (Processed) Table after AWS Glue Job execution.
We have an AWS Glue pipeline where: A crawler populates a raw database table from partitioned JSON files in S3. S3 structure: ``` raw/ ├── org=21/ │ └── 221.json └── org=23/ └── 654... More on repost.aws
🌐 repost.aws
2
0
April 15, 2025
Overwrite or truncate existing data on jdbc connection Mysql with glue job
I want to overwrite or truncate a table in Mysql using aws glue job python, I tried using preactions like redshift but It doesn't work. here is my code : ``` datasink4 = glueContext.write_dynamic_... More on repost.aws
🌐 repost.aws
1
0
May 25, 2023
🌐
AWS
docs.aws.amazon.com › aws glue › user guide › aws glue programming guide › programming spark scripts › program aws glue etl scripts in pyspark › aws glue pyspark extensions reference › dynamicframe class
DynamicFrame class - AWS Glue
If there is no matching record in the staging frame, all records (including duplicates) are retained from the source. If the staging frame has matching records, the records from the staging frame overwrite the records in the source in AWS Glue. stage_dynamic_frame – The staging DynamicFrame ...
🌐
AWS
docs.aws.amazon.com › aws glue › user guide › aws glue programming guide › programming spark scripts › program aws glue etl scripts in pyspark › aws glue pyspark extensions reference › gluecontext class
GlueContext class - AWS Glue
__init__ — creating —getSourcecreate_dynamic_frame_from_rddcreate_dynamic_frame_from_catalogcreate_dynamic_frame_from_optionscreate_sample_dynamic_frame_from_catalogcreate_sample_dynamic_frame_from_optionsadd_ingestion_time_columnscreate_data_frame_from_catalogcreate_data_frame_from_optionsforEachBatch — Amazon S3 datasets —purge_tablepurge_s3_pathtransition_tabletransition_s3_path — extracting —extract_jdbc_conf— transactions —start_transactioncommit_transactioncancel_transaction — writing —getSinkwrite_dynamic_frame_from_optionswrite_from_optionswrite_dynamic_frame_from_catalogwrite_data_frame_from_catalogwrite_dynamic_frame_from_jdbc_confwrite_from_jdbc_conf
🌐
GitHub
github.com › awslabs › aws-glue-libs › issues › 108
GlueContext.write_dynamic_frame.from_options · Issue #108 · awslabs/aws-glue-libs
December 9, 2021 - glueContext.write_dynamic_frame.from_options(frame=dynamicFrame, connection_type="s3", connection_options={ "path": s3PathLatest, "StorageClass": "STANDARD_IA"}, format="csv", format_options={"separator": ",", "writeHeader": True, "optimizePerformance": True}, transformation_ctx=f"{table['Name']}_dataSink") There doesn't seem to be an actual documentation on what the connection_options dict actually supports and looking over the code library it doesn't actually really care what you throw in there.
Author   theonlyway
Top answer
1 of 2
6

Part 1: identifying the problem

The solution how to find what is causing the problem was to switch output from .parquet to .csv and drop ResolveChoice or DropNullFields (as it is automatically suggested by Glue for .parquet):

datasink2 = glueContext.write_dynamic_frame.from_options(frame = applymapping1, connection_type = "s3", connection_options = {"path": "s3://xxxx"}, format = "csv", transformation_ctx = "datasink2")
job.commit()

It has produced the more detailed error message:

An error occurred while calling o120.pyWriteDynamicFrame. Job aborted due to stage failure: Task 5 in stage 0.0 failed 4 times, most recent failure: Lost task 5.3 in stage 0.0 (TID 182, ip-172-31-78-99.ec2.internal, executor 15): com.amazonaws.services.glue.util.FatalException: Unable to parse file: xxxx1.csv.gz

The file xxxx1.csv.gz mentioned in the error message appears to be too big for Glue (~100Mb .gzip and ~350Mb as uncompressed .csv).

Part 2: true source of the problem and fix

As mentioned in the 1st part thanks to export to .csv it was possible to identify the wrong file.

Further investigation by loading .csv into R has revealed that one of the columns contains a single string record, while all other values of this column were long or NULL.

After dropping this value in R and re-uploading data to S3 the problem vanished.

Note #1: the column was declared string in Athena so I consider this behaviour as bug

Note #2: the nature of the problem was not the size of the data. I have successfuly processed files up to 200Mb .csv.gz which correspond to roughtly 600 Mb .csv.

2 of 2
1

Please use the updated table schema from the data catalog.

I have gone through this same error. In my case, the crawler had created another table of the same file in the database. I was referencing the old one. This can happen if crawler was crawling again and again the same path and made different schema table in data catalog. So glue job wasn't finding the table name and schema. Thereby giving this error.

Moreover, you can change DeleteBehavior: "LOG" to DeleteBehavior: "DELETE_IN_DATABASE"

🌐
AWS re:Post
repost.aws › questions › QUwojlEHztSCyyQ3cM2Ggz7Q › duplicate-records-in-parquet-processed-table-after-aws-glue-job-execution
Duplicate Records in Parquet (Processed) Table after AWS Glue Job execution. | AWS re:Post
April 15, 2025 - ✅ Use partitionKeys and ... overwrite mode is enabled }, format="parquet" ) 🔸 But Glue doesn't support dynamic partition overwrite by default....
🌐
GitHub
github.com › aws-samples › aws-glue-samples › blob › master › examples › resolve_choice.py
aws-glue-samples/examples/resolve_choice.py at master · aws-samples/aws-glue-samples
glueContext.write_dynamic_frame.from_options(frame = medicare_res_cast, connection_type = "s3", connection_options = {"path": medicare_cast}, format = "json") glueContext.write_dynamic_frame.from_options(frame = medicare_res_project, connection_type = "s3", connection_options = {"path": medicare_project}, format = "json") glueContext.write_dynamic_frame.from_options(frame = medicare_res_make_cols, connection_type = "s3", connection_options = {"path": medicare_cols}, format = "json") glueContext.write_dynamic_frame.from_options(frame = medicare_res_make_struct, connection_type = "s3", connection_options = {"path": medicare_struct}, format = "json") glueContext.write_dynamic_frame.from_options(frame = medicare_sql_dyf, connection_type = "s3", connection_options = {"path": medicare_sql}, format = "json")
Author   aws-samples
Find elsewhere
🌐
LinkedIn
linkedin.com › pulse › using-jdbc-aws-glue-job-tom-reid
Using JDBC in an AWS Glue job
March 23, 2021 - # truncate the table # using something like the pg8000 external library # NB you need to provide the connection_db function which # gets your database connection deta # conn=connection_db() cursor = conn.cursor() cursor.execute("truncate table myschema.mytable") conn.commit() cursor.close() conn.close() # # Stage 4 # # read the data we just stored in S3 back into a new dataframe # # see above Stage 2 comments as to why this is required newdf = glueContext.create_dynamic_frame.from_options ( connection_type='s3', connection_options = {"paths": [" s3://somewhere/you-have/read-write access/to "]}
🌐
GitHub
github.com › aws-samples › aws-glue-samples › blob › master › examples › join_and_relationalize.py
aws-glue-samples/examples/join_and_relationalize.py at master · aws-samples/aws-glue-samples
# Write out the dynamic frame into parquet in "legislator_history" directory · print("Writing to /legislator_history ...") glueContext.write_dynamic_frame.from_options(frame = l_history, connection_type = "s3", connection_options = {"path": output_history_dir}, format = "parquet") ·
Author   aws-samples
🌐
Medium
swapnil-bhoite.medium.com › aws-glue-dynamicframe-transformations-with-example-code-and-output-26e14d13145f
AWS Glue DynamicFrame transformations with example code and output | by Swapnil Bhoite | Medium
April 28, 2022 - If there is no matching record in the staging frame, all records (including duplicates) are retained from the source. If the staging frame has matching records, the records from the staging frame overwrite the records in the source in AWS Glue.
🌐
Krupchinskiy
krupchinskiy.com › wp-content › plugins › ieinjhx › gluecontext-write_dynamic_frame-from_options-file-name-88f4bb
gluecontext write_dynamic_frame from_options file name
Dynamic Frame mode overwrite, Is there a way to use mode 'rewrite' instead of default 'append' when I write a glueContext.write_dynamic_frame.from_options(frame = prices, connection_options = {"path": "s3://aws-glue-target/temp"} For JDBC connections, several properties must be defined.
🌐
Reddit
reddit.com › r/aws › aws glue tutorial: not sure how to get the name of the dynamic frame that is being used to write out the data
r/aws on Reddit: AWS Glue Tutorial: Not sure how to get the name of the dynamic frame that is being used to write out the data
September 27, 2017 -

Hi everyone! I'm trying to follow this tutorial https://aws.amazon.com/blogs/big-data/harmonize-query-and-visualize-data-from-various-providers-using-aws-glue-amazon-athena-and-amazon-quicksight/ to understand AWS Glue a bit better, but I'm having a hard time with one of the steps

In the job generation, they have this step

Let’s now convert that to a DataFrame. Please replace the <DYNAMIC_FRAME_NAME> with the name generated in the script.

And this snippet

 ##----------------------------------
 #convert to a Spark DataFrame...
 customDF = <DYNAMIC_FRAME_NAME>.toDF()

But I can't seem to find where the <DYNAMIC_FRAME_NAME> can be found. I thought it was customDF = resolvechoice2.toDF() , but it didn't run correctly.

Here's my entire code (with the edited names of the buckets, of course)

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
from pyspark.sql.functions import lit
from awsglue.dynamicframe import DynamicFrame

## @params: [JOB_NAME]
args = getResolvedOptions(sys.argv, ['JOB_NAME'])

sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)
## @type: DataSource
## @args: [database = "nycitytaxianalysis", table_name = "blog_yellow",    transformation_ctx = "datasource0"]
## @return: datasource0
## @inputs: []
datasource0 = glueContext.create_dynamic_frame.from_catalog(database = "nycitytaxianalysis", table_name = "blog_yellow", transformation_ctx = "datasource0")
## @type: ApplyMapping
## @args: [mapping = [("vendorid", "long", "vendorid", "long"), ("tpep_pickup_datetime", "string", "pickup_datetime", "timestamp"), ("tpep_dropoff_datetime", "string", "dropoff_datetime", "timestamp"), ("passenger_count", "long", "passenger_count", "long"), ("trip_distance", "double", "trip_distance", "double"), ("pickup_longitude", "double", "pickup_longitude", "double"), ("pickup_latitude", "double", "pickup_latitude", "double"), ("ratecodeid", "long", "ratecodeid", "long"), ("store_and_fwd_flag", "string", "store_and_fwd_flag", "string"), ("dropoff_longitude", "double", "dropoff_longitude", "double"), ("dropoff_latitude", "double", "dropoff_latitude", "double"), ("payment_type", "long", "payment_type", "long"), ("fare_amount", "double", "fare_amount", "double"), ("extra", "double", "extra", "double"), ("mta_tax", "double", "mta_tax", "double"), ("tip_amount", "double", "tip_amount", "double"), ("tolls_amount", "double", "tolls_amount", "double"), ("improvement_surcharge", "double", "improvement_surcharge", "double"), ("total_amount", "double", "total_amount", "double")], transformation_ctx = "applymapping1"]
## @return: applymapping1
## @inputs: [frame = datasource0]
applymapping1 = ApplyMapping.apply(frame = datasource0, mappings = [("vendorid", "long", "vendorid", "long"), ("tpep_pickup_datetime", "string", "pickup_datetime", "timestamp"), ("tpep_dropoff_datetime", "string", "dropoff_datetime", "timestamp"), ("passenger_count", "long", "passenger_count", "long"), ("trip_distance", "double", "trip_distance", "double"), ("pickup_longitude", "double", "pickup_longitude", "double"), ("pickup_latitude", "double", "pickup_latitude", "double"), ("ratecodeid", "long", "ratecodeid", "long"), ("store_and_fwd_flag", "string", "store_and_fwd_flag", "string"), ("dropoff_longitude", "double", "dropoff_longitude", "double"), ("dropoff_latitude", "double", "dropoff_latitude", "double"), ("payment_type", "long", "payment_type", "long"), ("fare_amount", "double", "fare_amount", "double"), ("extra", "double", "extra", "double"), ("mta_tax", "double", "mta_tax", "double"), ("tip_amount", "double", "tip_amount", "double"), ("tolls_amount", "double", "tolls_amount", "double"), ("improvement_surcharge", "double", "improvement_surcharge", "double"), ("total_amount", "double", "total_amount", "double")], transformation_ctx = "applymapping1")
## @type: ResolveChoice
## @args: [choice = "make_struct", transformation_ctx = "resolvechoice2"]
## @return: resolvechoice2
## @inputs: [frame = applymapping1]
resolvechoice2 = ResolveChoice.apply(frame = applymapping1, choice = "make_struct", transformation_ctx = "resolvechoice2")
## @type: DropNullFields
## @args: [transformation_ctx = "dropnullfields3"]
## @return: dropnullfields3
## @inputs: [frame = resolvechoice2]
dropnullfields3 = DropNullFields.apply(frame = resolvechoice2, transformation_ctx = "dropnullfields3")
##----------------------------------
#convert to a Spark DataFrame...
customDF = resolvechoice2.toDF() <<---- HERE'S MY CODE 

#add a new column for "type"
customDF = customDF.withColumn("type", lit('yellow'))

# Convert back to a DynamicFrame for further processing.
customDynamicFrame = DynamicFrame.fromDF(customDF, glueContext, "customDF_df")
##----------------------------------
## @type: DataSink
## @args: [connection_type = "s3", connection_options = {"path": "s3://<<s3-bucket>>/glue-blog/"}, format = "parquet", transformation_ctx = "datasink4"]
## @return: datasink4
## @inputs: [frame = customDynamicFrame]
datasink4 = glueContext.write_dynamic_frame.from_options(frame = customDynamicFrame, connection_type = "s3", connection_options = {"path": "s3://<<s3-bucket>>/glue-blog/"}, format = "parquet", transformation_ctx = "datasink4")
job.commit()

Where can I find the <DYNAMIC_FRAME_NAME> ?

Thanks!

🌐
CopyProgramming
copyprogramming.com › howto › overwrite-parquet-files-from-dynamic-frame-in-aws-glue
Updating AWS Glue's dynamic frame with new data and replacing existing parquet files - Amazon web services
May 27, 2023 - The sentence that I use is this: glueContext.write_dynamic_frame. Currently AWS Glue doesn't support 'overwrite' mode but they are working on this feature.
🌐
Amazon Web Services
docs.amazonaws.cn › 亚马逊云科技 › amazon glue › user guide › data discovery and cataloging in amazon glue › managing the data catalog › updating the schema, and adding new partitions in the data catalog using amazon glue etl jobs
Updating the schema, and adding new partitions in the Data Catalog using Amazon Glue ETL jobs - Amazon Glue
The code uses enableUpdateCatalog set to true, and also updateBehavior set to UPDATE_IN_DATABASE, which indicates to overwrite the schema and add new partitions in the Data Catalog during the job run. ... additionalOptions = { "enableUpdateCatalog": True, "updateBehavior": "UPDATE_IN_DATABASE"} ...
🌐
AWS
docs.aws.amazon.com › aws glue › user guide › data discovery and cataloging in aws glue › managing the data catalog › updating the schema, and adding new partitions in the data catalog using aws glue etl jobs
Updating the schema, and adding new partitions in the Data Catalog using AWS Glue ETL jobs - AWS Glue
The code uses enableUpdateCatalog set to true, and also updateBehavior set to UPDATE_IN_DATABASE, which indicates to overwrite the schema and add new partitions in the Data Catalog during the job run. ... additionalOptions = { "enableUpdateCatalog": True, "updateBehavior": "UPDATE_IN_DATABASE"} ...
🌐
AWS re:Post
repost.aws › questions › QU3rukJUaHRpiMNjydfqLZgw › aws-glue-create-dynamic-frame-from-data-in-postgresql-with-custom-bookmark-key
aws glue create_dynamic_frame from data in PostgreSQL with custom bookmark key | AWS re:Post
March 18, 2023 - cb_ceres = glueContext.create_dynamic_frame.from_options( connection_type="postgresql", connection_options={ "url": f"jdbc:postgresql://{ENDPOINT}:5432/{DBNAME}", "dbtable": "xxxxx_raw_ceres", "user": username, "password": password, }, additional_options={"jobBookmarkKeys": ["ceres_mono_index"], "jobBookmarkKeysSortOrder": "asc"}, transformation_ctx="cb_ceres_bookmark", )
🌐
Javaer101
javaer101.com › en › article › 19882830.html
Overwrite parquet files from dynamic frame in AWS Glue - Javaer101
glueContext.write_dynamic_frame.from_options(frame = table, connection_type = "s3", connection_options = {"path": output_dir, "partitionKeys": ["var1","var2"]}, format = "parquet") Is there anything like "mode":"overwrite" that replace my parquet files?