Glue etl job fails to write to Redshift using dynamic frame - reason ?
Write AWS Glue DynamicFrame to redshift table - Stack Overflow
Glue and the write_dynamic_frame preactions and postactions options help
AWS GLUE - JOB ERROR
Does AWS Glue support Redshift?
What role does S3 play in AWS Glue to Redshift migration?
Can I automate Redshift data loading without writing code?
I am working with a large number of files that hit S3 throughout the the day from several sources. The are all the same format but can have overlapping records, the good news is that when the records do overlap the are duplicates.
The destination for my ETL is redshift and I am very comfortable with the stage / dedupe / merge techniques.
to help control costs i want to fire my glue jobs on a schedule rather than triggering on files arriving. So when the job runs I may have 10-100 files to process all with potential for some duplicate records. I am typically using bookmarks and this all works nice when i do not have the potential for duplicates.
My goal is to use the options above as per https://aws.amazon.com/premiumsupport/knowledge-center/sql-commands-redshift-glue-job/ but this only works if the pre and post jobs are run PER FILE. so Is my glue job issuing a COPY command per file or is it reading all of the available files into the dataframe and performing a single COPY with the pre and post command run one time ?
Job bookmarks are the key. Just edit the job and enable "Job bookmarks" and it won't process already processed data. Note that the job has to rerun once before it will detect it does not have to reprocess the old data again.
For more info see: http://docs.aws.amazon.com/glue/latest/dg/monitor-continuations.html
The name "bookmark" is a bit far fetched in my opinion. I would have never looked at it if I did not coincidentally stumble upon it during my search.
This was the solution I got from AWS Glue Support:
As you may know, although you can create primary keys, Redshift doesn't enforce uniqueness. Therefore, if you are rerunning Glue jobs then duplicate rows can get inserted. Some of the ways to maintain uniqueness are:
Use a staging table to insert all rows and then perform a upsert/merge [1] into the main table, this has to be done outside of glue.
Add another column in your redshift table [1], like an insert timestamp, to allow duplicate but to know which one came first or last and then delete the duplicate afterwards if you need to.
Load the previously inserted data into dataframe and then compare the data to be insert to avoid inserting duplicates[3]
[1] - http://docs.aws.amazon.com/redshift/latest/dg/c_best-practices-upsert.html and http://www.silota.com/blog/amazon-redshift-upsert-support-staging-table-replace-rows/
[2] - https://github.com/databricks/spark-redshift/issues/238
[3] - https://kb.databricks.com/data/join-two-dataframes-duplicated-columns.html