aws glue create_dynamic_frame from data in PostgreSQL with custom bookmark key
How to create dynamic dataframe from AWS Glue catalog in local environment?
pyspark - create_dynamic_frame_from_catalog returning zero results - Stack Overflow
dataframe - Create dynamic frame from options (from rds - mysql) providing a custom query with where clause - Stack Overflow
There are several poorly documented features/gotchas in Glue which is sometimes frustrating.
I would suggest to investigate the following configurations of your Glue job:
- Does the S3 bucket name has aws-glue-* prefix?
- Put the files in S3 folder and make sure the crawler table definition is on folder rather than actual file.
I have also written a blog on LinkedIn about other Glue gotchas if that helps.
Do you have subfolders under the path where your Athena table points to? glueContext.create_dynamic_frame.from_catalog does not recursively read the data. Either put the data in the root of where the table is pointing to or add additional_options = {"recurse": True} to your from_catalog call.
credit: https://stackoverflow.com/a/56873939/5112418
Apologies, I would have made a comment but I do not have sufficient reputation. I was able to make the solution that Guillermo AMS provided work within AWS Glue, but it did require two changes:
- The "jdbc" format was unrecognized (the provided error was: "py4j.protocol.Py4JJavaError: An error occurred while calling o79.load. : java.lang.ClassNotFoundException: Failed to find data source: jbdc. Please find packages at http://spark.apache.org/third-party-projects.html") -- I had to use the full name: "org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider"
- The query option was not working for me (the provided error was: "py4j.protocol.Py4JJavaError: An error occurred while calling o72.load. : java.sql.SQLSyntaxErrorException: ORA-00911: invalid character"), but fortunately, the "dbtable" option supports passing in either a table or a subquery -- that is using parentheses around a query.
In my solution below I have also added a bit of context around the needed objects and imports.
My solution ended up looking like:
from awsglue.context import GlueContext
from pyspark.context import SparkContext
glue_context = GlueContext(SparkContext.getOrCreate())
tmp_data_frame = glue_context.spark_session.read\
.format("org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider")\
.option("url", jdbc_url)\
.option("user", username)\
.option("password", password)\
.option("dbtable", "(select * from test where id<100)")\
.load()
The way I was able to provide a custom query was by creating a Spark DataFrame and specifying it with options: https://spark.apache.org/docs/2.3.0/sql-programming-guide.html#manually-specifying-options
Then transform that DataFrame into a DynamicFrame using said class: https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-crawler-pyspark-extensions-dynamic-frame.html
tmp_data_frame = spark.read.format("jbdc")
.option("url", jdbc_url)
.option("user", username)
.option("password", password)
.option("query", "select * from test where id<100")
.load()
dynamic_frame = DynamicFrame.fromDF(tmp_data_frame, glueContext)