Apologies, I would have made a comment but I do not have sufficient reputation. I was able to make the solution that Guillermo AMS provided work within AWS Glue, but it did require two changes:
- The "jdbc" format was unrecognized (the provided error was: "py4j.protocol.Py4JJavaError: An error occurred while calling o79.load. : java.lang.ClassNotFoundException: Failed to find data source: jbdc. Please find packages at http://spark.apache.org/third-party-projects.html") -- I had to use the full name: "org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider"
- The query option was not working for me (the provided error was: "py4j.protocol.Py4JJavaError: An error occurred while calling o72.load. : java.sql.SQLSyntaxErrorException: ORA-00911: invalid character"), but fortunately, the "dbtable" option supports passing in either a table or a subquery -- that is using parentheses around a query.
In my solution below I have also added a bit of context around the needed objects and imports.
My solution ended up looking like:
from awsglue.context import GlueContext
from pyspark.context import SparkContext
glue_context = GlueContext(SparkContext.getOrCreate())
tmp_data_frame = glue_context.spark_session.read\
.format("org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider")\
.option("url", jdbc_url)\
.option("user", username)\
.option("password", password)\
.option("dbtable", "(select * from test where id<100)")\
.load()
Answer from Jon Legendre on Stack Overflowjson - Create dynamic frame from S3 bucket AWS Glue - Stack Overflow
dataframe - Create dynamic frame from options (from rds - mysql) providing a custom query with where clause - Stack Overflow
amazon web services - glueContext create_dynamic_frame_from_options exclude one file type from loading? - Stack Overflow
glueContext create_dynamic_frame_from_options exclude one file?
Apologies, I would have made a comment but I do not have sufficient reputation. I was able to make the solution that Guillermo AMS provided work within AWS Glue, but it did require two changes:
- The "jdbc" format was unrecognized (the provided error was: "py4j.protocol.Py4JJavaError: An error occurred while calling o79.load. : java.lang.ClassNotFoundException: Failed to find data source: jbdc. Please find packages at http://spark.apache.org/third-party-projects.html") -- I had to use the full name: "org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider"
- The query option was not working for me (the provided error was: "py4j.protocol.Py4JJavaError: An error occurred while calling o72.load. : java.sql.SQLSyntaxErrorException: ORA-00911: invalid character"), but fortunately, the "dbtable" option supports passing in either a table or a subquery -- that is using parentheses around a query.
In my solution below I have also added a bit of context around the needed objects and imports.
My solution ended up looking like:
from awsglue.context import GlueContext
from pyspark.context import SparkContext
glue_context = GlueContext(SparkContext.getOrCreate())
tmp_data_frame = glue_context.spark_session.read\
.format("org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider")\
.option("url", jdbc_url)\
.option("user", username)\
.option("password", password)\
.option("dbtable", "(select * from test where id<100)")\
.load()
The way I was able to provide a custom query was by creating a Spark DataFrame and specifying it with options: https://spark.apache.org/docs/2.3.0/sql-programming-guide.html#manually-specifying-options
Then transform that DataFrame into a DynamicFrame using said class: https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-crawler-pyspark-extensions-dynamic-frame.html
tmp_data_frame = spark.read.format("jbdc")
.option("url", jdbc_url)
.option("user", username)
.option("password", password)
.option("query", "select * from test where id<100")
.load()
dynamic_frame = DynamicFrame.fromDF(tmp_data_frame, glueContext)
raw_data_input_path = "s3a://{}/logs/application_id={}/component_id={}/".format(s3BucketName, application_id, component_id)
df = glueContext.create_dynamic_frame_from_options(connection_type="s3",
connection_options={"paths": [raw_data_input_path],
"recurse": True},
format="json",
transformation_ctx=dbInstance)Structure under component_id: The below folders contains jsons.
But now i've added a watermark.txt file, how do i exclude this particular file from the paths to recurse on, inside component_id?
I can't put this file in any other folder, is the only way to accomplish this is to put all ts=.. folders inside one folder called data?