gluecontext create_dynamic_frame from_options s3

glue etl jobs - get s3 subfolders using create_dynamic_frame.from_options

stackoverflow.com › questions › 60616004 › glue-etl-jobs-get-s3-subfolders-using-create-dynamic-frame-from-options

Found a solution for this problem, looks like the dictionary accepts more parameters, the one I needed was "recurse". You can also exclude certain patterns with "exclusions".

Source https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-connect.html#aws-glue-programming-etl-connect-s3

dyf = glueContext.create_dynamic_frame.from_options(
    's3',
    {
        "paths": [
            's3://bucket/2017/'
        ],
        "recurse" : True
    },
    "json",
    transformation_ctx = "dyf")

Answer from Joshua on Stack Overflow

AWS

docs.aws.amazon.com › aws glue › user guide › aws glue programming guide › programming spark scripts › program aws glue etl scripts in pyspark › aws glue pyspark extensions reference › dynamicframe class

DynamicFrame class - AWS Glue

To access the dataset that is used in this example, see Code example: Data preparation using ResolveChoice, Lambda, and ApplyMapping and follow the instructions in Step 1: Crawl the data in the Amazon S3 bucket. # Example: Use filter to create a new DynamicFrame # with a filtered selection of records from pyspark.context import SparkContext from awsglue.context import GlueContext # Create GlueContext sc = SparkContext.getOrCreate() glueContext = GlueContext(sc) # Create DynamicFrame from Glue Data Catalog medicare = glueContext.create_dynamic_frame.from_options( "s3", { "paths": [ "s3://awsglu

AWS

GlueContext class - AWS Glue

__init__ — creating —getSourcecreate_dynamic_frame_from_rddcreate_dynamic_frame_from_catalogcreate_dynamic_frame_from_optionscreate_sample_dynamic_frame_from_catalogcreate_sample_dynamic_frame_from_optionsadd_ingestion_time_columnscreate_data_frame_from_catalogcreate_data_frame_from_optionsforEachBatch — Amazon S3 datasets —purge_tablepurge_s3_pathtransition_tabletransition_s3_path — extracting —extract_jdbc_conf— transactions —start_transactioncommit_transactioncancel_transaction — writing —getSinkwrite_dynamic_frame_from_optionswrite_from_optionswrite_dynamic_frame_from_catalogwrite_data_frame_from_catalogwrite_dynamic_frame_from_jdbc_confwrite_from_jdbc_conf

Discussions

python - glue etl jobs - get s3 subfolders using create_dynamic_frame.from_options - Stack Overflow

I am creating an AWS Glue ETL job, but I'm running into some roadblocks with file retrieval. It seems that the following code only gets the files at the root folder 2017 and not any further. Is th... More on stackoverflow.com

stackoverflow.com

json - Create dynamic frame from S3 bucket AWS Glue - Stack Overflow

Summary: I've got a S3 bucket which contains list of JSON files. Bucket contains child folders which are created by date. All the files contain similar file structure. Files get added on daily basis. More on stackoverflow.com

stackoverflow.com

Support for format options to be pushed down when using `parquet` as the format

Hope The create_dynamic_frame_from_options method signature accepts format_options object. aws-glue-libs/awsglue/context.py Lines 143 to 144 in 28805fe def create_dynamic_frame_from_options(self, c... More on github.com

github.com

July 25, 2021

amazon web services - How to create dynamic data frame from S3 files in Glue Job in Scala? - Stack Overflow

I'm having problems in converting a Python Glue Job to Scala Glue Job, namely create_dynamic_data_frame_options method. In python the syntax is: dyf = glueContext.create_dynamic_frame_from_options... More on stackoverflow.com

stackoverflow.com

October 12, 2019

Videos

03:58

YouTube

How to Read S3 Partitioned Data as Columns in AWS Glue DF - YouTube

AWS Glue: Write Parquet With Partitions to AWS S3 - YouTube

September 5, 2022

04:57

YouTube

Read S3 Files / Create DynamicFrame from S3 using Local Glue Script ...

DynamicFrameWriter class - AWS Glue

This example writes the output locally using a connection_type of S3 with a POSIX path argument in connection_options, which allows writing to local storage. glueContext.write_dynamic_frame.from_options(\ frame = dyf_splitFields,\ connection_options = {'path': '/home/glue/GlueLocalOutput/'},\ connection_type = 's3',\ format = 'json') Document Conventions ·

Stack Overflow

stackoverflow.com › questions › 60616004 › glue-etl-jobs-get-s3-subfolders-using-create-dynamic-frame-from-options

python - glue etl jobs - get s3 subfolders using create_dynamic_frame.from_options - Stack Overflow

Top answer

1 of 1

Found a solution for this problem, looks like the dictionary accepts more parameters, the one I needed was "recurse". You can also exclude certain patterns with "exclusions".

Source https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-connect.html#aws-glue-programming-etl-connect-s3

dyf = glueContext.create_dynamic_frame.from_options(
    's3',
    {
        "paths": [
            's3://bucket/2017/'
        ],
        "recurse" : True
    },
    "json",
    transformation_ctx = "dyf")

GitHub

github.com › aws-samples › aws-glue-samples › blob › master › examples › resolve_choice.py

aws-glue-samples/examples/resolve_choice.py at master · aws-samples/aws-glue-samples

glueContext.write_dynamic_frame.from_options(frame = medicare_res_cast, connection_type = "s3", connection_options = {"path": medicare_cast}, format = "json") glueContext.write_dynamic_frame.from_options(frame = medicare_res_project, connection_type = "s3", connection_options = {"path": medicare_project}, format = "json") glueContext.write_dynamic_frame.from_options(frame = medicare_res_make_cols, connection_type = "s3", connection_options = {"path": medicare_cols}, format = "json") glueContext.write_dynamic_frame.from_options(frame = medicare_res_make_struct, connection_type = "s3", connection_options = {"path": medicare_struct}, format = "json") glueContext.write_dynamic_frame.from_options(frame = medicare_sql_dyf, connection_type = "s3", connection_options = {"path": medicare_sql}, format = "json")

Author aws-samples

Stack Overflow

stackoverflow.com › questions › 74734233 › create-dynamic-frame-from-s3-bucket-aws-glue

json - Create dynamic frame from S3 bucket AWS Glue - Stack Overflow

Question I am trying to create dynamic frame from options where source is S3 and type is JSON. I'm using following code however it is not returning any value. Where am I going wrong? ... import sys from awsglue.transforms import * from awsglue.utils import getResolvedOptions from pyspark.context import SparkContext from awsglue.context import GlueContext from awsglue.job import Job from functools import reduce from awsglue.dynamicframe import DynamicFrame ## @params: [JOB_NAME] args = getResolvedOptions(sys.argv, ['JOB_NAME']) sc = SparkContext() glueContext = GlueContext(sc) spark = glueContext.spark_session job = Job(glueContext) job.init(args['JOB_NAME'], args) df = glueContext.create_dynamic_frame.from_options( connection_type = 's3', connection_options={'paths':['Location for S3 folder']}, format='json', # formatOptions=$..* ) print('Total Count:') df.count()

Sqlandhadoop

sqlandhadoop.com › aws-glue-create-dynamic-frame

AWS Glue create dynamic frame – SQL & Hadoop

import sys from awsglue.transforms import * from awsglue.utils import getResolvedOptions from pyspark.context import SparkContext from awsglue.context import GlueContext from awsglue.job import Job glueContext = GlueContext(SparkContext.getOrCreate()) # creating dynamic frame from S3 data dyn_frame_s3 = glueContext.create_dynamic_frame_from_options( connection_type="s3", connection_options = { "paths": ["s3://<bucket name>/data/sales/"], "inferSchema": "true" }, format = "csv", format_options={ "separator": "\t" }, transformation_ctx="") print (dyn_frame_s3.count()) # creating dynamic frame from Glue catalog table dyn_frame_catalog = glueContext.create_dynamic_frame_from_catalog( database = "db_readfile", table_name = "sales", transformation_ctx = "") print (dyn_frame_catalog.count())

Find elsewhere

Google Bing Mojeek

Medium

medium.com › @kundansingh0619 › aws-glue-3-aae089693d5a

AWS_Glue_3: Glue(DynamicFrame). GlueContext is the entry point for… | by Kundan Singh | Medium

February 12, 2025 - #create DynamicFame from S3 parquet files datasource0 = glueContext.create_dynamic_frame_from_options( connection_type="s3", connection_options = { "paths": [S3_location] }, format="parquet", transformation_ctx="datasource0")#create DynamicFame from glue catalog datasource0 = glueContext.create_dynamic_frame.from_catalog( database = "demo", table_name = "testtable", transformation_ctx = "datasource0")#convert to spark DataFrame #convert to Glue DynamicFrame df1 = datasource0.toDF() df2 = DynamicFrame.fromDF(df1, glueContext , "df2") df = dynamic_frame.toDF() df.show() print("Dataframe converted")

GitHub

github.com › awslabs › aws-glue-libs › issues › 90

Support for format options to be pushed down when using `parquet` as the format · Issue #90 · awslabs/aws-glue-libs

July 25, 2021 - def create_dynamic_frame_from_options(self, connection_type, connection_options={}, format=None, format_options={}, transformation_ctx = "", push_down_predicate= "", **kwargs): At first sight, this seems to options that can be configured for a given format. With parquet as the format, something like a basePath is very useful when reading partitioned data from S3. from awsglue.context import GlueContext from awsglue import DynamicFrame # initialize SparkContext and SparkSession spark_context = SparkContext.getOrCreate() glue_context = GlueContext(spark_context) spark = glue_context.spark_sessio

Author mnoumanshahzad

Aws-dojo

aws-dojo.com › ws9 › labs › script-to-move-data-s3-to-s3

AWS Dojo - Workshop - Building AWS Glue Job using PySpark - Part:2(of 2)

glueContext.write_dynamic_frame.from_options(productlineDF, connection_type = "s3", connection_options = {"path": "s3://dojo-data-lake/data/productline"}, format = "json")

GitHub

github.com › aws-samples › aws-glue-samples › blob › master › examples › join_and_relationalize.py

aws-glue-samples/examples/join_and_relationalize.py at master · aws-samples/aws-glue-samples

glueContext.write_dynamic_frame.from_options(frame = l_history, connection_type = "s3", connection_options = {"path": output_history_dir}, format = "parquet") · # Write out a single file to directory "legislator_single" s_history = l_history.toDF().repartition(1) print("Writing to /legislator_single ...") s_history.write.parquet(output_lg_single_dir) ·

Author aws-samples

AWS

docs.aws.amazon.com › aws glue › user guide › aws glue programming guide › programming spark scripts › program aws glue etl scripts in pyspark › aws glue python code samples › code example: joining and relationalizing data

Code example: Joining and relationalizing data - AWS Glue

glueContext.write_dynamic_frame.from_options(frame = l_history, connection_type = "s3", connection_options = {"path": "s3://glue-sample-target/output-dir/legislator_history"}, format = "parquet")

Medium

swapnil-bhoite.medium.com › aws-glue-dynamicframe-transformations-with-example-code-and-output-26e14d13145f

AWS Glue DynamicFrame transformations with example code and output | by Swapnil Bhoite | Medium

April 28, 2022 - This is also a good opportunity to showcase how to load a dataset directly from S3: dyF = glueContext.create_dynamic_frame.from_options( 's3', {'paths': ['s3://awsglue-datasets/examples/medicare/Medicare_Hospital_Provider.csv']}, 'csv', {'withHeader': True})dyF.printSchema()root |-- DRG Definition: string |-- Provider Id: string |-- Provider Name: string |-- Provider Street Address: string |-- Provider City: string |-- Provider State: string |-- Provider Zip Code: string |-- Hospital Referral Region Description: string |-- Total Discharges: string |-- Average Covered Charges: string |-- Average Total Payments: string |-- Average Medicare Payments: string

Stack Overflow

stackoverflow.com › questions › 58356140 › how-to-create-dynamic-data-frame-from-s3-files-in-glue-job-in-scala

amazon web services - How to create dynamic data frame from S3 files in Glue Job in Scala? - Stack Overflow

Top answer

1 of 2

Notice that when you "repartition(1)" only one core of the cluster can do work from them all, if you want to just generate a file put the repartition as late as possible (just before the write). Also bear in mind that when you run the write, that is not running the write but all the work from source to the point it writes (e.g. repartition, filtering, etc), so even if at the end there is no data coming out, it has to do all the work to reach that.

2 of 2

Then you can't move the repartition down further (you could move it after the conversion but I don't think it will make any difference

AWS

docs.aws.amazon.com › aws glue › user guide › aws glue programming guide › programming spark scripts › features and optimizations for programming aws glue for spark etl scripts › data format options for inputs and outputs in aws glue for spark › using the json format in aws glue

Using the JSON format in AWS Glue - AWS Glue

// Example: Read JSON from S3 // For show, we handle a nested JSON file that we can limit with the JsonPath parameter // For show, we also handle a JSON where a single entry spans multiple lines // Consider whether optimizePerformance is right for your workflow. import com.amazonaws.services.glue.util.JsonOptions import com.amazonaws.services.glue.{DynamicFrame, GlueContext} import org.apache.spark.SparkContext object GlueApp { def main(sysArgs: Array[String]): Unit = { val spark: SparkContext = new SparkContext() val glueContext: GlueContext = new GlueContext(spark) val dynamicFrame = glueContext.getSourceWithFormat( formatOptions=JsonOptions("""{"jsonPath": "$.id", "multiline": true, "optimizePerformance":false}"""), connectionType="s3", format="json", options=JsonOptions("""{"paths": ["s3://s3path"], "recurse": true}""") ).getDynamicFrame() } }

AWS

docs.aws.amazon.com › aws glue › user guide › aws glue programming guide › programming spark scripts › features and optimizations for programming aws glue for spark etl scripts › connection types and options for etl in aws glue for spark › amazon s3 connections › reading input files in larger groups

Reading input files in larger groups - AWS Glue

If you are reading from Amazon S3 directly using the create_dynamic_frame.from_options method, add these connection options. For example, the following attempts to group files into 1 MB groups. df = glueContext.create_dynamic_frame.from_options("s3", {'paths': ["s3://s3path/"], 'recurse':True, 'groupFiles': 'inPartition', 'groupSize': '1048576'}, format="json")

AWS

docs.aws.amazon.com › aws glue › user guide › aws glue programming guide › programming spark scripts › features and optimizations for programming aws glue for spark etl scripts › connection types and options for etl in aws glue for spark › amazon s3 connections › managing partitions for etl output in aws glue

Managing partitions for ETL output in AWS Glue - AWS Glue

For example, the following Python code writes out a dataset to Amazon S3 in the Parquet format, into directories partitioned by the type field. From there, you can process these partitions using other systems, such as Amazon Athena. glue_context.write_dynamic_frame.from_options( frame = projectedEvents, connection_type = "s3", connection_options = {"path": "$outpath", "partitionKeys": ["type"]}, format = "parquet")

Stack Overflow

stackoverflow.com › questions › 61119685 › aws-glue-job-fails-at-create-dynamic-frame-from-options-when-reading-from-s3-buc

apache spark - AWS Glue Job fails at create_dynamic_frame_from_options when reading from s3 bucket with lot of files - Stack Overflow

April 9, 2020 - The data inside my s3 bucket looks like this... s3://bucketName/prefix/userId/XYZ.gz · There are around 20 million users, and within each user's subfolder, there will be 1 - 10 files. My glue job starts like this... datasource0 = glueContext.create_dynamic_frame_from_options("s3", {'paths': ["s3://bucketname/prefix/"], 'useS3ListImplementation':True, 'recurse':True, 'groupFiles': 'inPartition', 'groupSize': 100 * 1024 * 1024}, format="json", transformation_ctx = "datasource0") There are a bunch of optimizations like groupFiles, groupSize & useS3ListImplementations I have attempted, as shown above.

AWS

Using the Parquet format in AWS Glue - AWS Glue

// Example: Read Parquet from S3 import com.amazonaws.services.glue.util.JsonOptions import com.amazonaws.services.glue.{DynamicFrame, GlueContext} import org.apache.spark.SparkContext object GlueApp { def main(sysArgs: Array[String]): Unit = { val spark: SparkContext = new SparkContext() val glueContext: GlueContext = new GlueContext(spark) val dynamicFrame = glueContext.getSourceWithFormat( connectionType="s3", format="parquet", options=JsonOptions("""{"paths": ["s3://s3path"]}""") ).getDynamicFrame() } }