redshift spectrum cloudformation

Using Redshift Spectrum with Cloud Formation

stackoverflow.com › questions › 46007829 › using-redshift-spectrum-with-cloud-formation

Your template looks ok, but there is one more thing to consider which is the IAM role (IAMRoles array) that is needed the CF documentation lists this as an additional parameter.

myCluster: 
  Type: "AWS::Redshift::Cluster"
  Properties:
    DBName: "mydb"
    MasterUsername: "master"
    MasterUserPassword: 
      Ref: "MasterUserPassword"
    NodeType: "dw.hs1.xlarge"
    ClusterType: "single-node"
    IamRoles:
      - "arn:aws:iam::123456789012:role/S3Access"
    Tags:
      - Key: foo
        Value: bar

The IAM role is needed to talk to the Glue / Athena catalog and authenticate your requests against your data in S3.

Answer from grundprinzip on Stack Overflow

AWS

docs.aws.amazon.com › amazon redshift › database developer guide › amazon redshift spectrum › getting started with amazon redshift spectrum

Getting started with Amazon Redshift Spectrum - Amazon Redshift

PrerequisitesCloudFormationGetting started with Redshift Spectrum step by stepStep 1. Create an IAM roleStep 2: Associate the IAM role with your clusterStep 3: Create an external schema and an external tableStep 4: Query your data in Amazon S3Launch your CloudFormation stack and then query your data

GitHub

github.com › aws-samples › aws-redshift-spectrum-poc

GitHub - aws-samples/aws-redshift-spectrum-poc: Cloudformation and SQL scripts used to replicate a POC environment from the "Data Lake to Data Warehouse: Enhancing Customer 360 with Amazon Redshift Spectrum" post

Cloudformation and SQL scripts used to replicate a POC environment from the "Data Lake to Data Warehouse: Enhancing Customer 360 with Amazon Redshift Spectrum" post - aws-samples/aws-redshift-spectrum-poc

Starred by 31 users

Forked by 21 users

Languages TSQL 100.0% | TSQL 100.0%

Discussions

amazon redshift - Template to create IAM role for spectrum S3 access - Stack Overflow

You can associate an IAM Role with a Redshift cluster via the IamRoles field on the cluster, but this needs to be specified at the time that the cluster is launched. If you are adding a role after the cluster is launched, you can do it via the AWS CLI, but not a CloudFormation template. More on stackoverflow.com

stackoverflow.com

Can I use cloud formation somehow to run a redshift command (to create an external spectrum table)?

You can write custom resources. I did it for simple queries to create IAM users for Aurora users and add “service” users to an Aurora table.

Videos

05:42

YouTube

Amazon Redshift Spectrum - YouTube

August 25, 2020

11:25

YouTube

What is Redshift Spectrum? - YouTube

July 19, 2024

youtube.com

Redshift Spectrum Explained: Querying S3 without loading ...

04:52

YouTube

Amazon Redshift Spectrum - User Defined Data Handling Demo | Amazon ...

January 3, 2022

youtube.com

Amazon Redshift Spectrum Explained

youtube.com

Amazon Redshift Spectrum - User Defined Data Handling ...

View all

Stack Overflow

stackoverflow.com › questions › 46007829 › using-redshift-spectrum-with-cloud-formation

aws cloudformation - Using Redshift Spectrum with Cloud Formation - Stack Overflow

Top answer

1 of 2

2

Your template looks ok, but there is one more thing to consider which is the IAM role (IAMRoles array) that is needed the CF documentation lists this as an additional parameter.

myCluster: 
  Type: "AWS::Redshift::Cluster"
  Properties:
    DBName: "mydb"
    MasterUsername: "master"
    MasterUserPassword: 
      Ref: "MasterUserPassword"
    NodeType: "dw.hs1.xlarge"
    ClusterType: "single-node"
    IamRoles:
      - "arn:aws:iam::123456789012:role/S3Access"
    Tags:
      - Key: foo
        Value: bar

The IAM role is needed to talk to the Glue / Athena catalog and authenticate your requests against your data in S3.

2 of 2

1

Amazon Redshift Spectrum is a feature of Amazon Redshift.

Simply launch a normal Amazon Redshift cluster and the features of Amazon Redshift Spectrum are available to you.

From Getting Started with Amazon Redshift Spectrum:

To use Redshift Spectrum, you need an Amazon Redshift cluster and a SQL client that's connected to your cluster so that you can execute SQL commands.

AWS

aws.amazon.com › blogs › big-data › automate-amazon-redshift-cluster-creation-using-aws-cloudformation

Automate Amazon Redshift cluster creation using AWS CloudFormation | Amazon Web Services

February 9, 2022 - Redshift Spectrum – Allows you to add your existing S3 bucket for Redshift Spectrum access. It creates an IAM role with a policy to grant the minimum permissions required to use Redshift Spectrum to access Amazon S3, CloudWatch Logs, and AWS Glue.

GitHub

github.com › awsdocs › amazon-redshift-developer-guide › blob › master › doc_source › c-getting-started-using-spectrum.md

amazon-redshift-developer-guide/doc_source/c-getting-started-using-spectrum.md at master · awsdocs/amazon-redshift-developer-guide

As an alternative to the following steps, you can access the Redshift Spectrum DataLake AWS CloudFormation template to create a stack with an Amazon S3 bucket that you can query.

Author awsdocs

Stack Overflow

stackoverflow.com › questions › 58816446 › template-to-create-iam-role-for-spectrum-s3-access

amazon redshift - Template to create IAM role for spectrum S3 access - Stack Overflow

Top answer

1 of 1

2

Yes. An AWS CloudFormation template can be used to define an IAM Role.

Here is an example from AWS::IAM::Role - AWS CloudFormation:

AWSTemplateFormatVersion: 2010-09-09
Resources:
  RootRole:
    Type: 'AWS::IAM::Role'
    Properties:
      AssumeRolePolicyDocument:
        Version: 2012-10-17
        Statement:
          - Effect: Allow
            Principal:
              Service:
                - ec2.&api-domain;
            Action:
              - 'sts:AssumeRole'
      Path: /
      Policies:
        - PolicyName: root
          PolicyDocument:
            Version: 2012-10-17
            Statement:
              - Effect: Allow
                Action: '*'
                Resource: '*'

reddit.com › r/aws › can i use cloud formation somehow to run a redshift command (to create an external spectrum table)?

r/aws on Reddit: Can I use cloud formation somehow to run a redshift command (to create an external spectrum table)?

November 12, 2020 -

Would it be a best practice to use cloud formation to create the redshift cluster, but then to run commands manually in redshift to build out external tables, or should that external table creation command be run via cloud formation somehow? Thanks!!

Top answer

1 of 1

2

You can write custom resources. I did it for simple queries to create IAM users for Aurora users and add “service” users to an Aurora table.

Find elsewhere

Google Bing Mojeek

Thorn Technologies

thorntech.com › home › aws › how to create a redshift stack with aws cloudformation

How to create a Redshift stack with AWS CloudFormation - Thorn Technologies

May 24, 2022 - How to incorporate S3, EC2, and IAM in a CloudFormation template · Our third and final template creates an Amazon Redshift stack. Redshift is a data warehousing solution that allows you to run complex data queries on huge data sets within seconds ...

Call +14104290255

Address 9175 Guilford Rd, 21046, Columbia

AWS

docs.aws.amazon.com › amazon redshift › database developer guide › amazon redshift spectrum › amazon redshift spectrum overview

Amazon Redshift Spectrum overview - Amazon Redshift

Amazon Redshift Spectrum resides on dedicated Amazon Redshift servers that are independent of your cluster. Amazon Redshift pushes many compute-intensive tasks, such as predicate filtering and aggregation, down to the Redshift Spectrum layer. Thus, Redshift Spectrum queries use much less of your cluster's processing capacity than other queries.

AWS

aws.amazon.com › blogs › big-data › tag › amazon-redshift-spectrum

Amazon Redshift Spectrum | AWS Big Data Blog

Like Amazon EMR, you get the benefits of open data formats and inexpensive storage, and you can scale out to thousands of Redshift Spectrum nodes to pull data, filter, project, aggregate, group, and sort. Like Amazon Athena, Redshift Spectrum is serverless and there’s nothing to provision or manage.

Stack Overflow

stackoverflow.com › questions › 73994206 › when-to-use-redshift-spectrum-for-your-redshift-data-warehouse

amazon web services - When to use Redshift Spectrum for your Redshift data warehouse - Stack Overflow

Top answer

1 of 2

9

This is a broad topic but I'll give a few thoughts.

First off Spectrum is a (often large) set of compute elements embedded in S3 that can do some aspect of the query plan. These part centered around applying WHERE conditions and performing aggregation (GROUP BY). There are also aspects of the query plan that cannot be perform in the S3 layer such as JOINs and advanced functions such as window functions.

The next thing to understand is that while these embedded compute elements are close to S3 in terms of access speed, the S3 service is far away from the Redshift cluster (network distance). If the large amount of data stored in S3 can be pared down to a small set that is shipped to Redshift then Spectrum can be a huge performance improvement. However, if the large amount of data stored in S3 needs to be moved to the Redshift cluster completely to perform the query then there can be a large hit to performance.

Spectrum can be a huge benefit; allowing for a very large amount of data to be filtered down quickly by a fleet of small compute elements. This can result in a big win in performance and in the amount of data that can be addressed.

With these in mind you will want to have data in Spectrum that your query plan will want to get a subset transferred from S3 to redshift. This in general will apply to your fact tables and not to your dim tables. However, if your queries aren't going to apply a WHERE clause to the fact table or aggregate the data down then you won't see the advantages. Also for this to work the WHERE clause needs to apply to a column in the fact table as JOINs cannot be done in S3 so filtering on dim columns won't help. Similarly and GROUP BY needs to be applied only on the fact table columns or this won't reduce the data coming to Redshift from S3.

So fact tables.

Data generally gets into Redshift through S3 and this can be done with the COPY command. You can also get data into Redshift from S3 using Spectrum. This can be a useful tool if other tools are also using S3 for this shared data. S3 can seem like a common data store for separate data systems. This can be useful for some data solutions.

You also bring up very large, infrequently used data. Like older historical data that is usually needed but is sometimes needed. This can be helpful in that older data can be offloaded from the Redshift cluster and the access time for this data isn't important as it is very infrequently used. There is a potential issue - The Redshift cluster can only work on a certain size of data given it's disk space and memory. So you can clog up your cluster if the amount of historical data is too large. This may mean that looking at the full set of historical data in one query may not be possible. Again if the data is aggregated or filtered in S3 this issue isn't a problem.

Bottom line - Spectrum is a great tool but isn't the right tool for every problem.

2 of 2

2

In general, put everything into 'normal' Amazon Redshift.

Redshift Spectrum is handy for accessing data stored in Amazon S3 without having to load it into the Redshift cluster, but it will not be as fast as accessing data stored in 'normal' Redshift.

Therefore, it is useful for rarely-accessed data or for one-off queries on a dataset without having to import the data into Redshift.

Do not use Spectrum as part of your normal ETL flow. One exception to this might be if you are receiving 'landing' data via Amazon S3 (eg Seed Files) -- rather than importing the tables into Redshift, they could be referenced via Spectrum. However, normal loading tools such as Fivetran can load the data directly into Redshift, which is preferable to using Spectrum.

AWS

aws.amazon.com › blogs › big-data › amazon-redshift-spectrum-extends-data-warehousing-out-to-exabytes-no-loading-required

Amazon Redshift Spectrum Extends Data Warehousing Out to Exabytes—No Loading Required | Amazon Web Services

February 15, 2021 - We built Redshift Spectrum to end this “tyranny of OR.” With Redshift Spectrum, Amazon Redshift customers can easily query their data in Amazon S3. Like Amazon EMR, you get the benefits of open data formats and inexpensive storage, and you can scale out to thousands of nodes to pull data, filter, project, aggregate, group, and sort.

TechTarget

techtarget.com › searchaws › definition › Amazon-Redshift-Spectrum

What is Amazon Redshift Spectrum? | Definition from TechTarget

Amazon Athena is similar to Redshift Spectrum, though the two services typically address different needs. An analyst that already works with Redshift will benefit most from Redshift Spectrum because it can quickly access data in the cluster and extend out to infrequently accessed, external tables in S3.

reddit.com › r/aws › should i use spectrum or redshift native?

r/aws on Reddit: Should I use spectrum or redshift native?

November 11, 2020 -

Planning on using the redshift API to allow access into a table with roughly 5B rows. Currently that data is in S3 so to use the API, I am going to have to move that data to redshift. Should I use spectrum, or should I load the data natively? Which one do you think is cheaper long term if this API is hit multiple times a day? Thanks!

Top answer

1 of 2

2

It all depends on your query patterns. Do you touch ALL of the data in most queries? Only a few? Are there different types of queries for different teams etc? If you’re building a data warehouse that multiple analysts will use multiple times a day, then moving that data to redshift will increase performance. Now, is ALL of that data used by the analysts? It’s expensive to keep all data “hot” in redshift, especially if it’s not touched by the queries at hand. Keeping it in S3 (spectrum) virtually eliminates the cost of storing all that data in your redshift cluster but has a slight impact on performance in terms of latency. In essence, if you’re using ALL the data continuously, multiple times a day by multiple analysts, then redshift will serve you well but it will also cost you. Try to filter out the data you actually need, using AWS glue, to reduce cost. If you query the data rarely, then Athena is a great option as you can query the data directly in S3. Either way, do some ETL with Glue to create the tables you actually need, and query those tables with either Athena or redshift.

2 of 2

1

no opinion on redshift stuff, but I will point out that if your query volume is not huge, it may be simpler and more cost effective to use athena - particularly if your data is well formatted in parquet/orc files as this allows Presto (which powers athena) to do partial reads on the underlying data files using the indexes present in those file formats. maybe it's something you've considered and decided against, but if not, it's worth looking into. basic workflow is crawl the data files with a Glue crawler (to get the files' schemas into the glue data catalog) then athena queries the objects on S3 using the crawled metadata. can run sql queries in the athena dashboard itself, connect to it via JDBC, use tools like tableau, etc.

AWS

aws.amazon.com › blogs › big-data › geospatial-data-lakes-with-amazon-redshift

Geospatial data lakes with Amazon Redshift | Amazon Web Services

July 10, 2025 - We use Redshift Serverless and Amazon Redshift Spectrum to access this data from ArcGIS Pro, a GIS mapping software from Esri, an AWS Partner. The following diagram shows the architecture for this solution. The following is a sample schema for this post. In the following sections, we walk through the steps to set up the solution: Deploy the solution infrastructure using AWS CloudFormation...

Hevodata

cdn.hevodata.com › whitepapers › A Complete Guide On Amazon Spectrum.pdf pdf

A COMPLETE GUIDE ON REDSHIFT SPECTRUM Redshift Spectrum

Here, you don’t have to load data from S3 to Redshift using COPY · command or by any other means. The beauty of Spectrum is that it

AWS re:Post

repost.aws › questions › QUo3RCiRKgQnCpnTL-UqnRFQ › aws-redshift-serverless-with-redshift-spectrum

AWS Redshift Serverless with Redshift Spectrum | AWS re:Post

Top answer

1 of 2

2

In the serverless workgroup configuration, in the Permissions you need to add the role you use for S3 to the list of Associated IAM roles

2 of 2

1

To further expand on my colleague’s response, you can associate the role with the IAM cluster via the console. Navigate to the cluster you want to update and select Actions -> Manage IAM Roles. Specify the role to associate from the list or by directly adding the ARN. Reference https://docs.aws.amazon.com/redshift/latest/dg/c-getting-started-using-spectrum-add-role.html

Amazonaws

aws-quickstart.s3.amazonaws.com › quickstart-amazon-redshift › doc › modular-architecture-for-amazon-redshift.pdf pdf

Modular Architecture for Amazon Redshift

aws-quickstart · aws-quickstart-cribl-logstream/CODEOWNERS · 2022-11-18T17:40:57.000Z · "95416813aa724654cc7cb1a06bd04790" · STANDARD · aws-quickstart-cribl-logstream/LICENSE · "86d3f3a95c324c9479bd8986968f4327" · 11357

AWS

aws.amazon.com › blogs › big-data › centralize-governance-for-your-data-lake-using-aws-lake-formation-while-enabling-a-modern-data-architecture-with-amazon-redshift-spectrum

Centralize governance for your data lake using AWS Lake Formation while enabling a modern data architecture with Amazon Redshift Spectrum | Amazon Web Services

February 9, 2022 - You can delete the CloudFormation stack by selecting the stack on the AWS CloudFormation console and choosing Delete. This action deletes all the resources it provisioned. If you manually updated a template-provisioned resource, you may see some issues during clean-up, and you need to clean these up manually. In this post, we showed how you can integrate Lake Formation with Amazon Redshift to seamlessly control access to Amazon S3 data lake. We also demonstrated how to query your data lake using Redshift Spectrum and external tables.