redshift spectrum performance

When to use Redshift Spectrum for your Redshift data warehouse

stackoverflow.com › questions › 73994206 › when-to-use-redshift-spectrum-for-your-redshift-data-warehouse

This is a broad topic but I'll give a few thoughts.

First off Spectrum is a (often large) set of compute elements embedded in S3 that can do some aspect of the query plan. These part centered around applying WHERE conditions and performing aggregation (GROUP BY). There are also aspects of the query plan that cannot be perform in the S3 layer such as JOINs and advanced functions such as window functions.

The next thing to understand is that while these embedded compute elements are close to S3 in terms of access speed, the S3 service is far away from the Redshift cluster (network distance). If the large amount of data stored in S3 can be pared down to a small set that is shipped to Redshift then Spectrum can be a huge performance improvement. However, if the large amount of data stored in S3 needs to be moved to the Redshift cluster completely to perform the query then there can be a large hit to performance.

Spectrum can be a huge benefit; allowing for a very large amount of data to be filtered down quickly by a fleet of small compute elements. This can result in a big win in performance and in the amount of data that can be addressed.

With these in mind you will want to have data in Spectrum that your query plan will want to get a subset transferred from S3 to redshift. This in general will apply to your fact tables and not to your dim tables. However, if your queries aren't going to apply a WHERE clause to the fact table or aggregate the data down then you won't see the advantages. Also for this to work the WHERE clause needs to apply to a column in the fact table as JOINs cannot be done in S3 so filtering on dim columns won't help. Similarly and GROUP BY needs to be applied only on the fact table columns or this won't reduce the data coming to Redshift from S3.

So fact tables.

Data generally gets into Redshift through S3 and this can be done with the COPY command. You can also get data into Redshift from S3 using Spectrum. This can be a useful tool if other tools are also using S3 for this shared data. S3 can seem like a common data store for separate data systems. This can be useful for some data solutions.

You also bring up very large, infrequently used data. Like older historical data that is usually needed but is sometimes needed. This can be helpful in that older data can be offloaded from the Redshift cluster and the access time for this data isn't important as it is very infrequently used. There is a potential issue - The Redshift cluster can only work on a certain size of data given it's disk space and memory. So you can clog up your cluster if the amount of historical data is too large. This may mean that looking at the full set of historical data in one query may not be possible. Again if the data is aggregated or filtered in S3 this issue isn't a problem.

Bottom line - Spectrum is a great tool but isn't the right tool for every problem.

Answer from Bill Weiner on Stack Overflow

AWS

docs.aws.amazon.com › amazon redshift › database developer guide › amazon redshift spectrum › amazon redshift spectrum query performance

Amazon Redshift Spectrum query performance - Amazon Redshift

When large amounts of data are returned from Amazon S3, the processing is limited by your cluster's resources. Redshift Spectrum scales automatically to process large requests. Thus, your overall performance improves whenever you can push processing to the Redshift Spectrum layer.

AWS

docs.aws.amazon.com › aws prescriptive guidance › query best practices for amazon redshift › best practices for using amazon redshift spectrum

Best practices for using Amazon Redshift Spectrum - AWS Prescriptive Guidance

To improve performance, use columnar encoded files such as ORC or Parquet, and use CSV format only for very small dimension tables. Use prefix-based partitioning to take advantage of partition pruning. This means using filters that are keyed to the partitions in your data lake.

Discussions

amazon web services - When to use Redshift Spectrum for your Redshift data warehouse - Stack Overflow

I found an aws article ...or-amazon-redshift-spectrum/ states below, but still not clear. ... Save this answer. ... Show activity on this post. This is a broad topic but I'll give a few thoughts. First off Spectrum is a (often large) set of compute elements embedded in S3 that can do some aspect of the query plan. These part centered around applying WHERE conditions and performing aggregation ... More on stackoverflow.com

stackoverflow.com

amazon web services - Performance issues with Redshift Spectrum - Stack Overflow

I am using Redhshift spectrum. I created an external table and uploaded a csv data file on S3 with around 5.5 million records. If fire a query on this external table, it is taking ~15 seconds where... More on stackoverflow.com

stackoverflow.com

Should I use spectrum or redshift native?

It all depends on your query patterns. Do you touch ALL of the data in most queries? Only a few? Are there different types of queries for different teams etc? If you’re building a data warehouse that multiple analysts will use multiple times a day, then moving that data to redshift will increase performance. Now, is ALL of that data used by the analysts? It’s expensive to keep all data “hot” in redshift, especially if it’s not touched by the queries at hand. Keeping it in S3 (spectrum) virtually eliminates the cost of storing all that data in your redshift cluster but has a slight impact on performance in terms of latency. In essence, if you’re using ALL the data continuously, multiple times a day by multiple analysts, then redshift will serve you well but it will also cost you. Try to filter out the data you actually need, using AWS glue, to reduce cost. If you query the data rarely, then Athena is a great option as you can query the data directly in S3. Either way, do some ETL with Glue to create the tables you actually need, and query those tables with either Athena or redshift. More on reddit.com

r/aws

10

5

November 11, 2020

Athena and Redshift Spectrum performance best practices

A customer is working on a PoC to validate both Athena & Redshift spectrum. Need help with following questions to provide some clarity. 1. Is it fine to have a small Redshift cluster (e.g. 2-node ... More on repost.aws

repost.aws

1

0

March 3, 2019

Videos

05:42

YouTube

Amazon Redshift Spectrum - YouTube

August 25, 2020

11:25

YouTube

What is Redshift Spectrum? - YouTube

July 19, 2024

youtube.com

Redshift Spectrum Explained: Querying S3 without loading ...

04:28

YouTube

Analyze AWS S3 and Redshift via Amazon Redshift Spectrum | Join ...

August 2, 2022

21:45

YouTube

Difference between Athena and Redshift Spectrum - YouTube

March 28, 2021

04:07

YouTube

Mastering AWS Redshift: Optimizing and Reducing Costs - YouTube

January 18, 2024

View all

AWS

docs.aws.amazon.com › amazon redshift › database developer guide › amazon redshift spectrum › amazon redshift spectrum overview

Amazon Redshift Spectrum overview - Amazon Redshift

Amazon Redshift Spectrum resides on dedicated Amazon Redshift servers that are independent of your cluster. Amazon Redshift pushes many compute-intensive tasks, such as predicate filtering and aggregation, down to the Redshift Spectrum layer. Thus, Redshift Spectrum queries use much less of your cluster's processing capacity than other queries.

AWS re:Post

repost.aws › articles › ARJ9VDc8whS2a9o9vfqHxWHA › troubleshooting-redshift-spectrum-query-performance-and-errors-using-system-logs-and-views

Troubleshooting Redshift Spectrum query performance and errors using system logs and views. | AWS re:Post

September 30, 2024 - Redshift Spectrum feature allows efficient query and retrieving structured and semistructured data from files in Amazon S3 without having to load the data into Amazon Redshift tables.

Stack Overflow

stackoverflow.com › questions › 73994206 › when-to-use-redshift-spectrum-for-your-redshift-data-warehouse

amazon web services - When to use Redshift Spectrum for your Redshift data warehouse - Stack Overflow

Top answer

1 of 2

9

This is a broad topic but I'll give a few thoughts.

First off Spectrum is a (often large) set of compute elements embedded in S3 that can do some aspect of the query plan. These part centered around applying WHERE conditions and performing aggregation (GROUP BY). There are also aspects of the query plan that cannot be perform in the S3 layer such as JOINs and advanced functions such as window functions.

The next thing to understand is that while these embedded compute elements are close to S3 in terms of access speed, the S3 service is far away from the Redshift cluster (network distance). If the large amount of data stored in S3 can be pared down to a small set that is shipped to Redshift then Spectrum can be a huge performance improvement. However, if the large amount of data stored in S3 needs to be moved to the Redshift cluster completely to perform the query then there can be a large hit to performance.

Spectrum can be a huge benefit; allowing for a very large amount of data to be filtered down quickly by a fleet of small compute elements. This can result in a big win in performance and in the amount of data that can be addressed.

With these in mind you will want to have data in Spectrum that your query plan will want to get a subset transferred from S3 to redshift. This in general will apply to your fact tables and not to your dim tables. However, if your queries aren't going to apply a WHERE clause to the fact table or aggregate the data down then you won't see the advantages. Also for this to work the WHERE clause needs to apply to a column in the fact table as JOINs cannot be done in S3 so filtering on dim columns won't help. Similarly and GROUP BY needs to be applied only on the fact table columns or this won't reduce the data coming to Redshift from S3.

So fact tables.

Data generally gets into Redshift through S3 and this can be done with the COPY command. You can also get data into Redshift from S3 using Spectrum. This can be a useful tool if other tools are also using S3 for this shared data. S3 can seem like a common data store for separate data systems. This can be useful for some data solutions.

You also bring up very large, infrequently used data. Like older historical data that is usually needed but is sometimes needed. This can be helpful in that older data can be offloaded from the Redshift cluster and the access time for this data isn't important as it is very infrequently used. There is a potential issue - The Redshift cluster can only work on a certain size of data given it's disk space and memory. So you can clog up your cluster if the amount of historical data is too large. This may mean that looking at the full set of historical data in one query may not be possible. Again if the data is aggregated or filtered in S3 this issue isn't a problem.

Bottom line - Spectrum is a great tool but isn't the right tool for every problem.

2 of 2

2

In general, put everything into 'normal' Amazon Redshift.

Redshift Spectrum is handy for accessing data stored in Amazon S3 without having to load it into the Redshift cluster, but it will not be as fast as accessing data stored in 'normal' Redshift.

Therefore, it is useful for rarely-accessed data or for one-off queries on a dataset without having to import the data into Redshift.

Do not use Spectrum as part of your normal ETL flow. One exception to this might be if you are receiving 'landing' data via Amazon S3 (eg Seed Files) -- rather than importing the tables into Redshift, they could be referenced via Spectrum. However, normal loading tools such as Fivetran can load the data directly into Redshift, which is preferable to using Spectrum.

Hevo

hevodata.com › home › learn › data warehousing

Amazon Redshift vs Redshift Spectrum: 6 Differences in 2025

January 12, 2026 - Amazon Redshift Spectrum is a great ... performance when querying the data in which it resides. It’s about 10 times faster than other data warehouses....

Find elsewhere

Google Bing Mojeek

AWS

docs.aws.amazon.com › amazon redshift › database developer guide › amazon redshift spectrum › metrics in amazon redshift spectrum

Metrics in Amazon Redshift Spectrum - Amazon Redshift

The number of bytes returned from the Redshift Spectrum layer to the cluster. A large amount of data returned might affect system performance.

AWS

docs.aws.amazon.com › amazon redshift › database developer guide › amazon redshift spectrum

Amazon Redshift Spectrum - Amazon Redshift

Using Amazon Redshift Spectrum, you can efficiently query and retrieve structured and semi-structured data from files in Amazon S3 without having to load the data into Amazon Redshift tables. Redshift Spectrum queries employ massive parallelism to run very fast against large datasets.

Upsolver

upsolver.com › home › blog › improving redshift spectrum’s performance & costs

Improving Redshift Spectrum's Performance & Costs | Upsolver

May 28, 2024 - We ran the SQL queries in Redshift Spectrum on each version of the same dataset. You can find the details below, but let’s start with the bottom line: Running the query on 1-minute Parquet improved performance by 92.43% compared to raw JSON

Stack Overflow

stackoverflow.com › questions › 44952639 › performance-issues-with-redshift-spectrum

amazon web services - Performance issues with Redshift Spectrum - Stack Overflow

Top answer

1 of 4

4

For your performance optimizations please have a look to understand your query.

Right now, the best performance is if you don't have a single CSV file but multiple. Typically, you could say you get great performance if the number of files per query is at least about an order of magnitude larger than the number of nodes of your cluster.

In addition, if you use Parquet files you get the advantage of a columnar format on S3 rather than reading CSV which will read the whole file from S3 - and decreases your cost as well.

You can use the script to convert data to Parquet:

2 of 4

3

Reply from AWS forum as follows :

I understand that you have the same query running on Redshift & Redshift Spectrum. However, the results are different, while one run in 2 seconds the other run in around 15 seconds.

First of all, we must agree that both Redshift and Spectrum are different services designed differently for different purpose. Their internal structure varies a lot from each other, while Redshift relies on EBS storage, Spectrum works directly with S3. Redshift Spectrum's queries employ massive parallelism to execute very fast against large datasets. Much of the processing occurs in the Redshift Spectrum layer, and most of the data remain in Amazon S3.

Spectrum is also designed to deal with Petabytes of data structured and semi-structured data from files in Amazon S3 without having to load the data into Amazon Redshift tables while Redshift offers you the ability to store data efficiently and in a highly-optimez manner by means of Distribution and Sort Keys.

AWS does not advertise Spectrum as a faster alternative to Redshift. We offer Amazon Redshift Spectrum as an add-on solution to provide access to data stored in Amazon S3 without having to load it into Redshift (similar to Amazon Athena).

In terms of query performance, unfortunately, we can't guarantee performance improvements since Redshift Spectrum layer produces query plans completely different from the ones produced by Redshift's database engine interpreter. This particular reason itself would be enough to discourage any query performance comparison between these services as it is not fair to neither of them.

About your question about on the fly nodes, Spectrum adds them based on the demands of your queries, Redshift Spectrum can potentially use thousands of instances to take advantage of massively parallel processing. There aren't any specific criteria to trigger this behavior, however, bearing in mind that by following the best practices about how to improve query performance[1] and how to create data files for queries[2] you can potentially improve the overall Spectrum's performance.

For the last, I would like to point some interesting documentation to clarify you a bit more about how to achieve better performance improvements. Please see the references at the end!

linkedin.com › pulse › redshift-spectrum-performance-tuning-cost-abhishek-joshi-

Redshift Spectrum: Performance tuning: cost optimization

Proper partitioning can reduce scanning and increasing the performance of the query, these partitioning can be based on column often queries or row queried together. Excessive portioning will be counterproductive. ... There can be 20K partition per table. Spectrum support AES-256 encryption, managing encryption keys through KMS.

reddit.com › r/aws › should i use spectrum or redshift native?

r/aws on Reddit: Should I use spectrum or redshift native?

November 11, 2020 -

Planning on using the redshift API to allow access into a table with roughly 5B rows. Currently that data is in S3 so to use the API, I am going to have to move that data to redshift. Should I use spectrum, or should I load the data natively? Which one do you think is cheaper long term if this API is hit multiple times a day? Thanks!

Top answer

1 of 2

2

It all depends on your query patterns. Do you touch ALL of the data in most queries? Only a few? Are there different types of queries for different teams etc? If you’re building a data warehouse that multiple analysts will use multiple times a day, then moving that data to redshift will increase performance. Now, is ALL of that data used by the analysts? It’s expensive to keep all data “hot” in redshift, especially if it’s not touched by the queries at hand. Keeping it in S3 (spectrum) virtually eliminates the cost of storing all that data in your redshift cluster but has a slight impact on performance in terms of latency. In essence, if you’re using ALL the data continuously, multiple times a day by multiple analysts, then redshift will serve you well but it will also cost you. Try to filter out the data you actually need, using AWS glue, to reduce cost. If you query the data rarely, then Athena is a great option as you can query the data directly in S3. Either way, do some ETL with Glue to create the tables you actually need, and query those tables with either Athena or redshift.

2 of 2

1

no opinion on redshift stuff, but I will point out that if your query volume is not huge, it may be simpler and more cost effective to use athena - particularly if your data is well formatted in parquet/orc files as this allows Presto (which powers athena) to do partial reads on the underlying data files using the indexes present in those file formats. maybe it's something you've considered and decided against, but if not, it's worth looking into. basic workflow is crawl the data files with a Glue crawler (to get the files' schemas into the glue data catalog) then athena queries the objects on S3 using the crawled metadata. can run sql queries in the athena dashboard itself, connect to it via JDBC, use tools like tableau, etc.

CloudThat

cloudthat.com › home › blogs › leveraging amazon redshift spectrum for querying exabyte-scale data

Leveraging Amazon Redshift Spectrum for Querying Exabyte-Scale Data

June 9, 2025 - You can define external tables and partitions using AWS Glue, and Spectrum will use this metadata for querying. ... One of the biggest advantages is that you don’t need to load large datasets into Amazon Redshift to analyze them. This reduces ETL complexity, data duplication, and storage costs. ... With partitioning and columnar file formats, you can dramatically reduce the amount of data scanned, improving query performance and reducing costs.

AWS

aws.amazon.com › blogs › big-data › 10-best-practices-for-amazon-redshift-spectrum

Best Practices for Amazon Redshift Spectrum | Amazon Web Services

December 2, 2022 - You can query vast amounts of data ... that reside locally on Amazon Redshift. Amazon Redshift Spectrum applies sophisticated query optimization and scales processing across thousands of nodes to deliver fast performance....

AWS re:Post

repost.aws › questions › QUWpw4DnVAQ9OrsjdAFB31RQ › athena-and-redshift-spectrum-performance-best-practices

Athena and Redshift Spectrum performance best practices | AWS re:Post

Top answer

1 of 1

1

I would go through the Redshift Spectrum best practices blog [here][1] and plan to run some tests. It is hard to quantify such metrics as every customer workload is different. Regarding your questions: 1/ Depends on a variety of factors as noted in the best practices blog. Such as parquet file format, Snappy compression, proper partitioning on S3 to help with query access patterns/filters, type of queries such as ORDER BY, DISTINCT which cannot be pushed down to Spectrum compute layer etc. Amazon Redshift Spectrum owns managed compute layer independent of your Redshift cluster. The number of Redshift Spectrum compute nodes that a query uses depends on the Redshift node type and the overall workload. Based on the demands of your queries and Redshift cluster configuration, Redshift Spectrum scales automatically in an intelligent fashion. 2/ Same as #1 3/ Regarding query syntax difference between Athena and Redshift Spectrum, yes. Athena's query engine is Apache Presto and hence, it follows query syntax of Apache Presto. I would refer to Presto documentation [here][2] under "SQL Language" and "SQL Statement Syntax". As far as Spectrum goes, you will find that Spectrum follows pretty much the same syntax as Redshift except things like you cannot do DML operations on Spectrum tables due to the external table. For the second part of your question, I would make sure that customer is aware when to use Athena versus Spectrum. They are not meant to replace each other but rather meant for different workloads. Athena is more like rent-a-car for adhoc/on-demand data explorations as and when needed without needing to spin up a cluster etc. Whereas Redshift Spectrum is more like a secondary car and Redshift is your primary car. A common pattern for Redshift Spectrum is to run queries that span both the frequently accessed “hot” data stored locally in Amazon Redshift and the “warm/cold” data stored cost-effectively in Amazon S3. This pattern serves to separate compute and storage, enabling independent scaling of both to match the use case without having to pay disproportionately for value. Athena and Redshift Spectrum query optimizers are completely different. There are also differences such as you can get the same rich compliance standards of Amazon Redshift. [1]: https://aws.amazon.com/blogs/big-data/10-best-practices-for-amazon-redshift-spectrum/ [2]: http://prestodb.github.io/docs/current/

Quora

quora.com › Why-is-the-Redshift-Spectrum-performance-lower-than-Amazon-Redshift

Why is the Redshift Spectrum performance lower than Amazon Redshift? - Quora

Answer: Since Redshift exercises complete control over how data is stored, compressed and queried, it has a lot more options for optimizing a query. Spectrum only has control over how the data is queried (since how it’s stored is up to S3). ...

Ahmedahamid

ahmedahamid.com › amazon-redshift-at-a-glance

Amazon Redshift Spectrum at a glance – Ahmed AbdelHamid Blog

Amazon Redshift Spectrum can be used to efficiently query a very large amount of data that resides in S3 files.

AWS re:Post

repost.aws › knowledge-center › redshift-spectrum-query-charges

Calculate query charges for Amazon Redshift Spectrum | AWS re:Post

March 21, 2025 - Use cost controls for Redshift Spectrum and concurrency scaling features to monitor and control your usage. Use Optimized Data Formats to improve performance and lower costs.

Kpipartners

kpipartners.com › blogs › advantages-of-using-redshift-spectrum

Advantages of using Redshift Spectrum

February 11, 2026 - With Amazon Redshift Spectrum, you now have a fast, cost-effective engine that minimizes data processed with dynamic partition pruning. You can further improve query performance by reducing the data scanned.

Medium

blog.openbridge.com › 10-simple-tips-that-help-you-quickly-find-success-adopting-amazon-redshift-spectrum-810db089abbe

Amazon Redshift Spectrum: 10 Simple Tips That Help You Quickly Find Success | by Openbridge | Openbridge

September 11, 2020 - Amazon Redshift Spectrum allows to run queries on S3 data without having to set up servers, define clusters, or do any maintenance of the system. However, to improve query return speed and performance, it is recommended to compress data files. ...