This is a broad topic but I'll give a few thoughts.
First off Spectrum is a (often large) set of compute elements embedded in S3 that can do some aspect of the query plan. These part centered around applying WHERE conditions and performing aggregation (GROUP BY). There are also aspects of the query plan that cannot be perform in the S3 layer such as JOINs and advanced functions such as window functions.
The next thing to understand is that while these embedded compute elements are close to S3 in terms of access speed, the S3 service is far away from the Redshift cluster (network distance). If the large amount of data stored in S3 can be pared down to a small set that is shipped to Redshift then Spectrum can be a huge performance improvement. However, if the large amount of data stored in S3 needs to be moved to the Redshift cluster completely to perform the query then there can be a large hit to performance.
Spectrum can be a huge benefit; allowing for a very large amount of data to be filtered down quickly by a fleet of small compute elements. This can result in a big win in performance and in the amount of data that can be addressed.
With these in mind you will want to have data in Spectrum that your query plan will want to get a subset transferred from S3 to redshift. This in general will apply to your fact tables and not to your dim tables. However, if your queries aren't going to apply a WHERE clause to the fact table or aggregate the data down then you won't see the advantages. Also for this to work the WHERE clause needs to apply to a column in the fact table as JOINs cannot be done in S3 so filtering on dim columns won't help. Similarly and GROUP BY needs to be applied only on the fact table columns or this won't reduce the data coming to Redshift from S3.
So fact tables.
Data generally gets into Redshift through S3 and this can be done with the COPY command. You can also get data into Redshift from S3 using Spectrum. This can be a useful tool if other tools are also using S3 for this shared data. S3 can seem like a common data store for separate data systems. This can be useful for some data solutions.
You also bring up very large, infrequently used data. Like older historical data that is usually needed but is sometimes needed. This can be helpful in that older data can be offloaded from the Redshift cluster and the access time for this data isn't important as it is very infrequently used. There is a potential issue - The Redshift cluster can only work on a certain size of data given it's disk space and memory. So you can clog up your cluster if the amount of historical data is too large. This may mean that looking at the full set of historical data in one query may not be possible. Again if the data is aggregated or filtered in S3 this issue isn't a problem.
Bottom line - Spectrum is a great tool but isn't the right tool for every problem.
Answer from Bill Weiner on Stack Overflowamazon web services - When to use Redshift Spectrum for your Redshift data warehouse - Stack Overflow
amazon web services - Performance issues with Redshift Spectrum - Stack Overflow
Should I use spectrum or redshift native?
Athena and Redshift Spectrum performance best practices
When should I use Redshift spectrum?
How do you use Redshift a spectrum?
What is the Redshift spectrum layer?
Videos
This is a broad topic but I'll give a few thoughts.
First off Spectrum is a (often large) set of compute elements embedded in S3 that can do some aspect of the query plan. These part centered around applying WHERE conditions and performing aggregation (GROUP BY). There are also aspects of the query plan that cannot be perform in the S3 layer such as JOINs and advanced functions such as window functions.
The next thing to understand is that while these embedded compute elements are close to S3 in terms of access speed, the S3 service is far away from the Redshift cluster (network distance). If the large amount of data stored in S3 can be pared down to a small set that is shipped to Redshift then Spectrum can be a huge performance improvement. However, if the large amount of data stored in S3 needs to be moved to the Redshift cluster completely to perform the query then there can be a large hit to performance.
Spectrum can be a huge benefit; allowing for a very large amount of data to be filtered down quickly by a fleet of small compute elements. This can result in a big win in performance and in the amount of data that can be addressed.
With these in mind you will want to have data in Spectrum that your query plan will want to get a subset transferred from S3 to redshift. This in general will apply to your fact tables and not to your dim tables. However, if your queries aren't going to apply a WHERE clause to the fact table or aggregate the data down then you won't see the advantages. Also for this to work the WHERE clause needs to apply to a column in the fact table as JOINs cannot be done in S3 so filtering on dim columns won't help. Similarly and GROUP BY needs to be applied only on the fact table columns or this won't reduce the data coming to Redshift from S3.
So fact tables.
Data generally gets into Redshift through S3 and this can be done with the COPY command. You can also get data into Redshift from S3 using Spectrum. This can be a useful tool if other tools are also using S3 for this shared data. S3 can seem like a common data store for separate data systems. This can be useful for some data solutions.
You also bring up very large, infrequently used data. Like older historical data that is usually needed but is sometimes needed. This can be helpful in that older data can be offloaded from the Redshift cluster and the access time for this data isn't important as it is very infrequently used. There is a potential issue - The Redshift cluster can only work on a certain size of data given it's disk space and memory. So you can clog up your cluster if the amount of historical data is too large. This may mean that looking at the full set of historical data in one query may not be possible. Again if the data is aggregated or filtered in S3 this issue isn't a problem.
Bottom line - Spectrum is a great tool but isn't the right tool for every problem.
In general, put everything into 'normal' Amazon Redshift.
Redshift Spectrum is handy for accessing data stored in Amazon S3 without having to load it into the Redshift cluster, but it will not be as fast as accessing data stored in 'normal' Redshift.
Therefore, it is useful for rarely-accessed data or for one-off queries on a dataset without having to import the data into Redshift.
Do not use Spectrum as part of your normal ETL flow. One exception to this might be if you are receiving 'landing' data via Amazon S3 (eg Seed Files) -- rather than importing the tables into Redshift, they could be referenced via Spectrum. However, normal loading tools such as Fivetran can load the data directly into Redshift, which is preferable to using Spectrum.
For your performance optimizations please have a look to understand your query.
Right now, the best performance is if you don't have a single CSV file but multiple. Typically, you could say you get great performance if the number of files per query is at least about an order of magnitude larger than the number of nodes of your cluster.
In addition, if you use Parquet files you get the advantage of a columnar format on S3 rather than reading CSV which will read the whole file from S3 - and decreases your cost as well.
You can use the script to convert data to Parquet:
Reply from AWS forum as follows :
I understand that you have the same query running on Redshift & Redshift Spectrum. However, the results are different, while one run in 2 seconds the other run in around 15 seconds.
First of all, we must agree that both Redshift and Spectrum are different services designed differently for different purpose. Their internal structure varies a lot from each other, while Redshift relies on EBS storage, Spectrum works directly with S3. Redshift Spectrum's queries employ massive parallelism to execute very fast against large datasets. Much of the processing occurs in the Redshift Spectrum layer, and most of the data remain in Amazon S3.
Spectrum is also designed to deal with Petabytes of data structured and semi-structured data from files in Amazon S3 without having to load the data into Amazon Redshift tables while Redshift offers you the ability to store data efficiently and in a highly-optimez manner by means of Distribution and Sort Keys.
AWS does not advertise Spectrum as a faster alternative to Redshift. We offer Amazon Redshift Spectrum as an add-on solution to provide access to data stored in Amazon S3 without having to load it into Redshift (similar to Amazon Athena).
In terms of query performance, unfortunately, we can't guarantee performance improvements since Redshift Spectrum layer produces query plans completely different from the ones produced by Redshift's database engine interpreter. This particular reason itself would be enough to discourage any query performance comparison between these services as it is not fair to neither of them.
About your question about on the fly nodes, Spectrum adds them based on the demands of your queries, Redshift Spectrum can potentially use thousands of instances to take advantage of massively parallel processing. There aren't any specific criteria to trigger this behavior, however, bearing in mind that by following the best practices about how to improve query performance[1] and how to create data files for queries[2] you can potentially improve the overall Spectrum's performance.
For the last, I would like to point some interesting documentation to clarify you a bit more about how to achieve better performance improvements. Please see the references at the end!
Planning on using the redshift API to allow access into a table with roughly 5B rows. Currently that data is in S3 so to use the API, I am going to have to move that data to redshift. Should I use spectrum, or should I load the data natively? Which one do you think is cheaper long term if this API is hit multiple times a day? Thanks!