Videos
I'm building a web service via API Gateway that would allow users to run queries on a DB. The data is in S3 and I thought of using Athena and have Lambda run queries against it. Thing is, I see a lot of similar designs but with Redshift instead of Athena. One of our Principal Engineers said Redshift fits better for a web service compared to Athena (but I didn't ask why). Any idea why it's the case?
EDIT: for context the data in S3 is parquet and it is partitioned. I'm expecting a moderate number of users using the API.
I have used both across a few different use cases and conclude:
Advantages of Redshift Spectrum:
- Allows creation of Redshift tables
- Able to join Redshift tables with Redshift spectrum tables efficiently
If you do not need those things then you should consider Athena as well
Athena differences from Redshift spectrum:
- Billing. This is the major difference and depending on your use case you may find one much cheaper than the other
- Performance. I found Athena slightly faster.
- SQL syntax and features. Athena is derived from presto and is a bit different to Redshift which has its roots in postgres.
- Connectivity. Its easy enough to connect to Athena using API,JDBC or ODBC but many more products offer "standard out of the box" connection to Redshift
Also, for either solution, make sure you use the AWS Glue metadata, rather than Athena as there are fewer limitations.
This question has been up for quite a time, but still, I think I can contribute something to the discussion.
What is Athena?
Amazon Athena is an interactive query service that makes it easy to analyze data in Amazon S3 using standard SQL. Athena is serverless, so there is no infrastructure to manage, and you pay only for the queries that you run. (From the Doc)
Pretty straight forward, right?
Then comes the question of what is Redshift Spectrum and why Amazon folks made it when Athena was pretty much a solution for external table queries?
So, AWS folks wanted to create an extension to Redshift (which is pretty popular as a managed columnar datastore at this time) and give it the capability to talk to external tables(typically S3). But they wanted to make life easier for Redshift users, mostly analytics people. Many analytics tools don't support Athena but support Redshift at this time. But creating your Reshift cluster and storing data was a bottleneck. Again Redshift isn't that horizontally scalable and it takes some downtime in case of adding new machines. If you are a Redshift user, making your storage cheaper makes your life so much easier basically.
I suggest you use Redshift spectrum in the following cases:
You are an existing Redshift user and you want to store more data in Redshift.
You want to move colder data to an external table but still, want to join with Redshift tables in some cases.
- Spark unloading of your data and if you just want to import data to Pandas or any other tools for analyzing.
And Athena can be useful when:
- You are a new user and don't have Redshift cluster. Access to Spectrum requires an active, running Redshift instance. So Redshift Spectrum is not an option without Redshift.
- As Spectrum is still a developing tool and they are kind of adding some features like transactions to make it more efficient.
- BTW Athena comes with a nice REST API , so go for it you want that.
All to say Redshift + Redshift Spectrum is indeed powerful with lots of promises. But it has still a long way to go to be mature.