I'm building a web service via API Gateway that would allow users to run queries on a DB. The data is in S3 and I thought of using Athena and have Lambda run queries against it. Thing is, I see a lot of similar designs but with Redshift instead of Athena. One of our Principal Engineers said Redshift fits better for a web service compared to Athena (but I didn't ask why). Any idea why it's the case?
EDIT: for context the data in S3 is parquet and it is partitioned. I'm expecting a moderate number of users using the API.
I have bunch of small Avro on S3 I need to build some data warehouse on top of that. With redshift the same queries takes 10x times longer in comparison to Athena. What may I do wrong?
The final objective is to have this data in redshift Table.
Videos
Hi, in my company we use dynamodb to store all our data. I implemented an AWS glue etl pipeline to export the dynamodb to s3(in parquet format) and use Athena to run our adhoc aggregate queries. This is job cron scheduled job. We use Athena as our data source in quickaight to generate reports. This is actually meeting our current requirements. Now my company wants to build a analytics platform based on the data we have for other customers. The platform is more like UI can be catering more than 1000 users. What would be the best approach of storing the data should we continue using Athena or move it to others like redshift or other services ? I am pretty new in this area. Apologies for any premature assumptions I did. Thanks in Advance
Anyone have any specific use cases/rationale where using Redshift would be preferable to using S3 / Athena (with proper formatting/partitioning etc) both with a reporting engine on top?
Trying to think as clearly for large data sets Redshift clusters are expensive .. however they must have built Spectrum for a reason so looking for some sort of situational use case breakdown as it 'seems' more cost efficient to just use S3/Athena providing performance isnt dreadful?
More difficult to update S3 Athena.
If you have static data, then go for it. or, if it is mixed and you can put the static stuff (or batch updated stuff) in S3 and use Spectrum, and regular Redshift for things that need updates.... then do that.
I don't have hard data to back this up, but I think Redshift's optimizer is quite a bit more mature than Prestos (Athena). Which means that complex queries are much more likely to be suboptimal on Athena. Suboptimal can be orders of magnitude performance hits.
From my experience I think Redshift will also be better when one needs more interactive-speed queries, lets say when you're pulling data for a dashboard. By interactive-speed I'm thinking under 6 seconds in this case, ideally 0.5-2 seconds.
Finally, there may be some cases that are less expensive on Redshift - say very frequent scans of large volumes of data?
I'm getting ready to implement some very basic warehousing. A few hundred TB of data, accessed infrequently (monthly and yearly report generation for the most part), performance not an issue.
I started looking at Athena a couple weeks ago and it seems like, if I partition my data well, that a "data lake" may be all I need. I put that in quotes because I would do so document standardizing before storing - it wouldn't be just raw data. I am considering storing everything in Parquet to keep data scan costs down but given the relatively small amount of data (most temporal so partitioning is easy) I might just do json for future flexibility.
Has anyone gone this route and found road blocks?
Searching / querying costs. So using parquet or similar is key.
Your file naming / partitioning is really important too.
Article from this sub that was helpful http://tech.marksblogg.com/billion-nyc-taxi-rides-aws-athena.html
Ive done similiar - not over the same scale of data, but it does work well.
Make sure you design your partitions well to be cost effective
also be aware you basically cant create stuff in athena, so you're gonna be limited to what is there (no new tables)
I don’t know why but I have a hard time wrapping my head around data bases. Most of the other services I’m fairly comfortable with.
Edit: Also to add on would appreciate some recommendations for white papers on how particular companies shifted their work loads to the cloud. It’s cool to see how architects implement stuff.
Not sure about whitepapers but this video is great: https://www.youtube.com/watch?v=-pb-DkD6cWg
Let me try to put how i see it in plain english:
Relational database in general => probably Amazon Aurora
Specific relational database engine and version => Amazon RDS
Non-relational low-latency high-scale => Amazon DynamoDB
In-memory cache => Amazon Elasticache
In-memory cache for DynamoDB only => DynamoDB DAX
High-scale analytics / data warehousing => Amazon Redshift
Analytics on top of S3 Data => Amazon Athena
Analytics on top of S3 Data if already using Redshift => Redshift Spectrum
Documents with MongoDB Compatibility => DocumentDB
Search indexing => Amazon Elasticsearch Service
Imutable and cripto verifiable => QLDB
Time series database => Timestream (preview)
also, this video goes in depth about how Robinhood migrated their data warehouse to AWS
As data engineers, choosing between Amazon Redshift and Athena often comes down to tradeoffs in performance, cost, and maintenance.
I recently published a technical case study diving into:
🔹 Query Performance: Redshift’s optimized columnar storage vs. Athena’s serverless scatter-gather
🔹 Cost Efficiency: When Redshift’s reserved instances beat Athena’s pay-per-query model (and vice versa)
🔹 Operational Overhead: Managing clusters (Redshift) vs. zero-infra (Athena)
🔹 Use Case Fit: ETL pipelines, ad-hoc analytics, and concurrency limits
Spoiler: Athena’s cold starts can be brutal for sub-second queries, while Redshift’s vacuum/analyze cycles add hidden ops work.
Full analysis here:
👉 Amazon Redshift & Athena as Data Warehousing Solutions
Discussion:
-
How do you architect around these tools’ limitations?
-
Any war stories tuning Redshift WLM or optimizing Athena’s Glue catalog?
-
For greenfield projects in 2025—would you still pick Redshift, or go Athena/Lakehouse?
I have a use case where I have to collect 1k events/sec. They have to queryable but read queries are not too high. For eg. I have an event like this:
{
id: " 1,
"type" : "start",
"publisher": " ",
"company": " ",
}
I wanna be able to query over publisher, company, type, id etc. I don't need full text search or so. Essentially I am using ElasticSearch as a NoSQL database but instead of just querying with the key, I want to query by a variety of columns. I was wondering how Redshift will compare with this. The total data size would be 3-4 TB. Given the events won't change and I need to query the events on type, publisher, company etc how effective would ElasticSearch be as a NoSQL database?
I have used both across a few different use cases and conclude:
Advantages of Redshift Spectrum:
- Allows creation of Redshift tables
- Able to join Redshift tables with Redshift spectrum tables efficiently
If you do not need those things then you should consider Athena as well
Athena differences from Redshift spectrum:
- Billing. This is the major difference and depending on your use case you may find one much cheaper than the other
- Performance. I found Athena slightly faster.
- SQL syntax and features. Athena is derived from presto and is a bit different to Redshift which has its roots in postgres.
- Connectivity. Its easy enough to connect to Athena using API,JDBC or ODBC but many more products offer "standard out of the box" connection to Redshift
Also, for either solution, make sure you use the AWS Glue metadata, rather than Athena as there are fewer limitations.
This question has been up for quite a time, but still, I think I can contribute something to the discussion.
What is Athena?
Amazon Athena is an interactive query service that makes it easy to analyze data in Amazon S3 using standard SQL. Athena is serverless, so there is no infrastructure to manage, and you pay only for the queries that you run. (From the Doc)
Pretty straight forward, right?
Then comes the question of what is Redshift Spectrum and why Amazon folks made it when Athena was pretty much a solution for external table queries?
So, AWS folks wanted to create an extension to Redshift (which is pretty popular as a managed columnar datastore at this time) and give it the capability to talk to external tables(typically S3). But they wanted to make life easier for Redshift users, mostly analytics people. Many analytics tools don't support Athena but support Redshift at this time. But creating your Reshift cluster and storing data was a bottleneck. Again Redshift isn't that horizontally scalable and it takes some downtime in case of adding new machines. If you are a Redshift user, making your storage cheaper makes your life so much easier basically.
I suggest you use Redshift spectrum in the following cases:
You are an existing Redshift user and you want to store more data in Redshift.
You want to move colder data to an external table but still, want to join with Redshift tables in some cases.
- Spark unloading of your data and if you just want to import data to Pandas or any other tools for analyzing.
And Athena can be useful when:
- You are a new user and don't have Redshift cluster. Access to Spectrum requires an active, running Redshift instance. So Redshift Spectrum is not an option without Redshift.
- As Spectrum is still a developing tool and they are kind of adding some features like transactions to make it more efficient.
- BTW Athena comes with a nice REST API , so go for it you want that.
All to say Redshift + Redshift Spectrum is indeed powerful with lots of promises. But it has still a long way to go to be mature.