amazon web services - Optimizing Redshift Query Performance with Large IN Clause and Large Columns - Stack Overflow
Redshift performance optimization
database - Optimize large IN condition for Redshift query - Stack Overflow
What can I do about redshift slowness?
Videos
Working with a massive 14 billion row dataset in Redshift for sales analytics reporting: I've managed to optimize query times using sort keys and distribution keys, but as the dataset is continuously growing and currently spans three years of data, what are other effective strategies or methods you would recommend for further optimizing read performance on such a large and expanding dataset?
You can try to create temporary table/subquery:
SELECT DISTINCT t.ret_field
FROM table t
JOIN (
SELECT '5c8615fa967576019f846b55f11b6e41' AS phash
UNION ALL
SELECT '8719c8caa9740bec10f914fc2434ccfd' AS phash
UNION ALL
SELECT '9b657c9f6bf7c5bbd04b5baf94e61dae' AS phash
-- UNION ALL
) AS sub
ON t.phash = sub.phash
WHERE t.last_seen BETWEEN '2015-10-01 00:00:00' AND '2015-10-31 23:59:59';
Alternatively do searching in chunks (if query optimizer merge it to one, use auxiliary table to store intermediate results):
SELECT ret_field
FROM table
WHERE phash IN (
'5c8615fa967576019f846b55f11b6e41',
'8719c8caa9740bec10f914fc2434ccfd',
'9b657c9f6bf7c5bbd04b5baf94e61dae')
AND last_seen BETWEEN '2015-10-01 00:00:00' AND '2015-10-31 23:59:59'
UNION
SELECT ret_field
FROM table
WHERE phash IN ( ) -- more hashes)
AND last_seen BETWEEN '2015-10-01 00:00:00' AND '2015-10-31 23:59:59'
UNION
-- ...
If query optimizer merge it to one you can try to use temp table for intermediate results
EDIT:
SELECT DISTINCT t.ret_field
FROM table t
JOIN (SELECT ... AS phash
FROM ...
) AS sub
ON t.phash = sub.phash
WHERE t.last_seen BETWEEN '2015-10-01 00:00:00' AND '2015-10-31 23:59:59';
It's worth a try to set sortkeys (last_seen, phash), putting last_seen first.
The reason of slowness might be because the leading column for the sort key is phash which looks like a random character.
As AWS redshift dev docs says, the timestamp columns should be as the leading column for the sort key if using that for where conditions.
If recent data is queried most frequently, specify the timestamp column as the leading column for the sort key. - Choose the Best Sort Key - Amazon Redshift
With this order of the sort key, all columns will be sorted by last_seen, then phash. (What does it mean to have multiple sortkey columns?)
One note is that you have to recreate your table to change the sort key. This will help you to do that.