c# - Improving performance reading from large Redshift table - Stack Overflow
Optimize My Redshift SQL
Amazon Redshift extends Automatic Table Optimization to support Column Compression Encoding
I may be wrong, and I've not examined the behaviour of this functionality, but the choices Redshift so far makes with regard to column encoding are, in my view, extremely poor, and somewhat politically influenced, rather than purely technically influenced - there's a strong preference for AWS's proprietory encoding method, when it is not in fact an appropriate choice at all. Indeed, that method is even used with interleaved tables, where it is a catastrophically incorrect choice.
Moreover, any automated method is fundamentally limited in the information it can process to make its choice : it is not a human. It does not understand overall design, or what might come in the future. It may end up repeatedly swtiching between different choices, as query load varies over time.
In short, it is likely better - assuming the actual choices made are sound, which is not an assumption I would make - than a human who doesn't know what they are doing, and worse than a human who does know what they are doing.
My great fear is that it will not be possible, either now, or made so in the future, to disable this functionality, and so it will actively harm the clusters of people who do know what they're doing.
More on reddit.comNeed suggestion on query optimization
We also tried spectrum, which is advertised to be cheap and fast. In reality, we had the same experience as you. I'd would suggest to really review partitioning and also indexes in glue , which help a lot.
More on reddit.comVideos
There are a number of things you can do (depending on what you are trying to do which you haven't explained):
- Don't read all the columns (I expect you have thought of this).
- Make sure the data is compressed (encoded).
- Ensure you data isn't badly skewed (i.e. most of your data is on one slice)
- Allocate more memory to the query reading all this data. I expect that there is quite a bit of spill to disk, reducing this could have a big impact.
- Increase the number / size of nodes in your cluster. The disk bandwidth is directly proportional to the number of nodes.
- Use Redshift Spectrum to do the initial paring down of data. If you are doing group by / aggregation of the data then Spectrum can greatly increase the bandwidth for performing these initial actions of your query. This is only a win if you are not moving all the data to the Redshift cluster.
With all the said I am doubtful that you are really having issues with disk reads for only 100M rows. This is peanuts for Redshift. Unless you have 1000 columns and a tiny cluster this won't take 2 hours. Did you do a SELECT * with the result landing on your computer? If so the 2 hours was moving the data to you over the network, not reading it from disk.
I hope the suggestions above help but if my guess is correct and there is something wrong with your measurements you will need to provide more information. How large in GB is the table? How big is the cluster? What queries are you running? Table info like skew and compression. Query actual execution timing. Something seems amiss.
I now understand that the speed in question is pulling the data down to an EC2 instance. There are ways to speed this up as well.
The issue you are running into is that you are moving all the data through a single network connection. The issue with this is that a single network connection has a lot of handshake overhead and since Redshift requires a fairly small network MTU (packet size) there is a lot of handshaking. In addition the data is send uncompressed over the JDBC connection which takes more bandwidth than compressed data. So even though you can bringing the data to a single computer (ec2) there is significant speed up that can be done.
So if the question is how can I speed up the data coming from Redshift over the JDBC connection, I'm sorry you can't do much (high network speed ec2?). If instead you want to get the data to the ec2 the fastest there are improvements that can be made.
Believe it or not the fastest way is a 2 step approach. First unload the data to S3 and make sure it is compressed and with "parallel on". This will cause Redshift to start a data transfer from each slice to S3 - in your case 4 parallel connections. (If you had a bigger cluster the parallelism would be even higher.) Now you will have at least 4 files in S3.
Next you start parallel gets of these files from the ec2. You want around 4 parallel gets so this could work simply in your case. A bash script can be used to automate the process of having 4 parallel AWS CLI gets of the data running at all times (if you have more than 4 files). When each file is download you want to uncompress them and this can be done on the fly - "aws s3 cp s3://bucket/key - | gunzip -c > file". Last step is to cat these files together (if you need) and read them into whatever tool needs the data.
Because there is a lot of overhead in tcp connections and we have overlapping reads from S3, and the files are compressed this 2-step process can be significantly faster than the 1-step JDBC connection route for pulling large amounts of data from Redshift. The limiting step is likely the single network card of the ec2 but this process can maximize the performance of this limited resource.