Redhsift Vs RDS MySQL benchmarking
sql - Redshift design or configuration issue? - My Redshift datawarehouse seems much slower than my mysql database - Stack Overflow
amazon web services - Is AWS Redshift to PostgreSQL the same as AWS Aurora to MySQL? - Stack Overflow
performance - select * vs select column in Redshift and MySql - Stack Overflow
Videos
I've worked with clients on this type of issue many times and I'm happy to help but this may take some back and forth to narrow in on what is happening.
First I'm assuming that "leads" is a normal table, not a view and not an external table. Please correct if this assumption isn't right.
Next I'm assuming that this table isn't very wide and that "select *" isn't contributing greatly to the speed concern. Yes?
Next question is wide this size of cluster for a table of only 11M rows? I'd guess it is that there are other much larger data sets on the database and that this table isn't setting the size.
The first step of narrowing this down is to go onto the AWS console for Redshift and find the query in question. Look at the actual execution statistics and see where the query is spending its time. I'd guess it will be in loading (scanning) the table but you never know.
You also should look at STL_WLM_QUERY for the query in question and see how much wait time there was with the running of this query. Queueing can take time and if you have interactive queries that need faster response times then some WLM configuration may be needed.
It could also be compile time but given the simplicity of the query this seems unlikely.
My suspicion is that the table is spread too thin around the cluster and there are lots of mostly empty blocks being read but this is just based on assumptions. Is "id" the distkey or sortkey for this table? Other factors likely in play are cluster load - is the cluster busy when this query runs? WLM is one place that things can interfere but disk IO bandwidth is a share resource and if some other queries are abusing the disks this will make every query's access to disk slow. (Same is true of network bandwidth and leader node workload but these don't seem to be central to your issue at the moment.)
As I mentioned resolving this will likely take some back and forth so leave comments if you have additional information.
(I am speaking from a knowledge of MySQL, not Redshift.)
SELECT * FROM leads WHERE id = 10162064
If id is indexed, especially if it is a Unique (or Primary) key, 0.4 sec sounds like a long network delay. I would expect 0.004 as a worst-case (with SSDs and `PRIMARY KEY(id)).
(If leads is a VIEW, then let's see the tables. 0.4s may be be reasonable!)
That query works well for a RDBMS, but not for a columnar database. Face it.
I can understand using a columnar database to handle random queries on various columns. See also MariaDB's implementation of "Columnstore" -- that would give you both RDBMS and Columnar in a single package. Still, they are separate enough that you can't really intermix the two technologies.
If you are getting 100% CPU in MySQL, show us the query, its EXPLAIN, and SHOW CREATE TABLE. Often, a better index and/or query formulation can solve that.
For "real time reporting" in a Data Warehouse, building and maintaining Summary Tables is often the answer.
Tell us more about the "exact copy" of the DW data. In some situations, the Summary tables can supplant one copy of the Fact table data.
Redshift is not PostgreSQL. It is a column store engine that uses a very heavily modified part of a very old PostgreSQL version as its front-end. Under the hood it's powered by ParAccel, a very heavily modified fork of PostgreSQL 8.0.2.
Imagine someone took MySQL 4.1 or something from that era, deleted InnoDB and MyISAM, added their own hardwired storage engine, removed a whole bunch of features and added a bunch of different ones - changing the supported SQL dialect in the process. That gives you some idea.
It's a dramatically different product for different needs. It's heavily optimised for OLAP workloads and pays a heavy price for OLTP workloads.
In general you should use PostgreSQL (on AWS RDS, or elsewhere) for your day to day transaction processing. If you want data warehousing and analytics and have outgrown PostgreSQL for that then you might consider Redshift as one of your options... though it's likely you haven't really outgrown PostgreSQL, just AWS RDS.
Maybe you're looking for something more like Postgres-XL ?
The other answer is accurate regarding Redshift not being the PostgreSQL equivalent of Aurora. Generally you'd use Redshift when you needed to run some very heavy queries on a large dataset (the stuff that might take hours or more to finish running). Redshift is a columnar datastore that essentially auto-normalizes every piece of data that comes in and can execute queries that would otherwise take days in seconds. When you're done, you delete it and then repeat the process when you need it again.
In terms of getting an Aurora equivalent for PostgreSQL, I don't know how far off that is but I'm pretty sure an enterprising person could build their own with AWS EFS (https://aws.amazon.com/efs/). I'm fairly certain that's a big part of the Aurora formula.
As RedShift is Columnar database,
Select Column1,Clumn2 from table_a where some_criteria
The particular column select queries will be super fast because Redshift needs to just scan/read the particular column only.
While Select * will be much slower as Redshift need to scan and read all the columns.
In case of MySql as well, select col1,col2 from table_a will be OK(as less memory/IO), but, not as good as Redshift.
I would recommend you to read some really good documentation about columnar database like Redshift, its distribution key concept and encoding concept while also impacts performance greatly.
https://www.youtube.com/watch?v=iuQgZDs-W7A
For MySQL, there may be a small or large difference:
- Large: If
*hasTEXTorBLOBcolumns that are not present in the 2/3. This is because, in some cases, such fields require an extra disk hit to fetch. - Small otherwise. (More to parse, more to allocate memory for, etc.)
Has anyone used both Amazon Redshift and at least one other major RBDMS like PostgreSQL, MySQL, Microsoft SQL Server, or Oracle SQL? I just joined a new company and the first project that we've been tasked with is building out an ODS for our Marketing, Sales, Finance, and Product data. I'm arriving just in time to help them decide on the RDBMS and one of the suggestions from another team member is Amazon Redshift.
I've had plenty of experience working within the four db's mentioned above and I'm very comfortable in what they're capable of. I've worked inside of a large ODS built in Microsoft SQL Server that supported multiple databases with tables that had tens of millions of records. The schemas were well designed and the database rarely suffered any performance issues or hiccups. More recently, I architected a smaller marketing/sales database in MySQL, hooked it up to Zapier for data inputs and Chartio (BI tool) for reads, and it worked like a charm. I'm confident that with the data we're looking to capture and report off of, a "traditional" RBDMS would work just fine.
That said, I want to be open to Redshift though and give it a fair shot. What can Redshift bring to the table that db's like PGSQL and MySQL cannot? What would we sacrifice by choosing Redshift? How easy will it be for me to become comfortable designing and working within Redshift if I already know PostgreSQL? I could go on asking questions but generally I'm just looking to understand if Redshift has any distinct advantages over the db's I've worked with before.