Videos
I've worked with clients on this type of issue many times and I'm happy to help but this may take some back and forth to narrow in on what is happening.
First I'm assuming that "leads" is a normal table, not a view and not an external table. Please correct if this assumption isn't right.
Next I'm assuming that this table isn't very wide and that "select *" isn't contributing greatly to the speed concern. Yes?
Next question is wide this size of cluster for a table of only 11M rows? I'd guess it is that there are other much larger data sets on the database and that this table isn't setting the size.
The first step of narrowing this down is to go onto the AWS console for Redshift and find the query in question. Look at the actual execution statistics and see where the query is spending its time. I'd guess it will be in loading (scanning) the table but you never know.
You also should look at STL_WLM_QUERY for the query in question and see how much wait time there was with the running of this query. Queueing can take time and if you have interactive queries that need faster response times then some WLM configuration may be needed.
It could also be compile time but given the simplicity of the query this seems unlikely.
My suspicion is that the table is spread too thin around the cluster and there are lots of mostly empty blocks being read but this is just based on assumptions. Is "id" the distkey or sortkey for this table? Other factors likely in play are cluster load - is the cluster busy when this query runs? WLM is one place that things can interfere but disk IO bandwidth is a share resource and if some other queries are abusing the disks this will make every query's access to disk slow. (Same is true of network bandwidth and leader node workload but these don't seem to be central to your issue at the moment.)
As I mentioned resolving this will likely take some back and forth so leave comments if you have additional information.
(I am speaking from a knowledge of MySQL, not Redshift.)
SELECT * FROM leads WHERE id = 10162064
If id is indexed, especially if it is a Unique (or Primary) key, 0.4 sec sounds like a long network delay. I would expect 0.004 as a worst-case (with SSDs and `PRIMARY KEY(id)).
(If leads is a VIEW, then let's see the tables. 0.4s may be be reasonable!)
That query works well for a RDBMS, but not for a columnar database. Face it.
I can understand using a columnar database to handle random queries on various columns. See also MariaDB's implementation of "Columnstore" -- that would give you both RDBMS and Columnar in a single package. Still, they are separate enough that you can't really intermix the two technologies.
If you are getting 100% CPU in MySQL, show us the query, its EXPLAIN, and SHOW CREATE TABLE. Often, a better index and/or query formulation can solve that.
For "real time reporting" in a Data Warehouse, building and maintaining Summary Tables is often the answer.
Tell us more about the "exact copy" of the DW data. In some situations, the Summary tables can supplant one copy of the Fact table data.
As RedShift is Columnar database,
Select Column1,Clumn2 from table_a where some_criteria
The particular column select queries will be super fast because Redshift needs to just scan/read the particular column only.
While Select * will be much slower as Redshift need to scan and read all the columns.
In case of MySql as well, select col1,col2 from table_a will be OK(as less memory/IO), but, not as good as Redshift.
I would recommend you to read some really good documentation about columnar database like Redshift, its distribution key concept and encoding concept while also impacts performance greatly.
https://www.youtube.com/watch?v=iuQgZDs-W7A
For MySQL, there may be a small or large difference:
- Large: If
*hasTEXTorBLOBcolumns that are not present in the 2/3. This is because, in some cases, such fields require an extra disk hit to fetch. - Small otherwise. (More to parse, more to allocate memory for, etc.)