DynamoDB: GSI & Sort Key
Why are DynamoDB Global Secondary Indexes still eventually consistent?
indexing - Global Vs Local Secondary Indexes in DynamoDB - Stack Overflow
DynamoDB - Global Secondary Index on set items - Stack Overflow
Videos
I’ve recently been dipping my toes into the AWS ressources, specifically DynamoDB. Ive been reading up on the necessary documentation, however I don’t understand the main distinction between sort key and gsi. To query using the sort key, wouldn’t I need to create a gsi nonetheless? Otherwise, wouldn’t not be able to use it to query?
I've been using DynamoDB for many years, and I've occasionally run into situations where the lack of consistent reads on global secondary indexes is a pain. Since Global Secondary Indexes are effectively separate tables, which have writes replicated to them from the main table, it makes sense that they cannot support consistent reads.
In November 2018, DynamoDB introduced transactions. This means that instead of using a GSI, I could create my own GSI-like structure for a table by adding a new 'secondary' table with a different hash and sort key. I could update both of these tables in a single transaction, and I have now created a GSI which supports consistent reads.
So if I can manually implement a 'secondary index table' that supports consistent reads, why do we still not consistent reads for GSIs offered by Amazon?
Is there something I am missing about TransactWriteItems and read consistency on a table?
Every item in Dynamo must have a unique primary key. The primary key is the base table index. A primary key must have a partition key and can optionally have a range key (also called a sort key). Within a partition, items are ordered by range key. Accessing items using a partition key is fast.
Secondary indexes allow you to query the table using an alternative key. A Local Secondary Index (LSI) has the same partition key as the primary key (index), but a different range key. The way to think about an LSI is that its the same data as the primary index (key), just ordered by a different attribute.
A Global Secondary Index (GSI) has a different partition key to the primary key and is therefore a different set of data.
One of the important differences between an LSI and GSI is that an LSI takes its throughput capacity from the base table, whereas you have to purchase GSI throughput capacity separately. Put another way, an LSI costs you nothing, and a GSI incurs extra costs over your base table.
Let's have a look at the Music table example. Let's say the base table has this schema;
Artist: (Primary Key) Partition Key
SongTitle: (Primary Key) Range Key
AlbumTitle:
DateOfRelease:
This table is a list of songs. I can access all the songs for an artist really efficiently (i.e. query by Artist using the partition key). When I do this query, the songs will be ordered by SongTitle. I can also access songs by Artist and SongTitle very efficiently using the unique primary key.
Now lets say I want to get all songs by an Artist but ordered by DateOfRelease. In the current schema I would need to get all the songs and then order them in my application. A good alternative would be to create a new index, with a partition key of Artist and a range key DateOfRelease. This will be a LSI because the partition key of the index (Artist) is the same as the partition key of the primary key. I do not need to purchase additional throughput capacity as this index will provision itself from the base table capacity.
Now lets say I want to access the songs by AlbumTitle, ordered by SongTitle, i.e. create lists of Albums. To do this efficiently I create a new index with partition key AlbumTitle and range key SongTitle. This is a GSI because the partition key is different to the primary key. This GSI must be provisioned separately to the base table and therefore costs extra.
In answer to your question, GenreAlbumTitle is a GSI because it has a different partition key to Music.
There is some misconceptions about the costs of using LSI, so let me clarify here.
Using LSI is not free of charge. Just like GSI, dynamoDB needs to create and maintain additional partial copies of the table in order to quickly get the results. This maintenance of additional copies will incur additional read, write, and storage costs identical to that of GSI. (Additional cost will be written in bold). The only difference is that instead of allocating a separate pay plan, you use the same pay plan as the main table.
Before discussing the additional cost, let me again summarize what kind of information is stored in the partial copy table. The partial table copy (LSI) contains the partition key (same as original table), the sort key (a different one than the original table), and any additional projected attributes.
Original Table
| Artist (Primary Key) | Song title (sort key) | Album Title | Date Of Release |
|---|---|---|---|
| Michael Jackson | Beat It | Thriller | December 1, 1982 |
| Weeknd | The Hills | Beauty Behind the Madness | May 27, 2015, |
LSI
| Artist (Primary Key) | Album Title (sort key) | Date Of Release |
|---|---|---|
| Michael Jackson | Thriller | December 1, 1982 |
| Weeknd | Beauty Behind the Madness | May 27, 2015, |
Projected attributes are the additional information we want to query from the LSI. I could say, "show me all of release dates of the albums by the Weeknd, ordered by the album name". As you can see, we don't care about the song title here, and is not included in our LSI projections.
Additional cost for reads:
You are charged for 1 Read Unit for queries that can be satisfied by using LSI table alone. Example: "Show me all of release dates of the songs by the Weeknd, ordered by the album name."
You are charged 1 additional Read Unit for queries that LSI doesn't know how to serve on its own, forcing it to go to the main table for help. This will cost a total of 2 Read Units. Example: "show me all of release dates AND the song titles of the songs by the Weeknd, ordered by the album name."
Additional cost for writes:
(Writes are done to the main table, with its own write costs, and changes are later propagated to LSI's)
- If the update to the main table results in creation of a new row in the LSI => 1 additional Write Unit
- If the update to the main table results in the key attribute of an existing row in the LSI to be updated => cost for deletion (1 unit) + creation (1 unit) = 2 additional Write Unit
- If the update to the main table results in the non-key attribute of an existing row in the LSI to be updated => 1 additional Write Unit
- If the update to the main table results in the deletion of an existing attribute of an existing row in the LSI => 1 additional Write Unit
- If the update to the main table does not change any rows of LSI => 0 additional Write Unit
Additional cost for storage
- You pay additional cost for: (size of index keys + size of projected attributes + overhead) x number of rows
As you can see, if we are not careful with LSI, extra costs can become overbearing. To minimize cost, you must:
- Carefully consider your typical queries. Which types of information do you need?
- There is tradeoff between read cost and storage cost. If you project every attribute to the LSI, than you will incur no extra read cost, but your storage cost will double. If you project only the key attributes, and you often fetch additional information other than the key attributes, there will be a lot of extra read costs from having to go back to the main table for help.
- For tables that are write-heavy, expect to incur a huge uptick in the write cost. Remember, if the update to the main table updates the key attribute of an item in the LSI, you pay 2 additional write units, and for non-key attributes, 1 additional write unit.
- DynamoDB is not designed to optimize indexing on set values. Below is a copy of the amazon's relevant documentation (from Improving Data Access with Secondary Indexes in DynamoDB).
The key schema for the index. Every attribute in the index key schema must be a top-level attribute of type String, Number, or Binary. Nested attributes and multi-valued sets are not allowed. Other requirements for the key schema depend on the type of index: For a global secondary index, the hash attribute can be any scalar table attribute. A range attribute is optional, and it too can be any scalar table attribute. For a local secondary index, the hash attribute must be the same as the table's hash attribute, and the range attribute must be a non-key table attribute.
- Amazon recommends creating a separate one-to-many table for these kind of problems. More info here : Use one to many tables
This is a really old post, sorry to revive it, but I'd take a look at "Single Table Design"
Basically, stop thinking about your data as structured data - embrace denormalization
id (Number - primary key ) title (String) created_at (Number - long) tags (StringSet - contains a set of tags say android, ios, etc.,)
Instead of a nosql table with a "header" of this:
id|title|created_at|tags
think of it like this:
pk|sk |data....
id|id |{title, created_at}
id|id+tag|{id, tag} <- create one record per tag
You can still return everything by querying for pk=id & sk begins with id and join the tags to the id records in your app logic
and you can use a GSI to project id|id+tag into tag|id which will still require you to write two queries against your data to get items of a given tag (get the ids then get the items), but you won't have to duplicate your data, you wont have to scan and you'll still be able to get your items in one query when your access pattern doesn't rely on tags.
FWIW I'd start by thinking about all of your access patterns, and from there think about how you can structure composite keys and/or GSIs
cheers
Dear r/aws,
Did everyone know that the DynamoDB can have 20 global secondary indexes now? Since when? I was checking the DynamoDB documentation today then I was like o_O??? OMG 20?!
The last time when I checked it was 5. Seriously, when did this happen?