AWS outages analysis: Debunking 3 myths and revealing the least reliable region
Does anybody know the incidence of AWS region-wide failures?
Make some assumptions.
-
Region-wide outages are unusual
-
Because they're unusual, they will take a long time for aws to recover from. Assume 12 hours
-
Add the time it will take your operations team to validate everything is working again. Depending on your level of automation, that may be negligible.
-
Work out what the business impact is on your site being down for X hours once a year. (Note that this is NOT how much revenue your site makes in an hour, because some (but not all) of those customers will come back. Unless this is a large company where you can use the numbers thrown around by the business teams during an outage in your business case). For worst case, assume its all during peak times, but depending on your business and how global vs local it is, you may be able to take that down.
-
Work out how much time and effort it would take to be multiregional. Spread the one off costs over say 3 years (your company may have a policy)
-
Don't forget to include the ongoing costs - extra testing, extra aws costs, operational complexity, and most importantly the extra outages you will have due to more complex edge cases. These costs will be every year.
-
Work out how many regional outages per year you will need to have to make this possibly worth it, and then see if you think its likely. Don't forget that if you have other systems in your local DC that you depend on, large scale weather outages will affect those too.
Also, work out if you could do better with the money. Maybe instead of cross region ha, just do DR? Keep backups/replicas/puppet masters/etc in the other region. To test the dr, every 6 months change which region is active and keep it that way for the next 6 months...
More on reddit.com