Today our API and web UI were down for close to 15 minutes.
The root cause was a long running database update which was necessary as part of an upgrade. When we estimated the length of time the update would need we were off by 1 order of a magnitude. Consequently the update took close to 15 minutes instead of a little over a minute and translated into downtime for the API and UI.
These sort of database updates are something we do rarely. Never-the-less, we have identified methods we can use to do more extensive testing in the future, for similar updates. This would allow us to plan alternate, incremental ways to update the database and schedule those at off times.