DB outage

Discussion in 'General Discussion' started by tphilipp, Sep 15, 2021.

Thread Status:
Not open for further replies.
  1. tphilipp

    tphilipp SotA Dev Moderator SOTA Developer

    Messages:
    535
    Likes Received:
    1,747
    Trophy Points:
    63
    So after a DDoS yesterday, we later also suffered a database outage, "thanks" to Amazon, not because of the attack. Great timing....
    I'll write a more detailed report on it, when the dust settles a bit. In short:

    - there is a few hours of data loss, so if some forum post of yours is missing, please post again
    - we will recover any missing purchases, that info is not lost, please be patient
    - the data loss does *not* include anything done in-game
    - contact support if you think we missed something

    Sorry for the problems, more to be shared on this, soon. Thanks for understanding and your patience
     
  2. tphilipp

    tphilipp SotA Dev Moderator SOTA Developer

    Messages:
    535
    Likes Received:
    1,747
    Trophy Points:
    63
    Alright, now that the dust settled a bit, here's a more detailed report, as announced:

    Overview

    A server instance we ran at AWS disappeared without any trace, at precisely 2AM UTC (to the second) on Sep 15. We don't know why, there are zero events logged from their side, not even any note about this incident. The only "log" from Amazon where the outage is clearly visible, is basically the invoice we receive for their services.

    According to my notes, this is the 6th time over the years of SotA development that an instance simply disappeared like this. Sometimes we got an incident report from AWS, sometimes not, in this case we didn't. When it happened in the past, it affected either development instances, or parts of our cluster that wasn't holding any state, or machines that were redundant. Just for reference, the same happened on July 18, where a load balancer just disappeared, which also caused an outage, with less collateral damage, though. In that case Amazon communicated an incident report afterwards.

    Given the exact time of 2AM UTC and the fact that this was an instance that is part of a cluster that is hosted in the EU, this hints at some nightly maintenance at AWS gone awry, as this kind of work is usually done at night when traffic is low.

    Unfortunately, some internal confusion about how and who to notify, the fact that we are a small team, and the fact that I for example was sound asleep (I'm in Europe), led to a longer downtime than necessary. This in turn led to us revising our protocol on how to handle emergency situations like this one, to be better prepared for the future.

    Anyways, the site was finally put into maintenance mode 7:45AM UTC, and data recovery was started. Accessibility was restored at 9:00AM UTC, after data was recovered and a series of health checks were done to make sure everything is alright, so we had a total of 7h of service interruption.

    Note: this did not affect the game itself, which continued to run fine, however it did affect new game logins. People that played were able to continue playing.

    What was lost (or delayed)?

    • up to 24h of account and website related data was lost, this includes:
      • forum posts (unsure how many)
      • comments (few or none)
      • media uploads for 3 users (e.g. avatar changes)
      • one profile edit for one user
      • some map edits
    so if you did any edits or posts before the outage, please do those again, sorry for any inconveniences caused by this​
    • no transaction data or purchases were lost, however, in some cases the recovery of those took a while and some might not have shown up for up to 18h after service was restored; also, subscription payments around that time were delayed by a day
    What was gained?

    Well, ironically there was also a gain: people that purchased items during a few hours before the outage, and also claimed them successfully in-game before the outage, might now actually see those items being delivered again. Enjoy!

    Going forward

    Probably the most important point: please contact support at support@portalarium.com if you think we missed something, if something doesn't work the way it should, etc.. We will get it sorted for you.

    About AWS: given that those issues we experience with AWS are not new, that we have instances disappear nearly once a year (and given that, by experience, support requests from a small client like us is usually met with blanket responses or none at all), we certainly think about moving to a more predictable environment. In other words, we are a bit fed up with AWS. Of course, any such move needs careful planning, first, and makes only sense if we can guarantee that it would improve things, ideally also cut down on costs and give us more flexibility, a more direct control over instances or reachability of support.

    Again, sorry for the inconvenience caused by this, and thank you for your understanding and patience.
     
Thread Status:
Not open for further replies.