1. Here you will find official announcements and updates. These announcements are also linked in the Official SotA Discord server.
    We encourage comments from the community! To keep the announcements official, we ask that comment threads be created in the General forums for player input.

                                                 Thanks!

Austin DataCenter outage (May 3 2024)

Discussion in 'Announcements' started by Ravalox, May 3, 2024.

  1. Ravalox

    Ravalox Chief Cook and Bottle Washer Moderator SOTA Developer

    Messages:
    1,757
    Likes Received:
    5,064
    Trophy Points:
    125
    Gender:
    Male
    Location:
    Dallas, TX
    We lost connectivity to the Austin Datacenter at about 1:38 AM.

    the data center said they didn't see an issue, so I spoke with Cogent Communications (the backbone provider for our internet and they advised that the router circuit to the data center is down. I called the data center back and they are checking with their NOC and will call Cogent back to pursue a case. (I can't open a case with Cogent since we are not their customer) [I was using my backdoors to get the facts]

    We now have a case with the data center and I am awaiting a call back with an update.
     
  2. Ravalox

    Ravalox Chief Cook and Bottle Washer Moderator SOTA Developer

    Messages:
    1,757
    Likes Received:
    5,064
    Trophy Points:
    125
    Gender:
    Male
    Location:
    Dallas, TX
    I reached out to the Data Center. Cogent has dispatched at technician to site, we do not have an ETA at this time. Will update again as soon as we get more info. Atos (Chris) and Undone (Bobby) are on standby to go to the data center.
     
    Beaumaris, FrostII, Sentinel2 and 6 others like this.
  3. Ravalox

    Ravalox Chief Cook and Bottle Washer Moderator SOTA Developer

    Messages:
    1,757
    Likes Received:
    5,064
    Trophy Points:
    125
    Gender:
    Male
    Location:
    Dallas, TX
    Service was restored at 06:54am CT, I have spoken with the Austin data center and requested an official RCA (Root Cause Analysis) for the outage and advisements as to what can/will be done to increase the resiliency of the Cogent PoP (Point of Presence) and the monitoring of the handoff.

    I expect to get an update from the data center's executive by the end of the day. Once I have details, I will share them here.
     
  4. Ravalox

    Ravalox Chief Cook and Bottle Washer Moderator SOTA Developer

    Messages:
    1,757
    Likes Received:
    5,064
    Trophy Points:
    125
    Gender:
    Male
    Location:
    Dallas, TX
    I have a meeting scheduled with the data center execs for Monday afternoon. I will be able to post the RCA information after that.
     
  5. Ravalox

    Ravalox Chief Cook and Bottle Washer Moderator SOTA Developer

    Messages:
    1,757
    Likes Received:
    5,064
    Trophy Points:
    125
    Gender:
    Male
    Location:
    Dallas, TX
    Update:

    Had a meeting with the data center folks Monday afternoon. We discussed a number of topics, including what options they have for us to reduce exposure to this kind of outage.

    The RCA offered by the ISP indicates that a third party was responsible for taking the circuit down (between Houston and Austin). This does not explain why a Tier 1 ISP could have a complete failure at the routing level in their network. Since I am in Dallas, my trace (during the outage) showed that I left my local ISP and entered their network in Dallas, and then successfully connected to Houston. from there, the circuit died. They have routers in Dallas, Houston and Austin, my outstanding question to them is why full mesh routing was not invoked? (Meaning that if the router in Houston reported back to Dallas that there was no route available, then the Dallas router should have chosen the next route in its list to try to connect to Austin.

    I find it highly unlikely that this Tier 1 ISP (backbone level ISP) is serially connecting from Dallas to Houston and then to Austin. The route from Dallas to Austin may be more expensive (meaning more router connections in the middle - which would slow traffic down), but when there is no path, the router should have chosen the next available link instead of just allowing the traffic to die.

    1. I have asked the data center to go back to Cogent and ask about this.
    2. I have asked the data center to quote pricing to change our configuration and use a set of blended ISPs

    Since we are not directly a client of Cogent, I can only push so far to get an answer on the above scenario. I will follow up with the data center, however I feel that we will likely change ISPs over this issue. I will update everyone if/when a change will take place as it will require downtime to adjust our routers and physically change the fiber connections.