Resolved and monitoring System Issues

Discussion in 'Outages and maintenance' started by Martin Ortega, Jul 15, 2020.

Thread Status:
Not open for further replies.
  1. Martin Ortega

    Martin Ortega Everleap staff

    We're currently having issues with multiple servers within our network.

    Our system administrates are currently looking into the problem. Please monitor this forum post for any new updates.
     
  2. Martin Ortega

    Martin Ortega Everleap staff

    Our system administrators are still looking into the problem with our system. We'll reply back once we have a new update.
     
  3. Martin Ortega

    Martin Ortega Everleap staff

    At this moment I haven't received new information but our system admins are still working on the problem. We'll provide an update as soon as we receive it.
     
  4. Ray Huang

    Ray Huang Everleap staff

    We just sent email out regarding the outage. I believe most of the servers are online now, but it will take maybe a couple of more hours to bring the rest back online.
     
  5. Martin Ortega

    Martin Ortega Everleap staff

    Hello.

    Today, July 15, 2020, we experienced a global outage at around 7pm PST. The majority of servers were back online around 9pm and the rest of the servers were coming back online in a staggered fashion over the next 1.5 hours.

    At around 6:30 pm our team was notified about some servers having stability issues. Over the next hour, the hosting system servers experienced a cascading failure.

    The issue appears to stem from bugs within the Windows Azure Pack system, but we cannot pinpoint the exact root cause at this time. We are reaching out to Microsoft to get assistance to help determine the root cause. Once we learn more, we will post updates here in our forum.

    We understand that outages affect your business and we apologize. Our team will continue to work to determine the root cause of the outage and we will work to prevent such outages from happening again.
     
  6. Martin Ortega

    Martin Ortega Everleap staff

    All of our shared, Medium, and Large reserve servers are up and running.

    We're still waiting for our Small reserve to come online.
     
  7. Martin Ortega

    Martin Ortega Everleap staff

    Our control panel should now be up and running.
     
  8. Martin Ortega

    Martin Ortega Everleap staff

    Small reserve servers should also be back online.
     
  9. Martin Ortega

    Martin Ortega Everleap staff

    All sites should now be up and running. We will continue the monitor our systems.
     
  10. dmitri

    dmitri Everleap staff

    All servers / sites should be operational now. We continue monitoring the system to make sure the issue is permanently fixed.
     
  11. Takeshi

    Takeshi Everleap staff

    July 15, 2020 Outage Postmortem

    On July 15, 2020 around 7:00pm PST, one of the Everleap Shared Cloud VM Host Machines was experiencing latency. The Shared cloud system’s controller service started marking the virtual machines on the host as unhealthy and started repairing all the servers on the host. As a result, the host machine performance degraded and the system went into a repair cycle that never completed.

    Over time, the controller service was overloaded and started marking other healthy nodes as unhealthy and issued repair requests for them. This led to a cascade of server latency and failure.

    It took us some time to identify the problem and we had to manually repair each node so servers/sites were coming back online in staggered fashion and the system came back fully online at around 10 pm.

    We consulted with Microsoft regarding this issue. We were informed that this was a unique situation because of the fact that the failing web servers were not actually down but just slow to repair. Microsoft support provided guidance on how to mitigate this if it happens again in the future.
     
Thread Status:
Not open for further replies.

Share This Page