Resolved and monitoring Everleap Outage

Discussion in 'Outages and maintenance' started by Ray Huang, Oct 12, 2019 at 4:12 PM.

Thread Status:
Not open for further replies.
  1. Ray Huang

    Ray Huang Everleap staff

    We are aware that a number of our customer's sites are down, and we are working on a resolution. More information will be posted when it becomes available.
  2. What news have you? It's been 30 minutes.
  3. This is ongoing issue with them every month they have issues and we are losing business what is the time frame it will be fixed. ??????
    andestransit likes this.
  4. It’s almost 1 hr no news ??????
  5. "I'm afraid I don't have any further updates. We are still working on it." That's what I got from a support ticket. Not sure why they can't communicate that here.
  6. Ray Huang

    Ray Huang Everleap staff

    We apologize for the lack of updates, but there is no new information at this time. We are still working on a resolution.
  7. It’s crazy we will bill you back the business we lost in 1 hr and it’s your ongoing issues
  8. Ray Huang

    Ray Huang Everleap staff

    We sincerely apologize for the unexpected downtime incurred, but it will be at least a couple of hours before we can reach a resolution. We understand how important your site is to you and are working as fast as we can to rectify the situation.

    seems to be down now. Your problem seems to be spreading? My own website just went down a few minutes ago.
  10. Wow, this is a long outage.:( Will be interesting to see the postmortem on this one.
  11. 3 hours and counting. Need an ETA.
  12. Ray Huang

    Ray Huang Everleap staff

    We are moving closer to a resolution now. Some of sites on the shared servers have come back online, and we are working on the Reserved Servers now. Access to your sites may still be intermittent, but I will post again when everything is back up.
  13. Do we have any idea of a cause yet? I'm extremely tired of the lack of feedback when these things happen (more than once over the last couple of months).
    Shakil likes this.
  14. Ray Huang

    Ray Huang Everleap staff

    Some sites on the Reserved Servers are now coming back online, but it will be at least 1 hour before access to all of them are restored. The Control Panel is still offline but should be restored shortly after.
  15. Reserved sites are still down and have been for far too long now. Why can't there be an independent backup system where traffic could be redirected in such situations? This doesn't bode well for us considering there was just recently a significant outage.
    Shakil likes this.
  16. 5 hours and counting. This should not be happening. 503 response from the server. Definitely planning our move away from your service. Too slow, poor redundancies and too high a cost for continues outages.
    Shakil and Kelly Strouse like this.
  17. Ray Huang

    Ray Huang Everleap staff

    Sorry that it's taking longer than expected. I've checked, and some sites are back online. But it will take some time for the rest to be reactivated. I don't have another ETA at the moment.
  18. What’s going on it’s 5 hrs your service is so bad no body reply e mails I need my sites working now now now

    Don’t know what you doing for suck a long time
    Last couple of month you have same shit problem
    Shakil likes this.
  19. Ray Huang

    Ray Huang Everleap staff

    The new ETA I have is about one hour. The servers are being brought online, but it does take some time for this to complete.
  20. Ray Huang

    Ray Huang Everleap staff

    Most (if not all sites) should be back online now. If your site is still experiencing problems, please open a support ticket.
  21. Eagerly looking forward to a report of what happened this time and what steps have been taken to prevent this in the future.
    Shakil likes this.
  22. Yep, we would like to know what happened and measurement for future similar incident
  23. Ray Huang

    Ray Huang Everleap staff

    We want to thank all our customers for their patience and understanding during this outage. We won't be able to provide a post mortem right away, but we will provide one next week.
  24. Takeshi

    Takeshi Everleap staff

    Following up with what happened on October 13, 2019

    A little before 4pm, Oct 13, 2019 PDT, the Everleap platform went down. The hosting system was fully functional at around 9pm.

    After our monitoring systems alerted us to an outage, it took some time to trace the issue. The issue originated from the storage system, which cascaded into a global outage. (Note that this is a different storage issue than what occurred a month and half ago.)

    In normal operation, the storage array has spare disks which are used to handle automated failover when the system detects a production drive’s performance starting to deteriorate. Such “predictive failures” are detected early on and system recovery happens typically over timespans of weeks. We’ve had such events happen several times without incident and with no customer impact.

    On Oct 13th, a “predictive failure” was detected and the system started copying over files from the problem production drive to the spare disk system as normal. However, there was an abnormal hiccup in the middle of file transfer that stopped the transfer process. This resulted in a cascade of failures that took down the service. This was the first time we experienced this issue.

    Once the problem was identified, we took the bad drives out and recovered the complete files from the RAID system onto the spare disks. Then we restarted the Azure Pack system, which took several hours to rebuild. Customer sites came back online in staggered fashion and all sites were back online at around 9pm.

    We are working with the storage vendor to get a better understanding of what may have caused the original hiccup that led to stopping the initial transfer process. Second, we are working with Microsoft to see if there are ways to reduce the time it takes to rebuild the Azure Pack system as we have identified some areas of inefficiencies within the rebuild process.
    Shakil likes this.
Thread Status:
Not open for further replies.

Share This Page