Everleap Outage

Discussion in 'Questions about Everleap' started by Al Rony, Oct 14, 2019.

  1. If all of the below features are properly implemented how the system failed for 5 Hours? Each cloud server VM should be in different physical (hardware) server. The "abnormal hiccup in the middle of file transfer" you are saying making no sense to me.

    How Everleap works
    How multiple Cloud Servers are utilized
    The load balancers constantly adjust traffic to balance it between all of the Cloud Servers available for your site. If a Cloud Server is ever down for any reason, the load balancer immediately routes all traffic to healthy Cloud Servers. This prevents most causes of web server downtime.

    Every site on Everleap has access to two Cloud Servers
    You get the benefits of multiple Cloud Servers for every site you host with us at no extra charge. If you run a site that would benefit from more than two Cloud Servers, they're available to add instantly.

    Even if your site only runs on a single Cloud Server, it's still redundant
    Your site can be instantly replicated on another server at any time. So even if you only run a single Cloud Server, in the event that anything goes wrong with that server, your site spins up on a new server and traffic is routed to the healthy server. All of that happens quickly and automatically.
  2. Takeshi

    Takeshi Everleap staff

    VM stands for virtual machine and doesn't mean dedicated server. You can have multiple VMs on one physical server. We host many VMs on a cluster of web serrvers. The VMs are load balanced and redundant as described above. This system was not the origin of the issue we experienced.

    In order to make this redundant webserver system work, your content cannot permanently reside on the VM itself. If your content was on the physical server of the VM, we could not automatically spin up your site on another VM or on multiple VMs simultaneously. Theoretically, the only way that would be possible would be to copy all your content on all the VMs, but it is not practical to copy all our customer's content on all VMs. On top of that, you would still get out of sync if new content is being written to the site as you would have to somehow copy the new conetent to all the other VMs in the cluster.

    Therefore, all customer content is stored in a storage system behind the webservers. This way if one webserver fails, then another webserver can call up your site content from the storage system. And multiple webservers can call up your content and serve them from multiple webservers simultaneously.

    The storage system itself is also resilient. As with any hardware system, hard drives do fail sometimes so the storage system we use has some resiliency built-in. The storage array has spare disks which are used to handle automated failover when the system detects a production drive’s performance starting to deteriorate. Such “predictive failures” are detected early on and system recovery happens typically over timespans of weeks. Typically a hard drive doesn't just die all of a sudden - you'll see degredation in performance over timespans of weeks before it dies. We’ve had such predictive failure events happen many times in our storage system without incident and with no customer impact. We have plenty of warning and time to replace bad drives.

    During this particular incident there was a normal predictive failure response by the storage system. Again, when the storage system sees a hard drive start to deteriorate, it triggers the automated copying event. It was during this time that the copying was interrupted. This was the first time we've seen this happen. This hiccup led to the cascade that took the system down.
  3. This is the second storage device failure in about as many months. Is there something inherent in the architecture that does not permit the storage device to be duplicated in real-time, allowing the backup storage device to become the primary in such situations?

    What if the drives were SSD rather than spinning platters?

    The time to recover from a storage device failure is simply devastating for our clients.

    Everleap simply cannot rely on the storage device itself--there needs to be a hot backup.
  4. Takeshi

    Takeshi Everleap staff

    We understand that outages are bad for all of our customers and their clients - just as it is bad for our own business which is entirely online as well. We do our best to keep the hosting system up and running. And we continually work on improvements. We try to learn from every outage.

    Yes, we've had two different issues hit the storage system within a two month period which sucks. The storage system itself is resilient and the system has recovered from degraded hard drives several times wihtout any customer impact in the past. In this particular situation, some hiccup interrupted a normal recovery processs.

    The vast majority of sites we host are dynamic and we are also adding new customer sites and their content to our system all the time. So our system addresses the nature of what we are our hosting and how content is constantly changing in the most cost-effective way.

    We agree that recovering from a storage issue takes a long time and we are (and have been) working on addressing/improving this. There are a few aspects that take time - identifying the issue at the onset of an outage, getting the storage system back up and rebuilding the Azure system. We are working on all aspects of these issues.

    If the shared cloud hosting platform is not working for your mission critical sites, we also do offer Managed Services where you will be on your own custom Private Cloud that is not shared with other customers. We can help design and set up a custom failover system for your sites. If you are interested in this, contact our Technical Support team and they will direct your inquiry to staff that can discuss your needs.
  5. I actually choose Everleap because of being a premium partner of nopCommerce and presented accordingly to my Board Members to host our eCommerce site competing other hosting provider. Now I look real stupid to the Board due to the 5 hour down time :(. I would appreciate if these sort of issue can be address quicker and in the architecture the failure issues can be addressed somehow. Anyway, thanks for your clear response.

Share This Page