Best answer by JGiffard
As Senior Product Manager for the Business Endpoint Product, I wanted to connect with the community about the service incident that took place on Tuesday December 12, 2018 and offer my apologies to anyone who had to manage through this short window of disruption.
Before going into some of the details, I want to quickly outline what happens behind the scenes when Webroot experiences an issue.
Our entire team takes any service impacts to customers very seriously. When an incident occurs, we immediately work through a pre-determined plan designed to bring the incident to the swiftest close, assess and understand any customer impact, and quickly communicate next steps. We also schedule a post-mortem review session, where we investigate root causes and incorporate lessons learned.
As the Product Manager for our Business Endpoint, I am held accountable for providing updates to customers and the team, improving future releases, and ensuring that these issues won’t happen again. Each internal team has a dedicated lead who is responsible for completing post-recover actions. Additionally, all issues have executive visibility.
So, what happened in this case?
While we are continue investigating the unexpected infrastructure degradation, we know the following:
- At 15:06 UTC (approx.) on Tuesday December 12th, our infrastructure monitoring team noticed an unexpected volume of traffic.
- At 15:11 UTC, the infrastructure received a volume of ‘requests’ that exceeded safe levels and began rejecting these requests. At this time, some customers may have experienced service degradation as their device running Webroot Secure Anywhere may not have been able to connect to the Webroot infrastructure. While protection was not compromised during this time, some machines might have seen excessive CPU usage.
- At 15:43 UTC, automated network service scaling activated, and at 16:00 UTC additional scaling was required. At 17:00 UTC the service had recovered to normal levels.
So why did it happen?
Our initial investigation indicates that the release of our latest Endpoint version, Win v18.104.22.168, combined with a recent Microsoft Update, may have resulted in this issue. A large number of customers simultaneously upgrading to our newest version, combined with many new files from the Microsoft Update, could have caused an excessive volume of calls to our cloud. We are continuing our investigation into other corollary causes
What happens next?
We’re taking a number of preventative measures to avoid a future occurrence including increasing infrastructure ‘headroom’ and adding steps/tranches in the software release process to reduce the volume of machines updating at the same time.
As an active member in this community and forum please do not hesitate to contact me if you have concerns or if I can offer any further reassurance either by post, private message or I can arrange to call you directly. Once again, I apologize for any inconvenience caused by this incident and hope that the information provided above outlines our commitment to you and your customers.
Jon.giffard – Senior PM, Business Endpoint