Solved

Webroot Business Endpoint Protection - Intermittent performance issues

  • 12 December 2018
  • 17 replies
  • 1025 views

Userlevel 7
Badge +36
Webroot is aware of an issue that may have impacted Webroot Business Endpoint Protection users for a short duration of time on Wednesday, December 12, 2018. As of 10am MST, Webroot took action to address the issue. We recommend you rescan the affected endpoints to resume normal performance. If you need additional assistance, please contact Webroot Business Support at https://www.webroot.com/us/en/business/support/contact.
icon

Best answer by JGiffard 20 December 2018, 00:32

Hello,
 
As Senior Product Manager for the Business Endpoint Product, I wanted to connect with the community about the service incident that took place on Tuesday December 12, 2018 and offer my apologies to anyone who had to manage through this short window of disruption.
 
Before going into some of the details, I want to quickly outline what happens behind the scenes when Webroot experiences an issue.
 
Our entire team takes any service impacts to customers very seriously. When an incident occurs, we immediately work through a pre-determined plan designed to bring the incident to the swiftest close, assess and understand any customer impact, and quickly communicate next steps. We also schedule a post-mortem review session, where we investigate root causes and incorporate lessons learned.
 
As the Product Manager for our Business Endpoint, I am held accountable for providing updates to customers and the team, improving future releases, and ensuring that these issues won’t happen again. Each internal team has a dedicated lead who is responsible for completing post-recover actions. Additionally, all issues have executive visibility.

So, what happened in this case? 
While we are continue investigating the unexpected infrastructure degradation, we know the following:
  • At 15:06 UTC (approx.) on Tuesday December 12th, our infrastructure monitoring team noticed an unexpected volume of traffic.
  • At 15:11 UTC, the infrastructure received a volume of ‘requests’ that exceeded safe levels and began rejecting these requests. At this time, some customers may have experienced service degradation as their device running Webroot Secure Anywhere may not have been able to connect to the Webroot infrastructure. While protection was not compromised during this time, some machines might have seen excessive CPU usage.
  • At 15:43 UTC, automated network service scaling activated, and at 16:00 UTC additional scaling was required. At 17:00 UTC the service had recovered to normal levels.
Once an impacted device running Webroot Secure Anywhere was able to regain communication with the Webroot Cloud, either automatically, or after a reboot, local monitoring (on the device) would have stopped, and device performance would return to normal levels. In a limited number of cases, a re-install of WSA would have been required. Throughout this time the Webroot support team was offering a workaround/solution to the incident.
 
So why did it happen?
Our initial investigation indicates that the release of our latest Endpoint version, Win v9.0.24.37, combined with a recent Microsoft Update, may have resulted in this issue.  A large number of customers simultaneously upgrading to our newest version, combined with many new files from the Microsoft Update, could have caused an excessive volume of calls to our cloud.  We are continuing our investigation into other corollary causes 
 
What happens next?
We’re taking a number of preventative measures to avoid a future occurrence including increasing infrastructure ‘headroom’ and adding steps/tranches in the software release process to reduce the volume of machines updating at the same time. 
 
As an active member in this community and forum please do not hesitate to contact me if you have concerns or if I can offer any further reassurance either by post, private message or I can arrange to call you directly. Once again, I apologize for any inconvenience caused by this incident and hope that the information provided above outlines our commitment to you and your customers.

Kind Regards
Jon.giffard – Senior PM, Business Endpoint
View original

17 replies

Userlevel 4
Badge +16
Thanks for the information Jon, that helped allay our concerns significantly!
Userlevel 7
Badge +31
Hello,
 
As Senior Product Manager for the Business Endpoint Product, I wanted to connect with the community about the service incident that took place on Tuesday December 12, 2018 and offer my apologies to anyone who had to manage through this short window of disruption.
 
Before going into some of the details, I want to quickly outline what happens behind the scenes when Webroot experiences an issue.
 
Our entire team takes any service impacts to customers very seriously. When an incident occurs, we immediately work through a pre-determined plan designed to bring the incident to the swiftest close, assess and understand any customer impact, and quickly communicate next steps. We also schedule a post-mortem review session, where we investigate root causes and incorporate lessons learned.
 
As the Product Manager for our Business Endpoint, I am held accountable for providing updates to customers and the team, improving future releases, and ensuring that these issues won’t happen again. Each internal team has a dedicated lead who is responsible for completing post-recover actions. Additionally, all issues have executive visibility.

So, what happened in this case? 
While we are continue investigating the unexpected infrastructure degradation, we know the following:
  • At 15:06 UTC (approx.) on Tuesday December 12th, our infrastructure monitoring team noticed an unexpected volume of traffic.
  • At 15:11 UTC, the infrastructure received a volume of ‘requests’ that exceeded safe levels and began rejecting these requests. At this time, some customers may have experienced service degradation as their device running Webroot Secure Anywhere may not have been able to connect to the Webroot infrastructure. While protection was not compromised during this time, some machines might have seen excessive CPU usage.
  • At 15:43 UTC, automated network service scaling activated, and at 16:00 UTC additional scaling was required. At 17:00 UTC the service had recovered to normal levels.
Once an impacted device running Webroot Secure Anywhere was able to regain communication with the Webroot Cloud, either automatically, or after a reboot, local monitoring (on the device) would have stopped, and device performance would return to normal levels. In a limited number of cases, a re-install of WSA would have been required. Throughout this time the Webroot support team was offering a workaround/solution to the incident.
 
So why did it happen?
Our initial investigation indicates that the release of our latest Endpoint version, Win v9.0.24.37, combined with a recent Microsoft Update, may have resulted in this issue.  A large number of customers simultaneously upgrading to our newest version, combined with many new files from the Microsoft Update, could have caused an excessive volume of calls to our cloud.  We are continuing our investigation into other corollary causes 
 
What happens next?
We’re taking a number of preventative measures to avoid a future occurrence including increasing infrastructure ‘headroom’ and adding steps/tranches in the software release process to reduce the volume of machines updating at the same time. 
 
As an active member in this community and forum please do not hesitate to contact me if you have concerns or if I can offer any further reassurance either by post, private message or I can arrange to call you directly. Once again, I apologize for any inconvenience caused by this incident and hope that the information provided above outlines our commitment to you and your customers.

Kind Regards
Jon.giffard – Senior PM, Business Endpoint
Userlevel 4
Badge +16
Some additional thoughts:
 
  • We aren't consumers, we're sysadmins.  I'm not going to scream at you over the phone if something breaks.  I just need to know as much as you can tell me as soon as possible.   - What would really impress me, is if you'd emailed GSM users (or at least admin users) with a brief heads-up as soon as you learned of the problem.  Even if all you know is a symptom and that Webroot's causing it: that at least would let me notify our helpdesk team so they stop wasting time looking for other causes.  Mistakes are a part of life: when a company screws-up and quickly releases a transparent statement about what happened and the steps being taken to fix it, that improves my opinion of them.  It gives me a warm fuzzy feeling about that company.  My advice is, approach the next problem as an opportunity to demonstrate good response and communication practices.
  • I understand antivirus is hard - you're injecting hooks into thousands of processes you've never even heard of and trying to interpret their behavior based on heuristic analysis without negatively impacting them.  That's impossible.  Modern antivirus is going to cause some problems.  I expect it to cause some problems.  I'd be concerned if it didn't.
  • Shane Cooper's doing a great job in LabtechGeek Slack, and he's a large part of why I still support Webroot internally.  But he shouldn't be your single point-of-contact with the outside world, that's a crazy way to run a business.  Also, I shouldn't have to check LTG Slack hourly to find-out about emerging issues. 
  • Finally, what drives me nuts is, you've got a great product - we're thrilled with our infection stats since moving to Webroot - but it often feels like the company as a whole doesn't understand what an important part of our tech stack this software is and how serious it is when it causes problems.  All the individuals I interact with are great, but the overall attitude towards various issues we've reported over the last 18 months has been distinctly cavalier.   Congratulations on your continued growth, which is truly impressive, but I think you could do even better if you approached development, incidence response, and communication with enterprise customers in a more operationally-mature manner.  Good luck, I'm rooting for you - I really don't want to deploy another antiviurs solution right now! - and get us some real information about what happened as soon as you can.
Userlevel 1
Badge +2
I have to agree with Adam and Todd.  More information is needed.  Lucky for me I only had two machines get the update.  Unfortantly, one of those machines was on a laptop I was using for a seminar that I had to quickly reinstall Windows to get the laptop functional again and get back to the seminar.
I still have clients asking me for an explanation of the events that took place last week. We're also experiencing additional issues with scripts and such across the endpoints that updated to this version. No issues within clients that have endpoints on the previous version. It's causing a lot of unnecessary work for my employees. Not to mention the client fallout from an incident of which we had no control.
 
The most unsettling part of this incident is that we just transitioned back to Webroot after being on another product for about 18 months. This isn't what we expected to be dealing with upon our return.
Userlevel 4
Badge +16
I'm aware our clients were protected.  Many of them also couldn't work.  This "slowdown" caused 100% work-stoppage for many of our users and cost us a large amount of troubleshooting manhours, and possibly even client goodwill.  

Please don't take this personally, but that kind of response doesn't inspire confidence.  From our perspective it feels like Webroot doesn't take this seriously and that our concerns are interpreted as an overreaction.  We're not overreacting.  It's our business and reputation, and we need to know what happened and what's being done to ensure it never happens again.  

No joke, this could be a dealbreaker.  You might lose our business over this.

You can make mistakes; it's how you handle them that matters to us.  When one of our vendors causes massive user-impact for our customers, we expect a more prompt and transparent response.
Userlevel 7
Badge +36
@ wrote:
When should we expect the full incident report?  My boss is very interested.  This issue flooded our helpdesk last Wednesday and wreaked havoc for many of our clients.
Everyone was protected during the slow down and is still protected currently. We’ve paused the rollout of the update and will resume once we’ve determined the root of the issue.
Userlevel 4
Badge +16
When should we expect the full incident report?  My boss is very interested.  This issue flooded our helpdesk last Wednesday and wreaked havoc for many of our clients.
Userlevel 7
Badge +55
@ wrote:
At this time, we have determined that improvements within our recent release of Webroot Business Endpoint Protection designed to enhance communication with our cloud infrastructure caused a high volume of traffic within a short period of time.  Our Engineering teams have adjusted capacity within our cloud infrastructure and feel further performance impact is unlikely to resurface. We’re continuing to monitor and will do so throughout the weekend.
So it was a back end issue as it sounds and not the updated client issue correct?
 
Thanks,
Userlevel 7
Badge +36
At this time, we have determined that improvements within our recent release of Webroot Business Endpoint Protection designed to enhance communication with our cloud infrastructure caused a high volume of traffic within a short period of time.  Our Engineering teams have adjusted capacity within our cloud infrastructure and feel further performance impact is unlikely to resurface. We’re continuing to monitor and will do so throughout the weekend.
Userlevel 7
Badge +36
@ wrote:
@ wrote:
@ wrote:
@
 
What was the issue and root cause. What was done to fix the problem so it won't happen going forward? 
Our product team plans to deliver a full Incident Report by tomorrow, so I will be sure to update this thread at that time.
Has the report been released yet?
Not yet, I'll post it in this thread once it is.
Userlevel 7
Badge +55
@ wrote:
@ wrote:
@
 
What was the issue and root cause. What was done to fix the problem so it won't happen going forward? 
Our product team plans to deliver a full Incident Report by tomorrow, so I will be sure to update this thread at that time.
Has the report been released yet?
Userlevel 4
Badge +16
Ah. That rings a (non-seasonal) bell.
In this case, 'Intermittent performance issues' = Brought windows to a screeching halt.  A number of my workstations experienced this yesterday.
Userlevel 4
Badge +16
Can we at least have some description of the symptoms, start time? This is so vague it looks like it was written by legal covering their backsides - not particularly reassuring.
Userlevel 7
Badge +36
@ wrote:
@
 
What was the issue and root cause. What was done to fix the problem so it won't happen going forward? 
Our product team plans to deliver a full Incident Report by tomorrow, so I will be sure to update this thread at that time.
Userlevel 6
Badge +28
@
 
What was the issue and root cause. What was done to fix the problem so it won't happen going forward? 

Reply