One would think that with all the investments in developing cloud-native
applications and microservices – with all the steps IT has taken to
modernize applications for the cloud – and with all the monitoring tools
we’ve instrumented across our hybrid clouds – that IT Ops and incident
managers would see fewer Priority 1 (P1) incidents and that their resolution
times would be decreasing.
Unfortunately, that’s not the case, and in StarCIO’s recently published
AIOps Benchmark Report
on how AIOps is the operating platform for digital transformation, 25
percent of respondents said their P1 resolution times usually take over six
hours to resolve.
Newsflash – Businesses Expect Higher Reliability and Performance
There’s no way IT and business leaders should allow six-hour or longer P1
resolutions as acceptable.
That six-plus hour P1 results in downtime or poor performance in a customer
or employee experience. Long-running P1s can result in lost revenue, added
costs, lower customer satisfaction, frustrated employees, and burned-out IT
staffers. Just see how
Facebook’s recent outage
is estimated to have cost the company over $100 million.
Today, business leaders have little tolerance for outages lasting six hours
or longer. Many IT systems run in the cloud, so there’s an expectation that
the infrastructure is always available and that IT deploys robust
applications that won’t go down. So one reason we have more P1s today is
that businesses have raised the bar to require better performance,
especially during peak business periods.
Let’s consider some
business impacts and
reasons to avoid incidents, poor system performance, and outages.
-
Retailers can’t have lengthy e-commerce system outages during holiday
shopping periods -
Manufacturers can’t slow down and are under pressure to deliver more
products faster -
Financial institutions are under significant pressure to improve customer
experiences -
SaaS technology companies have customers that expect near-zero
downtime -
Airlines and hospitality businesses can’t afford poor performance in
mission-critical systems -
Online gaming customers will lose loyal fans if games are slow or have
repeating failures -
All companies driving digital transformations can’t afford outages that
disrupt innovation teams
Long P1s? AIOps Has the Answers
Respondents highlighted several factors contributing to longer resolutions
and why implementing an
AIOps platform is key to
their strategy for reducing P1 incident mean time to resolution (MTTR). In
our research, 93 percent of respondents implement AIOps or plan too soon,
and MTTR was one of the top KPIs identified for measurement and
improvement.
Respondents shared many reasons why resolving P1s is harder today.
-
Complexities in Supporting Hybrid Architectures require IT Ops
teams to retain skills, tools, and procedures to support public cloud,
data center, and edge computing infrastructures. Also, applications often
include cloud-native architectures such as serverless and microservices,
legacy enterprise systems, SaaS, low-code platforms, and the integrations
connecting them. When there’s a P1, it usually requires diagnosing
performance across multiple systems and monitoring tools, requiring more
people and time to identify P1 root causes. AIOps addresses these
complexities by
centralizing visibility across enterprise hybrid stacks. -
Fewer Skilled People to Resolve Major Incidents is the top concern
of incident management and IT Ops teams reported by over 50 percent of
respondents. One problem is ensuring the knowledge transfer required to
support legacy systems, then finding the more advanced cloud ops and SRE
skillsets to support cloud-native architectures. AIOps addresses
this gap by
enabling remote IT operations
and consolidating IT Ops, NOC, and DevOps views needed to resolve
incidents. -
Increased DevOps-Driven Deployment Frequencies help deliver new
capabilities and fix defects to end-customers faster, but also increase
the risks of introducing performance, reliability, and security issues.
Respondents are automating CI/CD and IaC to improve changes but see
investments in
AIOps as the guard rails
to digital transformation initiatives. -
Complexities in Resolving P1s with Hybrid Working Teams is a factor
because of the added time needed to get everyone on bridge calls, Zooms,
Microsoft Teams, or other collaboration tools. Then, outside the NOCs and
war rooms, incident management teams need more time to discuss findings,
agree on root causes, and define action plans. AIOps platforms that
provide an
open integration hub
enable connecting workflows across tools and promote information sharing
needed by hybrid working teams. -
More Monitoring Tools and Events to Review increases the number of
people involved in P1s and lengthens the time to review all the
alerts. Incident management teams seek an AIOps platform with
event correlation, a machine learning algorithm that connects events into manageable
incidents, and a single pane of glass where everyone can review the
time-sequenced monitoring and observability events.
The research identifies
three primary AIOps capabilities
that directly address P1 incident detection, triage, and resolution time.
While automation is important, machine learning capabilities in event
correlation, enrichment, and triage are key for helping incident management
and IT Ops teams. These capabilities help Ops prioritize incidents, simplify
root cause analysis, reduce the number of people responding to P1s, and
provide the tools to resolve incidents quickly and accurately. In other
words, AIOps helps reduce the number and severity of P1s.
So while data processing, analytics, applications, automations, customer
experiences, and employee workflows are more important to every business,
AIOps is the primary investment to ensure that performance and reliability
don’t fall behind business needs.
Read the
AIOps Benchmark Report
for more details!
This post is brought to you by BigPanda.
The views and opinions expressed herein are those of the author and do
not necessarily represent the views and opinions of BigPanda.