As it continues efforts to recover from a weeklong cloud outage for hundreds of customers, Atlassian was roundly criticized by IT pros online for how it has responded so far.
Atlassian CTO Sri Viswanath published a detailed analysis Tuesday of the incident to date, more than a week after the downtime began for 400 customers. The outage has affected nearly all of the company’s core cloud services, including Jira Software issue tracking, Jira Service Management ITSM, Jira Work Management, Confluence documentation, Opsgenie incident response and the Access single sign-on tool.
Viswanath’s post revealed the outage actually began Monday, April 4, at 8:12 p.m. Coordinated Universal Time (UTC), though its status page was first publicly updated April 5 at 9:03 a.m. UTC, saying the company was investigating the issue.
The incident began after the company integrated what had been a standalone asset management product into Jira Software and Jira Service Management, then began deactivating the standalone app.
“There was a communication gap between the team that requested the deactivation and the team that ran the deactivation,” Viswanath wrote. “Instead of providing the IDs of the intended app being marked for deactivation, the team provided the IDs of the entire cloud site where the apps were to be deactivated.”
Worse, the script used for deactivation was faulty and resulted in the permanent deletion of data for 400 customers. That data must now be extracted and restored manually from backups.
Atlassian public statements, including updates to its status page, have apologized for the length and severity of the outage. Viswanath’s post said the company’s response time was “not up to our standard.” But for some prospective Atlassian cloud customers, the post comes as too little, too late.
Brent CheckettsHead of platform engineering, stealth startup
“Of course, there’s a balance; you don’t want to share too much, but a week into a major outage is ridiculous,” said Brent Checketts, head of platform engineering for a stealth startup currently evaluating DevOps products, including services from Atlassian. Checketts said the handling of this incident has his company leaning toward competitors such as the Linear issue tracking tool and Shortcut project management.
“Until [Viswanath] shared the update last night, they were holding back info,” Checketts added. “You hold back info when you think you’re so secure in your market position that you don’t have to be transparent — I think that’s arrogance, or a lack of humility.”
Affected customers will not get a bill, per online Q&A
Viswanath’s post and others from the company acknowledged that public updates about the outage haven’t been up to snuff, but “until now, we were entirely focused on reaching our impacted customers directly,” he wrote.
However, one affected customer said this week that hadn’t been the case at his organization.
“I would have liked more information sooner, and more transparency about what happened and exactly what’s being done to restore service,” the customer said. “The repetitive and very detail-light updates haven’t been encouraging.”
The customer said he had not yet heard from Atlassian about compensation for violating its customer service-level agreement (SLA). In response to a SearchITOperations inquiry into SLA payouts, Atlassian head of product communications Arseny Tseytlin pointed to a post in a Q&A thread on the company’s feedback forum. There, Leslie Lee, senior director of customer engagement at Atlassian, made the following statement about SLAs:
Our top priority right now is recovering our customers’ sites to full functionality and as we do so, affected customers will not receive a bill from us in the short term. Following our efforts to restore, we will reach out to each of our affected customers to discuss how we can make things right in the long term. Post incident, we will also be conducting a detailed review of all of our processes in a complete post incident review (PIR) with an overview made available to all affected customers. This will help ensure we’re delivering the service and standard you’d expect from us.
That update didn’t receive any direct responses, but other customers and community leaders didn’t hold back criticisms of how the company has handled the incident otherwise.
“Thank you to Atlassian for finally releasing clear, specific information on the causes of this problem, and a realistic estimate of when this will all be behind us,” wrote one poster under the name Shane Doerksen. “If this had been done in the first 36 hours or so, when it became apparent that this was going to be a prolonged outage, it would have done wonders to reduce the rage and frustration that has built up in the community over the last week.”
Another poster identified as a community leader said Atlassian’s poor communication put him and other community leaders in a difficult position in the early days of the outage.
“As the Community is the only support for free tiers, and often the first or second place paying customers go for support, you could and should have properly redirected several irate customers to the Statuspage,” wrote Darryl Lee, whose Atlassian profile identifies him as senior systems engineer at TV streaming device maker Roku, on the Q&A thread (emphasis in the original). “Instead, Community Leaders and your own staff were stuck running PR for you. This is not right.”
We’re one of the affected customers. It’s an absolute disaster for us! We depend on Jira for all our customer and partner communication. Now we’re back to MS Outlook and employee-colors in mailboxes to handle this
— Sjoerd Bakker (@nlsjoerd)
April 12, 2022
Atlassian cloud migration strategy under scrutiny
Atlassian has been unequivocal about its plan to push users toward its cloud services — it incurred the ire of many users of the on-premises Server and Data Center editions of its software when it announced in 2020 that Server licenses would be discontinued and Data Center licenses subject to price hikes. Many of the on-demand customer sessions at its Team 22 conference last week focused on the company’s push toward cloud migration.
Before it switched from self-managed SaaS infrastructure to microservices on AWS in 2018, Atlassian cloud services had a less-than-stellar reputation for reliability, but the company made improvements, including offering SLAs to enterprise customers.
Still, Atlassian’s emphasis on its cloud services has also been called into question amid such a severe outage this week.
“Atlassian is staking its future on being a cloud provider,” wrote a group of Forrester Research analysts in a post April 13. “This week’s outage puts intense scrutiny on its abilities to execute, win, and maintain customer trust.”
In addition to its timing during its annual conference, when it announced multiple cloud services updates, including a new 99.95% enterprise SLA, the outage comes “at a particularly contentious time” for its customer base as they adjust to its cloud migration push, the Forrester post said.
“Before the outage broke, analyst and market reception to Atlassian’s business strategy was mixed. Forrester received customer complaints about being forced to shift to the cloud, as well,” the post said. “It seems likely that Atlassian’s cloud migration timelines will be adjusted.”
Atlassian’s Tseytlin said another post-incident review will be published when the outage is fully resolved, “which will include our steps to assure customers about improvements in our cloud services.”
Beth Pariseau, senior news writer at TechTarget, is an award-winning veteran of IT journalism. She can be reached at [email protected] or on Twitter @PariseauTT.