Digital Edition

SYS-CON.TV
Application Failures in Production | @DevOpsSummit [DevOps]
The wealth of out-of-the-box insights you could obtain from a single urgent, albeit unspecific log message

How to Approach Application Failures in Production

In my recent article, "Software Quality Metrics for your Continuous Delivery Pipeline - Part III - Logging," I wrote about the good parts and the not-so-good parts of logging and concluded that logging usually fails to deliver what it is so often mistakenly used for: as a mechanism for analyzing application failures in production. In response to the heated debates on reddit.com/r/devops and reddit.com/r/programing, I want to demonstrate the wealth of out-of-the-box insights you could obtain from a single urgent, albeit unspecific log message if you only are equipped with the magic ingredient; full transaction context:

Examples of insights you could obtain from full transaction context on a single log message

Bear with me until I get to explain what this actually means and how it helps you get almost immediate answers to the most urgent questions when your users are struck by an application failure:

  • "How many users are affected and who are they?"
  • "Which tiers are affected by which errors and what is the root cause?"

Operator: I'm here because you broke something. (courtesy of ThinkGeek.com)

When All You Have Is a Lousy Log Message
Does this story sound familiar to you? It's a Friday afternoon and you just received the release artifacts from the development team belatedly, which need to be released by Monday morning. After spending the night and another day in operations to get this release out into production timely, you notice the Monday after that everything you have achieved in the end was some lousy log message:

08:55:26 SEVERE com.company.product.login.LoginLogic - LoginException occurred when processing Login transaction

While this scenario hopefully does not reflect a common case for you, it still shows an important aspect in the life of development and operations: working as an operator involves monitoring the production environment and providing assistance in troubleshooting application failures mainly with the help of log messages - things that developers have baked into their code. While certainly not all log messages need to be as poor as this one, getting down to the bottom of a production failure is often a tedious endeavor (see this comment on reddit by RecklessKelly who sometimes needs weeks to get his "Eureka moment") - if at all possible.

Why There Is No Such Thing as a 100% Error-Free Code
Production failures can become a major pain for your business with long-term effects: they will not only make your visitors buy elsewhere, but depending on the level of frustrations, your customers may choose to stay at your competition instead of giving you another chance.

As we all know, we just cannot get rid of application failures in production entirely. Agile methodologies, such as Extreme Programming or Scrum, aim to build quality into our processes; however, there is still no such thing as a 100% error-free application. "We need to write more tests!" you may argue and I would agree: disciplines such as TDD and ATDD should be an integral part of your software development process since they, if applied correctly, help you produce better code and fewer bugs. Still, it is simply impossible to test each and every corner of your application for all possible combinations of input parameters and application state. Essentially, we can run only a limited subset of all possible test scenarios. The common goal of developers and test automation engineers, hence, must be to implement a testing strategy, which allows them to deliver code of sufficient quality. Consequently, there is always a chance that something can go wrong, and, as a serious business, you will want to be prepared for the unpredictable and, additionally, have as much control over it as possible:

Why you cannot get rid of application failures in production: remaining failure probability

Without further ado, let's examine some precious out-of-the-box insights you could obtain if you are equipped with full transaction context and are able to capture all transactions.

Why this is important? Because it enables you to see the contributions of input parameters, processes, infrastructure and users at all times whenever a failure occurred, solve problems faster, and additionally use the presented information such as unexpected input parameters to further improve your testing strategy.

Initial Situation: Aggregated Log Messages
Instead of crawling a bunch of possibly distributed log files to determine the count of particular log messages, we may, first of all, want to have this done automatically for us just as they happen. This gives a good overview on the respective message frequencies and facilitates prioritization:

Aggregated log events: severity, logger name, message and count

What we see here (analysis view based on our PurePath technology) is that there have been 104 occurrences of the same log message in the application. We could also observe other captured event data, such as the severity level and the name of the logger instance (usually the name of the class that created the logger).

Question #1: How many users are affected and who are they?

Failed Business Transactions: "Logins" and "Logins by Username"

Having the full transactional context and not just the log message allows us to figure out which critical Business Transactions of our application are impacted. From the dashboard above we can observe that "Logins" and "Logins by Username" have failed: we see that 61 users attempted the 104 logins and who these users were by their username.

For questions 2 and 3, and for further insight, click here for the full article.

About Martin Etmajer
Leveraging his outstanding technical skills as a lead software engineer, Martin Etmajer has been a key contributor to a number of large-scale systems across a range of industries. He is as passionate about great software as he is about applying Lean Startup principles to the development of products that customers love.

Martin is a life-long learner who frequently speaks at international conferences and meet-ups. When not spending time with family, he enjoys swimming and Yoga. He holds a master's degree in Computer Engineering from the Vienna University of Technology, Austria, with a focus on dependable distributed real-time systems.

In order to post a comment you need to be registered and logged in.

Register | Sign-in

Reader Feedback: Page 1 of 1



ADS BY GOOGLE
Subscribe to the World's Most Powerful Newsletters

ADS BY GOOGLE

As DevOps methodologies expand their reach across the enterprise, organizations face the daunting ch...
As Marc Andreessen says software is eating the world. Everything is rapidly moving toward being soft...
You know you need the cloud, but you’re hesitant to simply dump everything at Amazon since you know ...
ChatOps is an emerging topic that has led to the wide availability of integrations between group cha...
The need for greater agility and scalability necessitated the digital transformation in the form of ...
In his keynote at 18th Cloud Expo, Andrew Keys, Co-Founder of ConsenSys Enterprise, provided an over...
The cloud era has reached the stage where it is no longer a question of whether a company should mig...
Coca-Cola’s Google powered digital signage system lays the groundwork for a more valuable connection...
In his session at 21st Cloud Expo, Raju Shreewastava, founder of Big Data Trunk, provided a fun and ...
While some developers care passionately about how data centers and clouds are architected, for most,...
"Since we launched LinuxONE we learned a lot from our customers. More than anything what they respon...
Is advanced scheduling in Kubernetes achievable?Yes, however, how do you properly accommodate every ...
DevOps is under attack because developers don’t want to mess with infrastructure. They will happily ...
"As we've gone out into the public cloud we've seen that over time we may have lost a few things - w...
In his session at 21st Cloud Expo, Michael Burley, a Senior Business Development Executive in IT Ser...
Sanjeev Sharma Joins June 5-7, 2018 @DevOpsSummit at @Cloud Expo New York Faculty. Sanjeev Sharma is...
We are given a desktop platform with Java 8 or Java 9 installed and seek to find a way to deploy hig...
"I focus on what we are calling CAST Highlight, which is our SaaS application portfolio analysis too...
"Cloud4U builds software services that help people build DevOps platforms for cloud-based software a...
The question before companies today is not whether to become intelligent, it’s a question of how and...