Most Read This Week
The DevOps Way to Solve JVM Memory Issues
Foster collaboration and be proactive
Oct. 17, 2013 08:00 AM
The killer in any IT operation is unplanned work. Unplanned work may go by many names: firefighting, war rooms, Sev 1 incidents. The bottom line is that Operations must stop whatever planned work it was doing to manage this drill. This means little or no normal work is being accomplished. It is a scenario most of you will be familiar with: your application servers are humming along happily until suddenly, without an obvious reason, memory usage starts to increase, soon followed by longer garbage collection suspensions that finally force you to restart the application. The operations team is typically unaware of the actual impact on end users (other than a service being down), and it additionally lacks data and time to further investigate the issue. As communication between the traditional silos of operations, testing and development teams is often less than ideal, a scheduled restart in a "low impact" timeframe is often the easiest solution and turns into something resembling a "Production Best Practice" over time. This adds to the workload of an operations team because unplanned work becomes unnecessary preventative work. It also becomes a suspect every time there is a problem with the application. Wouldn't it be better to actually fix the issues instead of just working around them? Shouldn't there be a general understanding across all teams responsible for an application to fix the problem as fast as possible and make sure that it's prevented in the future?
DevOps Fundamentals at Cloud Expo Event Calendar
In this blog we walk you through a case study where a memory leak in a third-party plugin impacted end user performance. Instead of hiding the problem with preventive JVM restarts, DevOps Best Practices were used, which fostered the collaboration between Ops, Test and Dev.
The Rise of Third-Party Plugins
From an application owner's perspective, the biggest benefit of a plugin-based architecture is the increase in flexibility - you can meet changing needs by adding new plugins, instead of worrying about upgrading a much larger system. But by using plugins, you grant a (more or less well-known) third party access to your data and systems, which frequently raises privacy concerns as well as security issues. An example of this is the Java browser plugin's gaping security holes. While best practices such as sandboxing help with these risks, these discussions typically focus on the client side; the performance impact of plugins on your application's server-side is often missed. We have covered the possible effects of client-side plugins before, and will focus on the server side in this blog post.
The Trigger for Operations
Fast forward a couple of months. Seemingly out of the blue, we began to see performance issues with JIRA. Our production monitoring alerted Ops about decreased end-user performance with some users aborting actions due to very long response times. Nobody had called in yet - but the early warning system indicated that users would soon complain.
Ops and Dev Working Together
Ops started to investigate and worked with our performance engineering team to establish causality between the start of the issues and other changes, but came up blank. No new plugins had been installed recently, and no updates to the underlying operating system, the Java runtime, or JIRA had been applied within the last weeks.
Due to the increased memory usage, an Out of Memory Exception (OOM) was unavoidable. As it was still during business hours a "controlled" restart was also not the best option. The OOM unfortunately happened. In this case our monitoring solution automatically triggered a full memory dump that allowed us to view the heap's content at the time of the error. When analyzing the dump, we noticed a number of large object instances as shown in the following screenshot:
Automatically triggered Memory Dump shows objects that consumed most of the heap space
Looking at the class names of these instances, we were able to identify the actual culprit: the Salesforce synchronization plugin. The plugin had been in use for over half a year without any problem. It comes with a cache used for the tickets that were synchronized. With the number of tickets growing over time, this cache grew as well. Unfortunately, this cache was not limited, and when we finally reached a critical number of tickets and attachments, this cache caused JIRA to run out of memory.
The very high number of HashMap and HashMap Entry objects filled up JIRA's heap.
With this information, we were able to pinpoint the root cause and reach out to the developers of the third-party provider of the plugin. The detailed data we had available - both the memory dumps and the impact it had on end-user response time - avoided all collaboration and communication problems that you typically have. There was no finger pointing or going back and forth multiple times to provide more detailed log files. Within days (before another OOM could occur again), we had a fixed version, first deployed in our staging environment and tested by our performance team then later deployed in production, giving Ops the confidence that it would solve the problem.
Don't Do It the "Easy Way": Preventive JVM Reboots
Reaching out to Atlassian and supplying the team with log files would've been the next step, adding additional turnaround time - time that our users have to spend dealing with sporadic outages. Even if they would've been able to point us in the right direction (to the plugin vendor), we would've lost more time there as the usual process of trying to reproduce the problem on their systems would've begun.
The most common solution we talked about in the introduction is to schedule restarts of JVMs during low traffic hours in order to prevent a major impact, continuing this until the problem finally gets fixed or simply do it forever. This is not "proactive." It is just the easy way for damage control.
Do It the "DevOps Way": Foster Collaboration and Be Proactive
It requires a performance culture within the organization to put the right people, processes and priorities in place - supported by tooling that makes collaboration and root cause analysis easy. Having data readily available allows us to overcome all the typical collaboration and communication problems between those that are impacted by the problem (Ops) and those that have to fix the problem (Dev). With that you can ensure higher availability of your systems, resulting not only in happier users, but also freeing up Ops resources from troubleshooting.
Reader Feedback: Page 1 of 1
Subscribe to the World's Most Powerful Newsletters
Today's Top Reads