The Real Costs of Sloppy Infrastructure Logging

photo credit: wolfnowl

Cleanliness is indeed next to godlinessJohn Wesley

Summary

If I took a photo of your infrastructure server logs, would it look like the picture above?

The state of your logs is one of the most neglected aspects of a healthy infrastructure. Yet, they are critical in order to resolve incidents when they occur. If the server logs contain large amounts of superfluous and misleading data, then critical errors can be missed. It’s like searching the haystack for a needle. If the logs churn out ‘ignorable’ errors, then you suffer delays during incident resolution as debugging may be caught up chasing ghosts.

A clean production log allow incidents to resolve faster and make problem identification easier. It also allows better proactive monitoring which will result in discovering problems before they turn into larger incidents. This translates to a cheaper and more productive environment.

1. Introduction
2. What is a Sloppy System Log’?
3. The Costs of Sloppy Logs
3.1 Increased Incident Resolution Time
3.2 Decreased Proactivity
3.3 Additional Operations Team Stress
4. Recommendations
5. Conclusion

Introduction

In most environments, the exact output written to the log files by servers is normally ignored during the development process. If the system works, then the project is happy and promotes the solution into production. Any objection regarding logging is met with indifference. A project manager may say that “It will be fixed in the next version”. This never happens. What is considered a low priority item in the first release remains a low priority item in the future releases.

It is the intent of this discussion to shine light on the hidden costs of treating log cleanliness as a low priority item.

What is a Sloppy Log?

Before we can discuss the impact of sloppy logging, we have to define the situation we are talking about. A sloppy log is considered a system or application log that contains:

Excessive verbose messages
Constant error messages occurring that are considered benign.
A mix of log categories of data in a single file (example, transaction latency logging and platform messages/errors interspersed).

These three points can be boiled down into the following statement: If a log message is being written when it shouldn’t, then this is an example of sloppiness.

A large volume of messages per transaction does not necessarily point to a poor situation as long as they are all known, business relevant messages.

A simple test to tell if you are experiencing this problem is as follows. If your logs are set to roll over at a certain size (say 100MB) and between the time it takes to find and open the latest file, it’s already rolled over into a new file.

Many modern infrastructures have a central log management system such as Splunk. In these situations you can tell if you have a sloppy log situation if you query for the word ‘error’ and then need to sift through pages and pages of results.

The Costs of Sloppy Logs

During development, the costs of sloppy logging are treated as if they are an abstract concept that can’t be tied down to spending of real dollars. This gives projects a get-out-of-jail-free card and absolves them of fixing the problem.

Sloppy logs have real associated costs:

Production incidents take longer to resolve. Every minute of downtime costs money.
The longer an incident goes unresolved, there are more people with higher seniority that need to be brought in. More people equals more salary equals more money.
The integration phase of projects in development is longer than it should, as time is spent searching haystacks for needles. Time is money.
The stress placed on the operations team increases. This leads to burnout and high turnover in critical support roles. Turnover is money.
The inability to identify, categorize and address production errors before they become incidents. The earlier the problem is identified, the cheaper its resolution will be. Not being proactive costs money.

The following sections flesh out these costs.

Increased Incident Resolution Time

A sloppy log situation increases the time spent resolving incidents. The log is the truth of the system. Like footprints in the sand, it documents the path that the server has taken. If that truth is smothered in lies, then it becomes difficult to comprehend. This is even worse when under business pressure to immediately resolve the issue.

The documentation handed over to the operations team is normally fundamentally lacking. Most of the time it’s the last deliverable of an already time-crunched-over-budget project. It fails to capture the reality of the system and focuses on outdated design decisions made months before development started. Operations teams know this, so they rely on the truth of the log from the system and will only use documentation to corroborate what they see. If the log is messy, the deck is stacked against them making a quick identification and resolution of the issue.

To make resolution more difficult, the person assigned to the incident ticket may not have a strong history working with the technology. They’re going to take the logs at face value and may not know to ignore superfluous errors or meaningless debugging statements. They will waste time chasing ghosts, unable to see past the red herrings to the real issue.

In the development phase of a project, the largest loss of time is the repeated searching of the log files that projects demand. Due to the distributed nature of architectures, tracking a transaction as it passes between all the dependent systems in a failure condition becomes difficult. End to end transaction tracking systems either don’t exist or are in their infancy. This means that servers are asked to “Check their Logs” to see where in the process the transaction failed. Larger bloated logs take more time to search and the request to search occurs often.

Decreased Proactivity

In a clean log scenario, imagine that you are looking at your logs. Unrelated to the reason why you are looking at the logs, you notice an exception you haven’t seen before. You make a note of it and reach out to the developer at the next opportunity. The developer acknowledges the error and fixes it in the next release of the software or maybe it’s urgent and a process for a production fix is started. Either way, the problem made its way back to development before the issue escalated.

In a sloppy log scenario, the above work never happened. The logs were so full of errors, messages and junk that you found the single line of data you were looking for and left the mess as fast as possible. Two months later, the business took a 2 hour outage over an exception that was present in the log that you couldn’t find.

Here, you can see the power of a clean log. An exception being written to a log was so rare that upon seeing one you didn’t recognize, you questioned development for more information. In a sloppy log, you are so overwhelmed with useless data that you can only focus on the task you are assigned, missing the opportunity to proactively fix the environment.

Operations Stress

I think operations have one of the toughest jobs in all of IT. They’re generalists in that they are responsible for all of the platforms in the enterprise, but they also need to be on-demand specialists to drive incident resolution. They have to have strong technical foundations and an innate ability to understand systems at a glance in stressful situations.

The handover between development and operations also has issues. A handover document usually consists of 10-20 chapters of information requested. The desire is to accurately capture the state of the project being installed into production. The problem is that the operations team has little knowledge of the project to be able to ask probing questions of the draft document. This results in the documentation that barely (ever) meets its core requirement: It’s supposed to be able to stand on its own an educate an operations person on how the system behaves.

This breakdown in handover manifests itself during production incident support. The documentation is assumed to be incomplete. They will look at the logs for the truth of the system. This means that the cleaner the logs, the easier you are making life for your operations team.

Operations is inherently a stressful job. You have the weight of the world on your shoulders to resolve issues as fast as possible for 40+ hours a week. Clean logs can help the team quickly identify the issue and engage the proper resolution process for the technology.

Recommendations

The recommendations for clean logging are:

Acknowledge that a commitment to clean logging is required for all projects that participate in the platform. It only takes one malfunctioning project to ruin the log.
During all phases of the project, the logs must be monitored and any poor log messages must be removed from the system. Removing a poor message is substantially cheaper than educating every administrator about the superfluous message forever.
The nature of messages being logged need to be kept separate. A latency log documenting service response times and a system log documenting errors should never mingle in the same physical file.

Conclusion

Sloppy logs are a common problem in most infrastructures. The desire to get projects in production is outweighing the concerns of the environment management teams. This article has outlined the tradeoffs are when clean logs are a low priority in the development process. In the long run, it will cost you extra time when time is of the most essence.

There is constant pressure to accelerate in all aspects of IT. This includes problem resolution. It is difficult to move forward when chained in place by complex, verbose log files.

About the Author

Dan Zrobok

Dan is the owner of Orange Specs Consulting, with over 14 years of experience working in enterprise systems integration. He is an advocate of the IBM DataPower Gateway platform and looks to improve environments that have embraced it. He also occasionally fights dragons with his three year old daughter Ruby, and newborn Clementine.