Here is another little summary of a Liferay Unconference openspace discussion held in Amsterdam on Wednesday October 4th. This one was organised around the topic of what log analysis software people are using and what experiences they have with them. Since quite a few people who did not make it to this discussion asked me on the following days what was discussed I felt obligated to make a somewhat verbose summary with this blog post.
My aim with this discussion was learning which tools other people are using and what their experiences are with these, but also to discuss about which type of information and data the fellow discussion participants hope to extract from their logging.
As my previously led discussion I started this one off with a little introduction round. The participants appear to come from a varied set of sectors like governmental, research, education, medial and commercial, among others. Surprisingly, a common aspect was similar for pretty much every participant. Nobody had a good stack running for log analysis or was only experimenting ELK (Elasticsearch, Logstash plus Kibana) or something similar. This was also the case for ourselves. The bulk of our log analysis we do using a simple Logcheck setup and we’re only really experimenting with an ELK stack for more advanced log analysis.
So instead of sharing experiences with advanced what everybody’s need was. It largely appeared that there were 3 areas of interest in particular.
- As a way of detecting incidents impacting the provided service
- As a tool for analysing bugs to help developers fix them
- As an auditing tool either for regulatory purposes or internal needs
The first need is kind of the perceived holy grail of log analysis. This is the ability that the logging stack analyses the logs either realtime or periodically and determine if there are exceptional events going on and reporting these to the users. Usually detection of these events is based on a change in the frequency a particular event occurs in a certain time period or based on some numerical metric extracted from logged events. This type of anomaly detection mostly still based on manually built algorithms. Research using machine learning is seemingly making big strides in this area to make anomaly detection more or mostly automatic.
It is quite often the case that severe performance degradation or stalls are accompanied by heavy garbage collection activity in java. One of the participants suggested to enable verbose garbage collection logging. A increase in the frequency of either minor GCs or full GCs would then be the thing to look for as indication of trouble. In another discussion earlier that day GCeasy was mentioned as a useful tool to analyse these logs in more details.
The second need focusses more on helping developers solve application specific bugs than operations being able to have an understanding of the general well-being of the application. This focuses on creating a rich context for developer so they have lots of clues and evidence to track down the root cause of an issue.
One concrete tip discussed was to customize your error handling on your web application. When an enduser encounters a fatal error on the frontend of your application, he/she could be presented with a unique error code in the error message. This error could would also be saved in the logfiles of the application. In the error message for the user there would also be an invitation for the user to contact the site administration with that code and a description of what the user was doing. With that unique code a developer can quickly find the correct place and time in the logfiles for further analysis. In fact, the error page itself could contain a webform to submit this information.
If the cause of the problem remains elusive, but it does happen somewhat frequently, than you could add additional logging to custom code sections executed around the problem area. Useful information to log could be internal state or session information of an active user. You could also log a stacktrace for more information. Beyond this, saving heapdumps and threaddumps also generate lots of information which can be further investigated. Spotify’s threaddump analyser was mentioned as a nice tool for analysing those dumps.
Furthermore on could even go as far as to pause the JVM and drop down into an interactive debugger waiting for a developer to introspect at the moment of the critical exception. This last method will cause that JVM to stop handling requests to may be disruptive to endusers unless you have a decent loadbalanced or failover infrastructure.
Having a log with an audit trail is the third need identified at the beginning of the discussion. In general these logs have a longer retention period and may have a higher security need. In fact, if is personal identifiable data logged by the application you will likely need to take GDPR regulation into account. This means among other things that you may not be allowed to share the content of the logs willy nilly without anonymising the content. You may also need to be able to remove information about a specific person in the logs selectively, which can be quite challenging. Of course having good access control to your logging is also a must. Only correctly authenticated and authorised people should be able to gain access to the logs.
A more general topic was discussed was the performance impact of logging. Only in cases where the amount of logging is very high it should have a noticeable impact on the performance of your application. In that case it might be useful to not log locally at all, but ship the log messaged directly to a dedicated service located on a different server. This different server should be in the same location, otherwise the network bandwidth might turn into a bottleneck instead. If you’re running loadtests or stresstest a tip would be to set your logging to debug in the test so you can determine the impact under a worst case scenario.
While nobody really had a clear cut example of a perfect setup to analyse their application logs, I do feel that the range of topics we discusses was a good exploration of the field and may have given participants inspiration for their own new experiments.