Monitoring 101

If you can’t measure it, you can’t fix it.

Definition

Nothing will ever go wrong with our coding. Everything will always operate perfectly. And, I’ll win the lottery. Riiigghhht!

The reality is something will go wrong and at the worst possible time. But, what if we have something watching everything, all the time, with the ability to let us know exactly when things are out of sorts. Yes. That is a ‘monitoring’ system. And, no matter what company, no matter what industry, no matter what planet. YOU will need monitoring.

Currently most people still see ‘monitoring’ as only an IT resource, but it’s so much more because, in principle, it can be applied to every department. ‘Monitoring’ should be non compartment, dynamic, explorative, and intelligent. It’s paradoxically, AI. Which, simply means, with a backfeeding component into engineering, the system can be turn key AI. Go get a cup of coffee and ponder that.

Brief

Identified need, evaluated technologies, performed feasibility analysis, and created an open source enterprise monitoring system to service SLAs for a global FinTech mobile app.  Key metrics: 200+ remote servers, 100+ remote sites, live regional simulation (SMS, URL, and REST), flexible decoupled architecture, secure push/pull channels, single code base, self monitoring, billions of metrics annually, efficient data modeling, scalable assessment, 100s of custom modules, 6 month ROI, zero licensing costs, very low administrative costs and a comprehensive monthly statistical report.

I was hired as a Support Engineer. After just a few weeks, I noticed a glaring philosophical omission in the support paradigm. For each support case opened, there was limited information to provide the software developers except for the logs and the end-user’s complaint. And while the logs are very helpful, I quickly ascertained that a monitoring system could help on several fronts. It could provide near real-time discovery and scope. This raised awareness and helped assign severity level. All to help meet the SLAs. I gained management support for pursuing a solution.

Research

Of course, I evaluated the commercial offerings. But, around every corner, I was not comfortable with the ROI, contractual reliability, and lack of adaptability. So, I was compelled to create one using ‘open source’ technology. Nothing remotely comes close to cost effectiveness than a well developed, well managed, ‘open source’ solution. For this particular solution I had, as the highest architectural priorities, the following: flexibility, scalability, single code base, multiple channel alert delivery, decoupled architecture, short term ROI, comprehensive reporting/graphing ability, pre-emptive discovery, and in-line response methodology . I even lobbied for a ‘premium’ support package add-on to the contract, thus reducing ROI to immediate.

What required monitoring?

The product to be monitored was a FinTech mobile banking solution. That product included a mobile client application, coupled with an in-house server which was uniquely tied to the financial institution’s core processing equipment. Inclusive was SMS and web-based banking functionality. As you can imagine, each deployed solution was very complex in nature. At monitoring’s peak, we were monitoring over 100 different solutions which included thousands of metrics. Were we a Splunk? No. They are infinitely better at log monitoring, but that’s where it stopped. The cost infrastructure, the necessary true ‘monitoring’ functionality, the flexibility, and the required support infrastructure were much worse with Splunk.

Architecture

I conceptualized the system as plug-and-play. The most essential piece of this puzzle were dynamically adjustable configuration files. There were multiple levels to accommodate ‘common’ metrics to every customer all the way down to highly specific ones. I also wanted to distribute the load of responsibility. As such I went with ‘local’ monitoring agents. This ‘local’ concept is akin to having a nurse in every patient’s home. She would do fundamental observations. If a symptom emerged which was outside the normal, the connection to an expert was expedited and the patient was then elevated in observation. If the patient continued to decline they’d be put in the ICU. The monitoring was conceived very similarly. Tweak a config file here or there and you’d expand the detail, assign a level of severity, add a new metric, change a channel of delivery, increase the logging level, or simply provide information on how to proceed.

Of course, all this information needed to be aggregated, assessed, and acted upon. I went with the Nagios concept. At the time it was the largest open source monitoring solution out there. What I didn’t like was the rather archaic dashboard concept. I needed a system that a level 1 support engineer could benefit from. I decided to split the system into two parts: the agents and the core.

The agents were a series of Bash, Python, Java, and Perl programs. Though I leaned heavily on Bash as first option because of its portability and broader appeal. Where it wasn’t up to the task I opted for other solutions. For example we used Java to write the main SQL module.

I chose Centreon as the core. It was open source, very easy to understand, had a comprehensive alerting platform, and highly configurable graphs.

Benefits

  • Open source – this entire solution uses open source tools.  It, therefore, is modifiable and expandable with the only cost being man-hours.  Additionally, there are no licenses, support contracts, or recurring costs whatsoever for the resources used to build this solution.   Keep in mind employee time is required regardless of what monitoring solution is deployed.  When considering this it is difficult to find an alternative with a better ROI.
  • Autonomous – The primary objective was a solution which was autonomous from the product.  This, effectively, eliminates the potential conflicts of an integrated system.
  • Single code base – Due to the rapid deployment mechanisms and design, there is only one code base.  It is always the ‘production’ code and is always up to date at all customers. This allows a distribution of real-life data situations which may reveal the need for a feature enhancement or a bug fix.
  • Flexible data formatting – The ‘results’ generated from the agents are all in Nagios format.  This allows the agents to port to any monitoring solution.  This solution used ‘Centreon’ as the aggregator and management console, but just as easily can use any out-of-the-box Nagios compatible solution.  Knowing that construct, ‘any’ monitoring system can absorb the same data.
  • Community of help – Nagios is the 3rd largest monitoring solution in the world and the largest ‘open-source’ monitoring solution.  As an open-source product of such scope the qualified scope of available talent is large.
  • Multiple Channel Alert Delivery – Alerting has multiple methods (e.g. email, voicemail, SMS, etc.) of delivery. This increases the probability of the proper people being informed. There is an SMS testing platform which has the dual purpose of sending outgoing SMS alerts if said customer had requested that feature. Of course, email offers the opportunity for customized distribution lists depending on what type of alert.
  • Dynamic configuration – The decoupled operation allows for configuration on the fly, thus enabling features and fixes in minutes … anywhere.
  • Decoupled Architecture – The system is operationally modular. The agents, delivery mechanism, core engine, update engine, connection, and reporting are all separate modules. Since the core agents are Nagios format, it offers the ability to add agents rather easily. These agents operate in a store and forward manner, much like most email systems. Again, this was by design to allow different channel delivery mechanisms if so desired. The Centreon dashboard can be changed rather easily because the agents are, again, standard Nagios formats. A custom end-of-month reporting program was written to provide management statistics on product performance/operation.
  • Store and forward model – Much like ‘postfix’ we opted for a store-and-forward concept in the design of this product.  This detached process allows for ‘updates’, ‘improvements’, and ‘fixes’ to be rapidly deployed.  Agents extract metrics, analyzers assess and package this data, channel mechanisms deliver the data, monitoring ‘presents’ the results, the email engine enhances the messages and the additional information library enables a more consistent support response.
    • ‘Agents’ – These are all bash scripts called using a hierarchy of cron scheduling.  The ‘results’ are stored as text files which are later analyzed for further processing.
    • ‘Analyzer’ – This script assesses all the data created by the agents every ‘cycle’ (default is 20 minutes).  The analyzer evaluates metrics against thresholds, ‘rolls up similar metrics’ to combine metrics in graphs (thus reducing the number of rrd files generated) and groups based on category to combine similar metrics into fewer emails (this dramatically
    • ‘Channel’ delivery – The ‘analyzed’ results are either ‘pulled’ or ‘pushed’ (depending on the customer’s security model) to a core DMZ site.  There are multiple options for delivery of the data which more readily adheres to customer security choices.  From there they are automatically security vetted and relayed to an internal destination for ‘presentation’ to the team.
    • ‘Presentation’ – The core monitoring component ‘presents’ the data.  This means it graphs, reports, and where applicable submits an alert to the email engine.
    • ‘Email engine’ – This process augments the message with color coding, adds content specific instruction and validation links and formats for multiple delivery mechanisms (eg email, sms, etc)
    • ‘Library’ – The links embedded in email alerts redirect the team to ‘next’ steps.  These are called ‘click-throughs’.  By having these with each alert, the learning curve to onboard personnel is substantially reduced.  It further reduces errors in ‘assumptions’ and/or diagnostic methods as this library of instructions is constantly maintained by the support team with the optimal procedures and/or alternate procedures unique to customers.
  • Self governing – This solution monitors itself too.  There are several maintenance scripts, and monitoring agents designed to make sure monitoring does not impact production.  Again, the ‘modular’ architecture further reduces this risk.
  • Unique agents – Though much of monitoring can be replicated in a commercial product, some subsets are unique and unmatched.
    • Log monitoring – The global community of log monitors, for the most part, was inadequate.  A custom script was created allowing for custom thresholds, custom strings, and rapid deployment of new strings to watch.
    • Sqlclient – This is a custom java sql script for ‘read-only’ DB queries.  Since it covers, MSSql, Oracle, MySql, and Postgres it is transparent to the monitoring scripts using it.
    • ‘Hive’ and ‘Bee’ – The ‘bees’ is an Android application designed to perform round-trip SMS testing to the product.  Their results are delivered to the ‘hive’ in the DMZ.  Those results are, in turn, retrieved by the core.
    • Stats – Using sqlclient, this script is able to extract monthly billings statistics used by the Monitise adoption team
  • Utilities – Over the years several tools have been implemented to facilitate the services monitoring provides
    • mSync – This synchronizes the code to any single, any group (eg ‘cloud’), or all customer VMs.
    • mGetReports – Retrieves any/all reports from any/all customer VMs
    • Report repository – Quick and dirty ajax page to display the local directory of reports.
    • m – A short cut to display VM information
    • mZipAgents / mUnzipAgents – Zips the entire monitoring code base where the ‘customers’ prefer to deploy updates themselves
    • Backbase – Facilitates the ‘sql’ (repository of SQL commands), ‘servers’ (table of customer VMs), and ‘cheat’ (shortcuts to helpful linux commands) tables
    • mArchive – This RRR approved script zips and archives files matching selected filters.  This maintains disk integrity by moving and/or removing aging log files.

Short Term ROI

There are probably better monitoring systems out there, but none that come near to the ROI of this one. As the sole Architect, Developer, Administrator, Deployment Engineer, Support Engineer, Reports Developer and millions of dollars of savings in SLAs alone, it’s safe to say this was the most cost effective and best overall solution for this highly unique environment. The ROI was my salary. Though it took a few years to reach its maximum capability, comparable systems would not only be the licensing, but also the personnel to perform all the other tasks of operation. The licensing alone for every other system to meet our requirements were much higher than my salary.

Comprehensive Reporting/Graphics

I envisioned other employees eventually taking over. The first task, of course, for them would be to constantly monitor a live dashboard.

Pre-emptive Discovery

A major part of the architectural concept was the implementation of product farms constantly running dummy scenarios. This was used to test the SMS, email, and internet connectivity. Furthermore, it tested the customer’s round-robin application. If any single point (e.g. database) in the system had problems, this would be just another way of catching it and averting downtime.

In-line Response Methodology

Huh? Basically, this means every alert had embedded links to assist operators on best courses of action. A comprehensive web site was constructed to provide this information. This dramatically lowered on-boarding costs of support personnel, enabled the support team to become more monitoring averse, and help management understand the complexities and nuances of the system that other competing products could never provide.

Vindication

Everyone likes to be vindicated for taking a risk on a project of this scope. Mine was when our monitoring system caught Bank of America’s first nationwide outage, many hours before they did. During the prevailing hours we, of course, informed BofA, there was a growing malfunction within their infrastructure. Eventually, it became public.

Leave a Comment

Scroll to Top