An emergency notification system in InvariMatch

As a result of optimizing and adding several new features, InvariMatch could now work in a cluster of machines and was much faster at processing search requests than before. It worked fine for the most part, but we still had to deal with occasional system failures.

Errors and downtime can be extremely harmful in highly loaded systems. In order for us to fix them as fast as possible, we needed to know about the problem as soon as it happened. Because of this, we thought about creating an emergency notification system that would check the key parameters of InvariMatch and notify us about any errors or issues.

On July 18, 2017, we started working on an algorithm that would automatically run a system diagnostic: checking the number of working search cores and if there were any issues with storing core data, monitoring the data volumes in each core and how it was being processed in nodes. The algorithm would also report about the absence of a task queue and if any of the system nodes had disconnected, monitor how much disk space is available as well as notifying us about system configuration failures and errors during data processing.

The development of the algorithm took a little over a month. We decided that the most convenient way of receiving critical reports about faults would be via Telegram. So we made a telegram bot that would send us notifications in case something was wrong. A message from this bot usually looks something like this:

[!]> Fault checking: – there are too many faults (about 100%) in video processing for an hour

[!]> Scan checking: – nothing has been scanning for three hours

After receiving an error message, we check whether the failure is a real or if it is a false alarm, and take the measures necessary to resolve it.

Adding monitoring and emergency notification system allowed us to resolve minor faults in a timely manner and reduce the number of global failures.