5 Mins Read
Prometheus | Monitoring and Alert
An open-source system and service monitoring tool Prometheus, originally built by SoundCloud in 2012 has been adopted by several organizations by virtue of its user-friendly and rich features. Its popularity has driven an immensely active developer and user community along with its membership in the Cloud Native Computing Foundation.
Prometheus like aforementioned has varied capabilities along with Grafana for better visualization, alert manager for ideal management of alerts, and node exporter for analysis and ensure effectiveness. Kristal.AI has achieved effective utilization of these to derive maximum benefits.
The later sections of the article strive to provide knowledge on the same.
This diagram illustrates the architecture of Prometheus and some of its ecosystem components:
The Node Exporter enables Prometheus to derive essential metrics from the target server and sort the same and thus facilitate Prometheus to provide us with a visual and appetising representation of complex and essential information. Nodes exporter does so by scraping the hardware and OS metrics exposed by kernels.
The component has been written in ‘Go’ and for effective performance, it is to be installed in all the nodes we desire to retrieve metrics from.
Here is the latest version of the Node Exporter for you :
After successful download and installation you can check its usability by curling the metrics endpoint:
curl -X GET http://localhost:9100/metrics
The metrics shall ideally be visible as follows:
Prometheus stores its ‘disk time-series’ data in a local disk and the default storage path is within the Prometheus setup directory.
Data is initially grouped into blocks of two hours.
Each block or directory contains one or more chunk files that contain time-series samples for that specific period of time, as well as a metadata file and an index file that indexes metric names and labels in relation to the time series.
These initial two-hour blocks are eventually in the background compacted into longer time period blocks usually one block per day.
Additionally, there is a wal directory inside the data directory which helps in making Prometheus crash tolerant.
The data is primarily stored in memory and not the blocks hence though fast it is volatile and cannot ensure persistence and there is a risk of data loss in the case of a crash. However, the wal directory contains Write-ahead log files that can be replayed when the Prometheus server restarts after a crash. These files contain raw data that has not been compacted yet hence, are significantly larger than regular block files blocks.
Since the default storage path is the data directory inside the Prometheus directory, Prometheus has a way to make it easier for you, you can configure it using — storage.tsdb.path flag. Another important configurable flag that determines the time to remove old data apart from the default time of 15 days – storage.tsdb.retention.time
The local storage supported by Prometheus has certain limitations:
- It is not clustered or replicated
- Works as a single node database hence is not durable in case of node or disk outages
- It is only vertically scalable which after an extent makes it expensive and less affordable
Here is the latest version of Prometheus for you!
In the configuration file, you are to mention the target server’s IP address and ports where the node exporter is running to scrape metrics, additionally, you can also provide each target a name using a label to ensure you do not have to memorize the IP’s and can also club multiple targets into one job which can help in filtering out metrics based on names.
To help increase familiarity we have captured how the Prometheus user interface looks like:
Numerous metrics can now be received from Prometheus but, it is up to us to leverage these to derive some sense out of it and take corrective action before it breaches the threshold.
The Alert Manager comes into play here for your rescue. It handles alerts sent by client applications such as the Prometheus server and is capable of handling hundreds of alerts at one go.
Alertmanager provides the inbuilt functionality of grouping the alerts, as hundreds of alerts can be sent to Alertmanager at any time, but as a user, one only wants to get a single alert while still being able to see exactly which service instances were affected. Thus the Alertmanager is configurable to group alerts by their cluster and alert name enabling it to send a single compact notification. Muting alerts for a certain time period is also simplified with Altermager hence avoiding unnecessary notifications during certain maintenance activities or when we are already well aware of the alert scenario.
Now, for some handy tips for you. In order to enable notifications, you are to define the rules or thresholds in the Prometheus configuration file. Hence, Prometheus, when there is any threshold breach, notifies the Alert Manager which in turn alerts users via multiple channels like email, Slack, PagerDuty, etc.
This simple rule means if the Available memory of Prod1 instance is less than 1Gb for 5 minutes, it is to send an alert to the Alert manager with severity as critical.
Like the above and more we can have as many rules as we want.
In the Alert manager configuration, we can add our channel configurations like Slack Webhooks or Email SMTP credentials and we can group alerts based on severity levels as well like below route means, if the notification sent by Prometheus has ‘severity’ as the WARNING, send only slack alerts and if it is CRITICAL send both slack and email alerts and also repeat the notification every 15 minutes.
To help increase familiarity we have captured how Slack and email alerts look like:
And that is how we at Kristal.AI are always on track. Hoping you had a good read.
The materials and data contained herein are for information only and shall in no event be construed as an offer to purchase or sell or the solicitation of an offer to purchase or sell any securities in any jurisdiction. Kristal Advisors does not make any representation, undertaking, warranty or guarantee as to the update, completeness, correctness, reliability or accuracy of the materials and data herein. All opinions, forecasts or estimation expressed herein are subject to change without prior notice. Kristal Advisors and its affiliates accept no liability or responsibility whatsoever for any direct or consequential loss and/or damages arising out of or in relation to any use of opinions, forecasts, materials and data contained herein or otherwise arising in connection therewith.
Post tagged under
Other stories you might like
5 Mins Read