Metrics¶
Building and testing an application (or microservice) is merely the first step in its lifetime cycle. Once you enter production and start deploying your software, you constantly need to monitor it. Is it still running? How many actors do we have? How much requests can our system handle? Where are potential bottlenecks? Do we have resources to spare or do we need to allocate more? Are we keeping our SLAs?
In order to answer such high-level questions, powerful tools like Prometheus have emerged. However, such monitoring systems are only as good as the data you feed it.
The metrics API in CAF enables you to instrument your code for generating performance data. The API is vendor-neutral, but borrows many concepts as well as terminology from Prometheus. Currently, CAF can only export metrics to Prometheus. However, the API allows users to collect the metrics manually for writing custom integrations.
Note
All classes for instrumenting code live in the namespace caf::telemetry
.
Metric Names and Labels¶
Each metric is uniquely identified by:
A prefix. This acts as a namespace for grouping metrics together. All metrics that CAF collects by itself use the prefix
caf
.A name. This identifies the metric within the prefix. By convention, these names are all-lowercase and hyphenated. For example,
running-actors
.Any number of label dimensions. Labels are key-value pairs that divide a metric into useful categories. For example, a metric that counts HTTP requests could split into
method=get
,method=put
,method=post
, etc. Aggregating all metrics bymethod
would then yield the total amount.
Metrics that share prefix, name and label names form a metric family. This is
also directly reflected in the API: the class metric_family
bundles all
shared attributes and stores all instances as children.
A metric family without labels always contains exactly one child. Hence, CAF calls this metric singleton in its API.
Note
CAF identifies metrics by prefix and name. Hence, families with the same prefix and name but different label names are prohibited.
Metric Types¶
CAF knows these types of metrics:
Counters. A counter represents a monotonically increasing value. For example, the total number of messages received by all actors, the total number of errors since starting the system, etc.
Gauges. A gauge represents a numerical value that can arbitrarily increase or decrease. For example, the current number of messages in all mailboxes, the number of running actors, etc.
Histograms. A histogram observes numerical values and counts them in (configurable) buckets. For example, sampling the processing time of messages
t
with buckets for0ms ≤ t ≤ 1ms
,1ms < t ≤ 10ms
,10ms < t ≤ 100ms
, and so on gives information on the usual response time and outliers. Histograms internally consist of counters and provide a relatively lightweight sampling mechanism. However, providing the right boundaries for the buckets can require some experimentation or experience.
Further, CAF provides two implementations for each metric type: one using
int64_t
as internal representation and one using double
. Both
implementations use atomic operations, but the former is usually more efficient
on platforms such as x86. In user code, we recommend only using these type
definitions:
dbl_counter
for monotonically increasing floating point numbersint_counter
for monotonically increasing 64-bit integersdbl_gauge
for arbitrary floating point numbersint_gauge
for arbitrary 64-bit integersdbl_histogram
for sampling floating point numbersint_histogram
for sampling 64-bit integers
The associated headers are:
caf/telemetry/counter.hpp
caf/telemetry/gauge.hpp
caf/telemetry/histogram.hpp
Counters¶
Counters wrap an atomic count but only allows incrementing it. The class provides the following member functions:
/// Increments the counter by 1.
void inc() noexcept;
/// Increments the counter by `amount`.
/// @pre `amount > 0`
void inc(value_type amount) noexcept;
/// Returns the current value of the counter.
value_type value() const noexcept;
/// Increments the counter by 1.
/// @note only available if value_type == int64_t
value_type operator++() noexcept;
Gauges¶
Like counters, gauges also wrap an atomic count. However, gauges are less permissive and allow decrementing as well.
/// Increments the gauge by 1.
void inc() noexcept;
/// Increments the gauge by `amount`.
void inc(value_type amount) noexcept;
/// Decrements the gauge by 1.
void dec() noexcept;
/// Decrements the gauge by `amount`.
void dec(value_type amount) noexcept;
/// Sets the gauge to `x`.
void value(value_type x) noexcept;
/// Increments the gauge by 1.
/// @returns The new value of the gauge.
/// @note only available if value_type == int64_t
value_type operator++() noexcept;
/// Decrements the gauge by 1.
/// @returns The new value of the gauge.
/// @note only available if value_type == int64_t
value_type operator--() noexcept;
/// Returns the current value of the gauge.
value_type value() const noexcept;
Histogram¶
Histograms consist of one counter per bucket as well as a gauge for the sum of all observed values (values may be negative).
/// Increments the bucket where the observed value falls into and increments
/// the sum of all observed values.
void observe(value_type value);
/// Returns the sum of all observed values.
value_type sum() const noexcept;
Metric Units and Flags¶
All metric types store numerical values, either as double
or as int64_t
.
For giving this number additional semantics, CAF allows assigning units (of
measurement) to metrics. The default unit is 1
, which denotes dimensionless
counts such as the number of messages in a mailbox.
The unit can be any string, but we recommend using only base units such as
seconds
or bytes
to make processing of these metrics with monitoring
systems easier.
Each metric also carries one flag: is-sum
. Setting this to true
(the
default is false
) indicates that this metric adds something up to a total
where only the total value is of interest. For example, the total number of HTTP
requests. CAF itself does not care about the flag, but it can give extra
information to collectors or exporters. For example, the Prometheus exporter
will add a _total
suffix to the exported metric name.
Timers¶
When instrumenting code, timers offer a convenient way for measuring the duration of individual operations.
The metrics API in CAF includes histograms for sampling observations over time. For example, how long it takes to handle incoming requests or to perform some expensive operations.
Sampling time manually is quite tedious, though, as illustrated by this snippet:
caf::telemetry::dbl_histogram* my_histogram = nullptr;
// ... some place later ...
auto t0 = std::chrono::steady_clock::now();
// ... expensive operation ...
auto delta = std::chrono::steady_clock::now() - t0;
// ... convert delta to fractional seconds and pass to my_histogram ...
To automate this process, CAF includes timers. They simply store the current time when created and pass the elapsed time since construction to a histogram when destroyed. Hence, we can replace the verbose version from before simply by putting a timer into the scope of the expensive option and take advantage of RAII:
caf::telemetry::dbl_histogram* my_histogram = nullptr;
// ... some place later ...
{
auto t = caf::telemetry::timer{my_histogram};
// ... expensive operation ...
}
The constructor of timer
also accepts a nullptr
. This accounts for the
fact that some metrics may be disabled by default.
The Metric Registry¶
All metrics of an actor system are managed by a single registry to make sure only one metric instance exists per prefix and name combination. Further, the registry stores all metrics in a single place to allow collectors to iterate over all metrics in a single place.
A minimal custom collector class requires providing operator()
overloads as
shown below:
class my_collector {
public:
void operator()(const metric_family* family, const metric* instance,
const dbl_counter* impl);
void operator()(const metric_family* family, const metric* instance,
const int_counter* impl);
void operator()(const metric_family* family, const metric* instance,
const dbl_gauge* impl);
void operator()(const metric_family* family, const metric* instance,
const int_gauge* impl);
void operator()(const metric_family* family, const metric* instance,
const dbl_histogram* impl);
void operator()(const metric_family* family, const metric* instance,
const int_histogram* impl);
};
Applying the collector to the registry looks as follows (with sys
being a
reference to an actor_system
):
my_collector f;
sys.metrics().collect(f);
The associated headers is caf/telemetry/metric_registry.hpp
.
Accessing Metrics¶
Accessing a metric is a three-step process:
Get the
metric_registry
from the actor system.Get the
metric_family
from the registry.Call
get_or_add
on the family to get a pointer to the counter, gauge, or histogram.
The pointer remains valid until the actor system gets destroyed. Hence, holding on to the pointer in an actor is always safe.
The registry creates metrics lazily (to be more precise, it creates families lazily that in turn create metric instances lazily). Since this requires synchronization via mutexes, we recommend to only access the registry once per metric and then store the pointer.
Accessing Counters and Gauges¶
Counters and gauges are very similar in their API. Hence, all functions that
work on gauges only require replacing gauge
with counter
to work with
counters instead.
Gauges are owned (and created) by a gauge family object. We can either get the
family object explicitly by calling gauge_family
, or we can use one of the
two shortcut functions gauge_instance
or gauge_singleton
. The C++
prototypes for the registry member functions look as follows:
template <class ValueType = int64_t>
auto* gauge_family(string_view prefix, string_view name,
span<const string_view> labels, string_view helptext,
string_view unit = "1", bool is_sum = false);
template <class ValueType = int64_t>
auto* gauge_instance(string_view prefix, string_view name,
span<const label_view> labels, string_view helptext,
string_view unit = "1", bool is_sum = false);
template <class ValueType = int64_t>
auto* gauge_singleton(string_view prefix, string_view name,
string_view helptext, string_view unit = "1",
bool is_sum = false);
Note
All functions that take a span
also provide an overload that accepts a
std::initializer_list
instead to make working with constants easier.
The function gauge_family
returns a type-specific metric family object,
while the other two functions return the gauge directly.
The family objects only have a single noteworthy member function,
get_or_add
:
auto fptr = registry.counter_family("http", "requests", {"method"},
"Number of HTTP requests.", "seconds",
true);
auto count = fptr->get_or_add({{"method", "put"}});
If we only get a single counter from the family, we can use counter_instance
instead:
auto count = registry.counter_instance("http", "requests",
{{"method", "put"}},
"Number of HTTP requests.",
"seconds", true);
Accessing Histograms¶
The member functions for accessing histogram families and histograms follow the same pattern as the member functions for counters and gauges.
template <class ValueType = int64_t>
auto* histogram_family(string_view prefix, string_view name,
span<const string_view> label_names,
span<const ValueType> default_upper_bounds,
string_view helptext, string_view unit = "1",
bool is_sum = false);
template <class ValueType = int64_t>
auto* histogram_instance(string_view prefix, string_view name,
span<const label_view> label_names,
span<const ValueType> default_upper_bounds,
string_view helptext, string_view unit = "1",
bool is_sum = false);
template <class ValueType = int64_t>
auto* histogram_singleton(string_view prefix, string_view name,
span<const ValueType> default_upper_bounds,
string_view helptext, string_view unit = "1",
bool is_sum = false);
Compared to the member functions for counters and guages, histograms require one addition argument for the default bucket upper bounds.
Warning
The default_upper_bounds
parameter must be sorted!
CAF automatically adds one additional bucket for observing all values between
the last upper bound and infinity (double
) or INT_MAX (int64_t
). For
example, passing [10, 100, 1000]
as upper bounds creates four buckets in
total. The first bucket captues all values with x ≤ 10
. The second bucket
captues all values with 10 < x ≤ 100
. The third bucket captures all values
with 100 < x ≤ 1000
. Finally, the fourth bucket (added automatically)
captures all values with 1000 < x ≤ INT_MAX
.
Configuration Parameters¶
Histograms use the actor system configuration to enable users to override
hard-coded default bucket settings. On construction, the histogram family check
whether a key caf.metrics.${prefix}.${name}.buckets
exists. Further, the
metric instance also checks on construction whether a more specific bucket
setting for one of its label dimensions exist.
For example, consider we add a histogram family with prefix http
, name
request-duration
, and label dimension method
to the registry. The family
first tries to read caf.metrics.http.request-duration.buckets
from the
configuration and otherwise falls back to the hard-coded defaults. When creating
a histogram instance from the family with the label method=put
, the
construct first tries to read
caf.metrics.http.request-duration.method=put.buckets
from the configuration
and otherwise uses the default for the family.
In a configuration file, users may provide bucket settings like this:
caf {
metrics {
http {
# measures the duration per HTTP request in seconds
request-duration {
buckets = [
0.001, # ≤ 1ms
0.01, # ≤ 10ms
0.05, # ≤ 50ms
0.1, # ≤ 100ms
0.25, # ≤ 250ms
0.5, # ≤ 500ms
0.75, # ≤ 750ms
]
# use different settings for get requests
"method=put" {
buckets = [
0.007, # ≤ 7ms
0.012, # ≤ 12ms
0.025, # ≤ 25ms
0.05, # ≤ 50ms
0.1, # ≤ 100ms
]
}
}
}
}
}
Note
Ambiguous settings for metrics with multiple label dimensions will result in CAF picking the first match from an unspecified order. Hence, prefer using only one label dimension for configuring buckets or otherwise make sure there is always exactly one match for instance labels.
Performance Considerations¶
Instrumenting code should affect the performance as little as possible. Keep in
mind that each member function on the registry has to acquire a lock. Ideally,
applications call functions such as gauge_family
once during setup and
then store the family pointer to create metric instances later.
Ideally, there is a single occurrence in the code for getting the family object
from the registry and a single occurrence in the code for getting the
gauge/counter/histogram object from the family (get_or_add
also has to
acquire a lock).
All operations on gauges, counters and histograms use atomic operations.
Depending on the type, CAF internally uses std::atomic<int64_t>
or
std::atomic<double>
. Adding a sample to a histogram requires two atomic
operations: one for the bucket and one for the sum.
Atomic operations are reasonably fast, but we still recommend to avoid them in tight loops.
Builtin Metrics¶
CAF collects a set of builtin metrics in order to provide insights into the actor system and its modules. Some are always collect while others require configuration by the user.
Base Metrics¶
The actor system collects this set of metrics always by default (note that all
caf.middleman
metrics only appear when loading the I/O module).
- caf.system.running-actors
Tracks the current number of running actors in the system.
Type:
int_gauge
Label dimensions: none.
- caf.system.processed-messages
Counts the total number of processed messages.
Type:
int_counter
Label dimensions: none.
- caf.system.rejected-messages
Counts the number of messages that where rejected because the target mailbox was closed or did not exist.
Type:
int_counter
Label dimensions: none.
- caf.middleman.inbound-messages-size
Samples the size of inbound messages before deserializing them.
Type:
int_histogram
Unit:
bytes
Label dimensions: none.
- caf.middleman.outbound-messages-size
Samples the size of outbound messages after serializing them.
Type:
int_histogram
Unit:
bytes
Label dimensions: none.
- caf.middleman.deserialization-time
Samples how long the middleman needs to deserialize inbound messages.
Type:
dbl_histogram
Unit:
seconds
Label dimensions: none.
- caf.middleman.serialization-time
Samples how long the middleman needs to serialize outbound messages.
Type:
dbl_histogram
Unit:
seconds
Label dimensions: none.
Actor Metrics and Filters¶
Unlike the base metrics, actor metrics are off by default. Applications can spawn thousands of actors, with many only existing for a brief time. Hence, blindly collecting data from all actors in the system can impact the performance and also produce a lot of irrelevant noise.
To make sure CAF only collects actor metrics that are relevant to the user, the
actor system configuration provides two lists:
caf.metrics-filters.actors.includes
and
caf.metrics-filters.actors.excludes
. CAF collects metrics for all actors
that have names that are selected by the includes
list and are not selected
by the excludes
list. Entries in the list can use glob-style syntax, in
particular *
-wildcards. For example:
caf {
metrics-filters {
actors {
includes = [ "foo.*" ]
excludes = [ "foo.bar" ]
}
}
}
The configuration above would select all actors with names that start with
foo.
except for actors named foo.bar
.
Note
Names belong to actor types. CAF assigns default names such as
user.scheduled-actor
by default. To provide a custom name, either override
the member function const char* name() const
when implementing class-based
actors or add a static member variable
static inline const char* name = "..."
to your state class when using
stateful actors.
CAF uses a hierarchical, hyphenated naming scheme with .
as the separator
and all-lowercase name components. For example, caf.system.spawn-server
.
Users may follow this naming scheme for consistency, but CAF does not enforce
any structure on the names. However, we do recommend to avoid whitespaces and
special characters that the glob engine recognizes, such as *
, /
, etc.
For all actors that are selected by the user-defined filters, CAF collects this set of metrics:
- caf.actor.processing-time
Samples how long the actor needs to process messages.
Type:
dbl_histogram
Unit:
seconds
Label dimensions: name.
- caf.actor.mailbox-time
Samples how long messages wait in the mailbox before being processed.
Type:
dbl_histogram
Unit:
seconds
Label dimensions: name.
- caf.actor.mailbox-size
Counts how many messages are currently waiting in the mailbox.
Type:
int_gauge
Label dimensions: name.
- caf.actor.stream.processed-elements
Counts the total number of processed stream elements from upstream.
Type:
int_counter
Label dimensions: name, type.
- caf.actor.stream.input-buffer-size
Tracks how many stream elements from upstream are currently buffered.
Type:
int_gauge
Label dimensions: name, type.
- caf.stream.pushed-elements
Counts the total number of elements that have been pushed downstream.
Type:
int_counter
Label dimensions: name, type.
- caf.stream.output-buffer-size
Tracks how many stream elements are currently waiting in the output buffer.
Type:
int_gauge
Label dimensions: name, type.
Exporting Metrics to Prometheus¶
The network module in CAF comes with builtin support for exporting metrics to Prometheus via HTTP. However, this feature is off by default since CAF generally avoids opening ports without explicit user input.
During startup, the middleman enables the export of metrics when the
configuration provides a valid value (0 to 65536) for
caf.middleman.prometheus-http.port
as shown in the example config file
below.
caf {
middleman {
prometheus-http {
# listen for incoming HTTP requests on port 8080 (required parameter)
port = 8080
# the bind address (optional parameter; default is 0.0.0.0)
address = "0.0.0.0"
# optionally enable TLS for the prometheus server. Disabled by default.
tls {
key-file = "/path/to/key.pem"
cert-file = "/path/to/cert.pem"
}
}
}
}