Metrics

Building and testing an application (or microservice) is merely the first step in its lifetime cycle. Once you enter production and start deploying your software, you constantly need to monitor it. Is it still running? How many actors do we have? How much requests can our system handle? Where are potential bottlenecks? Do we have resources to spare or do we need to allocate more? Are we keeping our SLAs?

In order to answer such high-level questions, powerful tools like Prometheus have emerged. However, such monitoring systems are only as good as the data you feed it.

The metrics API in CAF enables you to instrument your code for generating performance data. The API is vendor-neutral, but borrows many concepts as well as terminology from Prometheus. Currently, CAF can only export metrics to Prometheus. However, the API allows users to collect the metrics manually for writing custom integrations.

Note

All classes for instrumenting code live in the namespace caf::telemetry.

Metric Names and Labels

Each metric is uniquely identified by:

  • A prefix. This acts as a namespace for grouping metrics together. All metrics that CAF collects by itself use the prefix caf.
  • A name. This identifies the metric within the prefix. By convention, these names are all-lowercase and hyphenated. For example, running-actors.
  • Any number of label dimensions. Labels are key-value pairs that divide a metric into useful categories. For example, a metric that counts HTTP requests could split into method=get, method=put, method=post, etc. Aggregating all metrics by method would then yield the total amount.

Metrics that share prefix, name and label names form a metric family. This is also directly reflected in the API: the class metric_family bundles all shared attributes and stores all instances as children.

A metric family without labels always contains exactly one child. Hence, CAF calls this metric singleton in its API.

Note

CAF identifies metrics by prefix and name. Hence, families with the same prefix and name but different label names are prohibited.

Metric Types

CAF knows these types of metrics:

  1. Counters. A counter represents a monotonically increasing value. For example, the total number of messages received by all actors, the total number of errors since starting the system, etc.
  2. Gauges. A gauge represents a numerical value that can arbitrarily increase or decrease. For example, the current number of messages in all mailboxes, the number of running actors, etc.
  3. Histograms. A histogram observes numerical values and counts them in (configurable) buckets. For example, sampling the processing time of messages t with buckets for 0ms t 1ms, 1ms < t 10ms, 10ms < t 100ms, and so on gives information on the usual response time and outliers. Histograms internally consist of counters and provide a relatively lightweight sampling mechanism. However, providing the right boundaries for the buckets can require some experimentation or experience.

Further, CAF provides two implementations for each metric type: one using int64_t as internal representation and one using double. Both implementations use atomic operations, but the former is usually more efficient on platforms such as x86. In user code, we recommend only using these type definitions:

  • dbl_counter for monotonically increasing floating point numbers
  • int_counter for monotonically increasing 64-bit integers
  • dbl_gauge for arbitrary floating point numbers
  • int_gauge for arbitrary 64-bit integers
  • dbl_histogram for sampling floating point numbers
  • int_histogram for sampling 64-bit integers

The associated headers are:

  • caf/telemetry/counter.hpp
  • caf/telemetry/gauge.hpp
  • caf/telemetry/histogram.hpp

Counters

Counters wrap an atomic count but only allows incrementing it. The class provides the following member functions:

/// Increments the counter by 1.
void inc() noexcept;

/// Increments the counter by `amount`.
/// @pre `amount > 0`
void inc(value_type amount) noexcept;

/// Returns the current value of the counter.
value_type value() const noexcept;

/// Increments the counter by 1.
/// @note only available if value_type == int64_t
value_type operator++() noexcept;

Gauges

Like counters, gauges also wrap an atomic count. However, gauges are less permissive and allow decrementing as well.

/// Increments the gauge by 1.
void inc() noexcept;

/// Increments the gauge by `amount`.
void inc(value_type amount) noexcept;

/// Decrements the gauge by 1.
void dec() noexcept;

/// Decrements the gauge by `amount`.
void dec(value_type amount) noexcept;

/// Sets the gauge to `x`.
void value(value_type x) noexcept;

/// Increments the gauge by 1.
/// @returns The new value of the gauge.
/// @note only available if value_type == int64_t
value_type operator++() noexcept;

/// Decrements the gauge by 1.
/// @returns The new value of the gauge.
/// @note only available if value_type == int64_t
value_type operator--() noexcept;

/// Returns the current value of the gauge.
value_type value() const noexcept;

Histogram

Histograms consist of one counter per bucket as well as a gauge for the sum of all observed values (values may be negative).

/// Increments the bucket where the observed value falls into and increments
/// the sum of all observed values.
void observe(value_type value);

/// Returns the sum of all observed values.
value_type sum() const noexcept;

Metric Units and Flags

All metric types store numerical values, either as double or as int64_t. For giving this number additional semantics, CAF allows assigning units (of measurement) to metrics. The default unit is 1, which denotes dimensionless counts such as the number of messages in a mailbox.

The unit can be any string, but we recommend using only base units such as seconds or bytes to make processing of these metrics with monitoring systems easier.

Each metric also carries one flag: is-sum. Setting this to true (the default is false) indicates that this metric adds something up to a total where only the total value is of interest. For example, the total number of HTTP requests. CAF itself does not care about the flag, but it can give extra information to collectors or exporters. For example, the Prometheus exporter will add a _total suffix to the exported metric name.

The Metric Registry

All metrics of an actor system are managed by a single registry to make sure only one metric instance exists per prefix and name combination. Further, the registry stores all metrics in a single place to allow collectors to iterate over all metrics in a single place.

A minimal custom collector class requires providing operator() overloads as shown below:

class my_collector {
public:
  void operator()(const metric_family* family, const metric* instance,
                  const dbl_counter* impl);

  void operator()(const metric_family* family, const metric* instance,
                  const int_counter* impl);

  void operator()(const metric_family* family, const metric* instance,
                  const dbl_gauge* impl);

  void operator()(const metric_family* family, const metric* instance,
                  const int_gauge* impl);

  void operator()(const metric_family* family, const metric* instance,
                  const dbl_histogram* impl);

  void operator()(const metric_family* family, const metric* instance,
                  const int_histogram* impl);
};

Applying the collector to the registry looks as follows (with sys being a reference to an actor_system):

my_collector f;
sys.metrics().collect(f);

The associated headers is caf/telemetry/metric_registry.hpp.

Accessing Metrics

Accessing a metric is a three-step process:

  1. Get the metric_registry from the actor system.
  2. Get the metric_family from the registry.
  3. Call get_or_add on the family to get a pointer to the counter, gauge, or histogram.

The pointer remains valid until the actor system gets destroyed. Hence, holding on to the pointer in an actor is always safe.

The registry creates metrics lazily (to be more precise, it creates families lazily that in turn create metric instances lazily). Since this requires synchronization via mutexes, we recommend to only access the registry once per metric and then store the pointer.

Accessing Counters and Gauges

Counters and gauges are very similar in their API. Hence, all functions that work on gauges only require replacing gauge with counter to work with counters instead.

Gauges are owned (and created) by a gauge family object. We can either get the family object explicitly by calling gauge_family, or we can use one of the two shortcut functions gauge_intance or gauge_singleton. The C++ prototypes for the registry member functions look as follows:

template <class ValueType = int64_t>
auto* gauge_family(string_view prefix, string_view name,
                   span<const string_view> labels, string_view helptext,
                   string_view unit = "1", bool is_sum = false);

template <class ValueType = int64_t>
auto* gauge_instance(string_view prefix, string_view name,
                     span<const label_view> labels, string_view helptext,
                     string_view unit = "1", bool is_sum = false);

template <class ValueType = int64_t>
auto* gauge_singleton(string_view prefix, string_view name,
                      string_view helptext, string_view unit = "1",
                      bool is_sum = false);

Note

All functions that take a span also provide an overload that accepts a std::initializer_list instead to make working with constants easier.

The function gauge_family returns a type-specific metric family object, while the other two functions return the gauge directly.

The family objects only have a single noteworthy member function, get_or_add:

auto fptr = registry.counter_family("http", "requests", {"method"},
                                    "Number of HTTP requests.", "seconds",
                                    true);
auto count = fptr->get_or_add({{"method", "put"}});

If we only get a single counter from the family, we can use counter_instance instead:

auto count = registry.counter_instance("http", "requests",
                                       {{"method", "put"}},
                                       "Number of HTTP requests.",
                                       "seconds", true);

Accessing Histograms

The member functions for accessing histogram families and histograms follow the same pattern as the member functions for counters and gauges.

template <class ValueType = int64_t>
auto* histogram_family(string_view prefix, string_view name,
                       span<const string_view> label_names,
                       span<const ValueType> default_upper_bounds,
                       string_view helptext, string_view unit = "1",
                       bool is_sum = false);

template <class ValueType = int64_t>
auto* histogram_instance(string_view prefix, string_view name,
                         span<const label_view> label_names,
                         span<const ValueType> default_upper_bounds,
                         string_view helptext, string_view unit = "1",
                         bool is_sum = false);

template <class ValueType = int64_t>
auto* histogram_singleton(string_view prefix, string_view name,
                          span<const ValueType> default_upper_bounds,
                          string_view helptext, string_view unit = "1",
                          bool is_sum = false);

Compared to the member functions for counters and guages, histograms require one addition argument for the default bucket upper bounds.

Warning

The default_upper_bounds parameter must be sorted!

CAF automatically adds one additional bucket for observing all values between the last upper bound and infinity (double) or INT_MAX (int64_t). For example, passing [10, 100, 1000] as upper bounds creates four buckets in total. The first bucket captues all values with x 10. The second bucket captues all values with 10 < x 100. The third bucket captures all values with 100 < x 1000. Finally, the fourth bucket (added automatically) captures all values with 1000 < x INT_MAX.

Configuration Parameters

Histograms use the actor system configuration to enable users to override hard-coded default bucket settings. On construction, the histogram family check whether a key caf.metrics.${prefix}.${name}.buckets exists. Further, the metric instance also checks on construction whether a more specific bucket setting for one of its label dimensions exist.

For example, consider we add a histogram family with prefix http, name request-duration, and label dimension method to the registry. The family first tries to read caf.metrics.http.request-duration.buckets from the configuration and otherwise falls back to the hard-coded defaults. When creating a histogram instance from the family with the label method=put, the construct first tries to read caf.metrics.http.request-duration.method=put.buckets from the configuration and otherwise uses the default for the family.

In a configuration file, users may provide bucket settings like this:

caf {
  metrics {
    http {
      # measures the duration per HTTP request in seconds
      request-duration {
        buckets = [
          0.001, # ≤   1ms
          0.01,  # ≤  10ms
          0.05,  # ≤  50ms
          0.1,   # ≤ 100ms
          0.25,  # ≤ 250ms
          0.5,   # ≤ 500ms
          0.75,  # ≤ 750ms
        ]
        # use different settings for get requests
        "method=put" {
          buckets = [
            0.007, # ≤   7ms
            0.012, # ≤  12ms
            0.025, # ≤  25ms
            0.05,  # ≤  50ms
            0.1,   # ≤ 100ms
          ]
        }
      }
    }
  }
}

Note

Ambiguous settings for metrics with multiple label dimensions will result in CAF picking the first match from an unspecified order. Hence, prefer using only one label dimension for configuring buckets or otherwise make sure there is always exactly one match for instance labels.

Performance Considerations

Instrumenting code should affect the performance as little as possible. Keep in mind that each member function on the registry has to acquire a lock. Ideally, applications call functions such as gauge_family once during setup and then store the family pointer to create metric instances later.

Ideally, there is a single occurrence in the code for getting the family object from the registry and a single occurrence in the code for getting the gauge/counter/histogram object from the family (get_or_add also has to acquire a lock).

All operations on gauges, counters and histograms use atomic operations. Depending on the type, CAF internally uses std::atomic<int64_t> or std::atomic<double>. Adding a sample to a histogram requires two atomic operations: one for the bucket and one for the sum.

Atomic operations are reasonably fast, but we still recommend to avoid them in tight loops.

Builtin Metrics

CAF collects a set of builtin metrics in order to provide insights into the actor system and its modules. Some are always collect while others require configuration by the user.

Base Metrics

The actor system collects this set of metrics always by default (note that all caf.middleman metrics only appear when loading the I/O module).

caf.system.running-actors
  • Tracks the current number of running actors in the system.
  • Type: int_gauge
  • Label dimensions: none.
caf.system.processed-messages
  • Counts the total number of processed messages.
  • Type: int_counter
  • Label dimensions: none.
caf.system.rejected-messages
  • Counts the number of messages that where rejected because the target mailbox was closed or did not exist.
  • Type: int_counter
  • Label dimensions: none.
caf.middleman.inbound-messages-size
  • Samples the size of inbound messages before deserializing them.
  • Type: int_histogram
  • Unit: bytes
  • Label dimensions: none.
caf.middleman.outbound-messages-size
  • Samples the size of outbound messages after serializing them.
  • Type: int_histogram
  • Unit: bytes
  • Label dimensions: none.
caf.middleman.deserialization-time
  • Samples how long the middleman needs to deserialize inbound messages.
  • Type: dbl_histogram
  • Unit: seconds
  • Label dimensions: none.
caf.middleman.serialization-time
  • Samples how long the middleman needs to serialize outbound messages.
  • Type: dbl_histogram
  • Unit: seconds
  • Label dimensions: none.

Actor Metrics and Filters

Unlike the base metrics, actor metrics are off by default. Applications can spawn thousands of actors, with many only existing for a brief time. Hence, blindly collecting data from all actors in the system can impact the performance and also produce a lot of irrelevant noise.

To make sure CAF only collects actor metrics that are relevant to the user, the actor system configuration provides two lists: caf.metrics-filters.actors.includes and caf.metrics-filters.actors.excludes. CAF collects metrics for all actors that have names that are selected by the includes list and are not selected by the excludes list. Entries in the list can use glob-style syntax, in particular *-wildcards. For example:

caf {
  metrics-filters {
    actors {
      includes = [ "foo.*" ]
      excludes = [ "foo.bar" ]
    }
  }
}

The configuration above would select all actors with names that start with foo. except for actors named foo.bar.

Note

Names belong to actor types. CAF assigns default names such as user.scheduled-actor by default. To provide a custom name, either override the member function const char* name() const when implementing class-based actors or add a static member variable static inline const char* name = "..." to your state class when using stateful actors.

CAF uses a hierarchical, hyphenated naming scheme with . as the separator and all-lowercase name components. For example, caf.system.spawn-server. Users may follow this naming scheme for consistency, but CAF does not enforce any structure on the names. However, we do recommend to avoid whitespaces and special characters that the glob engine recognizes, such as *, /, etc.

For all actors that are selected by the user-defined filters, CAF collects this set of metrics:

caf.actor.processing-time
  • Samples how long the actor needs to process messages.
  • Type: dbl_histogram
  • Unit: seconds
  • Label dimensions: name.
caf.actor.mailbox-time
  • Samples how long messages wait in the mailbox before being processed.
  • Type: dbl_histogram
  • Unit: seconds
  • Label dimensions: name.
caf.actor.mailbox-size
  • Counts how many messages are currently waiting in the mailbox.
  • Type: int_gauge
  • Label dimensions: name.
caf.actor.stream.processed-elements
  • Counts the total number of processed stream elements from upstream.
  • Type: int_counter
  • Label dimensions: name, type.
caf.actor.stream.input-buffer-size
  • Tracks how many stream elements from upstream are currently buffered.
  • Type: int_gauge
  • Label dimensions: name, type.
caf.stream.pushed-elements
  • Counts the total number of elements that have been pushed downstream.
  • Type: int_counter
  • Label dimensions: name, type.
caf.stream.output-buffer-size
  • Tracks how many stream elements are currently waiting in the output buffer.
  • Type: int_gauge
  • Label dimensions: name, type.

Exporting Metrics to Prometheus

The network module in CAF comes with builtin support for exporting metrics to Prometheus via HTTP. However, this feature is off by default since CAF generally avoids opening ports without explicit user input.

During startup, the middleman enables the export of metrics when the configuration provides a valid value (0 to 65536) for caf.middleman.prometheus-http.port as shown in the example config file below.

caf {
  middleman {
    prometheus-http {
      # listen for incoming HTTP requests on port 8080 (required parameter)
      port = 8080
      # the bind address (optional parameter; default is 0.0.0.0)
      address = "0.0.0.0"
    }
  }
}