Database Engineering

Crushing Cardinality: Optimizing High-Volume Time Series Ingestion in Prometheus and VictoriaMetrics

Observability is the backbone of modern distributed systems, but it comes with a significant performance tax: cardinality. As developers add labels to metrics to enable granular querying, we often inadvertently create combinatorial explosions in our time series data. This "cardinality explosion" can lead to out-of-memory errors, disk I/O bottlenecks, and exorbitant infrastructure costs. In this post, we will explore strategies to optimize high-cardinality time series ingestion, comparing the native architecture of Prometheus with the cloud-native efficiency of VictoriaMetrics.

Understanding the Cardinality Trap

Cardinality refers to the number of unique time series generated by a combination of metric names and label pairs. For example, exposing a metric http_requests_total with a label user_id is dangerous. If you have 1 million active users, you instantly create 1 million unique time series. Prometheus stores all data locally on disk in a format optimized for time-range queries but not for massive label combinations. When ingestion rates spike or the number of unique label values grows too large, the TSDB (Time Series Database) component struggles to manage WAL (Write-Ahead Log) segments and memory-mapped files.

Strategy 1: Pre-emptive Label Filtering in Prometheus

The first line of defense is preventing high-cardinality labels from ever entering the system. In Prometheus, you can use metric_relabel_configs in your scraping configuration to drop metrics or labels that violate your cardinality rules.

For instance, if you accidentally scrape an endpoint that exposes PII (Personally Identifiable Information) like session IDs, you must strip that label before storage. Here is how you can configure a drop rule in your prometheus.yml:

scrape_configs:
  - job_name: 'web_app'
    metrics_path: '/metrics'
    relabel_configs:
      - source_labels: [__name__]
        regex: 'my_app_(.+)'
        action: drop
    metric_relabel_configs:
      # Drop metrics with high cardinality labels like session_id
      - source_labels: [session_id]
        regex: '.+'
        action: drop

This approach ensures that the database never allocates memory for these series. However, this is a "lossy" approach; you lose the ability to query those specific series entirely. Ensure you only drop labels that are not required for your monitoring needs.

Strategy 2: Leveraging VictoriaMetrics for High-Scale Ingestion

While Prometheus is excellent for short-term, high-resolution monitoring, it was not designed for long-term storage of high-cardinality data. This is where VictoriaMetrics shines. Built specifically to address the limitations of Prometheus at scale, VictoriaMetrics uses a more efficient storage engine and compression algorithms.

VictoriaMetrics supports "VMAlert" for alerting and offers a Prometheus-compatible remote write protocol. One of its key advantages is its ability to handle high ingestion rates without requiring sharding or complex cluster management for single-node deployments. It uses a single-node architecture that can handle billions of unique time series on a single machine, reducing operational complexity.

To migrate, you can simply change the remote_write endpoint in your Prometheus configuration to point to VictoriaMetrics:

global:
  scrape_interval: 15s

remote_write:
  - url: 'http://victoriametrics:8428/api/v1/write'
    queue_config:
      max_samples_per_send: 5000
      capacity: 25000

Strategy 3: Downsampling and Data Retention Policies

Regardless of the backend, storing raw millisecond-level data for months is often unnecessary. Both Prometheus and VictoriaMetrics support downsampling. VictoriaMetrics, in particular, allows for aggressive downsampling policies where older data is automatically aggregated into 1-minute or 1-hour resolutions.

This significantly reduces the disk footprint. By configuring retention periods that balance cost and observability needs, you can keep high-resolution data for a few weeks (sufficient for debugging recent incidents) and low-resolution data for years (useful for trend analysis).

Conclusion

Optimizing high-cardinality time series ingestion is not just about hardware scaling; it requires architectural discipline. Start by auditing your metrics for "bad" labels like user IDs, IP addresses, or dynamic request IDs. Use Prometheus relabeling rules to strip these labels at the source. For long-term storage and historical analysis, consider offloading data to VictoriaMetrics, which offers superior compression and ingestion performance. By combining label hygiene with the right storage backend, you can maintain a robust, cost-effective observability stack that scales with your application's growth.

Share: