Prometheus Federation Article

What is Federation?

The term federation refers to a system in which multiple smaller entities (such as states, regions, or organizations) come together to form a larger, unified entity while retaining some level of individual autonomy.

In a political context, it typically involves a shared central government with member states under it; but each member still maintains its own independence in certain areas. (like policies, rules and state taxes etc) In an organizational context, federation can refer to a group of entities that join together to pursue common goal while maintaining their separate identities. For example, in sports, a federation might be an umbrella organization that coordinates the activities of various member teams or leagues.

Federation in Prometheus

In context of the Prometheus monitoring system, federation refers to the ability to scale monitoring by collecting and aggregating metrics from multiple Prometheus servers. This enables hierarchical or multi-level monitoring by allowing one Prometheus server (often called a federation server or mother-ship) to scrape selected time series metrics from other Prometheus servers.

This is particularly useful in large-scale deployments where you have multiple independent Prometheus instances monitoring different parts of an infrastructure. With federation you can aggregate and centralize those metrics for broader analysis.

Benefits of federation in Prometheus:

Aggregation: A federating Prometheus server can pull metrics from multiple child servers, providing a global view of metrics across multiple regions, sub-systems, environments or accounts.
Selective Scraping: With federation you can select which metrics to scrape from the child servers. This is useful for saving network bandwidth and reducing data overhead. Usually, you want to collect only high-level or aggregated metrics, while detailed, granular data remains on the child servers.
Hierarchical Monitoring: Federation enables a multi-tier setup where lower-level Prometheus instances handle detailed, local monitoring, and higher-level servers handle aggregated metrics, providing better scalability.
Reduced Load: By scraping only aggregated or specific metrics at the federation level, you can reduce the amount of data transferred, making the system more efficient.

For example, in a large organization, each data center or environment might have its own Prometheus instance. A central Prometheus server can federate metrics from each of these child instances to provide an aggregated overview of the entire system's health and performance, without needing to scrape every individual metric.

Federation is configured by adding scrape_config entries in the Prometheus configuration, specifying which metrics to scrape from the target or remote servers.

Example configuration:

										
scrape_configs:
  - job_name: 'federation'
      scrape_interval: 5m
	  honor_labels: true
	  metrics_path: '/federate'
	  params:
		'match[]':
		  - '{job="node"}'
		  - 'up'
	  static_configs:
		- targets:
		  - 'prometheus-server-1:9090'
	  	  - ‘prometheus-server-2:9090'

In this example, the federation server only scrapes the up and {job="node"} metrics from two child Prometheus servers. (prometheus-server-1 and prometheus-server-2)

A Real-World Example

Consider a large-scale global e-commerce platform like eBay, which likely has multiple data centers spread across different regions around the world. Its platform might consists of thousands of microservices, and other system components deployed across these data centers. The system would likely involve many geographically distributed backend services for handling:

Web applications (e.g., product pages, cart, checkout, sellers, reviews, product suggestions etc)
Databases (SQL, NoSQL)
Messaging systems (e.g., Kafka, RabbitMQ)
Storage services (e.g., Amazon S3)
Cacheing systems (e.g., memcached, redis)
Payment gateways
Third-party API integrations
CDNs and load balancers to manage traffic

Given the criticality of the platform, the size and complexity of its infrastructure, monitoring each component becomes crucial for ensuring smooth operation, optimum performance, and uptime. This is where Prometheus federation becomes a necessity for scaling the monitoring system efficiently and effectively.

This global e-commerce platform will have hundreds of microservices, application servers, database servers, and network devices all spread across the world. Given the scale we’re dealing with, having a single Prometheus instance collecting metrics from every application, server and network device across the world becomes impractical due to:

High Latency: Scraping all global metrics from a single central Prometheus server would lead to high network latencies, especially between geographically distant data centers.
Scalability: A single Prometheus server will struggle with the load of collecting and storing millions of time-series metrics.
Fault Tolerance: the entire global monitoring system will go blind if this single Prometheus server fails. This will lead to missed alerts and degraded operational insights. Troubleshooting for issues could become impossible in this situation.

To overcome these shortcomings, Prometheus federation can be used to build a hierarchical monitoring architecture. Let’s go through the design and architecture of such a monitoring system.

Federation Architecture

1. Regional Prometheus Instances (Child Servers):

Each region (and the resp. Infrastructure) will have its own Prometheus instance dedicated to monitoring the local services.
These regional prometheus instances also scrape granular system metrics and other KPIs (e.g., CPU usage, memory, disk I/O, response times, request rates, user traffic, new signups per day etc) from infrastructure in their respective regions and store them locally.

2. Global Prometheus Instance (Federation Server):

A central Prometheus instance is set up to federate high-level, aggregated metrics from the regional Prometheus instances. This instance is responsible for global monitoring, incident response, and alerting.
It doesn’t scrape the detailed time series data but rather focuses on metrics like:

Service availability (e.g., is the checkout microservice up in all regions?)
Response time summaries (e.g., p99 latency of the API gateways across the globe
Error rates (e.g., 5xx error rates in the North American region)
Aggregated resource utilization (e.g., overall CPU and memory usage trends in each region)

This global instance can be used by platform engineers and SREs who need a high-level overview of the system's health worldwide.

The system and infrastructure we need to monitor:

Lets dive deeper into what this monitoring system needs to monitor and how it will accomplish this for our global e-commerce platform:

North America Region:

Services: Checkout (microservice), Payment Gateway, Regional Shipping Service etc.
Infrastructure: 500 servers (both physical and virtual), databases, caching layers etc.
The Prometheus server in North America scrapes granular system and service metrics for all these servers and applications, and stores detailed time-series data locally.

Europe Region:

Services: Checkout, European Payment Gateway, Regional Shipping etc.
Infrastructure: 300 servers, databases, storage services etc.
The Europe Prometheus instance scrapes and stores similar detailed metrics, but only for the applications and infrastructure in Europe.

Asia Region:

Services: Checkout, Analytics, Payment Processing.
Infrastructure: 1000 servers, NoSQL databases, queuing systems.
The Asia Prometheus instance focuses on monitoring microservices, queuing systems and databases within the Asian data centers.

The Federation Server

Our central federation prometheus server is configured to scrape aggregated metrics from each of these regional Prometheus instances.
For example, the global Prometheus instance scrapes metrics from each region by querying only the high-level metrics it cares about. It might pull:

The up metric for all services to check their availability.
Aggregated error rates. (http_requests_total where status=5xx).
High-level latency metrics. (e.g., http_request_duration_seconds).
Saturation or resource usage like node_cpu_seconds_total aggregated across all nodes in a region.

										
scrape_configs:
  - job_name: 'federation'
	scrape_interval: 5m
	honor_labels: true
	metrics_path: '/federate'
	params:
	  'match[]':
	    - '{job="checkout"}'
	    - 'up'
	    - 'http_requests_total{status="5xx"}'
	    - 'http_request_duration_seconds'
	static_configs:
	  - targets:
	    - 'prometheus-northamerica.example.com:9090'
	    - 'prometheus-europe.example.com:9090'
	    - 'prometheus-asia.example.com:9090'

Benefits of Federation in this scenario:

Scalability: Each regional Prometheus server handles its own local metrics load. The global Prometheus only scrapes aggregated data, reducing the load on both the network and the Global Prometheus instance itself.
Low Latency: Each region’s metrics are collected locally, avoiding cross-region network delays. More granular monitoring and alerting can happen at the local level. The global Prometheus only fetches summary data from each region, which is much smaller in volume.
Fault Isolation: If a regional Prometheus server goes down, only the monitoring for that region is affected. The global monitoring system still functions and alerts about the regional failure. We can also add a backup global prometheus server for redundancy.
Global Visibility: The federation server provides a high-level global view, enabling stakeholders to see the overall health and trends of the system across different regions. This is crucial for executives, SREs, and global incident response teams.
Efficiency: Only the essential, aggregated metrics are sent to the global instance. This avoids data overload and makes it easier to manage the large-scale monitoring infrastructure.

Summary

Prometheus federation is a powerful technique for monitoring large-scale, globally distributed infrastructures like that of a global e-commerce platform. By leveraging federation, the monitoring system becomes highly scalable, efficient, and fault-tolerant, enabling real-time insights into the platform's health across different geographic regions without overwhelming the system with unnecessary granular data.