Google的产品质量指标体系包括SLI(service level indicators)、SLO(service level objectives)、SLA(service level agreements)。其中SLA是产品侧,主要面向外部用户。SLI和SLO是内部对质量的衡量指标。SLI的定义原则描述如下:
You shouldn’t use every metric you can track in your monitoring system as an SLI; an
understanding of what your users want from the system will inform the judicious
selection of a few indicators. Choosing too many indicators makes it hard to pay the
right level of attention to the indicators that matter, while choosing too few may leave
significant behaviors of your system unexamined. We typically find that a handful of
representative indicators are enough to evaluate and reason about a system’s health.
Services tend to fall into a few broad categories in terms of the SLIs they find relevant:
- User-facing serving systems, such as the Shakespeare search frontends, generally
care about availability, latency, and throughput. In other words: Could we
respond to the request? How long did it take to respond? How many requests
could be handled? - Storage systemsoften emphasize latency, availability, and durability. In other
words: How long does it take to read or write data? Can we access the data on
demand? Is the data still there when we need it? See Chapter 26 for an extended
discussion of these issues. - Big data systems, such as data processing pipelines, tend to care about throughput
and end-to-end latency. In other words: How much data is being processed? How
long does it take the data to progress from ingestion to completion? (Some pipe‐
lines may also have targets for latency on individual processing stages.) - All systems should care about correctness: was the right answer returned, the
right data retrieved, the right analysis done? Correctness is important to track as
an indicator of system health, even though it’s often a property of the data in the
system rather than the infrastructure per se, and so usually not an SRE responsi‐
bility to meet.