Observability

Measure how well a state of a service or application can be described from the outside world.

Typical flow of data and actions:

Application <- Instrumentation -> Telemetry <- Observability <- Analysis -> Actions

3 Pillars of Observability

PillarFormatPurposeDescription
MetricsMachine ReadableDetectDo I have a Problem
TracingMachine ReadableTroubleshootWhere is the Problem
LoggingHuman ReadablePinpointWhat is the Problem

Notes:

Do not use print do not re-invent the wheel. There is an existing library already. Think about the audience and the use case of the data to collect Only collect data that brings value and is manageable Choose the right format and be consistent

Data formats

  • Human readable: plaintext
  • Machine readable: structured text (json) or structured data (bytestreams, protobuf, binlogs, pflog)

Logging

Time series of log events written as log messages to a logbook (stdout, database, collector).

Human readable format.

Event

Describe some state at distinct point in time.

  • Immutable
  • Timestamped
  • Categorized
  • Discrete
  • Record

Metrics

Numeric values of measured data at a given time. Recorded within a fixed interval and used for historical visualization and alerting.

Machine readable format does not change und usually consist of:

  • Metric name
  • Timestamp
  • Labels with measured data

Traces

Scoped series of (distributed) events and their duration.

Analysis

  • Graphical
  • Automated alerting

CPU Load

On Linux systems on can read kernel metrics via:

cat /proc/stat

The values are measured in USER_HZ which can be obtained by running getconf CLK_TCK but typically defaults to 100. So each value is a counter of 1/100ths of a second since the boot time btime which is measured as Epoch Unix Timestamp.

#!/bin/bash 

while :; do
  cpu_now=($(head -n1 /proc/stat)) # Get "cpu" line which is the total of all cores 

  cpu_sum="${cpu_now[@]:1}" # Skip first column
  cpu_sum=$((${cpu_sum// /+})) # Add all colums to get the total

  cpu_sum_last="${cpu_last[@]:1}"
  cpu_sum_last=$((${cpu_sum_last// /+}))
  
  # Calculate the delta between two reads for each column
  cpu_delta=$((cpu_sum - cpu_sum_last)) 
  user_delta=$((cpu_now[1] - cpu_last[1])) # Time spent in user mode (CPU bound)
  nice_delta=$((cpu_now[2] - cpu_last[2])) # Time spent in user mode with low priority (CPU bound)
  system_delta=$((cpu_now[3] - cpu_last[3])) # Time spent in system mode (CPU bound)
  idle_delta=$((cpu_now[4] - cpu_last[4])) # Time spent in the idle task (Ideling)
  iowait_delta=$((cpu_now[5] - cpu_last[5])) # Time waiting for I/O to complete (Network/Disk bound)
  irq_delta=$((cpu_now[6] - cpu_last[6])) # Time serving hardware interrupts
  softirq_delta=$((cpu_now[7] - cpu_last[7])) # Time serving software interrupts
  steal_delta=$((cpu_now[8] - cpu_last[8])) # Time stolen by a guest VM
  guest_delta=$((cpu_now[9] - cpu_last[9])) # Time spent running a virtual CPU for guest VM
  guest_niced_delta=$((cpu_now[9] - cpu_last[9])) # Times spent running a virtual CPU with low Priority

  cpu_used=$((cpu_delta - idle_delta)) # Total time spent in doing something
  cpu_usage=$((100 * cpu_used / cpu_delta)) # Calculate the percentage
  
  # Keep for delta
  cpu_last=("${cpu_now[@]}") 

  echo "CPU usage at $cpu_usage%" 
  sleep 1 
done

File sizes

AmountNameEquals ToSize(In Bytes)
1Bit1 Bit1/8
1Nibble4 Bits1/2
1Byte8 Bits1
1Kilobyte1024 Bytes1024
1Megabyte1024 Kilobytes1048576
1Gigabyte1024 Megabytes1073741824
1Terrabyte1024 Gigabytes1099511627776
1Petabyte1024 Terabytes1125899906842624
1Exabyte1024 Petabytes1152921504606846976
1Zettabyte1024 Exabytes1180591620717411303424
1Yottabyte1024 Zettabytes1208925819614629174706176