Observability
Measure how well a state of a service or application can be described from the outside world.
Typical flow of data and actions:
Application <- Instrumentation -> Telemetry <- Observability <- Analysis -> Actions
3 Pillars of Observability
| Pillar | Format | Purpose | Description |
|---|---|---|---|
| Metrics | Machine Readable | Detect | Do I have a Problem |
| Tracing | Machine Readable | Troubleshoot | Where is the Problem |
| Logging | Human Readable | Pinpoint | What is the Problem |
Notes:
Do not use print do not re-invent the wheel. There is an existing library already. Think about the audience and the use case of the data to collect Only collect data that brings value and is manageable Choose the right format and be consistent
Data formats
- Human readable: plaintext
- Machine readable: structured text (json) or structured data (bytestreams, protobuf, binlogs, pflog)
Logging
Time series of log events written as log messages to a logbook (stdout, database, collector).
Human readable format.
Event
Describe some state at distinct point in time.
- Immutable
- Timestamped
- Categorized
- Discrete
- Record
Metrics
Numeric values of measured data at a given time. Recorded within a fixed interval and used for historical visualization and alerting.
Machine readable format does not change und usually consist of:
- Metric name
- Timestamp
- Labels with measured data
Traces
Scoped series of (distributed) events and their duration.
Analysis
- Graphical
- Automated alerting
CPU Load
On Linux systems on can read kernel metrics via:
cat /proc/stat
The values are measured in USER_HZ which can be obtained by running getconf CLK_TCK but typically defaults to 100.
So each value is a counter of 1/100ths of a second since the boot time btime which is measured as Epoch Unix Timestamp.
#!/bin/bash
while :; do
cpu_now=($(head -n1 /proc/stat)) # Get "cpu" line which is the total of all cores
cpu_sum="${cpu_now[@]:1}" # Skip first column
cpu_sum=$((${cpu_sum// /+})) # Add all colums to get the total
cpu_sum_last="${cpu_last[@]:1}"
cpu_sum_last=$((${cpu_sum_last// /+}))
# Calculate the delta between two reads for each column
cpu_delta=$((cpu_sum - cpu_sum_last))
user_delta=$((cpu_now[1] - cpu_last[1])) # Time spent in user mode (CPU bound)
nice_delta=$((cpu_now[2] - cpu_last[2])) # Time spent in user mode with low priority (CPU bound)
system_delta=$((cpu_now[3] - cpu_last[3])) # Time spent in system mode (CPU bound)
idle_delta=$((cpu_now[4] - cpu_last[4])) # Time spent in the idle task (Ideling)
iowait_delta=$((cpu_now[5] - cpu_last[5])) # Time waiting for I/O to complete (Network/Disk bound)
irq_delta=$((cpu_now[6] - cpu_last[6])) # Time serving hardware interrupts
softirq_delta=$((cpu_now[7] - cpu_last[7])) # Time serving software interrupts
steal_delta=$((cpu_now[8] - cpu_last[8])) # Time stolen by a guest VM
guest_delta=$((cpu_now[9] - cpu_last[9])) # Time spent running a virtual CPU for guest VM
guest_niced_delta=$((cpu_now[9] - cpu_last[9])) # Times spent running a virtual CPU with low Priority
cpu_used=$((cpu_delta - idle_delta)) # Total time spent in doing something
cpu_usage=$((100 * cpu_used / cpu_delta)) # Calculate the percentage
# Keep for delta
cpu_last=("${cpu_now[@]}")
echo "CPU usage at $cpu_usage%"
sleep 1
done
File sizes
| Amount | Name | Equals To | Size(In Bytes) |
|---|---|---|---|
| 1 | Bit | 1 Bit | 1/8 |
| 1 | Nibble | 4 Bits | 1/2 |
| 1 | Byte | 8 Bits | 1 |
| 1 | Kilobyte | 1024 Bytes | 1024 |
| 1 | Megabyte | 1024 Kilobytes | 1048576 |
| 1 | Gigabyte | 1024 Megabytes | 1073741824 |
| 1 | Terrabyte | 1024 Gigabytes | 1099511627776 |
| 1 | Petabyte | 1024 Terabytes | 1125899906842624 |
| 1 | Exabyte | 1024 Petabytes | 1152921504606846976 |
| 1 | Zettabyte | 1024 Exabytes | 1180591620717411303424 |
| 1 | Yottabyte | 1024 Zettabytes | 1208925819614629174706176 |