Observability
cpe-sim exposes Prometheus metrics and a small admin introspection surface on a single HTTP listener controlled by --metrics-bind-addr (env CPE_SIM_METRICS_BIND_ADDR, YAML metricsBindAddr). Both are off by default; setting the bind address enables them and keeps the process running in daemon mode even if no other long-lived feature is configured.
cpe-sim \
--profile ./profiles/example-tr181-gateway \
--acs-url http://acs.local:7547/cwmp \
--metrics-bind-addr 127.0.0.1:9100
The v0 server has no authentication. Bind it to localhost in dev and only expose it externally inside a trust boundary.
Endpoints¶
| Method | Path | Returns |
|---|---|---|
GET |
/metrics |
Prometheus scrape format. |
GET |
/admin/cpes |
JSON list of every CPE: id, serial, lifecycle_state, last_event, last_event_at. |
GET |
/admin/cpes/{id} |
JSON detail: above plus 32-entry recent-event tail, tree-leaf sample, USP Subscription summary. |
POST |
/admin/cpes/{id}/cr |
Triggers a synthetic Connection-Request session for the CPE. Returns 202 with {cr_url, invoked_at}. |
POST |
/admin/cpes/{id}/fault/{code} |
Queues a CWMP fault code for the next session. Returns 202. |
force CR and inject fault return 503 Service Unavailable when the underlying hook is not yet wired (Phase 7 lands the metrics surface; the CR / fault hooks bind to existing internals in follow-up issues).
Metrics catalog¶
All metrics are namespaced cpe_sim_*. Per-CPE identity is intentionally not a label. Aggregating across the fleet is the design choice that lets a single time-series scale to thousands of CPEs without Prometheus cardinality issues; use the admin endpoint when you need per-CPE introspection.
Process¶
cpe_sim_process_heap_alloc_bytes(gauge): Go heap currently allocated.cpe_sim_process_goroutines(gauge): live goroutine count.cpe_sim_process_fd_count{os}(gauge): open file descriptors.os="linux"reads/proc/self/fd; other platforms emit 0 withos="unsupported".cpe_sim_process_uptime_seconds_total(counter): wall-clock uptime since startup.cpe_sim_process_gc_pause_seconds(histogram): per-cycle GC pause.
Fleet¶
cpe_sim_fleet_cpes_by_state{lifecycle_state}(gauge): count of CPEs in each lifecycle state. States:new,bootstrapping,ready,failed. Polled every 5s.cpe_sim_fleet_bootstraps_total(counter): successful first-contact bootstrap sessions.cpe_sim_fleet_failed_sessions_total(counter): bootstrap or periodic sessions that returned an error.cpe_sim_fleet_periodic_ticks_total(counter): periodic-Inform timer ticks fired.cpe_sim_fleet_scheduled_events_fired_total(counter): deferred Reboot / FactoryReset / boot events fired.
CWMP¶
cpe_sim_cwmp_informs_total{event_code, result}(counter): Inform sessions, broken out per event code and result (ok/error).
USP¶
cpe_sim_usp_requests_total{msg_type, result}(counter): USP requests handled.cpe_sim_usp_request_rtt_seconds{msg_type}(histogram): handler RTT (decode through send).cpe_sim_usp_autonomous_notifies_total{notify_kind, gated}(counter): autonomous Notifies emitted by the Subscription evaluator.gated="true"is the post-Bundle-2 path (every emission gated on a matching Subscription).cpe_sim_usp_mqtt_connection_state{broker, state}(gauge): MQTT broker connection state. States:connected,connecting,disconnected. Cardinality stays atO(brokers)regardless of fleet size.
RPC faults¶
cpe_sim_rpc_faults_total{protocol, code}(counter): protocol fault responses.protocol="cwmp"today; USP faults addprotocol="usp"in a later iteration.
Scrape config¶
scrape_configs:
- job_name: cpe-sim
static_configs:
- targets: ['127.0.0.1:9100']
scrape_interval: 15s
Alert recipes¶
These five alerts cover the failure modes the simulator's been instrumented for. Tune thresholds to your fleet size.
- Failed-bootstrap rate spike. A sudden jump in
cpe_sim_fleet_failed_sessions_totalis almost always an ACS misconfig, network change, or credentials drift.
rate(cpe_sim_fleet_failed_sessions_total[5m]) > 1
- Autonomous-notify backlog. If
cpe_sim_usp_autonomous_notifies_totalflatlines whilecpe_sim_fleet_cpes_by_state{lifecycle_state="ready"}stays high, the Subscription evaluator may be stuck.
rate(cpe_sim_usp_autonomous_notifies_total[5m]) == 0
and cpe_sim_fleet_cpes_by_state{lifecycle_state="ready"} > 0
- MQTT disconnected fleet. Any CPEs in
state="disconnected"for more than a minute indicates a broker or auth issue.
sum by (broker) (cpe_sim_usp_mqtt_connection_state{state="disconnected"}) > 0
- Fault-rate spike. A jump in
cpe_sim_rpc_faults_totaltypically means the ACS sent malformed RPCs or the data model drifted.
rate(cpe_sim_rpc_faults_total[5m]) > 0.1
- Heap growth without CPE growth. If
cpe_sim_process_heap_alloc_bytesclimbs butcpe_sim_fleet_cpes_by_stateis steady, you have a leak; bisect against the bench harness below.
deriv(cpe_sim_process_heap_alloc_bytes[10m]) > 0
and deriv(sum(cpe_sim_fleet_cpes_by_state)[10m]) == 0
Benchmark harness¶
The per-CPE cost claim is verified by a go test -bench harness in cmd/cpe-sim:
go test -bench=. -run=^$ -benchtime=1x ./cmd/cpe-sim/
Output:
BenchmarkBootstrapN_100-16 1 32731652 ns/op 9632 bytes_per_cpe 0.06 goroutines_per_cpe 12 fds_open
BenchmarkBootstrapN_1000-16 1 139353111 ns/op 3533 bytes_per_cpe 0.006 goroutines_per_cpe 12 fds_open
Read it as:
ns/op: total wall time for one iteration of the named scale (N CPEs bootstrapped).bytes_per_cpe: heap delta over baseline, divided by N. Approximate; GC can drop allocations between samples.goroutines_per_cpe: goroutines retained after the iteration, divided by N.fds_open: total file descriptors open at the end of the iteration (process-level, not per-CPE).
Paste the most-recent output into the PR description for any change that touches per-CPE state. A future PR may add a CI check that fails when bytes_per_cpe regresses past a threshold.
The USP RTT benchmark (BenchmarkUSPRequestResponseRTTN_*) is gated behind USP_BENCH=1 because it spins up the embedded MQTT broker; it reports rtt_p50_us / rtt_p95_us / rtt_p99_us when run.