Skip to main content
CryptoFlex// chris johnson
Shipping
§ 01 / The Blog · Custom SIEM

Building a Custom SIEM, Part 2: Implementation and Gotchas

Part 2 of a 2-part series on replacing the Mission Control Dashboard's SQLite-only event store with a Vector + ClickHouse log-lake. This post walks the 14 implementation phases, the cutover gate that caught 19 column-name drifts, the launchd watchdog bug that almost shipped a 7-second outage on every fsevent, and the persona-rule track record from 137 commits.

Chris Johnson··18 min read

137 commits. 14 phases. 68 catches. One PR.

Part 1 was the design: why the Mission Control Dashboard outgrew SQLite as an event store, why I went custom instead of adopting Wazuh or Splunk, and how Vector plus ClickHouse on a Mac mini M4 hangs together. This post is what it took to ship that design. The phase order. The cutover gate that caught 19 column-name drifts before they could hit production. The launchd watchdog that fired three orders of magnitude more often than the event it was supposed to detect. The persona-rule track record from 11 phases of work, including the three drifts where the captain agent was the one who got the framing wrong.

SIEM Log-Lake Technical Retrospective infographic. 137 commits, 14 phases, 68 catches across 11 phases of work. Vector plus ClickHouse cutover from SQLite-only event store. 984 backend tests passing at merge, 31,587 lines added, 7,747 removed across 221 files.

Series Context

This is part 2 of the Custom SIEM series. Part 1 covers the architecture and is the prerequisite read:

This post stands on its own as a debug-and-ship narrative, but the architecture context lives in Part 1. Diagrams of the substrate (logical architecture, ingestion patterns, schema engines) are over there.

The Phase Model#

14-phase SIEM rollout: 137 commits across 221 files, from prerequisites through documentation soak. Phases are grouped into five stages: Setup (cyan), Wire-up (emerald), Cutover (amber), Operationalize (fuchsia), and Cleanup plus soak (rose).

Fourteen phases shipped on a single feature branch. Numbering runs Phase 0 through Phase 13, with Phase 4.5 and Phase 5.5 inserted as gates between engineering phases. Each phase is TDD-shaped: a pre-verify step that fails, an implementation step, a post-verify step that passes. The shape of the rollout breaks into four bands.

Setup (0 through 2). Prerequisites that have to be true before any code changes: the Mac mini's DHCP reservation pinned to 10.0.0.187, external SSD sleep disabled (pmset -a disksleep 0), rsyslog imfile config installed on the Pi-hole host, and live tcpdump captures of UDM Pro and Pi-hole syslog committed as VRL test fixtures. Then the docker-compose scaffolding and the 11 ClickHouse migrations.

Wire-up (3 through 4.5). The Vector pipeline (sources, VRL transforms, sinks), the backend ClickHouse module (clickhouse-connect client wrapper, migration runner, query helpers, disk-cap poller), and an operator-side smoke gate to verify dual-write parity before flipping any read paths.

Cutover (5 through 7). Endpoint rewrites behind a SIEM_DUAL_WRITE_ENABLED flag, with a 19-drift fold-back commit at the end of Phase 5 catching every column-name mismatch the writer agents produced. Detection retarget to ClickHouse (10 security signals + 6 threat-intel heuristics). The unified security_finding table becomes the write target for both detection paths.

Operationalize plus cleanup (8 through 13). Notification dispatcher (macOS push for HIGH-severity, Gmail digest at 08:00 UTC). Health chips and scripts/doctor.sh. SQLite event-table cleanup (drops 23 tables, leaving 18 reference-and-ack tables behind). End-to-end soak with deliberate SSD unmount/remount. Documentation pass.

The single PR shape is deliberate. Mid-cutover read paths are fragile because half the surface reads SQLite and half reads ClickHouse, and the dual-write window is the only safe time to fix divergence. Shipping the intermediate states as separate PRs would have meant living in that window for weeks. One PR, 137 commits, merged with the cutover gate green.

The Five-Agent Team#

Five Claude Code subagents owned the work. Each had two modes: spec-review (read the existing code, check the design assumption) and implementation-ownership (write the code).

AgentModelSurface
siem-captainopusWave dispatch, pre-implementation audit, cross-phase sequencing rulings
siem-securitysonnetThreat model the listener, the auth boundary, residual risks; detection rule retarget
siem-networksonnetSyslog source config, VRL transforms, vector.yaml, rsyslog setup
siem-databasesonnetClickHouse schema, engine choices, migrations, query helpers, writers
siem-homeoperatorsonnetRunbook drafting, scripts/doctor.sh, parity-gate scripting, operator-side perspective

The working agreement that did most of the work was the verify-against-existing-clients persona rule. Phrased in one sentence: before writing a new client (a Vector source, a Python writer, a query helper), read the existing client for the same upstream system. Don't assume the auth shape, the URL shape, the header shape, or the failure modes. Read the file.

This sounds obvious until you watch a coding agent skip it. The captain self-audit checklist that landed at the end of Phase 10 says, in part:

The captain self-audit checklist

  1. Re-grep the affected surface against the full migrating-model list, not the agent's reported list.
  2. When a previous captain's framing references a library's behavior, read the library source before ratifying.
  3. When a destructive operation deletes registry entries, verify whether ANY caller zips against the registry. If yes, deletion is forbidden and replacement-with-stub is required.
  4. Soak heuristics are heuristics, not invariants. When a bug shape is exercised by load not by elapsed time, compressed-load-soak is functionally equivalent to wall-clock soak.

Each item is a rule that got formalized after a drift caught by the rule's absence. The rule applied to the agents first. Then it applied to captain rulings. By the end of Phase 11 it applied to the captain self-auditing prior captain rulings, which is when this project got interesting.

Phase 0: tcpdump as a Design Tool#

The first thing this project shipped was 60 seconds of tcpdump.

bash
sudo tcpdump -i any -n udp port 5140 and src host 10.0.0.1 -w /tmp/udmpro-live.pcap -c 500

That capture became vector/tests/fixtures/udmpro-samples.txt. Same pattern for the Pi-hole at 10.0.0.227. The captures gave the VRL transforms in Phase 3 an actual ground truth to test against, instead of guessed-at log shapes from vendor docs.

The captures also surfaced two architectural facts the design spec had wrong.

First: the U7 Pro APs and the USW-Flex switches were already forwarding syslog directly to the LAN, not just relaying through the UDM Pro. Phase 0 expected one syslog source on 10.0.0.1. Phase 0 found four classes: UDM Pro, Pi-hole (via rsyslog relay on 10.0.0.227), three U7 Pro APs, and one USW-Flex switch. The plan grew a parse_unifi_device transform to handle five additional shapes (hostapd 802.11 events, hostapd WPA, kernel wlan events, stahtd JSON STA-tracker events, and PSE flap events from the Flex).

Second: that USW-Flex was also flapping its PSE-2 PoE port repeatedly during the capture window. Hardware finding, surfaced before any detection rule was written, just by reading the syslog stream the way the future SIEM was going to read it.

The pattern

If you are building a system that ingests data from somewhere, capture an hour of the data first and read it. The shape will not match the docs. The capture is the design doc.

Phase 3: Vector 0.40 Sink Config Gotchas#

Phase 3 was three syslog parsers (parse_udmpro, parse_pihole, parse_unifi_device), one route transform that dispatches by source IP and hostname, and two ClickHouse sinks. The hard part was not the VRL. The hard part was getting Vector 0.40 to talk to ClickHouse without crashing.

The sink config that finally worked has four annotations on it that each represent something that broke in production:

yaml
# Vector 0.40 ClickHouse sink encoding notes:
# - encoding.codec is INVALID; only only_fields/except_fields/timestamp_format accepted
# - Use timestamp_format: unix_ms; ClickHouse JSONEachRow rejects ISO8601 Z suffix
# - Disk buffer minimum is 268435488 bytes (256 MiB + 32 bytes); smaller values crash Vector
# - ch_raw_syslog needs only_fields: Vector syslog source emits ~12 metadata fields
#   (appname, msgid, procid, source_ip, version, etc.) not in raw_syslog schema;
#   ClickHouse rejects rows with unknown JSON keys by default
sinks:
  ch_raw_syslog:
    type: clickhouse
    inputs: [parse_udmpro, parse_pihole, parse_unifi_device]
    endpoint: "http://${CLICKHOUSE_HOST}:8123"
    database: "homenet"
    table: "raw_syslog"
    auth:
      strategy: basic
      user: "${CLICKHOUSE_USER}"
      password: "${CLICKHOUSE_PASSWORD}"
    encoding:
      only_fields: [ts, time_received, source, client_mac, facility, severity, hostname, program, message, raw, fields]
      timestamp_format: unix_ms
    compression: zstd
    buffer:
      type: disk
      max_size: 2147483648  # 2 GB
      when_full: block

Walking the four gotchas:

encoding.codec is invalid. Vector 0.40's ClickHouse sink only accepts only_fields, except_fields, and timestamp_format under encoding. Setting encoding.codec: json is a config error, not a no-op. The error message points at the wrong line.

timestamp_format: unix_ms is required. The default is RFC3339 with a Z suffix. ClickHouse's JSONEachRow parser rejects the Z and the row insert fails silently from Vector's perspective (the events accumulate in the buffer; nothing arrives at ClickHouse).

Disk buffer minimum is exactly 268435488 bytes. That is 256 MiB plus 32 bytes. Setting it to 268435456 (a clean 256 MiB) crashes Vector at startup with a numeric-bounds error. Setting it any lower also crashes. The + 32 bytes is the buffer header overhead; Vector won't let you allocate space that doesn't fit at least one empty record.

only_fields is mandatory for ch_raw_syslog. Vector's syslog source decorates every event with about 12 metadata fields (appname, msgid, procid, source_ip, version, and friends) that aren't in the raw_syslog schema. By default, ClickHouse rejects JSON rows that contain unknown keys. The only_fields allowlist limits the wire payload to columns the table actually has.

That last one took the longest to find. The symptom was "events flow through Vector cleanly, the heartbeat sink writes successfully, the raw_syslog sink reports successful writes, but SELECT count() FROM raw_syslog is zero." The fix was a 6-token list literal in YAML.

Vector 0.40 sink discovery shape

None of the four issues above produced a useful Vector log line. Each surfaced as "rows in the buffer, zero rows in ClickHouse." The debug path that worked was: enable the ClickHouse query log, watch what HTTP body actually hits the server, compare to the schema. Treat the ClickHouse query log as the canonical debug surface for any "but Vector says it sent" mystery.

Phase 5: The Cutover Gate and 19 Fold-Back Drifts#

Phase 5 was the read-path cutover. Each routers/*.py file got rewritten from select(Client) against SQLAlchemy to queries.clients_latest(ch_client) against ClickHouse. The new code shipped behind a flag (SIEM_DUAL_WRITE_ENABLED) so the operator could turn ClickHouse off without losing the SQLite write path during the transition.

The dual-write window was the place where divergence had to be caught, because it was the only place the system could tolerate divergence. Phase 4.5 (the operator-action smoke gate) inserted a scripts/check_dual_write_parity.sh that compares row counts SQLite-vs-ClickHouse for client, device, network, wlan, and pihole_stats after at least one tick of every default-cadence poller. Grading is cadence-aware: >= 0.5% divergence is RED.

That gate caught 19 column-name drifts between the SQLite SQLModel definitions and the ClickHouse migrations. A representative sample:

SQLite columnClickHouse columnCaught at
last_seenlast_seen_tsPhase 5 close-out
node_typetypePhase 5 close-out
pihole_top_permittedpihole_top_allowedPhase 5 close-out
dpi_activedpi_snapshotPhase 5 close-out
event_type / messagetitle / detailPhase 5 close-out

Each one would have shipped a silently-empty endpoint. The reason they got caught is that the captain dispatched siem-database to verify-against-actual-schema (read the migration files, then read the SQLModel files, then list every column-name diff) before each rewrite wave dispatched. The 19 drifts went into a single fold-back commit with each drift named, the symptom described, and the canonical naming pinned. That fold-back commit closed Phase 5.

The flag-gated branch pattern (no in-handler try/except fallback) was the operator decision. The alternative considered was a try/except wrapping the ClickHouse call that would silently fall back to SQLite on failure. The operator killed that idea: silent fallback would mask ClickHouse degradation. If ClickHouse goes unhealthy, the operator wants to see broken endpoints and know about it, not see slowly-stale SQLite-backed data and not know about it. The flag is binary, the failure is loud, the recovery is SIEM_DUAL_WRITE_ENABLED=false and a backend restart.

Read-path cutover discipline

For every cutover where two stores temporarily coexist: pick a parity script with cadence-aware grading, name divergence as a hard stop, and refuse silent fallback patterns in handler code. Loud failure and a runbook beat silent degradation every time.

Phase 12: The launchd Watchdog Bug That Almost Shipped#

This is the one that almost shipped a regression worse than the bug it was meant to fix.

The Mac mini external SSD lives at /Volumes/MacExternal. macOS occasionally remounts that volume on its own (drive sleep wake, power blip, Docker Desktop restart). When the volume drops, the ClickHouse container is reading from a dead mount and the data directory is gone. The fix is to bounce the docker-compose stack as soon as the mount comes back.

The original design was a launchd WatchPaths agent at launchd/com.homenet.siem-remount.plist that watched /Volumes/MacExternal and, on any fsevent, ran scripts/siem_remount_recover.sh to bounce the stack. The script worked correctly when triggered. Phase 1 shipped this and Phase 12 was supposed to soak-test it under realistic remount conditions.

Phase 12 Task 12.2 (compressed-soak window) caught what shipped: launchd's WatchPaths over-fires on bind mounts by approximately three orders of magnitude. Every fsevent at the mount root, every Docker bind-mount write, every container layer commit, every ClickHouse part-merge that touched the volume's path metadata: each one fired the watchdog. Without the stale-state guard, every fsevent triggered a 7-second outage on a stack that was perfectly healthy.

The fix was a stale-state guard pattern in scripts/siem_remount_recover.sh. The guard probes the running state before doing anything destructive:

  1. Running-state probe. Are the ClickHouse and Vector containers up and healthy? If yes, the fsevent was noise; exit 0.
  2. SQL probe. Can the ClickHouse container answer SELECT 1? If yes, the volume is alive and the stack is fine; exit 0.
  3. Stage-1 recovery. Stop and start the docker-compose services. Re-probe.
  4. Stage-2 recovery. Stop, sleep 5, start. Re-probe.
  5. Stage-3 escalation. Full Docker Desktop daemon restart via osascript. This is for the worst case where the Docker daemon itself has gotten confused by the remount. Re-probe.

Each stage waits and verifies before escalating. Recovery from a forced unmount-remount is approximately 16 seconds end-to-end. Recovery from a noise fsevent is approximately 50 milliseconds (the running-state probe alone).

launchd WatchPaths is not an event channel

WatchPaths is a fsevent firehose, not a state-change channel. Treat every callback as advisory and verify state before acting. The pattern that works is: probe the system you are about to "fix" first; the fix is a no-op if the system is healthy.

The catch attribution for this one: it was the operator (me, doing Phase 12 Task 12.2 by hand) who noticed the docker-compose containers cycling repeatedly during the soak window, before any production data was at risk. The drift list in the plan doc files this as drift #67, captain framing miss sub-class "verify-against-actual-bug-shape, not against ceremony." The original design's confidence in WatchPaths came from the docs. The actual behavior under the deployment shape only surfaced under load.

Persona-Rule Track Record#

The verify-against-existing-clients rule, plus the captain self-audit checklist that grew alongside it, caught 68 issues + 2 architectural escalations + 3 special attribution columns across 11 phases of work, in 12 distinct sub-classes. Captain framing misses on this project: 3 (drifts #45, #52, and #65).

Two of those captain-side misses are worth narrating.

Drift #45: the lru_cache framing miss. A previous captain handed off a bug as "the ClickHouse client pool got collapsed to an lru_cache singleton at some point; rebuild the pool, est. 1 to 2 hours." The next captain's first instinct was to dispatch the rebuild. Instead, they read backend/src/homenet_dashboard/clickhouse/client.py end-to-end first. The pool spec was there, correctly implemented (pool_mgr=get_pool_manager(maxsize=cfg.dashboard_pool_size)). The actual root cause was different: clickhouse-connect 0.15.1's sync get_client() defaults autogenerate_session_id=True, which injects a single per-Client session_id. That session_id is the collision key for httpclient.py:535-544's _active_session check, which serializes concurrent queries on a single Client across the entire FastAPI worker pool.

Reading the vendored library source was load-bearing. The fix shrank from "1-to-2-hour pool rebuild" to "20-minute one-line autogenerate_session_id=False flip plus two regression tests." The pre-existing pool was correct; the per-Client session was wrong. Caught pre-fix by captain succession.

Drift #52: the "dead code under flag" miss. Phase 10's SQLite cleanup framed eight evaluator helpers in evaluator.py as "dead code under the SIEM_DUAL_WRITE_ENABLED flag, safe to delete." The captain ratified that framing and dispatched the deletion. Two ClickHouse-flavored evaluators (evaluator_ch.py) imported the supposedly-dead helpers as fallback delegates. The deletion broke the import chain.

The bug surfaced post-deploy, when Signal 5 (firmware drift) crashed every cycle in production with a NameError. The captain's miss was framing-level: "dead code under flag" was a partial truth that hid a back-edge dependency. The fix was a stub-replacement pattern: every removed helper got replaced with a one-line stub returning SignalResult(status="unknown", metadata={"reason": "SQLite path retired post-Phase-10"}). The SIGNAL_REGISTRY ordering invariant (10 entries, positionally zipped against security_rules.yml) is preserved. The dead SQL bodies are gone. Sub-class added to the rule list: "verify-against-delegation-back-edges."

The lesson generalizes: persona rules apply to captain rulings too, not just agent dispatches. When the captain frames a piece of work, the next captain should treat that framing the same way an agent would treat a spec: read the actual code first, verify the framing matches reality, then ratify or revise.

Gotchas Worth Knowing#

Six pieces of this stack have a "you will hit this" quality. Naming them up front saves the next person a debugging cycle.

Docker for Mac IPv6. ClickHouse 24.x crashes at startup binding [::]:9009 because there is no IPv6 in the Docker for Mac container. Fix: <interserver_http_port remove="remove"/> in the server config XML. Single-node deployments don't need replication, and removing the inter-server port removes the bind that fails.

Alpine localhost resolves IPv6 first. The ClickHouse healthcheck inside an alpine-based image resolves localhost to ::1 before 127.0.0.1. If the server is listening IPv4-only (because of the IPv6 fix above), the healthcheck fails. Use 127.0.0.1 explicitly in the healthcheck command, not localhost.

testcontainers on macOS needs DOCKER_HOST. Docker Desktop on Mac puts the socket at ~/.docker/run/docker.sock, not the conventional /var/run/docker.sock. testcontainers-python doesn't auto-detect this. Four-line conftest.py shim:

python
import os
import pathlib
sock = pathlib.Path.home() / ".docker" / "run" / "docker.sock"
os.environ.setdefault("DOCKER_HOST", f"unix://{sock}")

Pi-hole v6 session auth is stateful. The flow is POST /api/auth with the admin password to get a SID, send the SID on every request, then DELETE /api/auth to free the seat (Pi-hole has a small concurrent-session cap). Vector's http_client source is one-shot per interval and cannot do that dance. This is why REST polling stayed in Python in the architecture in Part 1: the existing clients/pihole.py already handles seats_exhausted (429) and auth_expired (401/403) correctly. The "who initiates the network call" boundary fell out of this naturally.

launchd WatchPaths over-fires on bind mounts. Approximately three orders of magnitude more often than the underlying mount-state change. The stale-state guard pattern is canonical: probe running state and SQL state before any destructive action, escalate in stages, treat every callback as advisory.

rsyslog imfile format matters. Pi-hole's FTL.log is not native syslog: it is a plain text file. The Pi runs rsyslog with the imfile module watching /var/log/pihole/FTL.log and forwarding to UDP 5140 with RSYSLOG_ForwardFormat. The default RSYSLOG_TraditionalFileFormat strips the priority marker (<NN>), which makes Vector's syslog source reject the line as malformed. The diff is one line in the rsyslog template name and an hour of "why does Vector see UDM Pro lines but not Pi-hole lines."

Each gotcha was a lost afternoon

None of these are documented in a single place. Each one is a "the docs imply X, the system does Y, the gap is one config keystroke" story. Capture them when you find them; the next person on the same stack hits the same six.

What's Next#

The substrate is live. The daily Gmail digest fires at 08:00 UTC. The macOS push fires within five minutes of a HIGH-severity finding insert. Claude Code can ask the dashboard's data plain SQL questions through the read-only claude ClickHouse role. The 10 security signals and 6 threat-intel heuristics keep firing on the same cadence they always did, just against a store that fits.

The MVP is a substrate, not an end state. Section 12 of the design spec lists deferred work with target dates, in priority order:

CapabilityTargetWhy
macOS Unified Log onboarding via Vector file sourceWithin 1 monthHighest-value Phase 2 work; Mac mini compromise is the highest-value target on this network
NetFlow / IPFIX from UniFi switchesWithin 3 monthsLargest detection-coverage gap (lateral movement, IoT-to-LAN scans)
UniFi Protect motion / event pollingWithin 3 monthsPhysical-layer events in the same lake as IPS events
DGA / beaconing detection rule on client_dns_queryOpportunisticDomain entropy, request periodicity, low-success-rate ratios
Grafana sidecar with a ClickHouse data sourceNo commitmentPure want
Sigma rule transpilationIf/whenUnblocks third-party detection content

That is the forward path. The shape that made this project work, the persona-rule track record from 11 phases of work, generalizes past SIEM. The rule set is short: read the existing client before writing the new one, read the library source before ratifying the framing, treat every callback as advisory until you have probed state, treat the cutover gate as a hard stop. Each rule was earned by something getting caught.

The substrate caught the same things last week it would have caught this week. It just does it on a store that fits, with a SQL surface a coding agent can use for forensics in real time, on hardware that is going to outlast the next three SSDs. The point of going custom was never the SIEM. The point was getting back the part of the stack that the off-the-shelf options ask you to give up: the right to investigate your own data in plain SQL, on your own hardware, on your own schedule.

That is what shipped.

Related Posts

Building a Custom SIEM, Part 1: Why and the Architecture

Part 1 of a 2-part series on replacing the Mission Control Dashboard's SQLite-only event store with a Vector + ClickHouse log-lake on a Mac mini. This post covers the use case, the reasoning behind going custom instead of off-the-shelf, the three ingestion patterns, and the ClickHouse engine choices. Part 2 covers the implementation phases and the gotchas that almost shipped.

Chris Johnson··14 min read
LOG LAKE panel build, branded NotebookLM infographic. Two halves. Top half is the clean architecture (ingestion-health strip, GUI query builder, identifier-allowlist compiler, parameterized ClickHouse SQL). Bottom half is the five-bug deploy gauntlet (readonly-pool 500, poll crash loop, 20-day Pi-hole gap, stale Vector config, UDM doubled-hostname frame). Closes with the meta-lesson, one SELECT count() that revealed 100% of 159,909 rows were DNAT and vetoed a complex rewrite in favor of a four-line MV recreation.

Part 6 of the home network dashboard build. The LOG LAKE panel ships a SIEM ingestion-health strip and a GUI firewall query builder that compiles to parameterized ClickHouse under the hood. One PR, two waves, 1193 backend tests at merge. Then deploy day on the live Mac mini produced five production-only bugs in a single afternoon: a readonly-pool 500, a timezone-mixed poll crash that had been firing every five minutes for hours, a 20-day-silent Pi-hole pipeline (two layers stacked), a Vector container reading a stale bind-mounted config, and a UDM doubled-hostname frame that silently broke action derivation for 159,909 rows. The meta-lesson is that the proposed fix for the last one was an invasive Vector source rewrite that the persona team vetoed in favor of an operator toggle and a four-line MV recreation.

Chris Johnson··24 min read
Engineering a Searchable SIEM Dashboard, branded NotebookLM infographic summarizing the DNS Search Panel build session

Part 6 of the home network dashboard build. The SIEM cutover dropped the DNS search endpoint without replacing it, and the only reason I caught it was clicking into the live dashboard and seeing "Failed to load DNS query log." This post walks the session that put search back: the diagnosis, the brainstorming workflow that pinned down five contested design choices, the five-wave persona dispatch, the parallel reviews that caught a third-scan query and a PII gate divergence, the FastAPI int-Literal gotcha that ate an hour, and a live smoke at 41 results in under 100ms with the sparkline-sum-equals-aggregate-total invariant holding 454 = 454 on the first row.

Chris Johnson··21 min read

Comments

Subscribers only — enter your subscriber email to comment

Reaction:
Loading comments...

Navigation

Blog Posts

↑↓ navigate openesc close