This guide is for platform and SRE maintainers who run the search process alongside a Phoenix app using Scrypath. It complements Sync Modes and Visibility (application semantics) and Operator Support (library maintainer first response).
Scope
- Scrypath v1 publicly targets Meilisearch first. Runbooks here assume a Meilisearch cluster or single node your app reaches over HTTP. Other engines need different metrics and failure modes; Scrypath does not abstract them on the public path.
- Goal: a small set of metrics and alerts so teams notice user-visible failure and capacity risk without paging on normal variance.
Two layers: app vs search engine
| Layer | What you own | What breaks first when users complain |
|---|---|---|
| Application | Scrypath sync/search/hydration paths, Oban queues, DB | Wrong or stale results, timeouts, 5xx from your app |
| Meilisearch | Process health, disk, RAM, version, task pipeline | Search down, writes stuck, index corruption risk under disk pressure |
Instrument both. Do not page only on Meilisearch CPU; pair with Scrypath search error rate and end-to-end latency from the app.
Scrypath telemetry (application signals)
Scrypath emits :telemetry.span/3-style events (see Telemetry for start / stop / exception and duration measurements on stop). Keep dashboards low cardinality: use schema, backend, index, sync_mode — avoid high-cardinality tags such as raw query text or primary keys on alert rules.
Stable event prefixes (each has :start, :stop, and on failure :exception where applicable):
| Event prefix | When | Useful aggregates |
|---|---|---|
[:scrypath, :search] | Common-path search | p95/p99 duration, error rate, hit_count from stop metadata |
[:scrypath, :hydration] | Repo batch load after search | Duration vs hit_count / record_count / missing_count (drift indicator when missing_count grows) |
[:scrypath, :sync, :upsert] / [:scrypath, :sync, :delete] | Document sync | Error rate, document_count, :noop ratio (noisy if alerted per call) |
[:scrypath, :meilisearch, :request] | HTTP to Meilisearch | status_code, method, path pattern — alert on sustained 5xx / connection errors |
[:scrypath, :meilisearch, :task_wait] | Waiting for Meilisearch task completion | poll_count, final_status — large poll_count or non-:succeeded trends |
[:scrypath, :reindex, :settings_verified] | Post-apply settings read-back | Stop metadata result tag (:parity, :drift, etc.) |
[:scrypath, :reindex, :verify_skipped] | Execute only — settings verify skipped by opt | Rare spikes may be intentional deploys; correlate with logs |
[:scrypath, :operator, :failed_work, :observed] | Each failed-work row materialized from backend tasks or Oban jobs | Useful for dashboards and structured logs; high volume on noisy data — do not page on every event; treat as diagnostic signal and aggregate |
Dashboard-first: sync upsert volume, search QPS, hydration missing_count distribution, Meilisearch request latency.
Meilisearch infrastructure (minimal signals)
Prioritize signals that predict outage, data loss risk, or unbounded backlog. Exact metric names depend on your exporter (Prometheus sidecar, cloud vendor agent, or logs). Map these concepts to your stack:
- Process up / ready — HTTP
GET /health(or vendor equivalent) from the same network path as the app. Page when unreachable for longer than a short window (e.g. two failed checks), not on single blips. - Disk free — Meilisearch persists indexes; running out of disk is a top cause of corruption and wedged tasks. Alert on free space percentage or absolute GB with headroom for compactions and reindexes.
- Memory pressure — Large batches and concurrent indexing drive RSS. Page on OOM kills or sustained memory limit pressure from your orchestrator, not one-off spikes during planned reindex.
- Task failures — Meilisearch indexes work through a task queue. Sustained failed tasks (not every transient validation error) indicate a bad deploy, schema mismatch, or upstream bug. Prefer a rate or count over a window, not every single failure.
- Replication / multi-node (if used) — split brain or lag between nodes is a separate product surface; follow Meilisearch’s own HA docs for your version.
Avoid alert fatigue: do not page on single slow searches, one failed document in a batch, or Meilisearch 202 Accepted enqueue latency alone. Those belong on dashboards or SLO burn-rate rules with long windows.
Footguns (Meilisearch + Scrypath-shaped)
filterandfacetFiltersAND together — Users can think they cleared facets while a basefilterstill narrows results. Document in your UI and ops playbooks; see the faceted search guide appendix.- Reindex + disk — Full reindex can temporarily double index footprint until old data is dropped. Plan disk headroom before
Scrypath.reindex/2on large corpora. - Settings verify skipped —
skip_settings_verification?: truespeeds emergencies but hides drift until the next verify. Treat as a temporary flag; do not leave it on silently. - Sync mode semantics —
:obanmeans durable enqueue, not “search is updated.” Paging on queue depth without checking search visibility misdiagnoses user impact; see sync modes guide. - Version skew — Meilisearch minor versions change task and index behavior. Pin server and client expectations per environment; roll upgrades in a canary before production.
What to run before you tune alerts
From the repo root (maintainer checks):
mix verify.phase13(with integration when you haveSCRYPATH_MEILISEARCH_URL) — focused operator-flow checks against a real Meilisearch, matching the CI-style job that runs with live integration enabled.- Application-level:
Scrypath.sync_status/2,Scrypath.failed_sync_work/2,Scrypath.reconcile_sync/2for human-readable posture before you change indexing.
Related docs
- ARCHITECTURE.md — drift, reindex order, and sync guarantees
- guides/sync-modes-and-visibility.md —
:inline/:oban/:manual - guides/operator-mix-tasks.md — thin Mix wrappers over
Scrypath.* - guides/relevance-tuning.md — settings and verify-applied semantics