Search backend operations (SRE view)

This guide is for platform and SRE maintainers who run the search process alongside a Phoenix app using Scrypath. It complements Sync Modes and Visibility (application semantics) and Operator Support (library maintainer first response).

Scope

Scrypath v1 publicly targets Meilisearch first. Runbooks here assume a Meilisearch cluster or single node your app reaches over HTTP. Other engines need different metrics and failure modes; Scrypath does not abstract them on the public path.
Goal: a small set of metrics and alerts so teams notice user-visible failure and capacity risk without paging on normal variance.

Two layers: app vs search engine

Layer	What you own	What breaks first when users complain
Application	Scrypath sync/search/hydration paths, Oban queues, DB	Wrong or stale results, timeouts, 5xx from your app
Meilisearch	Process health, disk, RAM, version, task pipeline	Search down, writes stuck, index corruption risk under disk pressure

Instrument both. Do not page only on Meilisearch CPU; pair with Scrypath search error rate and end-to-end latency from the app.

Scrypath telemetry (application signals)

Scrypath emits :telemetry.span/3-style events (see Telemetry for start / stop / exception and duration measurements on stop). Keep dashboards low cardinality: use schema, backend, index, sync_mode — avoid high-cardinality tags such as raw query text or primary keys on alert rules.

Stable event prefixes (each has :start, :stop, and on failure :exception where applicable):

Event prefix	When	Useful aggregates
`[:scrypath, :search]`	Common-path search	p95/p99 duration, error rate, `hit_count` from stop metadata
`[:scrypath, :hydration]`	Repo batch load after search	Duration vs `hit_count` / `record_count` / `missing_count` (drift indicator when `missing_count` grows)
`[:scrypath, :sync, :upsert]` / `[:scrypath, :sync, :delete]`	Document sync	Error rate, `document_count`, `:noop` ratio (noisy if alerted per call)
`[:scrypath, :meilisearch, :request]`	HTTP to Meilisearch	`status_code`, `method`, `path` pattern — alert on sustained 5xx / connection errors
`[:scrypath, :meilisearch, :task_wait]`	Waiting for Meilisearch task completion	`poll_count`, `final_status` — large `poll_count` or non-`:succeeded` trends
`[:scrypath, :reindex, :settings_verified]`	Post-apply settings read-back	Stop metadata `result` tag (`:parity`, `:drift`, etc.)
`[:scrypath, :reindex, :verify_skipped]`	Execute only — settings verify skipped by opt	Rare spikes may be intentional deploys; correlate with logs
`[:scrypath, :operator, :failed_work, :observed]`	Each failed-work row materialized from backend tasks or Oban jobs	Useful for dashboards and structured logs; high volume on noisy data — do not page on every event; treat as diagnostic signal and aggregate

Dashboard-first: sync upsert volume, search QPS, hydration missing_count distribution, Meilisearch request latency.

Meilisearch infrastructure (minimal signals)

Prioritize signals that predict outage, data loss risk, or unbounded backlog. Exact metric names depend on your exporter (Prometheus sidecar, cloud vendor agent, or logs). Map these concepts to your stack:

Process up / ready — HTTP GET /health (or vendor equivalent) from the same network path as the app. Page when unreachable for longer than a short window (e.g. two failed checks), not on single blips.
Disk free — Meilisearch persists indexes; running out of disk is a top cause of corruption and wedged tasks. Alert on free space percentage or absolute GB with headroom for compactions and reindexes.
Memory pressure — Large batches and concurrent indexing drive RSS. Page on OOM kills or sustained memory limit pressure from your orchestrator, not one-off spikes during planned reindex.
Task failures — Meilisearch indexes work through a task queue. Sustained failed tasks (not every transient validation error) indicate a bad deploy, schema mismatch, or upstream bug. Prefer a rate or count over a window, not every single failure.
Replication / multi-node (if used) — split brain or lag between nodes is a separate product surface; follow Meilisearch’s own HA docs for your version.

Avoid alert fatigue: do not page on single slow searches, one failed document in a batch, or Meilisearch 202 Accepted enqueue latency alone. Those belong on dashboards or SLO burn-rate rules with long windows.

Footguns (Meilisearch + Scrypath-shaped)

filter and facetFilters AND together — Users can think they cleared facets while a base filter still narrows results. Document in your UI and ops playbooks; see the faceted search guide appendix.
Reindex + disk — Full reindex can temporarily double index footprint until old data is dropped. Plan disk headroom before Scrypath.reindex/2 on large corpora.
Settings verify skipped — skip_settings_verification?: true speeds emergencies but hides drift until the next verify. Treat as a temporary flag; do not leave it on silently.
Sync mode semantics — :oban means durable enqueue, not “search is updated.” Paging on queue depth without checking search visibility misdiagnoses user impact; see sync modes guide.
Version skew — Meilisearch minor versions change task and index behavior. Pin server and client expectations per environment; roll upgrades in a canary before production.

What to run before you tune alerts

From the repo root (maintainer checks):

mix verify.phase13 (with integration when you have SCRYPATH_MEILISEARCH_URL) — focused operator-flow checks against a real Meilisearch, matching the CI-style job that runs with live integration enabled.
Application-level: Scrypath.sync_status/2, Scrypath.failed_sync_work/2, Scrypath.reconcile_sync/2 for human-readable posture before you change indexing.

ARCHITECTURE.md — drift, reindex order, and sync guarantees
guides/sync-modes-and-visibility.md — :inline / :oban / :manual
guides/operator-mix-tasks.md — thin Mix wrappers over Scrypath.*
guides/relevance-tuning.md — settings and verify-applied semantics

← Previous Page Operator Support