Search backend operations (SRE view)

Copy Markdown View Source

This guide is for platform and SRE maintainers who run the search process alongside a Phoenix app using Scrypath. It complements Sync Modes and Visibility (application semantics) and Operator Support (library maintainer first response).

Scope

  • Scrypath v1 publicly targets Meilisearch first. Runbooks here assume a Meilisearch cluster or single node your app reaches over HTTP. Other engines need different metrics and failure modes; Scrypath does not abstract them on the public path.
  • Goal: a small set of metrics and alerts so teams notice user-visible failure and capacity risk without paging on normal variance.

Two layers: app vs search engine

LayerWhat you ownWhat breaks first when users complain
ApplicationScrypath sync/search/hydration paths, Oban queues, DBWrong or stale results, timeouts, 5xx from your app
MeilisearchProcess health, disk, RAM, version, task pipelineSearch down, writes stuck, index corruption risk under disk pressure

Instrument both. Do not page only on Meilisearch CPU; pair with Scrypath search error rate and end-to-end latency from the app.

Scrypath telemetry (application signals)

Scrypath emits :telemetry.span/3-style events (see Telemetry for start / stop / exception and duration measurements on stop). Keep dashboards low cardinality: use schema, backend, index, sync_mode — avoid high-cardinality tags such as raw query text or primary keys on alert rules.

Stable event prefixes (each has :start, :stop, and on failure :exception where applicable):

Event prefixWhenUseful aggregates
[:scrypath, :search]Common-path searchp95/p99 duration, error rate, hit_count from stop metadata
[:scrypath, :hydration]Repo batch load after searchDuration vs hit_count / record_count / missing_count (drift indicator when missing_count grows)
[:scrypath, :sync, :upsert] / [:scrypath, :sync, :delete]Document syncError rate, document_count, :noop ratio (noisy if alerted per call)
[:scrypath, :meilisearch, :request]HTTP to Meilisearchstatus_code, method, path pattern — alert on sustained 5xx / connection errors
[:scrypath, :meilisearch, :task_wait]Waiting for Meilisearch task completionpoll_count, final_status — large poll_count or non-:succeeded trends
[:scrypath, :reindex, :settings_verified]Post-apply settings read-backStop metadata result tag (:parity, :drift, etc.)
[:scrypath, :reindex, :verify_skipped]Execute only — settings verify skipped by optRare spikes may be intentional deploys; correlate with logs
[:scrypath, :operator, :failed_work, :observed]Each failed-work row materialized from backend tasks or Oban jobsUseful for dashboards and structured logs; high volume on noisy data — do not page on every event; treat as diagnostic signal and aggregate

Dashboard-first: sync upsert volume, search QPS, hydration missing_count distribution, Meilisearch request latency.

Meilisearch infrastructure (minimal signals)

Prioritize signals that predict outage, data loss risk, or unbounded backlog. Exact metric names depend on your exporter (Prometheus sidecar, cloud vendor agent, or logs). Map these concepts to your stack:

  1. Process up / ready — HTTP GET /health (or vendor equivalent) from the same network path as the app. Page when unreachable for longer than a short window (e.g. two failed checks), not on single blips.
  2. Disk free — Meilisearch persists indexes; running out of disk is a top cause of corruption and wedged tasks. Alert on free space percentage or absolute GB with headroom for compactions and reindexes.
  3. Memory pressure — Large batches and concurrent indexing drive RSS. Page on OOM kills or sustained memory limit pressure from your orchestrator, not one-off spikes during planned reindex.
  4. Task failures — Meilisearch indexes work through a task queue. Sustained failed tasks (not every transient validation error) indicate a bad deploy, schema mismatch, or upstream bug. Prefer a rate or count over a window, not every single failure.
  5. Replication / multi-node (if used) — split brain or lag between nodes is a separate product surface; follow Meilisearch’s own HA docs for your version.

Avoid alert fatigue: do not page on single slow searches, one failed document in a batch, or Meilisearch 202 Accepted enqueue latency alone. Those belong on dashboards or SLO burn-rate rules with long windows.

Footguns (Meilisearch + Scrypath-shaped)

  • filter and facetFilters AND together — Users can think they cleared facets while a base filter still narrows results. Document in your UI and ops playbooks; see the faceted search guide appendix.
  • Reindex + disk — Full reindex can temporarily double index footprint until old data is dropped. Plan disk headroom before Scrypath.reindex/2 on large corpora.
  • Settings verify skippedskip_settings_verification?: true speeds emergencies but hides drift until the next verify. Treat as a temporary flag; do not leave it on silently.
  • Sync mode semantics:oban means durable enqueue, not “search is updated.” Paging on queue depth without checking search visibility misdiagnoses user impact; see sync modes guide.
  • Version skew — Meilisearch minor versions change task and index behavior. Pin server and client expectations per environment; roll upgrades in a canary before production.

What to run before you tune alerts

From the repo root (maintainer checks):