Changelog

View Source

All notable changes to this project will be documented in this file.

[0.3.4] - 2025-12-26

Fixed

  • Configurable polling backoff for 408/5xx in Future polling (opt-in).
  • TrainingClient background task monitoring fix to surface crashes and avoid mailbox bloat.
  • Circuit breaker registry concurrency fix to prevent lost updates under concurrent failures.
  • Jittered exponential backoff for SamplingDispatch/RetrySemaphore busy loops (tunable).

[0.3.3] - 2025-12-26

Added

  • Streaming Sampling: New SamplingClient.sample_stream/4 function for real-time token streaming via Server-Sent Events (SSE)

    • Added SampleStreamChunk type for incremental token processing
    • Added API.Sampling.sample_stream/2 endpoint implementation
    • Stream supports token-by-token enumeration with finish reasons and metadata
  • OpenTelemetry Integration: Opt-in W3C Trace Context propagation for distributed tracing

    • Added Telemetry.Otel module for traceparent/tracestate header injection
    • Enable with otel_propagate: true in config or TINKEX_OTEL_PROPAGATE=true environment variable
    • Gracefully degrades when OpenTelemetry packages not installed
  • Circuit Breaker: Per-endpoint circuit breaker pattern for resilient API calls

    • Added CircuitBreaker module with three states: closed, open, half-open
    • Added CircuitBreaker.Registry for ETS-based multi-process circuit breaker management
    • Configurable failure thresholds and reset timeouts
    • Automatic failure detection and recovery

Changed

  • Future Polling (Python SDK Parity): Polling loop now handles transient errors internally instead of delegating to HTTP retry layer
    • Continues polling on 408 (Request Timeout) and 5xx (Server Errors) until poll_timeout
    • Retries connection errors with exponential backoff
    • Passes max_retries: 0 to HTTP layer; polling loop manages all retry logic
    • Fixed backoff overflow by capping iteration exponent
  • Polling Defaults: Future polling now defaults to a 45s per-request HTTP timeout (Python parity) and sampling futures inherit that default unless explicitly overridden.
  • Retry Config Parity: Tinkex.API.RetryConfig now defaults to max_retries: 10 to match the Python SDK.
  • Telemetry Metadata: config.user_metadata is now merged into HTTP + Future telemetry metadata; per-request telemetry_metadata still overrides on conflicts.
  • Examples/Docs: Task await samples now default to :infinity in examples and README, matching the cookbook’s no-timeout behavior.
  • Examples: examples/run_all.sh now logs timestamps and durations for each script run.
  • Logo Update: Redesigned project logo with modern coral/salmon gradient aesthetic, neural network iconography, and hexagonal frame
  • Code Quality: Extensive refactoring across 50+ files for improved readability and maintainability
    • Simplified conditional logic throughout codebase
    • Extracted helper functions to reduce complexity
    • Improved error handling patterns
    • Reduced cyclomatic complexity in high-traffic modules

Documentation

  • Added comprehensive implementation guide for fresh agents
  • Added gaps analysis document identifying enhancement opportunities
  • Added current state documentation (v0.3.2 baseline)
  • Added model registry planning document (deferred feature)
  • Added test infrastructure overhaul/refactor plan with verification checklist

Tests

  • Added test/tinkex/future_test.exs for Future polling behavior (408/5xx/connection error handling, Python SDK parity)
  • Refactored async test suite with Supertester v0.4 isolation (telemetry/logger/ETS) and Bypass-based HTTP cases to eliminate flakiness

[0.3.2] - 2025-12-15

  • Breaking: Tinkex.Config now requires api_key to start with the tml- prefix (Python SDK parity); tests and guides updated accordingly.
  • Fix: validate loss_fn and normalize/validate loss_fn_config client-side for TrainingClient.forward/4 and TrainingClient.forward_backward/4, matching Python SDK expectations and preventing avoidable server 500s.

[0.3.1] - 2025-12-13

  • Examples: multimodal sampling example now prefers Qwen3-VL models when available (Qwen/Qwen3-VL-30B-A3B-Instruct, Qwen/Qwen3-VL-235B-A22B-Instruct).
  • Examples: multimodal sampling example uses examples/assets/vision_sample.png by default and supports TINKER_IMAGE_PATH / TINKER_IMAGE_EXPECTED_TOKENS.
  • Version reporting: session sdk_version, telemetry sdk_version, and x-stainless-package-version are pinned to the official Python Tinker SDK version configured in mix.exs (removed TINKEX_SDK_VERSION/sdk_version overrides).
  • Fix: HuggingFace tokenizer downloads no longer emit :httpc notices about autoredirect.

[0.3.0] - 2025-12-12

Added

  • Kimi K2 tokenization support via tiktoken_ex (loads tiktoken.model + tokenizer_config.json from HuggingFace and caches the encoding).
  • New live example: examples/kimi_k2_sampling_live.exs (end-to-end sampling with Kimi K2 when available).
  • New guide: docs/guides/kimi_k2_tokenization.md.

Changed

  • EXLA is optional and is not started automatically; docs now show starting :exla before setting Nx.default_backend/1.

Fixed

  • Tokenizer downloads are escript-safe by using OTP-provided CA certs for HuggingFace requests.

[0.2.2] - 2025-12-08

Fixed

  • Recovery ServiceStub now isolates state per service_pid instead of a shared persistent_term map, so concurrent tests cannot erase each other's failure counts or recipients; executor/monitor recovery tests no longer flaky.

[0.2.1] - 2025-12-07

Changed

  • AdamParams now includes weight_decay and grad_clip_norm fields with validation and JSON encoding to match Python SDK defaults.
  • Queue backpressure propagates queue_state_reason through TryAgainResponse/Future telemetry and observers, preferring server-supplied reasons in logs with updated human-readable messages.
  • Training chunking uses a shared byte estimator (images/assets/text/tensors) with 1024-item / 5MB caps instead of count-based limits.
  • Sampling dispatch adds byte-aware throttling (layered semaphores, 20× penalties after recent backoff) with size-based 429 backoffs (1s for ≤128KB, 5s otherwise).
  • Future polling now emits telemetry for timeouts and API/connection/request failures, and treats HTTP 410 “promise expired” responses as retryable with clearer messaging.
  • Retry semaphores support caller-provided keys; SamplingClient scopes retry capacity per session instead of sharing by limit value.
  • Sampling queue-state debounce entries can be cleared (clear_queue_state_debounce/1) and are cleaned up on client termination to avoid persistent_term growth.
  • New live examples: adam_and_chunking_live.exs (AdamParams + byte chunking) and queue_reasons_and_sampling_throttling.exs (queue reasons + dispatch throttling) added to examples/run_all.sh and README.
  • Major refactor of god files: Split 4 large modules into smaller, focused sub-modules for improved maintainability:
    • Tinkex.CLI (2,155 → 82 lines): Extracted to CLI.Commands.{Checkpoint, Run, Sample, Version}, CLI.Formatting, CLI.Pagination, CLI.Parser
    • Tinkex.TrainingClient (1,762 → 1,031 lines): Extracted to TrainingClient.{Operations, Polling, DataProcessor, Tokenizer, Observer}
    • Tinkex.API (1,115 → 317 lines): Extracted to API.{Request, ResponseHandler, Retry, Headers, URL, Compression}
    • Tinkex.Telemetry.Reporter (784 → 380 lines): Extracted to Reporter.{Queue, Events, ExceptionHandler, Serializer, Backoff}
  • All public APIs remain unchanged (transparent refactor)

[0.2.0] - 2025-12-04

Added

  • Opt-in recovery automation: Tinkex.Recovery.Policy/Monitor/Executor restart corrupted runs from checkpoints with callbacks, telemetry, and capped concurrency (disabled by default).
  • Recovery telemetry events (:detected, :started, :checkpoint_selected, :client_created, :completed, :failed, :exhausted) and config wiring (Config.recovery) for reuse in supervisors.
  • NxPenalties-backed regularizer adapters (L1, L2, Elastic Net, Entropy, KL divergence, Consistency, Orthogonality, Gradient Penalty) under Tinkex.Regularizers.
  • Structured regularizer examples (offline + live) exercise all adapters with reference/pair data wiring, KL direction/symmetric variants, and entropy temperature scaling.

Changed

  • Regularizers.KLDivergence now forwards NxPenalties :direction (:forward/:reverse) and :symmetric options and surfaces the chosen mode in metrics.
  • Regularizers.Entropy now accepts a :temperature option for entropy scaling.
  • Checkpoint timestamps now normalize to DateTime when ISO-8601 strings are returned (original strings are preserved on parse failure).
  • README and guides updated for 0.2.0 install and the new regularizer examples.
  • Dependency note: uses nx_penalties ~> 0.1.2 for tensor primitives.

[0.1.20] - 2025-12-03

Changed

  • Default parity: Config now defaults to Python parity (timeout: 60_000, max_retries: 10); use parity_mode: :beam or TINKEX_PARITY=beam to opt back into the 120s/2-retry defaults.
  • Datum dtype parity: loss_fn_inputs lists use the Python _key_to_type map (target_tokens → int64; weights/advantages/logprobs/clip_* → float32) and raise on list values under unknown keys; typed tensors remain allowed for custom keys.

Added

  • Tests covering the key-based dtype map and error behavior for unknown list keys; docs/README bumped to 0.1.20.

[0.1.19] - 2025-12-03

Added

  • Metric reducer parity: hash_unordered reducer in Tinkex.MetricsReduction for order-insensitive hash aggregation, matching Python's chunked metrics behavior.
  • ModelInput builders: empty/0, append/2, and append_int/2 helpers for building ModelInput incrementally. append_int/2 is token-aware: extends the last EncodedTextChunk when possible, otherwise creates a new chunk.
  • TensorData.tolist/1: Returns the flat data list for API parity with Python's TensorData.tolist().

Documentation

  • Gap analysis docs corrected per review: parity matrices now reflect Elixir regularizer/telemetry support, sampling retry_config, Python-only tensor backends, and the remaining reducer/model-input gaps.
  • Captured verification results in docs/20251203/gap_analysis_claude/REVIEW_RESULTS.md and refreshed README/install snippets/prompts to 0.1.19 for consistency.
  • Updated gap analysis to mark hash_unordered reducer and ModelInput builders as resolved.

[0.1.18] - 2025-12-03

Added

  • Public Tinkex.Types.ParsedCheckpointTinkerPath helper exposes structured tinker:// components and user-category errors; REST helpers and CLI checkpoint commands reuse it for consistent validation.

Changed

  • feature_gates now default to ["async_sampling"] when opts/app/env do not supply a value (Python parity); env snapshots/config defaults reflect the new baseline while keeping opts > app config > env precedence.
  • Environment and checkpoint guides note the default gate and the new path parser; README/install snippets bump to 0.1.18.

[0.1.17] - 2025-12-03

Added

  • Log-level parity: TINKER_LOG / config :tinkex, :log_level now set Logger at application start (Python parity) while keeping HTTP header dumps redacted unless explicitly enabled.
  • Telemetry capture: Service/Training/Sampling client entrypoints wrap exceptions through Tinkex.Telemetry.Capture/Reporter and emit session-end events on fatal paths.
  • Config knobs: Added default_headers, default_query, custom http_client/http_pool overrides, and env keys (TINKEX_DEFAULT_HEADERS, TINKEX_DEFAULT_QUERY, TINKEX_HTTP_CLIENT, TINKEX_HTTP_POOL) with opts > app config > env precedence and secret redaction for headers.

Changed

  • Session heartbeats pin a 10_000ms timeout with max_retries: 0 for Python parity while preserving debounce/warning semantics.
  • Finch pool isolation now routes per pool type (session/training/sampling/futures/telemetry) with Python-aligned sizes and deterministic pool names derived from http_pool + base_url.

[0.1.16] - 2025-12-03

Added

  • CLI parity for checkpoint/run management: --format json/--json output with total/shown counts, checkpoint list run filters, --limit 0 fetch-all semantics, and pagination progress indicators for lists.
  • Checkpoint info now surfaces checkpoint type, size, visibility, timestamps, and training run IDs alongside base model/LoRA metadata; run list/info surface owner, LoRA rank, statuses, last checkpoints, and user metadata in both table and JSON formats.

Changed

  • Checkpoint pagination cursors now use the typed Tinkex.Types.Cursor struct; CLI table outputs include the richer checkpoint/run fields.
  • Examples and docs updated for new CLI flags and cursor accessors; version bumped to 0.1.16 in mix/README/getting_started.

[0.1.15] - 2025-12-03

Added

  • Checkpoint archive responses now surface expires; CheckpointArchiveUrlResponse parses timestamps and Rest/RestClient expose ID-based archive/delete helpers alongside tinker-path variants.
  • Sampling backpressure parity: dispatch semaphore (default 400) gates sampling dispatch and 429s without Retry-After trigger a 1s shared backoff.

Changed

  • list_user_checkpoints/2 now defaults to limit: 100 (Python parity).
  • RetryConfig default max_connections raised to 1000 to match Python connection limits.
  • ServiceClient validation requires base_model or model_path for sampling clients and at least one of train_mlp/train_attn/train_unembed when creating training clients.

Documentation

  • README/version bump; checkpoint management guide covers archive expirations + ID helpers; retry guide reflects the new max_connections default.

[0.1.14] - 2025-12-02

Added

  • ServiceClient.create_training_client_from_state_with_optimizer/3 (+ async) for weight+optimizer restore with documented weights-only default on the existing helper.
  • CLI checkpoint delete now accepts multiple tinker:// paths with upfront validation, a single confirmation prompt (--yes to skip), per-path progress output, and aggregated failure reporting.
  • New examples:
    • examples/multimodal_resume_and_cleanup.exs (multimodal with expected_tokens, optimizer resume helper, multi-delete CLI usage)
    • examples/checkpoint_multi_delete_live.exs (creates two checkpoints and deletes both live with one CLI call)
    • examples/llama3_tokenizer_override_live.exs (live Llama-3 sampling plus encode/decode using the override tokenizer)
  • DELETE HTTP responses now accept empty bodies without raising JSON decode errors, fixing CLI multi-delete against endpoints that return 204/empty responses.
  • Retry handler no longer trips progress timeout before the first attempt (attempt 0 is allowed to run).

Changed

  • Image chunks: removed height/width/tokens fields from ImageChunk and ImageAssetPointerChunk, added expected_tokens, and made .length/1 raise when missing; JSON encoders now match Python. ModelInput length follows the new contract and training batching counts images by base64/location length to avoid missing expected tokens.
  • Retries: default progress timeout increased to 120 minutes and retries now run until that timeout (no attempt cap by default); RetryConfig/RetryHandler defaults and docs updated.
  • Tokenizer override: Llama-3 models now use thinkingmachineslabinc/meta-llama-3-tokenizer to avoid gated downloads.
  • Documentation updates for CLI multi-delete, optimizer resume helper, retry defaults, expected_tokens schema, and new example references.

[0.1.13] - 2025-12-01

Documentation release: clarifies project status and attribution.

Documentation

  • Added disclaimer clarifying Tinkex is an independent, community-maintained project not affiliated with or endorsed by Thinking Machines Lab.
  • Added links to official Thinking Machines Lab homepage and Tinker documentation.
  • Improved intro section clarity around Tinker's provenance.

[0.1.12] - 2025-11-27

Custom loss training now mirrors the Python SDK while adding multipart uploads, proxy-aware HTTP, streaming checkpoint downloads, queue observers, and richer capabilities metadata.

Added

  • Custom loss training parity: TrainingClient.forward_backward_custom/4 preserves per-datum logprobs as Nx tensors, computes gradients locally, sends them back as synthetic weights so the backend trains, and returns ForwardBackwardOutput ready for optim_step/2.
  • Structured capabilities metadata: Capabilities responses expose SupportedModel structs with model_id, model_name, and arch; model_names/1 offers backward-compatible name extraction and the capabilities example now prints the full metadata.
  • Multipart uploads: Tinkex.API.post/3 normalizes file inputs, flattens nested bodies into bracketed form fields, generates multipart boundaries, and sets Content-Type automatically when :files are present. Added path-based upload helpers and a runnable multipart demo.
  • Queue observers: Sampling and training clients register queue state observers that integrate with Future.poll/2, emitting debounced warnings with human-readable reasons for rate limiting or capacity throttling, aligned with the Python SDK.
  • Proxy-aware HTTP: Tinkex.Config now wires proxy settings through Finch pool configuration, supporting tuple proxies plus TINKEX_PROXY and TINKEX_PROXY_HEADERS while masking credentials in inspect/log output.

Changed

  • Checkpoint downloads: Archive downloads stream via Finch instead of loading into memory, maintaining O(1) memory usage while preserving progress callbacks and extraction semantics.

Documentation

  • Guides and examples refreshed for custom loss training, streaming checkpoint downloads, multipart uploads, async client creation, and the richer server capabilities response.

[0.1.11] - 2025-11-27

Achieves full behavioral parity with Python SDK (tinker v1.x) across retry semantics, HTTP connection pooling, and missing type definitions.

Breaking Changes

  • ServiceClient.create_lora_training_client/2/3: base_model is now a required second argument instead of being passed in opts:

    # Before (0.1.10)
    create_lora_training_client(service, base_model: "meta-llama/Llama-3.1-8B")
    
    # After (0.1.11)
    create_lora_training_client(service, "meta-llama/Llama-3.1-8B", opts)
  • TrainingClient.save_weights_for_sampler/2/3: name is now a required second argument mapping to the server path field:

    # Before (0.1.10)
    save_weights_for_sampler(training, name: "checkpoint-001")
    
    # After (0.1.11)
    save_weights_for_sampler(training, "checkpoint-001", opts)

Added

Python SDK Retry Parity

  • HTTP retry now matches Python _base_client.py behavior:
    • Retries on 408/409/429/5xx (added 409 Conflict support)
    • Uses Python jitter formula (0.75–1.0 range) instead of ±25%
    • Caps delay at 10s instead of 8s
    • Removes 30s wall-clock timeout; max_retries governs retry attempts
  • New Tinkex.API.RetryConfig module with Python parity formulas

HTTP Pool Parity

Missing Type Structs

  • FutureRetrieveRequest - request type for future polling
  • RequestFailedResponse - structured error response type
  • SessionHeartbeatRequest / SessionHeartbeatResponse - heartbeat wire types
  • TelemetryResponse - telemetry send response type
  • Tinkex.Types.TypeAliases module for ModelInputChunk, LossFnInputs, LossFnOutput union types
  • SupportedModel - structured model metadata with model_id, model_name, and arch fields

Server Capabilities Type Improvement

  • GetServerCapabilitiesResponse.supported_models now returns [SupportedModel.t()] instead of [String.t()]
  • Preserves full model metadata (model_id, model_name, arch) from the API response
  • Added GetServerCapabilitiesResponse.model_names/1 convenience helper for backward-compatible name extraction
  • Backward compatible: parser handles both object and string formats

API Helpers

TensorDtype Helpers

  • TensorDtype.from_nx_type/1 now emits warnings for float64→float32 downcast and u64→int64 overflow
  • TensorDtype.from_nx_type_quiet/1 - silent conversion without warnings
  • TensorDtype.check_precision_loss/1 - explicit precision loss checking

Fixed

  • Sampling nil-field fix: SampleRequest JSON encoder now omits nil for optional fields like prompt_logprobs instead of encoding as null (server rejects null). Uses drop_nil?: true in Transform layer.
  • Tokenizer resolution: Now strips /variant suffix from three-part model names; added Kimi K2 tokenizer with pinned revision
  • CLI checkpoint command: Generates default checkpoint names from base_model when --name not provided
  • Example bug fixes:
    • Fixed .samples.sequences in examples and docs (field was renamed in response type)
    • Fixed prompt.tokensprompt.chunks[0].tokens access pattern in telemetry_live.exs

Changed

  • Training loop example adds detailed step timing and clearer output formatting

Documentation

  • Updated all guides to use new create_lora_training_client/3 and save_weights_for_sampler/3 signatures
  • Added docs/20251126/gaps_05/gap-analysis-python-to-elixir.md: comprehensive gap analysis showing feature-complete parity with Python SDK
  • Added docs/20251126/gaps_05/remaining_gaps_fix_plan.md: concrete remaining gaps and fixes to reach 100% parity
  • Updated version references from 0.1.10 to 0.1.11 in README and mix.exs

Tests

  • Added test/tinkex/api/retry_parity_test.exs: validates Python retry formula, jitter range (0.75–1.0), 10s delay cap, and 409 retry support
  • Added test/tinkex/pool_config_parity_test.exs: verifies pool_size × pool_count = 1000 matching Python max_connections
  • Added test/tinkex/types/missing_types_parity_test.exs: round-trip tests for new type structs
  • Added test/tinkex/api/helpers_test.exs: with_raw_response and with_streaming_response helpers
  • Updated integration tests for new TrainingClient API signatures

[0.1.10] - 2025-11-27

Added

RestClient Async API

  • All REST methods now have *_async variants returning Task.t() for parallel requests:
    • get_session_async/2, list_sessions_async/2, get_sampler_async/2
    • get_weights_info_by_tinker_path_async/2, list_checkpoints_async/2
    • list_user_checkpoints_async/2, get_checkpoint_archive_url_async/2
    • delete_checkpoint_async/2, get_training_run_async/2
    • get_training_run_by_tinker_path_async/2, list_training_runs_async/2
    • publish_checkpoint_async/2, unpublish_checkpoint_async/2
    • Plus convenience aliases (delete_checkpoint_by_tinker_path_async/2, publish_checkpoint_from_tinker_path_async/2, etc.)

TrainingClient Tokenizer Helpers

  • TrainingClient.get_tokenizer/2 - fetches tokenizer using model info with ETS caching
  • TrainingClient.encode/3 - convenience wrapper for tokenizer encoding from training client
  • TrainingClient.decode/3 - convenience wrapper for tokenizer decoding from training client

Config Parity Mode

  • New parity_mode: :python option to align timeout/retry defaults with Python SDK:
    • Set via options: Tinkex.Config.new(parity_mode: :python)
    • Set via application config: config :tinkex, parity_mode: :python
    • Set via environment variable: TINKEX_PARITY=python
  • Python parity defaults: timeout: 60_000 (1 min), max_retries: 10 (11 total attempts)
  • BEAM-conservative defaults (unchanged): timeout: 120_000 (2 min), max_retries: 2 (3 total attempts)
  • New helper functions: Tinkex.Config.default_timeout/0, Tinkex.Config.default_max_retries/0, Tinkex.Config.python_timeout/0, Tinkex.Config.python_max_retries/0
  • Tinkex.Env.parity_mode/1 - reads TINKEX_PARITY environment variable

Typed Telemetry Events

  • New structs under Tinkex.Types.Telemetry:
    • EventType - enum for SESSION_START, SESSION_END, UNHANDLED_EXCEPTION, GENERIC_EVENT
    • Severity - enum for DEBUG, INFO, WARNING, ERROR, CRITICAL
    • GenericEvent, SessionStartEvent, SessionEndEvent, UnhandledExceptionEvent - typed event structs
    • TelemetryEvent - union type with to_map/1, from_map/1, event_type/1 dispatch helpers
    • TelemetryBatch - batch grouping with to_list/1, from_list/2, size/1
    • TelemetrySendRequest - request structure for API transmission

Changed

Tests

  • Added test/tinkex/types/telemetry_types_test.exs - 53 tests for telemetry type round-trips and conversions
  • Added test/tinkex/rest_client_async_test.exs - 14 tests for async REST client with Bypass
  • Added test/tinkex/training_client_tokenizer_test.exs - 11 tests for tokenizer helper wiring
  • Added test/tinkex/config_parity_test.exs - 10 tests for parity mode configuration
  • Extended test/tinkex/env_test.exs - 4 tests for parity_mode/1

[0.1.9] - 2025-11-26

Added

Env and Configuration Parity

  • Tinkex.Env: Introduced as the single source of truth for environment-driven configuration knobs.
  • Config precedence: Wired Tinkex.Config.new/1 to use opts > app config > env > defaults.
  • New env vars: Support for TINKER_TAGS, TINKER_FEATURE_GATES, TINKER_TELEMETRY, TINKER_LOG, TINKEX_DUMP_HEADERS, CLOUDFLARE_ACCESS_CLIENT_ID, and CLOUDFLARE_ACCESS_CLIENT_SECRET.
  • Secret masking: Added masking for API key and Cloudflare secrets in inspect and HTTP dumps.
  • Config accessors: Exposed tags, feature_gates, telemetry_enabled?, log_level, dump_headers? on Tinkex.Config; documented in environment_configuration.md.

Cloudflare Access and Header Redaction

  • CF headers: Inject CF-Access-Client-Id and CF-Access-Client-Secret from config/env into all requests via build_headers/4.
  • Redaction: Redact both x-api-key and cf-access-client-secret in request dump logs.
  • Tests: Added tests asserting Cloudflare headers are present when configured and that secrets never appear in logs.

Heartbeat Path and SessionManager Robustness

  • Heartbeat alignment: Changed Elixir heartbeat endpoint to POST /api/v1/session_heartbeat (matching Python) instead of /api/v1/heartbeat.
  • Warning threshold: Introduced heartbeat_warning_after_ms (default 120,000 ms) and emit warnings when heartbeats have failed for longer than this window.
  • Resilient sessions: Stop silently dropping sessions on 4xx; keep sessions alive and continue retrying while surfacing failures via logs.
  • ETS persistence: Persist session state in a protected ETS table :tinkex_sessions and reload on SessionManager init so restarts preserve heartbeat state.
  • Timer cleanup: Ensure heartbeat timers are tracked and cancelled on terminate.

Sampling Retries, Backpressure, and Connection Limiting

  • RetryConfig: Added Tinkex.RetryConfig with max_retries, base_delay_ms, max_delay_ms, jitter_pct, progress_timeout_ms, max_connections, and enable_retry_logic.
  • SamplingClient integration: Integrated RetryConfig into SamplingClient; allow per-client configs and a simple keyword shorthand retry_config: [...].
  • RetryHandler: Use RetryHandler.from_config/1 to wrap sampling requests with retry logic matching Python semantics (0.5s base, 10s cap, 25% jitter, long progress timeout).
  • RetrySemaphore: Introduced Tinkex.RetrySemaphore to cap concurrent sampling attempts per client using an ETS-backed semaphore and max_connections.
  • Retry defaults: Removed hardcoded max_retries: 0 in Sampling API; now respects configured max_retries from RetryConfig or opts.
  • Tests: Added tests for RetryConfig, RetryHandler jitter, RetrySemaphore, and integration cases for enabled/disabled retries.

Training Persistence and Checkpoint Workflows

  • LoadWeightsRequest fix: Fixed wire protocol by renaming load_optimizer_state to optimizer to match Python and server expectations.
  • TrainingClient.save_state/3: Implemented to save training checkpoints via /api/v1/save_weights, returning SaveWeightsResponse with tinker:// path.
  • TrainingClient.load_state/3: Implemented to load weights via /api/v1/load_weights.
  • TrainingClient.load_state_with_optimizer/3: Implemented to load weights with optimizer state.
  • ServiceClient.create_training_client_from_state/3: Uses RestClient.get_weights_info_by_tinker_path/2 to derive base_model and LoRA rank, creates a TrainingClient, then loads the checkpoint.
  • SaveWeightsForSamplerResponse: Extended to support both path and sampling_session_id-only responses, matching ephemeral sampler flows.
  • TrainingClient.save_weights_and_get_sampling_client/2: Added to save sampler weights or ephemeral sessions and return a SamplingClient.
  • TrainingClient.save_weights_and_get_sampling_client_sync/2: Synchronous helper variant.
  • Docs & examples: Added training_persistence.md guide and examples training_persistence_live.exs and save_weights_and_sample.exs.

Model Info and Unload Endpoints

  • Types: Added ModelData, GetInfoRequest, GetInfoResponse, UnloadModelRequest, and UnloadModelResponse types.
  • API endpoints: Implemented Tinkex.API.Models.get_info/2 and unload_model/2 on top of /api/v1/get_info and /api/v1/unload_model.
  • TrainingClient.get_info/1: Wired to the typed get_info endpoint, returning GetInfoResponse for tokenizer_id and architecture.
  • TrainingClient.unload_model/1: Wired to the unload_model endpoint with support for both immediate and future-based responses.
  • Example & guide: Added model_info_and_unload.exs example and model_info_unload.md guide.

REST Surface Parity and Tinker-Path Helpers

  • RestClient.get_sampler/2: Exposed on top of Rest.get_sampler/2, returning typed GetSamplerResponse.
  • RestClient.get_weights_info_by_tinker_path/2: Returns WeightsInfoResponse.
  • New helpers:
    • get_training_run_by_tinker_path/2
    • delete_checkpoint_by_tinker_path/2
    • publish_checkpoint_from_tinker_path/2
    • unpublish_checkpoint_from_tinker_path/2
    • get_checkpoint_archive_url_by_tinker_path/2
  • Docs: Extended checkpoint_management.md to use the new helpers.

Telemetry and Task Supervision

  • Task.Supervisor: Introduced top-level Tinkex.TaskSupervisor and route telemetry HTTP sends through it instead of Task.start/1.
  • Child specs: Made SamplingClient and TrainingClient child_spec use restart: :temporary to avoid restart storms and match user-managed lifecycle semantics.
  • Telemetry toggle: Keep telemetry_enabled? driven by Tinkex.Config.telemetry_enabled? with TINKER_TELEMETRY as the env fallback.

Docs, Guides, and Examples

  • New guides: environment_configuration.md, advanced configuration updates describing env precedence, Cloudflare Access, and heartbeat tuning.
  • Retry guide: Extended retry_and_error_handling.md with SamplingClient retry_config details and connection limiting behavior.
  • Persistence guides: Added training_persistence.md and model_info_unload.md focused on checkpoints, optimizer state, and model metadata APIs.
  • Examples README: Updated examples/README.md and run_all.sh to include new examples: heartbeat_probe.exs, model_info_and_unload.exs, training_persistence_live.exs, save_weights_and_sample.exs.

Tests

  • Added unit tests for Env, Config precedence, Cloudflare redaction, RetryConfig, RetryHandler, RetrySemaphore, ModelInfo and Unload types, LoadWeightsRequest wire format, and updated rate limiter behavior.
  • Added integration tests covering:
    • Multi-client concurrency with retry_config enabled/disabled
    • Sampling workflows under rate limits and transient errors
    • Training loop + checkpoint flows with save/load and from_state
    • SessionManager heartbeat path and warning behavior

[0.1.8] - 2025-11-26

Added

  • NotGiven + transform layer: Introduced omit/not-given sentinels and request transformation with aliasing/formatting to mirror Python serialization semantics.
  • REST client parity: Added RestClient.get_sampler/2, RestClient.get_weights_info_by_tinker_path/2, and tinker-path convenience aliases (training run, delete/archive/publish/unpublish) to match Python ergonomics.
  • Response wrappers & SSE: Added Tinkex.API.Response with metadata (headers, status, URL, elapsed, retries) plus SSE decoding helpers and streaming response support for event-stream endpoints.
  • Typed responses: New structs for weight save/load responses, training runs, cursors, server capabilities, and health checks; REST training run endpoints now decode into typed structs.
  • Service endpoints: Implemented /api/v1/get_server_capabilities and /api/v1/healthz in Tinkex.API.Service.
  • Sampling helper: Added SamplingClient.compute_logprobs/3 convenience for prompt token logprobs.
  • CLI management: Added checkpoint management subcommands (list/info/publish/unpublish/delete/download) and run management subcommands (list/info) with corresponding tests and docs.
  • Live example: New examples/live_capabilities_and_logprobs.exs showing capabilities + health probes and prompt logprobs; included in examples/run_all.sh.
  • Centralized env + Cloudflare headers: Tinkex.Env feeds Tinkex.Config/HTTP defaults (API key, base URL, tags, feature gates, telemetry, log level, dump headers, Cloudflare Access) with redaction helpers, matching Python env behavior and ADR-002.

Fixed

  • RestClient training run endpoints now return typed structs, and publish/unpublish checkpoint helpers are wired through REST with CLI wrappers.
  • Session heartbeat parity: Heartbeats now POST to /api/v1/session_heartbeat (matching Python), continue retrying on all errors, and emit warnings after sustained failure windows instead of silently dropping sessions on 4xx. Added a guarded probe script to verify /api/v1/session_heartbeat = 200 and /api/v1/heartbeat = 404 against a real server.

[0.1.7] - 2025-11-26

Added

  • Telemetry Reporter: New Tinkex.Telemetry.Reporter batches client-side telemetry (session start/end, HTTP, queue, custom events, exceptions) with configurable flush interval/threshold, HTTP timeout, and retry/backoff, plus wait-until-drained semantics and fatal-exception flushing. ServiceClient boots a reporter automatically and exposes it via telemetry_reporter/1; telemetry can be disabled with TINKER_TELEMETRY=0.
  • Telemetry examples: Added examples/telemetry_live.exs and examples/telemetry_reporter_demo.exs, documented in READMEs and included in examples/run_all.sh, showcasing reporter lifecycle, custom events, retries, drain/wait, and graceful shutdown.
  • Coverage: Added test/tinkex/telemetry_reporter_test.exs for reporter lifecycle, backoff, exception handling, and drain semantics; tests disable backend telemetry via TINKER_TELEMETRY=0.

Changed

  • Telemetry attribution: Sampling/training/client APIs now merge optional :telemetry_metadata (including session, sampling session, and model sequence IDs) into HTTP telemetry so backend events are session-scoped; telemetry POSTs honor configurable timeouts.
  • Docs: README highlights the telemetry reporter, backend shipping flow, and metadata tagging; installation snippet bumped to ~> 0.1.7.

[0.1.6] - 2025-11-25

Added

  • Metrics aggregation via Tinkex.Metrics: counters for total/success/failure HTTP requests, latency histograms with p50/p95/p99, snapshot/0, reset/0, and flush/0.
  • Automatic telemetry wiring: the metrics server starts under Tinkex.Application and subscribes to [:tinkex, :http, :request, :stop] by default.
  • Live metrics example: examples/metrics_live.exs runs a sampling call against the live API and prints the metrics snapshot; added to examples/run_all.sh and documented in examples/README.md.

Changed

  • Docs: README now highlights metrics and shows a quick snapshot snippet; installation version bumped to ~> 0.1.6.

[0.1.5] - 2025-11-25

Added

  • Structured Regularizer Composition: New TrainingClient.forward_backward_custom/4 API for computing custom loss functions with composable regularizers in Elixir/Nx.
  • RegularizerSpec type (lib/tinkex/types/regularizer_spec.ex): Typed configuration struct for regularizers with validation, supporting:
    • fn - Regularizer function of arity 2 returning {loss_tensor, metrics_map}
    • weight - Non-negative float multiplier for loss contribution
    • name - String identifier for telemetry and metrics
    • async - Boolean flag for Task-returning regularizers
  • RegularizerOutput type (lib/tinkex/types/regularizer_output.ex): Metrics struct for individual regularizer results including value, weight, contribution, and optional gradient norms.
  • CustomLossOutput type (lib/tinkex/types/custom_loss_output.ex): Structured output from composed loss computation with base loss metrics, per-regularizer outputs, and totals. Implements Jason.Encoder for serialization.
  • Regularizer behaviour (lib/tinkex/regularizer/regularizer.ex): Formal interface with @callback compute/3 and optional @callback name/0 for module-based regularizers.
  • GradientTracker (lib/tinkex/regularizer/gradient_tracker.ex): Nx-based gradient norm computation using Nx.Defn.grad for monitoring training dynamics:
    • compute_grad_norm/2 - L2 norm for arbitrary loss functions
    • grad_norm_for_regularizer/3 - Per-regularizer gradient norms
    • total_grad_norm/4 - Combined gradient norm for full loss composition
  • Executor (lib/tinkex/regularizer/executor.ex): Parallel/sequential regularizer execution using Task.async_stream/3:
    • Configurable parallelism with max_concurrency option
    • Timeout handling with task cleanup
    • Support for async (Task-returning) regularizers
    • Telemetry emission for start/stop/exception events
  • Pipeline (lib/tinkex/regularizer/pipeline.ex): Orchestration module coordinating base loss and regularizer execution:
    • Input validation and duplicate name detection
    • Parallel execution by default
    • Optional gradient norm tracking
    • Comprehensive telemetry events
  • Regularizer Telemetry (lib/tinkex/regularizer/telemetry.ex): Convenience helpers for attaching telemetry handlers to regularizer events:
    • [:tinkex, :custom_loss, :start | :stop | :exception]

    • [:tinkex, :regularizer, :compute, :start | :stop | :exception]

Changed

  • TrainingClient: Added forward_backward_custom/4 public function and corresponding handle_call clause for custom loss computation with regularizers.

[0.1.4] - 2025-11-25

  • Added EXLA dependency ({:exla, "~> 0.7"}) and configured Nx to use EXLA.Backend by default, enabling GPU/CPU-accelerated tensor operations for custom loss computation in Elixir.
  • Introduced TrainingClient.forward/4 for forward-only inference without backward pass, returning logprobs that can be converted to Nx tensors via TensorData.to_nx/1.
  • Added Training.forward_future/2 API endpoint for server-side future-based forward pass requests.
  • Created forward_inference.exs example demonstrating the forward-only API, Nx tensor conversion, and EXLA-accelerated operations.
  • Foundation for structured regularizer pipelines where custom loss functions and gradients are computed in Elixir/Nx rather than on the server.

[0.1.3] - 2025-11-25

  • Made SessionManager.stop_session/2 a synchronous GenServer call so heartbeat removal finishes before the client returns and refined heartbeat error handling to drop sessions silently on client-visible errors (e.g., 404) just like the Python SDK.
  • Added REST endpoints for fetching samplers, weights metadata, and training runs along with the new GetSamplerResponse and WeightsInfoResponse structs, ImageChunk.expected_tokens, LoadWeightsRequest.load_optimizer_state, and the :cispo/:dro LossFnType variants; expanded tests, serialization rounds, and .gitignore coverage to match the higher API parity.
  • Introduced the weightsinspection.exs example showing how to query checkpoint metadata, LoRA ranks, and sampler state via REST plus published detailed architectural documentation and a Structured Regularizers design doc that outlines custom loss/telemetry workflows requiring Nx/EXLA gradient computation support.

[0.1.2] - 2025-11-22

  • Fixed future polling to handle direct ForwardBackwardOutput payloads (no status/type wrapper) by normalizing them into completed responses, unblocking TrainingClient.forward_backward/3 result parsing.

[0.1.1] - 2025-11-21

  • Added Tinkex.RestClient for synchronous session and checkpoint management (list/get/delete and archive URL retrieval) with typed response structs.
  • Added Tinkex.CheckpointDownload to fetch and extract checkpoint archives with overwrite and progress callback options.
  • Added async client factories (ServiceClient.create_sampling_client_async/2, SamplingClient.create_async/2, TrainingClient.create_sampling_client_async/3) to parallelize client creation.
  • Expanded docs and examples to cover sessions, checkpoint management, downloads, and async workflows; included the changelog in HexDocs extras for publishing.
  • Fixed REST URL construction so query parameters (e.g., checkpoint archive URLs) are sent correctly, resolving 404s when fetching checkpoint downloads.
  • Avoided over-encoding tinker:// paths in checkpoint archive/delete endpoints so sampler checkpoints resolve correctly.

[0.1.0]

  • Initial release.