Changelog
View SourceAll notable changes to this project will be documented in this file.
[0.3.4] - 2025-12-26
Fixed
- Configurable polling backoff for 408/5xx in Future polling (opt-in).
- TrainingClient background task monitoring fix to surface crashes and avoid mailbox bloat.
- Circuit breaker registry concurrency fix to prevent lost updates under concurrent failures.
- Jittered exponential backoff for SamplingDispatch/RetrySemaphore busy loops (tunable).
[0.3.3] - 2025-12-26
Added
Streaming Sampling: New
SamplingClient.sample_stream/4function for real-time token streaming via Server-Sent Events (SSE)- Added
SampleStreamChunktype for incremental token processing - Added
API.Sampling.sample_stream/2endpoint implementation - Stream supports token-by-token enumeration with finish reasons and metadata
- Added
OpenTelemetry Integration: Opt-in W3C Trace Context propagation for distributed tracing
- Added
Telemetry.Otelmodule for traceparent/tracestate header injection - Enable with
otel_propagate: truein config orTINKEX_OTEL_PROPAGATE=trueenvironment variable - Gracefully degrades when OpenTelemetry packages not installed
- Added
Circuit Breaker: Per-endpoint circuit breaker pattern for resilient API calls
- Added
CircuitBreakermodule with three states: closed, open, half-open - Added
CircuitBreaker.Registryfor ETS-based multi-process circuit breaker management - Configurable failure thresholds and reset timeouts
- Automatic failure detection and recovery
- Added
Changed
- Future Polling (Python SDK Parity): Polling loop now handles transient errors internally instead of delegating to HTTP retry layer
- Continues polling on 408 (Request Timeout) and 5xx (Server Errors) until
poll_timeout - Retries connection errors with exponential backoff
- Passes
max_retries: 0to HTTP layer; polling loop manages all retry logic - Fixed backoff overflow by capping iteration exponent
- Continues polling on 408 (Request Timeout) and 5xx (Server Errors) until
- Polling Defaults: Future polling now defaults to a 45s per-request HTTP timeout (Python parity) and sampling futures inherit that default unless explicitly overridden.
- Retry Config Parity:
Tinkex.API.RetryConfignow defaults tomax_retries: 10to match the Python SDK. - Telemetry Metadata:
config.user_metadatais now merged into HTTP + Future telemetry metadata; per-requesttelemetry_metadatastill overrides on conflicts. - Examples/Docs: Task await samples now default to
:infinityin examples and README, matching the cookbook’s no-timeout behavior. - Examples:
examples/run_all.shnow logs timestamps and durations for each script run. - Logo Update: Redesigned project logo with modern coral/salmon gradient aesthetic, neural network iconography, and hexagonal frame
- Code Quality: Extensive refactoring across 50+ files for improved readability and maintainability
- Simplified conditional logic throughout codebase
- Extracted helper functions to reduce complexity
- Improved error handling patterns
- Reduced cyclomatic complexity in high-traffic modules
Documentation
- Added comprehensive implementation guide for fresh agents
- Added gaps analysis document identifying enhancement opportunities
- Added current state documentation (v0.3.2 baseline)
- Added model registry planning document (deferred feature)
- Added test infrastructure overhaul/refactor plan with verification checklist
Tests
- Added
test/tinkex/future_test.exsfor Future polling behavior (408/5xx/connection error handling, Python SDK parity) - Refactored async test suite with Supertester v0.4 isolation (telemetry/logger/ETS) and Bypass-based HTTP cases to eliminate flakiness
[0.3.2] - 2025-12-15
- Breaking:
Tinkex.Confignow requiresapi_keyto start with thetml-prefix (Python SDK parity); tests and guides updated accordingly. - Fix: validate
loss_fnand normalize/validateloss_fn_configclient-side forTrainingClient.forward/4andTrainingClient.forward_backward/4, matching Python SDK expectations and preventing avoidable server 500s.
[0.3.1] - 2025-12-13
- Examples: multimodal sampling example now prefers Qwen3-VL models when available (
Qwen/Qwen3-VL-30B-A3B-Instruct,Qwen/Qwen3-VL-235B-A22B-Instruct). - Examples: multimodal sampling example uses
examples/assets/vision_sample.pngby default and supportsTINKER_IMAGE_PATH/TINKER_IMAGE_EXPECTED_TOKENS. - Version reporting: session
sdk_version, telemetrysdk_version, andx-stainless-package-versionare pinned to the official Python Tinker SDK version configured inmix.exs(removedTINKEX_SDK_VERSION/sdk_versionoverrides). - Fix: HuggingFace tokenizer downloads no longer emit
:httpcnotices aboutautoredirect.
[0.3.0] - 2025-12-12
Added
- Kimi K2 tokenization support via
tiktoken_ex(loadstiktoken.model+tokenizer_config.jsonfrom HuggingFace and caches the encoding). - New live example:
examples/kimi_k2_sampling_live.exs(end-to-end sampling with Kimi K2 when available). - New guide:
docs/guides/kimi_k2_tokenization.md.
Changed
- EXLA is optional and is not started automatically; docs now show starting
:exlabefore settingNx.default_backend/1.
Fixed
- Tokenizer downloads are escript-safe by using OTP-provided CA certs for HuggingFace requests.
[0.2.2] - 2025-12-08
Fixed
- Recovery ServiceStub now isolates state per
service_pidinstead of a sharedpersistent_termmap, so concurrent tests cannot erase each other's failure counts or recipients; executor/monitor recovery tests no longer flaky.
[0.2.1] - 2025-12-07
Changed
AdamParamsnow includesweight_decayandgrad_clip_normfields with validation and JSON encoding to match Python SDK defaults.- Queue backpressure propagates
queue_state_reasonthroughTryAgainResponse/Futuretelemetry and observers, preferring server-supplied reasons in logs with updated human-readable messages. - Training chunking uses a shared byte estimator (images/assets/text/tensors) with 1024-item / 5MB caps instead of count-based limits.
- Sampling dispatch adds byte-aware throttling (layered semaphores, 20× penalties after recent backoff) with size-based 429 backoffs (1s for ≤128KB, 5s otherwise).
- Future polling now emits telemetry for timeouts and API/connection/request failures, and treats HTTP 410 “promise expired” responses as retryable with clearer messaging.
- Retry semaphores support caller-provided keys; SamplingClient scopes retry capacity per session instead of sharing by limit value.
- Sampling queue-state debounce entries can be cleared (
clear_queue_state_debounce/1) and are cleaned up on client termination to avoid persistent_term growth. - New live examples:
adam_and_chunking_live.exs(AdamParams + byte chunking) andqueue_reasons_and_sampling_throttling.exs(queue reasons + dispatch throttling) added toexamples/run_all.shand README. - Major refactor of god files: Split 4 large modules into smaller, focused sub-modules for improved maintainability:
Tinkex.CLI(2,155 → 82 lines): Extracted toCLI.Commands.{Checkpoint, Run, Sample, Version},CLI.Formatting,CLI.Pagination,CLI.ParserTinkex.TrainingClient(1,762 → 1,031 lines): Extracted toTrainingClient.{Operations, Polling, DataProcessor, Tokenizer, Observer}Tinkex.API(1,115 → 317 lines): Extracted toAPI.{Request, ResponseHandler, Retry, Headers, URL, Compression}Tinkex.Telemetry.Reporter(784 → 380 lines): Extracted toReporter.{Queue, Events, ExceptionHandler, Serializer, Backoff}
- All public APIs remain unchanged (transparent refactor)
[0.2.0] - 2025-12-04
Added
- Opt-in recovery automation:
Tinkex.Recovery.Policy/Monitor/Executorrestart corrupted runs from checkpoints with callbacks, telemetry, and capped concurrency (disabled by default). - Recovery telemetry events (
:detected,:started,:checkpoint_selected,:client_created,:completed,:failed,:exhausted) and config wiring (Config.recovery) for reuse in supervisors. - NxPenalties-backed regularizer adapters (L1, L2, Elastic Net, Entropy, KL divergence, Consistency, Orthogonality, Gradient Penalty) under
Tinkex.Regularizers. - Structured regularizer examples (offline + live) exercise all adapters with reference/pair data wiring, KL direction/symmetric variants, and entropy temperature scaling.
Changed
Regularizers.KLDivergencenow forwards NxPenalties:direction(:forward/:reverse) and:symmetricoptions and surfaces the chosen mode in metrics.Regularizers.Entropynow accepts a:temperatureoption for entropy scaling.- Checkpoint timestamps now normalize to
DateTimewhen ISO-8601 strings are returned (original strings are preserved on parse failure). - README and guides updated for 0.2.0 install and the new regularizer examples.
- Dependency note: uses
nx_penalties ~> 0.1.2for tensor primitives.
[0.1.20] - 2025-12-03
Changed
- Default parity: Config now defaults to Python parity (
timeout: 60_000,max_retries: 10); useparity_mode: :beamorTINKEX_PARITY=beamto opt back into the 120s/2-retry defaults. - Datum dtype parity:
loss_fn_inputslists use the Python_key_to_typemap (target_tokens→ int64;weights/advantages/logprobs/clip_*→ float32) and raise on list values under unknown keys; typed tensors remain allowed for custom keys.
Added
- Tests covering the key-based dtype map and error behavior for unknown list keys; docs/README bumped to 0.1.20.
[0.1.19] - 2025-12-03
Added
- Metric reducer parity:
hash_unorderedreducer inTinkex.MetricsReductionfor order-insensitive hash aggregation, matching Python's chunked metrics behavior. - ModelInput builders:
empty/0,append/2, andappend_int/2helpers for building ModelInput incrementally.append_int/2is token-aware: extends the last EncodedTextChunk when possible, otherwise creates a new chunk. - TensorData.tolist/1: Returns the flat data list for API parity with Python's
TensorData.tolist().
Documentation
- Gap analysis docs corrected per review: parity matrices now reflect Elixir regularizer/telemetry support, sampling
retry_config, Python-only tensor backends, and the remaining reducer/model-input gaps. - Captured verification results in
docs/20251203/gap_analysis_claude/REVIEW_RESULTS.mdand refreshed README/install snippets/prompts to 0.1.19 for consistency. - Updated gap analysis to mark
hash_unorderedreducer and ModelInput builders as resolved.
[0.1.18] - 2025-12-03
Added
- Public
Tinkex.Types.ParsedCheckpointTinkerPathhelper exposes structuredtinker://components and user-category errors; REST helpers and CLI checkpoint commands reuse it for consistent validation.
Changed
feature_gatesnow default to["async_sampling"]when opts/app/env do not supply a value (Python parity); env snapshots/config defaults reflect the new baseline while keeping opts > app config > env precedence.- Environment and checkpoint guides note the default gate and the new path parser; README/install snippets bump to 0.1.18.
[0.1.17] - 2025-12-03
Added
- Log-level parity:
TINKER_LOG/config :tinkex, :log_levelnow set Logger at application start (Python parity) while keeping HTTP header dumps redacted unless explicitly enabled. - Telemetry capture: Service/Training/Sampling client entrypoints wrap exceptions through
Tinkex.Telemetry.Capture/Reporterand emit session-end events on fatal paths. - Config knobs: Added
default_headers,default_query, customhttp_client/http_pooloverrides, and env keys (TINKEX_DEFAULT_HEADERS,TINKEX_DEFAULT_QUERY,TINKEX_HTTP_CLIENT,TINKEX_HTTP_POOL) with opts > app config > env precedence and secret redaction for headers.
Changed
- Session heartbeats pin a 10_000ms timeout with
max_retries: 0for Python parity while preserving debounce/warning semantics. - Finch pool isolation now routes per pool type (session/training/sampling/futures/telemetry) with Python-aligned sizes and deterministic pool names derived from
http_pool+base_url.
[0.1.16] - 2025-12-03
Added
- CLI parity for checkpoint/run management:
--format json/--jsonoutput withtotal/showncounts, checkpoint list run filters,--limit 0fetch-all semantics, and pagination progress indicators for lists. - Checkpoint info now surfaces checkpoint type, size, visibility, timestamps, and training run IDs alongside base model/LoRA metadata; run list/info surface owner, LoRA rank, statuses, last checkpoints, and user metadata in both table and JSON formats.
Changed
- Checkpoint pagination cursors now use the typed
Tinkex.Types.Cursorstruct; CLI table outputs include the richer checkpoint/run fields. - Examples and docs updated for new CLI flags and cursor accessors; version bumped to 0.1.16 in mix/README/getting_started.
[0.1.15] - 2025-12-03
Added
- Checkpoint archive responses now surface
expires;CheckpointArchiveUrlResponseparses timestamps and Rest/RestClient expose ID-based archive/delete helpers alongside tinker-path variants. - Sampling backpressure parity: dispatch semaphore (default 400) gates sampling dispatch and 429s without
Retry-Aftertrigger a 1s shared backoff.
Changed
list_user_checkpoints/2now defaults tolimit: 100(Python parity).RetryConfigdefaultmax_connectionsraised to 1000 to match Python connection limits.- ServiceClient validation requires
base_modelormodel_pathfor sampling clients and at least one oftrain_mlp/train_attn/train_unembedwhen creating training clients.
Documentation
- README/version bump; checkpoint management guide covers archive expirations + ID helpers; retry guide reflects the new
max_connectionsdefault.
[0.1.14] - 2025-12-02
Added
ServiceClient.create_training_client_from_state_with_optimizer/3(+ async) for weight+optimizer restore with documented weights-only default on the existing helper.- CLI
checkpoint deletenow accepts multipletinker://paths with upfront validation, a single confirmation prompt (--yesto skip), per-path progress output, and aggregated failure reporting. - New examples:
examples/multimodal_resume_and_cleanup.exs(multimodal withexpected_tokens, optimizer resume helper, multi-delete CLI usage)examples/checkpoint_multi_delete_live.exs(creates two checkpoints and deletes both live with one CLI call)examples/llama3_tokenizer_override_live.exs(live Llama-3 sampling plus encode/decode using the override tokenizer)
- DELETE HTTP responses now accept empty bodies without raising JSON decode errors, fixing CLI multi-delete against endpoints that return 204/empty responses.
- Retry handler no longer trips progress timeout before the first attempt (attempt 0 is allowed to run).
Changed
- Image chunks: removed
height/width/tokensfields fromImageChunkandImageAssetPointerChunk, addedexpected_tokens, and made.length/1raise when missing; JSON encoders now match Python. ModelInput length follows the new contract and training batching counts images by base64/location length to avoid missing expected tokens. - Retries: default progress timeout increased to 120 minutes and retries now run until that timeout (no attempt cap by default); RetryConfig/RetryHandler defaults and docs updated.
- Tokenizer override: Llama-3 models now use
thinkingmachineslabinc/meta-llama-3-tokenizerto avoid gated downloads. - Documentation updates for CLI multi-delete, optimizer resume helper, retry defaults, expected_tokens schema, and new example references.
[0.1.13] - 2025-12-01
Documentation release: clarifies project status and attribution.
Documentation
- Added disclaimer clarifying Tinkex is an independent, community-maintained project not affiliated with or endorsed by Thinking Machines Lab.
- Added links to official Thinking Machines Lab homepage and Tinker documentation.
- Improved intro section clarity around Tinker's provenance.
[0.1.12] - 2025-11-27
Custom loss training now mirrors the Python SDK while adding multipart uploads, proxy-aware HTTP, streaming checkpoint downloads, queue observers, and richer capabilities metadata.
Added
- Custom loss training parity:
TrainingClient.forward_backward_custom/4preserves per-datum logprobs as Nx tensors, computes gradients locally, sends them back as synthetic weights so the backend trains, and returnsForwardBackwardOutputready foroptim_step/2. - Structured capabilities metadata: Capabilities responses expose
SupportedModelstructs withmodel_id,model_name, andarch;model_names/1offers backward-compatible name extraction and the capabilities example now prints the full metadata. - Multipart uploads: Tinkex.API.post/3 normalizes file inputs, flattens nested bodies into bracketed form fields, generates multipart boundaries, and sets
Content-Typeautomatically when:filesare present. Added path-based upload helpers and a runnable multipart demo. - Queue observers: Sampling and training clients register queue state observers that integrate with
Future.poll/2, emitting debounced warnings with human-readable reasons for rate limiting or capacity throttling, aligned with the Python SDK. - Proxy-aware HTTP:
Tinkex.Confignow wires proxy settings through Finch pool configuration, supporting tuple proxies plusTINKEX_PROXYandTINKEX_PROXY_HEADERSwhile masking credentials in inspect/log output.
Changed
- Checkpoint downloads: Archive downloads stream via Finch instead of loading into memory, maintaining O(1) memory usage while preserving progress callbacks and extraction semantics.
Documentation
- Guides and examples refreshed for custom loss training, streaming checkpoint downloads, multipart uploads, async client creation, and the richer server capabilities response.
[0.1.11] - 2025-11-27
Achieves full behavioral parity with Python SDK (tinker v1.x) across retry semantics, HTTP connection pooling, and missing type definitions.
Breaking Changes
ServiceClient.create_lora_training_client/2→/3:base_modelis now a required second argument instead of being passed in opts:# Before (0.1.10) create_lora_training_client(service, base_model: "meta-llama/Llama-3.1-8B") # After (0.1.11) create_lora_training_client(service, "meta-llama/Llama-3.1-8B", opts)TrainingClient.save_weights_for_sampler/2→/3:nameis now a required second argument mapping to the serverpathfield:# Before (0.1.10) save_weights_for_sampler(training, name: "checkpoint-001") # After (0.1.11) save_weights_for_sampler(training, "checkpoint-001", opts)
Added
Python SDK Retry Parity
- HTTP retry now matches Python
_base_client.pybehavior:- Retries on 408/409/429/5xx (added 409 Conflict support)
- Uses Python jitter formula (0.75–1.0 range) instead of ±25%
- Caps delay at 10s instead of 8s
- Removes 30s wall-clock timeout;
max_retriesgoverns retry attempts
- New
Tinkex.API.RetryConfigmodule with Python parity formulas
HTTP Pool Parity
- Pool defaults now align with Python's
httpx.Limits:pool_size: 50,pool_count: 20for 1000 total connections- Matches Python
max_connections=1000
Tinkex.Env.pool_size/1andpool_count/1forTINKEX_POOL_SIZE/TINKEX_POOL_COUNTenv varsTinkex.Application.default_pool_size/0anddefault_pool_count/0exposed;Application.start/2respects pool config from env or app config
Missing Type Structs
FutureRetrieveRequest- request type for future pollingRequestFailedResponse- structured error response typeSessionHeartbeatRequest/SessionHeartbeatResponse- heartbeat wire typesTelemetryResponse- telemetry send response typeTinkex.Types.TypeAliasesmodule forModelInputChunk,LossFnInputs,LossFnOutputunion typesSupportedModel- structured model metadata withmodel_id,model_name, andarchfields
Server Capabilities Type Improvement
GetServerCapabilitiesResponse.supported_modelsnow returns[SupportedModel.t()]instead of[String.t()]- Preserves full model metadata (
model_id,model_name,arch) from the API response - Added
GetServerCapabilitiesResponse.model_names/1convenience helper for backward-compatible name extraction - Backward compatible: parser handles both object and string formats
API Helpers
Tinkex.API.Helpers.with_raw_response/1- Python-style raw response access patternTinkex.API.Helpers.with_streaming_response/1- Python-style streaming response access pattern
TensorDtype Helpers
TensorDtype.from_nx_type/1now emits warnings for float64→float32 downcast and u64→int64 overflowTensorDtype.from_nx_type_quiet/1- silent conversion without warningsTensorDtype.check_precision_loss/1- explicit precision loss checking
Fixed
- Sampling nil-field fix:
SampleRequestJSON encoder now omitsnilfor optional fields likeprompt_logprobsinstead of encoding asnull(server rejects null). Usesdrop_nil?: truein Transform layer. - Tokenizer resolution: Now strips
/variantsuffix from three-part model names; added Kimi K2 tokenizer with pinned revision - CLI checkpoint command: Generates default checkpoint names from
base_modelwhen--namenot provided - Example bug fixes:
- Fixed
.samples→.sequencesin examples and docs (field was renamed in response type) - Fixed
prompt.tokens→prompt.chunks[0].tokensaccess pattern intelemetry_live.exs
- Fixed
Changed
- Training loop example adds detailed step timing and clearer output formatting
Documentation
- Updated all guides to use new
create_lora_training_client/3andsave_weights_for_sampler/3signatures - Added
docs/20251126/gaps_05/gap-analysis-python-to-elixir.md: comprehensive gap analysis showing feature-complete parity with Python SDK - Added
docs/20251126/gaps_05/remaining_gaps_fix_plan.md: concrete remaining gaps and fixes to reach 100% parity - Updated version references from 0.1.10 to 0.1.11 in README and mix.exs
Tests
- Added
test/tinkex/api/retry_parity_test.exs: validates Python retry formula, jitter range (0.75–1.0), 10s delay cap, and 409 retry support - Added
test/tinkex/pool_config_parity_test.exs: verifiespool_size × pool_count = 1000matching Pythonmax_connections - Added
test/tinkex/types/missing_types_parity_test.exs: round-trip tests for new type structs - Added
test/tinkex/api/helpers_test.exs:with_raw_responseandwith_streaming_responsehelpers - Updated integration tests for new TrainingClient API signatures
[0.1.10] - 2025-11-27
Added
RestClient Async API
- All REST methods now have
*_asyncvariants returningTask.t()for parallel requests:get_session_async/2,list_sessions_async/2,get_sampler_async/2get_weights_info_by_tinker_path_async/2,list_checkpoints_async/2list_user_checkpoints_async/2,get_checkpoint_archive_url_async/2delete_checkpoint_async/2,get_training_run_async/2get_training_run_by_tinker_path_async/2,list_training_runs_async/2publish_checkpoint_async/2,unpublish_checkpoint_async/2- Plus convenience aliases (
delete_checkpoint_by_tinker_path_async/2,publish_checkpoint_from_tinker_path_async/2, etc.)
TrainingClient Tokenizer Helpers
TrainingClient.get_tokenizer/2- fetches tokenizer using model info with ETS cachingTrainingClient.encode/3- convenience wrapper for tokenizer encoding from training clientTrainingClient.decode/3- convenience wrapper for tokenizer decoding from training client
Config Parity Mode
- New
parity_mode: :pythonoption to align timeout/retry defaults with Python SDK:- Set via options:
Tinkex.Config.new(parity_mode: :python) - Set via application config:
config :tinkex, parity_mode: :python - Set via environment variable:
TINKEX_PARITY=python
- Set via options:
- Python parity defaults:
timeout: 60_000(1 min),max_retries: 10(11 total attempts) - BEAM-conservative defaults (unchanged):
timeout: 120_000(2 min),max_retries: 2(3 total attempts) - New helper functions: Tinkex.Config.default_timeout/0, Tinkex.Config.default_max_retries/0, Tinkex.Config.python_timeout/0, Tinkex.Config.python_max_retries/0
Tinkex.Env.parity_mode/1- readsTINKEX_PARITYenvironment variable
Typed Telemetry Events
- New structs under
Tinkex.Types.Telemetry:EventType- enum for SESSION_START, SESSION_END, UNHANDLED_EXCEPTION, GENERIC_EVENTSeverity- enum for DEBUG, INFO, WARNING, ERROR, CRITICALGenericEvent,SessionStartEvent,SessionEndEvent,UnhandledExceptionEvent- typed event structsTelemetryEvent- union type withto_map/1,from_map/1,event_type/1dispatch helpersTelemetryBatch- batch grouping withto_list/1,from_list/2,size/1TelemetrySendRequest- request structure for API transmission
Changed
Tinkex.Telemetry.Reporternow emits typed structs internally and converts to wire format maps on sendTinkex.Env.snapshot/1now includesparity_modefield
Tests
- Added
test/tinkex/types/telemetry_types_test.exs- 53 tests for telemetry type round-trips and conversions - Added
test/tinkex/rest_client_async_test.exs- 14 tests for async REST client with Bypass - Added
test/tinkex/training_client_tokenizer_test.exs- 11 tests for tokenizer helper wiring - Added
test/tinkex/config_parity_test.exs- 10 tests for parity mode configuration - Extended
test/tinkex/env_test.exs- 4 tests forparity_mode/1
[0.1.9] - 2025-11-26
Added
Env and Configuration Parity
- Tinkex.Env: Introduced as the single source of truth for environment-driven configuration knobs.
- Config precedence: Wired
Tinkex.Config.new/1to use opts > app config > env > defaults. - New env vars: Support for
TINKER_TAGS,TINKER_FEATURE_GATES,TINKER_TELEMETRY,TINKER_LOG,TINKEX_DUMP_HEADERS,CLOUDFLARE_ACCESS_CLIENT_ID, andCLOUDFLARE_ACCESS_CLIENT_SECRET. - Secret masking: Added masking for API key and Cloudflare secrets in inspect and HTTP dumps.
- Config accessors: Exposed
tags,feature_gates,telemetry_enabled?,log_level,dump_headers?onTinkex.Config; documented inenvironment_configuration.md.
Cloudflare Access and Header Redaction
- CF headers: Inject
CF-Access-Client-IdandCF-Access-Client-Secretfrom config/env into all requests viabuild_headers/4. - Redaction: Redact both
x-api-keyandcf-access-client-secretin request dump logs. - Tests: Added tests asserting Cloudflare headers are present when configured and that secrets never appear in logs.
Heartbeat Path and SessionManager Robustness
- Heartbeat alignment: Changed Elixir heartbeat endpoint to POST
/api/v1/session_heartbeat(matching Python) instead of/api/v1/heartbeat. - Warning threshold: Introduced
heartbeat_warning_after_ms(default 120,000 ms) and emit warnings when heartbeats have failed for longer than this window. - Resilient sessions: Stop silently dropping sessions on 4xx; keep sessions alive and continue retrying while surfacing failures via logs.
- ETS persistence: Persist session state in a protected ETS table
:tinkex_sessionsand reload onSessionManagerinit so restarts preserve heartbeat state. - Timer cleanup: Ensure heartbeat timers are tracked and cancelled on terminate.
Sampling Retries, Backpressure, and Connection Limiting
- RetryConfig: Added
Tinkex.RetryConfigwithmax_retries,base_delay_ms,max_delay_ms,jitter_pct,progress_timeout_ms,max_connections, andenable_retry_logic. - SamplingClient integration: Integrated
RetryConfigintoSamplingClient; allow per-client configs and a simple keyword shorthandretry_config: [...]. - RetryHandler: Use
RetryHandler.from_config/1to wrap sampling requests with retry logic matching Python semantics (0.5s base, 10s cap, 25% jitter, long progress timeout). - RetrySemaphore: Introduced
Tinkex.RetrySemaphoreto cap concurrent sampling attempts per client using an ETS-backed semaphore andmax_connections. - Retry defaults: Removed hardcoded
max_retries: 0in Sampling API; now respects configuredmax_retriesfromRetryConfigor opts. - Tests: Added tests for
RetryConfig,RetryHandlerjitter,RetrySemaphore, and integration cases for enabled/disabled retries.
Training Persistence and Checkpoint Workflows
- LoadWeightsRequest fix: Fixed wire protocol by renaming
load_optimizer_statetooptimizerto match Python and server expectations. - TrainingClient.save_state/3: Implemented to save training checkpoints via
/api/v1/save_weights, returningSaveWeightsResponsewithtinker://path. - TrainingClient.load_state/3: Implemented to load weights via
/api/v1/load_weights. - TrainingClient.load_state_with_optimizer/3: Implemented to load weights with optimizer state.
- ServiceClient.create_training_client_from_state/3: Uses
RestClient.get_weights_info_by_tinker_path/2to derive base_model and LoRA rank, creates aTrainingClient, then loads the checkpoint. - SaveWeightsForSamplerResponse: Extended to support both
pathandsampling_session_id-only responses, matching ephemeral sampler flows. - TrainingClient.save_weights_and_get_sampling_client/2: Added to save sampler weights or ephemeral sessions and return a
SamplingClient. - TrainingClient.save_weights_and_get_sampling_client_sync/2: Synchronous helper variant.
- Docs & examples: Added
training_persistence.mdguide and examplestraining_persistence_live.exsandsave_weights_and_sample.exs.
Model Info and Unload Endpoints
- Types: Added
ModelData,GetInfoRequest,GetInfoResponse,UnloadModelRequest, andUnloadModelResponsetypes. - API endpoints: Implemented
Tinkex.API.Models.get_info/2andunload_model/2on top of/api/v1/get_infoand/api/v1/unload_model. - TrainingClient.get_info/1: Wired to the typed
get_infoendpoint, returningGetInfoResponsefor tokenizer_id and architecture. - TrainingClient.unload_model/1: Wired to the
unload_modelendpoint with support for both immediate and future-based responses. - Example & guide: Added
model_info_and_unload.exsexample andmodel_info_unload.mdguide.
REST Surface Parity and Tinker-Path Helpers
- RestClient.get_sampler/2: Exposed on top of
Rest.get_sampler/2, returning typedGetSamplerResponse. - RestClient.get_weights_info_by_tinker_path/2: Returns
WeightsInfoResponse. - New helpers:
get_training_run_by_tinker_path/2delete_checkpoint_by_tinker_path/2publish_checkpoint_from_tinker_path/2unpublish_checkpoint_from_tinker_path/2get_checkpoint_archive_url_by_tinker_path/2
- Docs: Extended
checkpoint_management.mdto use the new helpers.
Telemetry and Task Supervision
- Task.Supervisor: Introduced top-level
Tinkex.TaskSupervisorand route telemetry HTTP sends through it instead ofTask.start/1. - Child specs: Made
SamplingClientandTrainingClientchild_specuserestart: :temporaryto avoid restart storms and match user-managed lifecycle semantics. - Telemetry toggle: Keep
telemetry_enabled?driven byTinkex.Config.telemetry_enabled?withTINKER_TELEMETRYas the env fallback.
Docs, Guides, and Examples
- New guides:
environment_configuration.md, advanced configuration updates describing env precedence, Cloudflare Access, and heartbeat tuning. - Retry guide: Extended
retry_and_error_handling.mdwithSamplingClientretry_configdetails and connection limiting behavior. - Persistence guides: Added
training_persistence.mdandmodel_info_unload.mdfocused on checkpoints, optimizer state, and model metadata APIs. - Examples README: Updated
examples/README.mdandrun_all.shto include new examples:heartbeat_probe.exs,model_info_and_unload.exs,training_persistence_live.exs,save_weights_and_sample.exs.
Tests
- Added unit tests for
Env,Configprecedence, Cloudflare redaction,RetryConfig,RetryHandler,RetrySemaphore,ModelInfoandUnloadtypes,LoadWeightsRequestwire format, and updated rate limiter behavior. - Added integration tests covering:
- Multi-client concurrency with
retry_configenabled/disabled - Sampling workflows under rate limits and transient errors
- Training loop + checkpoint flows with save/load and
from_state SessionManagerheartbeat path and warning behavior
- Multi-client concurrency with
[0.1.8] - 2025-11-26
Added
- NotGiven + transform layer: Introduced omit/not-given sentinels and request transformation with aliasing/formatting to mirror Python serialization semantics.
- REST client parity: Added
RestClient.get_sampler/2,RestClient.get_weights_info_by_tinker_path/2, and tinker-path convenience aliases (training run, delete/archive/publish/unpublish) to match Python ergonomics. - Response wrappers & SSE: Added
Tinkex.API.Responsewith metadata (headers, status, URL, elapsed, retries) plus SSE decoding helpers and streaming response support for event-stream endpoints. - Typed responses: New structs for weight save/load responses, training runs, cursors, server capabilities, and health checks; REST training run endpoints now decode into typed structs.
- Service endpoints: Implemented
/api/v1/get_server_capabilitiesand/api/v1/healthzinTinkex.API.Service. - Sampling helper: Added
SamplingClient.compute_logprobs/3convenience for prompt token logprobs. - CLI management: Added checkpoint management subcommands (list/info/publish/unpublish/delete/download) and run management subcommands (list/info) with corresponding tests and docs.
- Live example: New
examples/live_capabilities_and_logprobs.exsshowing capabilities + health probes and prompt logprobs; included inexamples/run_all.sh. - Centralized env + Cloudflare headers:
Tinkex.EnvfeedsTinkex.Config/HTTP defaults (API key, base URL, tags, feature gates, telemetry, log level, dump headers, Cloudflare Access) with redaction helpers, matching Python env behavior and ADR-002.
Fixed
RestClienttraining run endpoints now return typed structs, and publish/unpublish checkpoint helpers are wired through REST with CLI wrappers.- Session heartbeat parity: Heartbeats now POST to
/api/v1/session_heartbeat(matching Python), continue retrying on all errors, and emit warnings after sustained failure windows instead of silently dropping sessions on 4xx. Added a guarded probe script to verify/api/v1/session_heartbeat= 200 and/api/v1/heartbeat= 404 against a real server.
[0.1.7] - 2025-11-26
Added
- Telemetry Reporter: New
Tinkex.Telemetry.Reporterbatches client-side telemetry (session start/end, HTTP, queue, custom events, exceptions) with configurable flush interval/threshold, HTTP timeout, and retry/backoff, plus wait-until-drained semantics and fatal-exception flushing.ServiceClientboots a reporter automatically and exposes it viatelemetry_reporter/1; telemetry can be disabled withTINKER_TELEMETRY=0. - Telemetry examples: Added
examples/telemetry_live.exsandexamples/telemetry_reporter_demo.exs, documented in READMEs and included inexamples/run_all.sh, showcasing reporter lifecycle, custom events, retries, drain/wait, and graceful shutdown. - Coverage: Added
test/tinkex/telemetry_reporter_test.exsfor reporter lifecycle, backoff, exception handling, and drain semantics; tests disable backend telemetry viaTINKER_TELEMETRY=0.
Changed
- Telemetry attribution: Sampling/training/client APIs now merge optional
:telemetry_metadata(including session, sampling session, and model sequence IDs) into HTTP telemetry so backend events are session-scoped; telemetry POSTs honor configurable timeouts. - Docs: README highlights the telemetry reporter, backend shipping flow, and metadata tagging; installation snippet bumped to
~> 0.1.7.
[0.1.6] - 2025-11-25
Added
- Metrics aggregation via
Tinkex.Metrics: counters for total/success/failure HTTP requests, latency histograms with p50/p95/p99,snapshot/0,reset/0, andflush/0. - Automatic telemetry wiring: the metrics server starts under
Tinkex.Applicationand subscribes to[:tinkex, :http, :request, :stop]by default. - Live metrics example:
examples/metrics_live.exsruns a sampling call against the live API and prints the metrics snapshot; added toexamples/run_all.shand documented inexamples/README.md.
Changed
- Docs: README now highlights metrics and shows a quick snapshot snippet; installation version bumped to
~> 0.1.6.
[0.1.5] - 2025-11-25
Added
- Structured Regularizer Composition: New
TrainingClient.forward_backward_custom/4API for computing custom loss functions with composable regularizers in Elixir/Nx. - RegularizerSpec type (
lib/tinkex/types/regularizer_spec.ex): Typed configuration struct for regularizers with validation, supporting:fn- Regularizer function of arity 2 returning{loss_tensor, metrics_map}weight- Non-negative float multiplier for loss contributionname- String identifier for telemetry and metricsasync- Boolean flag for Task-returning regularizers
- RegularizerOutput type (
lib/tinkex/types/regularizer_output.ex): Metrics struct for individual regularizer results including value, weight, contribution, and optional gradient norms. - CustomLossOutput type (
lib/tinkex/types/custom_loss_output.ex): Structured output from composed loss computation with base loss metrics, per-regularizer outputs, and totals. ImplementsJason.Encoderfor serialization. - Regularizer behaviour (
lib/tinkex/regularizer/regularizer.ex): Formal interface with@callback compute/3and optional@callback name/0for module-based regularizers. - GradientTracker (
lib/tinkex/regularizer/gradient_tracker.ex): Nx-based gradient norm computation usingNx.Defn.gradfor monitoring training dynamics:compute_grad_norm/2- L2 norm for arbitrary loss functionsgrad_norm_for_regularizer/3- Per-regularizer gradient normstotal_grad_norm/4- Combined gradient norm for full loss composition
- Executor (
lib/tinkex/regularizer/executor.ex): Parallel/sequential regularizer execution usingTask.async_stream/3:- Configurable parallelism with
max_concurrencyoption - Timeout handling with task cleanup
- Support for async (Task-returning) regularizers
- Telemetry emission for start/stop/exception events
- Configurable parallelism with
- Pipeline (
lib/tinkex/regularizer/pipeline.ex): Orchestration module coordinating base loss and regularizer execution:- Input validation and duplicate name detection
- Parallel execution by default
- Optional gradient norm tracking
- Comprehensive telemetry events
- Regularizer Telemetry (
lib/tinkex/regularizer/telemetry.ex): Convenience helpers for attaching telemetry handlers to regularizer events:[:tinkex, :custom_loss, :start | :stop | :exception][:tinkex, :regularizer, :compute, :start | :stop | :exception]
Changed
- TrainingClient: Added
forward_backward_custom/4public function and correspondinghandle_callclause for custom loss computation with regularizers.
[0.1.4] - 2025-11-25
- Added EXLA dependency (
{:exla, "~> 0.7"}) and configured Nx to useEXLA.Backendby default, enabling GPU/CPU-accelerated tensor operations for custom loss computation in Elixir. - Introduced
TrainingClient.forward/4for forward-only inference without backward pass, returning logprobs that can be converted to Nx tensors viaTensorData.to_nx/1. - Added
Training.forward_future/2API endpoint for server-side future-based forward pass requests. - Created
forward_inference.exsexample demonstrating the forward-only API, Nx tensor conversion, and EXLA-accelerated operations. - Foundation for structured regularizer pipelines where custom loss functions and gradients are computed in Elixir/Nx rather than on the server.
[0.1.3] - 2025-11-25
- Made
SessionManager.stop_session/2a synchronous GenServer call so heartbeat removal finishes before the client returns and refined heartbeat error handling to drop sessions silently on client-visible errors (e.g., 404) just like the Python SDK. - Added REST endpoints for fetching samplers, weights metadata, and training runs along with the new
GetSamplerResponseandWeightsInfoResponsestructs,ImageChunk.expected_tokens,LoadWeightsRequest.load_optimizer_state, and the:cispo/:droLossFnTypevariants; expanded tests, serialization rounds, and.gitignorecoverage to match the higher API parity. - Introduced the
weightsinspection.exsexample showing how to query checkpoint metadata, LoRA ranks, and sampler state via REST plus published detailed architectural documentation and a Structured Regularizers design doc that outlines custom loss/telemetry workflows requiring Nx/EXLA gradient computation support.
[0.1.2] - 2025-11-22
- Fixed future polling to handle direct
ForwardBackwardOutputpayloads (nostatus/typewrapper) by normalizing them into completed responses, unblockingTrainingClient.forward_backward/3result parsing.
[0.1.1] - 2025-11-21
- Added
Tinkex.RestClientfor synchronous session and checkpoint management (list/get/delete and archive URL retrieval) with typed response structs. - Added
Tinkex.CheckpointDownloadto fetch and extract checkpoint archives with overwrite and progress callback options. - Added async client factories (
ServiceClient.create_sampling_client_async/2,SamplingClient.create_async/2,TrainingClient.create_sampling_client_async/3) to parallelize client creation. - Expanded docs and examples to cover sessions, checkpoint management, downloads, and async workflows; included the changelog in HexDocs extras for publishing.
- Fixed REST URL construction so query parameters (e.g., checkpoint archive URLs) are sent correctly, resolving 404s when fetching checkpoint downloads.
- Avoided over-encoding
tinker://paths in checkpoint archive/delete endpoints so sampler checkpoints resolve correctly.
[0.1.0]
- Initial release.