zero-dependency C/CUDA machine learning library
Thymos is a machine learning library written in C with CUDA backends. No Python. No PyTorch. No dependencies. Designed to run LLM inference on memory-constrained hardware (Jetson Orin Nano, 8GB) where the Python ecosystem can't fit.
It includes classical ML models from published research, a GGUF-based inference engine with custom memory allocation, and optional cuBLAS integration. The architecture follows typed morphism patterns — every operation has explicit inputs, outputs, and context. No hidden state.
Also includes production implementations of SVR, Least-Squares SVR, and the p-Laplacian semi-supervised regression algorithm from published biomedical research.
Arena allocator, context management, tensor primitives.
ThymosArena *arena = thymos_arena_create(512 * MB);
ThymosCtx *ctx = thymos_ctx_init(arena);
ThymosTensor *a = thymos_tensor_alloc(ctx, THYMOS_F32, (int[]){4, 4}, 2);
ThymosTensor *b = thymos_tensor_alloc(ctx, THYMOS_F32, (int[]){4, 4}, 2);
ThymosTensor *c = thymos_matmul(ctx, a, b);
thymos_arena_destroy(arena); // single free
BLAS-style operations, matrix decomposition.
thymos_gemm(ctx, CblasNoTrans, CblasNoTrans,
m, n, k, 1.0f, A, B, 0.0f, C);
Primal-dual solvers, convergence criteria.
ThymosSolver solver = thymos_solver_init(ctx, THYMOS_PRIMAL_DUAL); thymos_solver_set_tol(&solver, 1e-8); thymos_solver_run(&solver, &problem);
Morphism types, composition, typed operation signatures.
ThymosMorphism f = thymos_morph(ctx, domain_A, codomain_B, transform_fn); ThymosMorphism g = thymos_morph(ctx, domain_B, codomain_C, transform_fn2); ThymosMorphism h = thymos_compose(ctx, g, f); // h = g ∘ f
SVR, LS-SVR, p-Laplacian semi-supervised regression.
ThymosModel *svr = thymos_svr_train(ctx, X_train, y_train, ¶ms); thymos_predict(ctx, svr, X_test, y_pred);
GGUF model loading (Llama-family, Qwen). Custom arena-based memory management — no malloc/free churn. Quantized inference (Q4_0, Q4_1, Q8_0). KV-cache with explicit lifetime management. Transformer forward pass in ~1500 lines of C. Optional CUDA backend with custom kernels + cuBLAS.
┌─────────────┐ ┌──────────────┐ ┌──────────────┐
│ GGUF Parse │───▶│ Arena Alloc │───▶│ Forward Pass │
│ (model.c) │ │ (ml.h) │ │ (infer.c) │
└─────────────┘ └──────────────┘ └──────────────┘
│ │
▼ ▼
┌─────────────┐ ┌──────────────┐
│ Quantized │ │ KV Cache │
│ Weights │ │ (arena-mgd) │
└─────────────┘ └──────────────┘
ThymosCtx *ctx = thymos_init(&(ThymosConfig){
.arena_size = 512 * 1024 * 1024, // 512MB arena
.n_threads = 4,
.use_cuda = true,
});
ThymosModel *model = thymos_load_gguf(ctx, "qwen2.5-3b-q4_0.gguf");
ThymosResult res = thymos_generate(ctx, model, "explain arena allocation", 256);
printf("%s\n", res.text);
thymos_free(ctx); // single free, entire arena
git clone https://github.com/zach15james/thymos cd thymos make # CPU-only build make cuda=1 # with CUDA backends make test # run test suite
Target platforms: Linux x86_64, aarch64 (Jetson).
In progress. Target numbers on real hardware:
model device tok/s memory ───────────────────────────────────────────────────────── qwen2.5-3b-q4_0 Jetson Orin 8GB — — qwen2.5-3b-q4_0 RTX 3090 — — llama-3.2-1b-q8_0 CPU (i7-12700) — —