θ thymos

zero-dependency C/CUDA machine learning library

Thymos is a machine learning library written in C with CUDA backends. No Python. No PyTorch. No dependencies. Designed to run LLM inference on memory-constrained hardware (Jetson Orin Nano, 8GB) where the Python ecosystem can't fit.

It includes classical ML models from published research, a GGUF-based inference engine with custom memory allocation, and optional cuBLAS integration. The architecture follows typed morphism patterns — every operation has explicit inputs, outputs, and context. No hidden state.

Also includes production implementations of SVR, Least-Squares SVR, and the p-Laplacian semi-supervised regression algorithm from published biomedical research.

the library

ml.h — core runtime

Arena allocator, context management, tensor primitives.

ThymosArena *arena = thymos_arena_create(512 * MB);
ThymosCtx   *ctx   = thymos_ctx_init(arena);

ThymosTensor *a = thymos_tensor_alloc(ctx, THYMOS_F32, (int[]){4, 4}, 2);
ThymosTensor *b = thymos_tensor_alloc(ctx, THYMOS_F32, (int[]){4, 4}, 2);
ThymosTensor *c = thymos_matmul(ctx, a, b);

thymos_arena_destroy(arena);  // single free

la.h — linear algebra

BLAS-style operations, matrix decomposition.

thymos_gemm(ctx, CblasNoTrans, CblasNoTrans,
            m, n, k, 1.0f, A, B, 0.0f, C);

opt.h — optimization

Primal-dual solvers, convergence criteria.

ThymosSolver solver = thymos_solver_init(ctx, THYMOS_PRIMAL_DUAL);
thymos_solver_set_tol(&solver, 1e-8);
thymos_solver_run(&solver, &problem);

act.h — category theory foundations

Morphism types, composition, typed operation signatures.

ThymosMorphism f = thymos_morph(ctx, domain_A, codomain_B, transform_fn);
ThymosMorphism g = thymos_morph(ctx, domain_B, codomain_C, transform_fn2);
ThymosMorphism h = thymos_compose(ctx, g, f);  // h = g ∘ f

models/ — classical ML

SVR, LS-SVR, p-Laplacian semi-supervised regression.

ThymosModel *svr = thymos_svr_train(ctx, X_train, y_train, &params);
thymos_predict(ctx, svr, X_test, y_pred);

inference engine

GGUF model loading (Llama-family, Qwen). Custom arena-based memory management — no malloc/free churn. Quantized inference (Q4_0, Q4_1, Q8_0). KV-cache with explicit lifetime management. Transformer forward pass in ~1500 lines of C. Optional CUDA backend with custom kernels + cuBLAS.

┌─────────────┐    ┌──────────────┐    ┌──────────────┐
│  GGUF Parse  │───▶│  Arena Alloc  │───▶│  Forward Pass │
│  (model.c)   │    │  (ml.h)       │    │  (infer.c)    │
└─────────────┘    └──────────────┘    └──────────────┘
        │                                       │
        ▼                                       ▼
┌─────────────┐                        ┌──────────────┐
│  Quantized   │                        │   KV Cache    │
│  Weights     │                        │  (arena-mgd)  │
└─────────────┘                        └──────────────┘
ThymosCtx *ctx = thymos_init(&(ThymosConfig){
    .arena_size = 512 * 1024 * 1024,  // 512MB arena
    .n_threads  = 4,
    .use_cuda   = true,
});

ThymosModel  *model = thymos_load_gguf(ctx, "qwen2.5-3b-q4_0.gguf");
ThymosResult  res   = thymos_generate(ctx, model, "explain arena allocation", 256);

printf("%s\n", res.text);
thymos_free(ctx);  // single free, entire arena

build

git clone https://github.com/zach15james/thymos
cd thymos
make            # CPU-only build
make cuda=1     # with CUDA backends
make test       # run test suite

Target platforms: Linux x86_64, aarch64 (Jetson).

benchmarks

In progress. Target numbers on real hardware:

model                    device          tok/s    memory
─────────────────────────────────────────────────────────
qwen2.5-3b-q4_0         Jetson Orin 8GB   —       —
qwen2.5-3b-q4_0         RTX 3090          —       —
llama-3.2-1b-q8_0       CPU (i7-12700)    —       —