inference

8 guides tagged “inference”.

Getting Started with ONNX: Train and Deploy Custom Models

A practical, end-to-end guide to ONNX: what it is, how to export models from PyTorch and TensorFlow, run fast inference with ONNX Runtime, and ship to production.

May 28, 20265 min read

Production ML Workflows: How We Serve an ONNX Model with FastAPI

A real, honest production architecture: an ONNX image classifier served by FastAPI on Railway, loaded from object storage at startup, with one shared inference session on CPU — and what we'd improve.

May 25, 20266 min read

mlops production onnx fastapi inference

Server-Side vs On-Device ML Inference: How to Choose

The core trade-off in ML deployment: run inference on a central server or push it to the device. Real case: why TrichAi chose the server, what it cost, and when you'd choose differently.

May 24, 20265 min read

inference deployment mlops machine-learning

Quantizing ONNX Models: When It's Actually Worth It

A practical guide to INT8 and FP16 quantization for ONNX models — how much you save, what you risk, and a real decision: should an unquantized production model be quantized?

May 23, 20264 min read

onnx quantization inference optimization

The Silent Accuracy Killer: Preprocessing Mismatch

Your model scores 94% in the notebook and falls apart in production. The cause is usually not the model — it's a preprocessing mismatch between training and inference. Here's how to find and prevent it.

May 22, 20264 min read

inference debugging machine-learning production

Choosing an ONNX Runtime Execution Provider: CPU, CUDA, TensorRT, CoreML

ONNX Runtime can dispatch the same model to very different hardware backends. A practical guide to execution providers — what each is for, how the fallback chain works, and how to choose.

May 21, 20264 min read

onnx inference optimization deployment

Choosing a Model Format: ONNX vs TorchScript vs SavedModel

Once a model is trained, how you serialize it shapes everything downstream. A practical comparison of ONNX, TorchScript, and TensorFlow SavedModel — portability, performance, and lock-in.

May 19, 20264 min read

onnx deployment machine-learning inference

Batching Inference Requests: Throughput vs Latency

Processing requests one at a time wastes hardware; batching them trades a little latency for a lot of throughput. How dynamic batching works, when it helps, and when a single shared session is enough.

May 17, 20264 min read

inference optimization mlops production