I build the machinery that makes data and intelligence move — from billion-row pipelines to the inside of an LLM's KV cache.
I'm a generalist by instinct. Over the last eight years I've moved up and down the stack — wrangling billions of records a day through data lakehouses, shipping ML systems to production, building the platforms that hold them up, and lately optimizing the guts of LLM inference. I'm drawn to the hard, ambiguous problems that live between disciplines — the ones where nobody's quite sure whose job it is. Right now that means inference infrastructure at IO.net, and a growing fascination with web3, DePIN, and what happens when compute itself becomes a marketplace.
Squeezing latency and cost out of LLM serving — KV cache optimization, distributed cache offloading, disaggregated prefill / decode.
Lakehouses and streaming frameworks moving 1–5B records/day at sub-second query latency, benchmarked to 100B.
End-to-end ML systems following MLOps best practices — research-to-production pipelines, retraining, model promotion.
RAG pipelines, MCP servers, and production AI agents with end-to-end observability, tool use, and fault tolerance.
Managed compute platforms on EMR, EKS and bare metal — Ray clusters, container-as-a-service, orchestration with Temporal.
Block-reward systems for DePIN GPU suppliers — designing and A/B testing distribution formulas behind a token launch on Solana.
Contributions and published packages across distributed computing and LLM tooling.
Three hands-on studies of Preferred Networks' PLaMo model family — speculative decoding, LoRA fine-tuning for function calling, and prefill/decode batch-scaling — all measured on a single laptop.
A measured negative result: draft-model and n-gram speculation for the Mamba-hybrid plamo-2-8b on Apple Silicon — plus a novel quantization bug found along the way.
Fine-Tuning · LoRAThe first open-weights PLaMo with function calling — a LoRA on PFN's hidden control tokens, and two evaluation traps that took argument accuracy from 55% to 100%.
Serving · BatchingWhere the time goes on a laptop: prefill is compute-bound, decode is bandwidth-bound (19× slower/token), and what continuous batching buys each architecture.
What, why & how MLVajra helps take machine-learning models to deployment.
Data EngA brief tour of how Apache Airflow really works beneath the DAGs.
Machine LearningHow learning algorithms differ from classic programmatic ones — TSP, knapsack, Dijkstra and friends.
Machine LearningWhy a forest beats a single tree — the intuition behind the ensemble.
Machine LearningHow gradient descent quietly teaches a decision tree to do better.
Machine LearningA field guide to gradient-boosting variants and where each one earns its keep.
Machine LearningNotes and documentation from building a recommendation engine.