LLM Chat
A private chat assistant running on local open-weight models via Ollama, reverse-proxied to a GPU server on my network. Prompts and files never leave the LAN — no third-party API.
Try the live chat demo →
Capable AI on your own hardware — keeping prompts, documents, and conversations inside your network, and replacing per-seat AI licensing with a single CAPEX server.
A private chat assistant running on local open-weight models via Ollama, reverse-proxied to a GPU server on my network. Prompts and files never leave the LAN — no third-party API.
Try the live chat demo →A browser workspace pairing a live SSH terminal with an AI coding assistant (powered by Anthropic’s Claude). Operate infrastructure from anywhere, with per-user access and every action logged.
Open Tower (beta) Learn moreWant a private build for your organization? Request a free audit →
I don’t just deploy models — I build the tooling around them. Tower is a browser-based agentic workspace I built and run: a live SSH terminal paired with an AI coding assistant (Claude Code / Codex), with voice input, file attachments, and per-user key-based access. It’s how I operate infrastructure day to day — and it’s the same kind of private, auditable AI tooling I build for clients.
Prompts and documents never leave your premises. Sensitive data — payroll, HR files, client engagements, source code — stays on hardware you own, on a network you control.
Commercial AI subscriptions are billed per user per month. At $200–$300+ per seat, a 30-person org spends $6k–$9k monthly. A capable on-prem GPU server is a one-time CAPEX cost that pays itself back in months.
Models grounded in your actual data — SharePoint libraries, file shares, ticket history, internal wikis. Retrieval-augmented assistants answer from real company context, not generic web text.
Logs, rate limits, and access controls aligned with the rest of your security stack. AI usage is observable, reviewable, and policy-controlled — not a side-channel that bypasses governance.
Open-weight models (Llama, Qwen, DeepSeek, etc.) running via Ollama, vLLM, or llama.cpp behind a hardened reverse proxy. Swap models, upgrade GPUs, or migrate hardware without rewriting your stack.
MCP servers, Claude Code-style coding agents, and custom orchestrators wired into ops, support, and back-office tasks. AI as a force multiplier across the work your team already does.
Concrete scenarios where keeping the model and data on-prem pays off.
Contract review, redaction, and clause search across privileged documents. Client matter material never touches a third-party API, which keeps the work inside the engagement letter.
Clinical-note summarisation, intake triage, and policy lookup over PHI without sending it to a SaaS endpoint — relevant under HIPAA, PIPEDA, and equivalent regimes.
M&A document review, internal advisor copilots, and analyst drafting tools grounded in deal data that legal will not let leave the firewall.
Code review, architecture Q&A, and runbook search grounded in private repos and internal docs. Coding agents (Claude Code-style) can run against on-prem checkouts of source.
Ticket triage, KB-grounded reply drafting, and escalation summaries. Support context (PII, account data, prior tickets) stays inside the helpdesk.
Runbook search, incident playbook lookup, log triage, and configuration Q&A across firewall, AD, VPN, and SaaS estates — the same surface ops already manages.
A practical example for a 30-user organization. Numbers are illustrative — exact rates depend on the SaaS plan and the workload.
Open-weight models that self-host cleanly. Models can be swapped or upgraded as a config change — no rebuild, no vendor migration.
Llama 3.1 8B for fast, lightweight workloads; Llama 3.3 70B for higher-quality reasoning and longer context. Strong general-purpose baseline for most business uses.
Qwen 2.5 / Qwen 3 across sizes from 7B to 72B. Coding-strong variants (Qwen Coder) make this a frequent default for engineering and ops chat assistants.
DeepSeek V3 and the R1 reasoning family. Distilled smaller variants (R1-Distill-Qwen, R1-Distill-Llama) bring step-by-step reasoning to consumer-class GPUs.
Efficient dense models (Mistral 7B, Small, Large) and Mixture-of- Experts variants (Mixtral 8x7B / 8x22B) that punch above their parameter weight on throughput.
Compact, well-licensed models (Gemma 2 / 3 in 2B / 9B / 27B). Good fit when latency and footprint matter more than absolute quality.
Small models tuned for reasoning quality per parameter. Useful for edge deployments and embedded assistants where a 70B server is overkill.
Embeddings and reranking models (BGE, E5, Jina, Voyage open variants) drop in alongside for retrieval-augmented assistants.
Three reference tiers. Each assumes a single host with reverse-proxy, auth, and basic monitoring already wired in. Numbers are illustrative — final spec depends on the model, context length, and concurrency target.
~$4k–$8k · small team or pilot
~$8k–$15k · small org production
$20k+ · org-wide or RAG-heavy
Deliverables of a typical engagement — sized to fit a small org build or scaled up for a multi-department deployment.
A mid-tier build (single 24–48 GB GPU) reliably serves 30–50 light users on 13B–32B models, or 10–15 heavier RAG users. Scaling past that is a config change (vLLM tensor-parallel) or a second host.
Open-weight models release on the order of weeks. Swapping is a config change, not a rebuild — the runbook covers it. Old models stay available so you can A/B before cutting over.
Yes. Auth ties into Entra ID / Google / OIDC, and the assistant surfaces wire into M365 (SharePoint, Teams), Slack, Jira, GitHub, and most line-of-business apps via API.
Yes. The server lives on your LAN. Internet is only needed to pull new model weights, and that can be done once or via an internal mirror. Day-to-day inference is fully offline.
Prompts and uploaded documents never leave the box. There is no telemetry to a vendor. Audit logs are local and inspectable. You control retention, redaction, and who can see what.
Optional retainers cover model upgrades, capacity planning, prompt and RAG tuning, and feature builds (new agents, new integrations). One-shot builds with a handover are also fine for in-house teams.
A live demo of the on-prem chat UI runs right here on this server — a private LLM answering in real time, with nothing leaving the network.
Want a private LLM build for your organization? Book a free audit → — we’ll scope it together.
Tell me what your team does and what data you'd want it grounded in. I'll reply the same day with whether on-prem AI is a fit — plus a rough shape and cost.
Prefer email? info@sd-techsolutions.com · Same-day reply.