Zum Inhalt springen
stackschmiede.de
DE
Sovereign AI · On-prem · RAG

AI is a tool. Not an OpenAI subscription.

Most "AI integrations" are thin wrappers around ChatGPT. It works — but it ships every prompt, every document, every customer conversation to a US provider. For legal, medical, public-sector or R&D contexts, that's not an option.

Why not just OpenAI or Gemini?

Because you hand yourself over to a provider who can unilaterally change prices, terms of use, API behavior and regions — without your input. Every major LLM vendor in the last two years has pushed price increases, model deprecations and rate-limit changes that customers could not respond to except by paying.

On-prem LLMs on your server are insurance: predictable cost (hosting instead of token roulette), data stays in-house, features can't be cancelled overnight. That's not ideology — that's business continuity management.

02 / Sovereign AI

Your documents. Your model. Your server.

Mistral Small 3.1 and Qdrant on-prem, grounded on your contracts, tickets and wiki articles. No data to OpenAI, no per-query token bills — just your infrastructure.

Mistral Small 3.1QdrantLlamaIndexFastAPIDocker
neural · forward pass
online
01
Ingest
PDF · MD · SharePoint
02
Chunk
Semantic · 512 tok
03
Embed
BGE-M3 · 1024 dim
04
Retrieve
Hybrid · BM25 + Vec
05
Generate
Mistral · on-prem
rag.jsonl — stream
Live · latent vector space
contracts tickets wiki
< 1.4s
p50 response time
0
data to US cloud
100%
GDPR compliant

Common questions

Does "sovereign AI" really mean no data goes to OpenAI or Gemini?

Yes — the default setup runs the entire LLM on your server (or on one of my GPU servers in Germany). There is no fallback to external APIs unless you explicitly configure that for low-sensitivity use cases.

Does Mistral Small 3.1 reach GPT-4 quality?

For structured domain tasks (document extraction, summarization, RAG answers) — yes, sometimes better with fine-tuning. For long-form creative writing: slightly behind. We evaluate in project context. For code-specific workflows I use Codestral, for voice-to-text Voxtral.

Do I need my own hardware?

No. Dedicated GPU servers in Germany from ~€200/month are the standard path. If you prefer running it in-house: my AI-workshop packages deliver ready-made on-prem systems starting at €3,499. Own hardware only for very high load or specific compliance requirements.

What are operating costs after launch?

GPU hosting €150-500/month depending on model size and load, plus monitoring and updates. Typically 20-40% cheaper than equivalent OpenAI bills — and predictable.

How does it integrate with my existing stack?

Via REST, GraphQL or WebSocket. Standard patterns: chat widget, document upload, batch processing, webhooks. Also as an MCP server (Model Context Protocol).

What about the EU AI Act?

On-prem LLMs are easier to document w.r.t. transparency. For high-risk applications I refer AI lawyers — legal assessments aren’t my trade.

06 / Contact

Let’s talk.

Three channels, one contact. Reply within 24 hours on business days.

  • Phone (on request via email)
    Number shared after a short email pre-clarification.
  • Form
    Right — with project context
Response: < 24h on business days
Data transfer: encrypted (TLS 1.3)
Spam protection: Cloudflare Turnstile (no reCAPTCHA)