Skip to content

Specialized Models

Guides for deploying task-specific AI models on Spheron GPU instances. These models are purpose-built for document processing, audio generation, and visual code intelligence rather than general-purpose chat.

Available guides

Chandra OCR

State-of-the-art document processing model with 83.1% accuracy on the olmOCR benchmark, outperforming GPT-4o, Gemini Flash 2, and Mistral OCR. Converts images and PDFs to structured Markdown, HTML, or JSON while preserving layout, tables, and formulas.

Best for: Document digitization, text extraction from PDFs and scanned documents, OCR pipelines with 40+ language support.

Deployment: CPU (dev/test) through distributed H100 (enterprise). Supports local inference via HuggingFace transformers and high-throughput via vLLM server.

SoulX Podcast-1.7B

Multi-speaker podcast generation model (1.7B parameters). Generates 60+ minute dialogues with natural speaker switching, zero-shot voice cloning from 10–30 second samples, and paralinguistic expressions (laughter, sighs, intonation).

Best for: Audio content generation, podcast production, multi-speaker voice synthesis, and zero-shot voice cloning.

Deployment: RTX 4060 (testing) to H100 (production). Runs via a Gradio web interface.

Janus CoderV-8B

8B multimodal code intelligence model trained on JANUSCODE-800K, the largest multimodal code dataset. Generates HTML, CSS, and React components from screenshots, mockups, charts, and animations. Supports 32K token context.

Best for: Visual-to-code translation, UI mockup generation, layout bug fixing from screenshots, and chart-to-code conversion.

Deployment: RTX 4090 (standard) through H100 (high throughput). Supports 8-bit quantization for reduced VRAM.

Hardware overview

ModelMin VRAMRecommended GPUNotes
Chandra OCR (4-bit)6GBRTX 3060/4060 TiQuantized
Chandra OCR (BF16)16GBRTX 4090 / L40SFull precision
SoulX Podcast-1.7B4GBRTX 4060+RTX 4090 recommended
Janus CoderV-8B16GBRTX 4090 (24GB)8-bit: ~12GB

What's next