Specialized Models
Guides for deploying task-specific AI models on Spheron GPU instances. These models are purpose-built for document processing, audio generation, and visual code intelligence rather than general-purpose chat.
Available guides
Chandra OCR
State-of-the-art document processing model with 83.1% accuracy on the olmOCR benchmark, outperforming GPT-4o, Gemini Flash 2, and Mistral OCR. Converts images and PDFs to structured Markdown, HTML, or JSON while preserving layout, tables, and formulas.
Best for: Document digitization, text extraction from PDFs and scanned documents, OCR pipelines with 40+ language support.
Deployment: CPU (dev/test) through distributed H100 (enterprise). Supports local inference via HuggingFace transformers and high-throughput via vLLM server.
SoulX Podcast-1.7B
Multi-speaker podcast generation model (1.7B parameters). Generates 60+ minute dialogues with natural speaker switching, zero-shot voice cloning from 10–30 second samples, and paralinguistic expressions (laughter, sighs, intonation).
Best for: Audio content generation, podcast production, multi-speaker voice synthesis, and zero-shot voice cloning.
Deployment: RTX 4060 (testing) to H100 (production). Runs via a Gradio web interface.
Janus CoderV-8B
8B multimodal code intelligence model trained on JANUSCODE-800K, the largest multimodal code dataset. Generates HTML, CSS, and React components from screenshots, mockups, charts, and animations. Supports 32K token context.
Best for: Visual-to-code translation, UI mockup generation, layout bug fixing from screenshots, and chart-to-code conversion.
Deployment: RTX 4090 (standard) through H100 (high throughput). Supports 8-bit quantization for reduced VRAM.
Hardware overview
| Model | Min VRAM | Recommended GPU | Notes |
|---|---|---|---|
| Chandra OCR (4-bit) | 6GB | RTX 3060/4060 Ti | Quantized |
| Chandra OCR (BF16) | 16GB | RTX 4090 / L40S | Full precision |
| SoulX Podcast-1.7B | 4GB | RTX 4060+ | RTX 4090 recommended |
| Janus CoderV-8B | 16GB | RTX 4090 (24GB) | 8-bit: ~12GB |
What's next
- Text Models: General-purpose LLM guides
- Multimodal Models: Vision-language model guides
- Instance Types: GPU selection
- Getting Started: Spheron account and instance setup