Introduction
The Clarifai platform has developed considerably. Earlier generations of the platform relied on many small, task-specific fashions for visible classification, detection, OCR, textual content classification and segmentation. These legacy fashions had been constructed on older architectures that had been delicate to area shift, required separate coaching pipelines and didn’t generalize nicely outdoors their unique circumstances.
The ecosystem has moved on. Fashionable giant language fashions and vision-language fashions are educated on broader multimodal knowledge, cowl a number of duties inside a single mannequin household and ship extra steady efficiency throughout completely different enter sorts. As a part of the platform improve, we’re standardizing round these newer mannequin sorts.
With this replace, a number of legacy task-specific fashions are being deprecated and can not be out there. Their performance remains to be absolutely supported on the platform, however is now offered by extra succesful and basic mannequin households. Compute Orchestration manages scheduling, scaling and useful resource allocation for these fashions in order that workloads behave constantly throughout open supply and customized mannequin deployments.
This weblog outlines the core job classes supported at this time, the really helpful fashions for every and find out how to use them throughout the platform. It additionally clarifies which older fashions are being retired and the way their capabilities map to the present mannequin households.
Really helpful Fashions for Core Imaginative and prescient and NLP Duties
Visible Classification and Recognition
Visible classification and recognition contain figuring out objects, scenes and ideas in a picture. These duties energy product tagging, content material moderation, semantic search, retrieval indexing and basic scene understanding.
Fashionable vision-language fashions deal with these duties nicely in zero-shot mode. As a substitute of coaching separate classifiers, you outline the taxonomy within the immediate and the mannequin returns labels immediately, which reduces the necessity for task-specific coaching and simplifies updates.
Fashions on the platform fitted to visible classification, recognition and moderation
The fashions under supply sturdy visible understanding and carry out nicely for classification, recognition, idea extraction and picture moderation workflows, together with sensitive-safety taxonomy setups.
MiniCPM-o 2.6
A compact VLM that handles photos, video and textual content. Performs nicely for versatile classification workloads the place pace, price effectivity and protection should be balanced.
Qwen2.5-VL-7B-Instruct
Optimized for visible recognition, localized reasoning and structured visible understanding. Sturdy at figuring out ideas in photos with a number of objects and extracting structured info.
Moderation with MM-Poly-8B
A big portion of real-world visible classification work entails moderation. Many buyer workloads are constructed round figuring out whether or not a picture is secure, delicate or banned in line with a selected coverage. Not like basic classification, moderation requires strict taxonomy, conservative thresholds and constant rule-following. That is the place MM-Poly-8B is especially efficient.
MM-Poly-8B is Clarifai’s multimodal mannequin designed for detailed, prompt-driven evaluation throughout photos, textual content, audio and video. It performs nicely when the classification logic must be express and tightly managed. Moderation groups typically depend on layered directions, examples and edge-case dealing with. MM-Poly-8B helps this sample immediately and behaves predictably when given structured insurance policies or rule units.
Key capabilities:
-
Accepts picture, textual content, audio and video inputs
-
Handles detailed taxonomies and multi-level determination logic
-
Helps example-driven prompting
-
Produces constant classifications for safety-critical use instances
-
Works nicely when the moderation coverage requires conservative interpretation and bias towards security
As a result of MM-Poly-8B is tuned to comply with directions faithfully, it’s fitted to moderation situations the place false negatives carry increased danger and fashions should err on the aspect of warning. It may be prompted to categorise content material utilizing your coverage, determine violations, return structured reasoning or generate confidence-based outputs.
If you wish to exhibit a moderation workflow, you possibly can immediate the mannequin with a transparent taxonomy and ruleset. For instance:
“Consider this picture in line with the classes Protected, Suggestive, Express, Drug and Gore. Apply a strict security coverage and classify the picture into probably the most applicable class.”
For extra superior use instances, you possibly can present the mannequin with an in depth set of moderation guidelines, determination standards and examples that outline how every class must be utilized. This lets you confirm how mannequin behaves beneath stricter, policy-driven circumstances and the way it may be built-in into production-grade moderation pipelines.
MM-Poly-8B is on the market on the platform and can be utilized by the Playground or accessed programmatically through the OpenAI-compatible API.
Notice: If you wish to entry the above fashions like MiniCPM-o-2.6 and Qwen2.5-VL-7B-Instruct immediately, you possibly can deploy them to your individual devoted compute utilizing the Platform and entry them through API identical to another mannequin.
Tips on how to entry these fashions
All fashions described above might be accessed by Clarifai’s OpenAI-compatible API. Ship a picture and a immediate in a single request and obtain both plain textual content or structured JSON, which is beneficial whenever you want constant labels or need to feed the outcomes into downstream pipelines.
For particulars on structured JSON output, examine the documentation right here.
Coaching your individual classifier (fine-tuning)
In case your software requires domain-specific labels, industry-specific ideas or a dataset that differs from basic net imagery, you possibly can practice a customized classifier utilizing Clarifai’s visible classification templates. These templates present configurable coaching pipelines with adjustable hyperparameters, permitting you to construct fashions tailor-made to your use case.
Obtainable templates embody:
-
MMClassification ResNet 50 RSB A1
-
Clarifai InceptionBatchNorm
-
Clarifai InceptionV2
-
Clarifai ResNeXt
-
Clarifai InceptionTransferEmbedNorm
You may add your dataset, configure hyperparameters and practice your individual classifier by the UI or API. Take a look at the Wonderful-tuning Information on the platform.
Doc Intelligence and OCR
Doc intelligence covers OCR, structure understanding and structured subject extraction throughout scanned pages, varieties and text-heavy photos. The legacy OCR pipeline on the platform relied on language-specific PaddleOCR variants. These fashions had been slim in scope, delicate to formatting points and required separate upkeep for every language. They’re now being decommissioned.
Fashions being decommissioned
These fashions had been single-language engines with restricted robustness. Fashionable OCR and multimodal techniques help multilingual extraction by default and deal with noisy scans, blended codecs and paperwork that mix textual content and visible parts with out requiring separate pipelines.
Open-source OCR mannequin on the platform
DeepSeek OCR
DeepSeek OCR is the first open-source possibility. It helps multilingual paperwork, processes noisy scans moderately nicely and might deal with structured and unstructured paperwork. Nevertheless, it isn’t good. Benchmarks present inconsistent accuracy on messy handwriting, irregular layouts and low-resolution scans. It additionally has enter dimension constraints that may restrict efficiency on giant paperwork or multi-page flows. Whereas it’s stronger than the sooner language-specific engines, it isn’t the best choice for high-stakes extraction on advanced paperwork.
Third-party multimodal fashions for OCR-style duties
The platform additionally helps a number of multimodal fashions that mix OCR with visible reasoning. These fashions can extract textual content, interpret tables, determine key fields and summarize content material even when construction is advanced. They’re extra succesful than DeepSeek OCR, particularly for lengthy paperwork or workflows requiring reasoning.
Gemini 2.5 Professional
Handles text-heavy paperwork, receipts, varieties and sophisticated layouts with sturdy multimodal reasoning.
Claude Opus 4.5
Performs nicely on dense, advanced paperwork, together with desk interpretation and structured extraction.
Claude Sonnet 4.5
A quicker possibility that also produces dependable subject extraction and summarization for scanned pages.
GPT-5.1
Reads paperwork, extracts fields, interprets tables and summarizes multi-section pages with sturdy semantic accuracy.
Gemini 2.5 Flash
Light-weight and optimized for pace. Appropriate for widespread varieties, receipts and easy doc extraction.
These fashions carry out nicely throughout languages, deal with advanced layouts and perceive doc context. The tradeoffs matter. They’re closed-source, require third-party inference and are costlier to function at scale in comparison with an open-source OCR engine. They are perfect for high-accuracy extraction and reasoning, however not at all times cost-efficient for giant batch OCR workloads.
Tips on how to entry these fashions
Utilizing the Playground
Add your doc picture or scanned web page within the Playground and run it with DeepSeek OCR or any of the multimodal fashions listed above. These fashions return Markdown-formatted textual content, which preserves construction reminiscent of headings, paragraphs, lists or table-like formatting. This makes it simpler to render the extracted content material immediately or course of it in downstream doc workflows.

Utilizing the API (OpenAI-compatible)
All these fashions are additionally accessible by Clarifai’s OpenAI-compatible API. Ship the picture and immediate in a single request, and the mannequin returns the extracted content material in Markdown. This makes it simple to make use of immediately in downstream pipelines. Take a look at the detailed information on accessing DeepSeek OCR through the API.
Textual content Classification and NLP
Textual content classification is utilized in moderation, subject labeling, intent detection, routing, and broader textual content understanding. These duties require fashions that comply with directions reliably, generalize throughout domains, and help multilingual enter with no need task-specific retraining.
Instruction-tuned language fashions make this a lot simpler. They will carry out classification in a zero-shot method, the place you outline the lessons or guidelines immediately within the immediate and the mannequin returns the label with no need a devoted classifier. This makes it simple to replace classes, experiment with completely different label units and deploy the identical logic throughout a number of languages. In the event you want deeper area alignment, these fashions may also be fine-tuned.
Under are the some stronger fashions on the platform for textual content classification and NLP:
-
Gemma 3 (12B)
A latest open mannequin from Google, tuned for effectivity and high-quality language understanding. Sturdy at zero-shot classification, multilingual reasoning, and following immediate directions throughout diversified classification duties. -
MiniCPM-4 8B
A compact, high-performing mannequin constructed for instruction following. Works nicely on classification, QA, and general-purpose language duties with aggressive efficiency at decrease latency. -
Qwen3-14B
A multilingual mannequin educated on a variety of language duties. Excels at zero-shot classification, textual content routing, and multi-language moderation and subject identification.
Notice: If you wish to entry the above open-source fashions like Gemma 3, MiniCPM-4 or Qwen3 immediately, you possibly can deploy them to your individual devoted compute utilizing the Platform and entry them through API identical to another mannequin on the platform.
There are additionally many extra third-party and open-source fashions out there within the Group part, together with GPT-5.1 household variants, Gemini 2.5 Professional, and several other high-quality fashions. You may discover these primarily based in your scale, and domain-specific wants.
Customized Mannequin Deployment
Along with the fashions listed above, the platform additionally enables you to carry your individual fashions or deploy open supply fashions from the Group utilizing Compute Orchestration (CO). That is useful whenever you want a mannequin that isn’t already out there on the platform, or whenever you need full management over how a mannequin runs in manufacturing.
CO handles the operational particulars required to serve fashions reliably. It containerizes fashions robotically, applies GPU fractioning so a number of fashions can share the identical {hardware}, manages autoscaling and makes use of optimized scheduling to cut back latency beneath load. This allows you to scale customized or open supply fashions with no need to handle the underlying infrastructure.
CO helps deployment on a number of cloud environments reminiscent of AWS, Azure and GCP, which helps keep away from vendor lock-in and offers you flexibility in how and the place your fashions run. Take a look at the information right here on importing and deploying your individual customized fashions.
Conclusion
The mannequin households outlined on this information symbolize probably the most dependable and scalable method to deal with visible classification, detection, moderation, OCR and text-understanding workloads on the platform at this time. By consolidating these duties round stronger multimodal and language-model architectures, builders can keep away from sustaining many slim, task-specific legacy fashions and as an alternative work with instruments that generalize nicely, help zero-shot directions and adapt cleanly to new use instances.
You may discover extra open supply and third-party fashions within the Group part and use the documentation to get began with the Playground, API or fine-tuning workflows. In the event you need assistance planning a migration or choosing the fitting mannequin in your workload, you possibly can attain out to us on Discord or contact our help group right here.
