Hate Speech Policy

Safety Oracle for Rating Hate Speech [BETA]

Assess whether user-generated social content contains hate speech using Contextual AI's State-of-the-Art Agentic RAG system.

Contextual's Safety Oracle classifications are steerable and explainable as they are based on a policy document rather than parametric knowledge. This app returns ratings from LlamaGuard 3.0, the OpenAI Moderation API and the Perspective API from Google Jigsaw for comparison. Feedback is welcome as we work with design partners to bring this to production. Reach out to Aravind Mohan, Head of Data Science, at aravind.mohan@contextual.ai.

Instructions

Enter user-generated content to receive an assessment from all four models, or use the 'Random Test Case' button to generate an example. Safety warning: Some of the randomly generated test cases contain hateful language, which some readers may find offensive or upsetting.

How it works

Document-grounded evaluations ensure every rating is directly tied to our hate speech policy document, making our system far superior to solutions that lack transparent decision criteria.
Adaptable policies mean the system can instantly evolve to match your requirements without retraining.
Clear rationales are provided with every decision, referencing specific policy sections to explain why content was approved or flagged.
Continuous improvement is achieved through feedback loops that enhance retrieval accuracy and reduce misclassifications over time.
Our approach combines Contextual’s state-of-the-art steerable reranker, grounded language model, and agent specialization to deliver superhuman performance in content evaluation tasks.

🌟 Contextual Safety Oracle View policy
Rating will appear here
LlamaGuard 3.0 View model card
Rating will appear here
OpenAI Moderation View model card
Rating will appear here
Perspective API View docs
Rating will appear here