Toxicity detection

Use of natural-language-processing models to score user-generated content for hostility, hate, harassment or abusive language, typically as the input to a moderation decision.

Toxicity detection is the practice of running every user contribution through a machine-learning model that outputs a score : how toxic, hostile or abusive does this message read ? For a press comment system, toxicity detection is the first filter before any other moderation rule kicks in.

What “toxic” actually means

There is no single definition. Most models classify across multiple axes :

  • Severe toxicity, explicit insults, slurs, calls to violence.
  • Hate, targeted hostility against a group (ethnic, religious, political, sexual orientation).
  • Harassment, repeated abuse of a specific individual.
  • Spam, irrelevant promotional content.
  • Threats, credible threats of harm.

Each axis usually outputs a score from 0 to 1. The publisher’s moderation policy sets the thresholds : auto-reject above 0.9 on severe toxicity, send to human queue between 0.6 and 0.9, etc.

Why generic models fail on press content

Most off-the-shelf toxicity models (Perspective API, OpenAI moderation, etc.) were trained on social-media data, Reddit, Wikipedia, Twitter. They work fine on those domains but underperform on press comments because :

  • News comments often contain strong political opinion that the model misreads as toxicity.
  • Quoted material (extracts from articles, polemical citations) gets flagged even when contextually fine.
  • Editorial tone, sarcasm, rhetorical questions, irony common in opinion pieces, confuses the model.
  • Multilingual coverage, press audiences are often non-English, and generic models degrade significantly outside English.

Logora’s toxicity model is trained on 1M+ comments from European newsrooms across French, German, Italian, Spanish, Portuguese and English. False-positive rate on press content is significantly lower than generic models.

Toxicity score + human queue = the 85/15 rule

In production, toxicity detection alone is not enough. The model handles :

  • The clean 65%, auto-approve and publish.
  • The clearly-toxic 20%, auto-reject with a user-facing statement of reasons.
  • The ambiguous 15%, escalate to the human moderator with the toxicity score, the model’s reasoning, and a fast-action UI.

This is the operational model Der Spiegel and Milenio run with.

See the Logora vs Disqus comparison for how toxicity models differ between vendors.

⌘K / Ctrl+K to open