来源于PAN2021（PAN is a series of scientific events and shared tasks on digital text forensics and stylometry 数字文本取证和文体测量）的一个Shared Task

https://pan.webis.de/clef21/pan21-web/author-profiling.html

Untitled

Task: Given a Twitter feed, determine whether its author spreads hate speech.

Data

Input:
- Timelines of users sharing hate speech towards, for instance, immigrants and women.
- English and Spanish, 200 training cases/authors each (with 200 tweets per author) [data]

构建流程：

基于关键词搜索，主要搜索对于women和immigrants有hate的关键词
从一个已知的的hater（例如被封过的用户）中爬他们的following网络
确定用户之后，手机用户的timelines，手动标注传达hate的tweets
我们将超过10条hatedful tweet的用户标注为“keen to spread hate speech”
我们对于每个标注的用户手机100条tweet构建绥中的数据集.

数据集规模:

Untitled

Baseline

Character 𝑛-grams with 𝑛 ranging from 2 to 6 and Logistic Regression;
Word 𝑛-grams with 𝑛 ranging from 1 to 3 and SVM;
Universal Sentence Encoder (USE) to feed a BiLSTM with;
XLM-Roberta (XLMR) transformer to feed a BiLSTM with;
Multilingual BERT (MBERT) transformer to feed a BiLSTM with;