来源于PAN2021(PAN is a series of scientific events and shared tasks on digital text forensics and stylometry 数字文本取证和文体测量)的一个Shared Task
https://pan.webis.de/clef21/pan21-web/author-profiling.html

Task: Given a Twitter feed, determine whether its author spreads hate speech.
Data
- Input:
- Timelines of users sharing hate speech towards, for instance, immigrants and women.
- English and Spanish, 200 training cases/authors each (with 200 tweets per author) [data]
构建流程:
- 基于关键词搜索,主要搜索对于women和immigrants有hate的关键词
- 从一个已知的的hater(例如被封过的用户)中爬他们的following网络
- 确定用户之后, 手机用户的timelines, 手动标注传达hate的tweets
- 我们将超过10条hatedful tweet的用户标注为“keen to spread hate speech”
- 我们对于每个标注的用户手机100条tweet构建绥中的数据集.
数据集规模:

Baseline
- Character 𝑛-grams with 𝑛 ranging from 2 to 6 and Logistic Regression;
- Word 𝑛-grams with 𝑛 ranging from 1 to 3 and SVM;
- Universal Sentence Encoder (USE) to feed a BiLSTM with;
- XLM-Roberta (XLMR) transformer to feed a BiLSTM with;
- Multilingual BERT (MBERT) transformer to feed a BiLSTM with;