当前位置：网站首页>Google asked Indian taggers to tag reddit comment data sets, with an error rate of up to 30%?

Google asked Indian taggers to tag reddit comment data sets, with an error rate of up to 30%?

2022-07-20 22:05:00 【Heart of machine】

Almost Human reports

edit ： Egg sauce

One thing is for sure , The manual annotator doesn't understand at all Reddit The stem of netizens .

last year , Google released GoEmotions Data sets , The dataset contains 58K Manually marked Reddit Comment on , Which involves 27 Emotions .

But a man named Edwin Chen Machine learning engineers are using this data set , I found some funny mistakes by chance .

They were trying to be GoEmotions Training models on datasets , Notice that there seem to be some deep-seated quality problems . So they Randomly selected 1000 comments , In which the 308 A serious error was found in .

Here are some representative examples ：

aggressively tells friend I love them—— Marked as 「 anger 」
Yay, cold McDonald's. My favorite.—— Marked as 「 love 」
Hard to be sad these days when I got this guy with me—— Marked as 「 sad 」
Nobody has the money to. What a joke—— Marked as 「 Pleasure 」
……

Only from the extracted comments , They counted 25 An emotion that is wrongly marked .

In the field of artificial intelligence , Data annotation is a very basic , But also very critical work . Good data is crucial for training models , When the data is faced with such outrageous errors , How to train the model and evaluate its performance ？

Edwin Chen Last question ：「 Can we really believe that Google can create a fair real world AI ？」

therefore , What causes these problems ？

Some people say ：「 Is it possible , They didn't hire a manual annotator , Or the manual annotator did not master fluent English ？」

according to the understanding of ,GoEmotions The annotation of data sets is still manual , But these annotators are 「 Native English speaking Indians 」.

At the end of the paper 3.3 In the festival , There's a saying ：「 We assigned three evaluators to each sample . For those samples that the evaluators did not agree , We assigned two additional evaluators . All evaluators are native English speaking Indians .」

Because according to 「Cowen et al. (2019b) The conclusion of this study , The emotional judgment dimensions of English users in India and the United States are largely the same .

The fact is that , Although I have mastered fluent English , Many of the annotators may not understand the culture of the marked text 、 Social background . But this is one of the key points , Especially for NLP Data sets , The annotator must have full cultural awareness .

in other words , Given that many taggers may lack the necessary background knowledge , Even most of the data annotation is not controversial （ Pictured above ）, It does not mean that the annotation result is completely correct .

Another important reason for this problem is , There is no additional metadata in the data set ( Such as author or sub section name ). This is also mentioned in the original paper ：

Language is not in a vacuum , Information such as the section where it is located is very important . Google ignores this when building data sets .

This is not an isolated event ： The author also mentioned that , If even Google, a company with a lot of resources, is difficult to create accurate data sets , Then the quality of other data sets we have seen is even more unimaginable .

The good news is , Scholars have paid attention to this problem . Last month, , Wu Enda initiated 「 Data centric AI」 Initiative , He said , Focusing on improving the data quality of artificial intelligence system will help to release its full power .

If you want to deploy in reality work Machine learning model based on , It's time to focus on high-quality data sets rather than larger models .

Reference link ：

https://arxiv.org/pdf/2005.00547.pdf

https://www.surgehq.ai/blog/30-percent-of-googles-reddit-emotions-dataset-is-mislabeled

THE END

Please contact the official account for authorization.

To contribute or seek to report ：[email protected]

原网站

版权声明
本文为[Heart of machine]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/201/202207192049552221.html

当前位置：网站首页>Google asked Indian taggers to tag reddit comment data sets, with an error rate of up to 30%?

Google asked Indian taggers to tag reddit comment data sets, with an error rate of up to 30%?

边栏推荐

猜你喜欢

随机推荐