Relevance Dataset 16th-20th April 2016 (941 records)


This dataset contains 941 social media posts -- Facebook posts, Facebook comments, and tweets -- published between 16th and 20th of April 2016, in English, with 8 to 100 words, and not using profanity words.

Only a few days after their publication, each post was classified by contributors of the Crowdflower platform, according to the following criteria:

- Interestingness (1-not interesting to 5-interesting)
- Controversy (1-not controverse to 5-controverse)
- Meaningfulness (1-not meaningful to 5-meaningful)
- Novelty (1-old to 5-new) - Reliability (1-ureliable to 5-reliable)
- Scope (1-restricted scope to 5-wide scope)
- Relevance (1-irrelevant to 5-relevant)

A total of 840 posts was classified by three contributors, and the remaining 101 by 7.

This dataset was originally used for predicting the relevance of the posts, from a journalistic point of view, by exploiting linguistic features.
This was both done directly or indirectly, after predicting the other criteria first and then combining them in an ensemble.

Download CVS file with dataset

This work is licensed under a Creative Commons Attribution 3.0 Unported License.

 

 

Relevance Dataset 7th-14th September 2016 (9988 records)


This dataset contains 9988 social media posts -- Facebook posts, Facebook comments, and tweets -- published between 7th and 14th of September 2016, in English, with 8 to 100 words, and not using profanity words.

Only a few days after their publication, each post was classified by contributors of the Crowdflower platform. Each post was evaluated by one contributor, although several additional questions were made to ensure the quality of the answers. The questions included in the task were:

- Have you had knowledge of this content through another source? (Yes/No)
- Choose (from the provided text) three distinct words that best summarize it (3 text input fields)
- Can you find, in the provided text, information that can be considered relevant? (Yes/No)
- Do you consider yourself to be a person who is aware of the news? (Likert Scale ranging from -4 to 4)
- The sentiment expressed in this text is (Likert scale ranging from -5 to 5)
- Choose (from the provided text) the word that best supports your previous answer (text input field)
- This dataset was originally used for predicting the relevance (from a journalistic point of view) and the sentiment of the posts. 

Download CVS file with dataset

This work is licensed under a Creative Commons Attribution 3.0 Unported License.

 

 

Pre-tokenized Facebook posts (3393 records)


This dataset consists of pre-tokenized posts from Facebook - 7 facebook posts and 63 facebook comments - resulting in a total of 3393 tokens. The tokenization followed the Ritter’s tokenization method for tweets.

Download txt file with dataset

This work is licensed under a  Creative Commons Attribution 3.0 Unported License.

 

 

Manually annotated social media corpus (3393 records)


This dataset is a manually annotated social media corpus for named entity recognition. It contains 3393 tokens, obtained from the tokenization of 70 posts (7 facebook posts and 63 facebook comments). The tokenization followed the Ritter’s tokenization method for tweets. The dataset is in a tab formatted file, where the first column (before tab) are the tokens and the second column are the ground truth entities (PERSON, LOCATION and ORGANIZATION).

Download CSV file with dataset

This work is licensed under a  Creative Commons Attribution 3.0 Unported License.

Project: UTAP-ICDT/EEI-CTP/0022/2014
Period: 27-04-2015 to 10-11-2017

logo fct branco

Top