Title: Estimated The New York Times paragraph- and article-level sentiment on China topics, and aggregated survey attitude toward China

Authors: Junming Huang, Gavin G. Cook, Yu Xie

Issue date: Jul 2021

Here we publish the data used in paper "Junming Huang, Gavin Cook, and Yu Xie, Large-scale Quantitative Evidence of Media Impact on Public Opinion toward China". This dataset include estimated sentiments on The New York Times on China in eight topics from 1970 to 2019, and a time series of public attitude aggregated from surveys on China.


(1) Estimated sentiments on The New York Times on China in eight topics from 1970 to 2019

We estimate sentiments of The New York Times articles on China with a three-stage procedure. First, two human coders annotate 873 randomly selected articles with a total of 18,598 paragraphs as expressing either positive, negative, or neutral sentiment in each of eight topics (ideology, government & administration, democracy, economic development, marketization, welfare and well-being, globalization, and culture). We treat irrelevant articles as neutral sentiment. Secondly, we fine-tune a natural language processing model BERT (Bidirectional Encoder Representations from Transformers (Devlin et al., 2018)) with the human-coded labels. The model uses a deep neural network with 12 layers. It accepts paragraphs (i.e., word sequences of no more
than 128 words) as input and outputs a probability for each category. We end up with two binary classifiers for each topic for a grand total of 16 classifiers: an assignment classifier that determines whether a paragraph expresses sentiment in a given topic domain and a sentiment classifier that then distinguishes positive and negative sentiment in a paragraph classified as belonging to a given topic domain. Thirdly, we run the 16 trained classifiers on each paragraph in our corpus and assign category probabilities to every paragraph. We then use the probabilities of all the paragraphs in an article to determine the article’s overall sentiment category (i.e., positive, negative, or neutral) in every topic.


Filename: topic-0-paragraph-pred.tsv
Description: estimated paragraph-level sentiment of The New York Times on China's ideology
Format: tab-separated values
Rows: paragraphs
Columns: 
	url: url of an article (string)
	date: date of an article (YYYY-MM-DD)
	article_id: unique ID we assign to an article (int). This is for inner use only, and it has no association with The New York Times
	paragraph_id: zero-based index of a paragraph in an article (int)
	assignment_prediction_score: probability that this paragraph express a positive or negative sentiment toward China on ideology (float). A value close to 1 means that this paragraph is very likely to express a positive or negative sentiment. A value close to 0 means that this paragraph is very unlikely to express a positive or negative sentiment, i.e., it is neutral or irrelavant.
	sentiment_prediction_score: probability that this paragraph express a positive sentiment toward China on ideology (float). A value close to 1 means that this paragraph is very likely to express a positive sentiment. A value close to 0 means that this paragraph is very likely to express a negative sentiment. This value is useless when assignment_prediction_score is close to zero.


Filename: topic-0-article-pred.tsv
Description: estimated article-level sentiment of The New York Times on China's ideology
Format: tab-separated values
Rows: articles
Columns: 
	url: url of an article (string)
	ss1_prediction: estimated sentiment of an article on China's ideology (int). 0 if this article is estimated to express a neutral sentiment on China's ideology, or it is irrelavant to China's ideology. 1 if this article is estimated to express a positive sentiment. -1 if this article is estimated to express a negative sentiment. 


Filename: topic-0-trend.tsv
Description: estimated daily sentiment of The New York Times on China's ideology
Format: tab-separated values
Rows: dates
Columns: 
	date: date (YYYY-MM-DD)
	num_articles: number of The New York Times articles on this date (int)
	num_positive_articles: number of The New York Times articles that are estimated to express positive sentiments on China's ideology
	num_negative_articles: number of The New York Times articles that are estimated to express negative sentiments on China's ideology


The estimated sentiments on other topics are recorded in similar format, including
	topic-1-*.tsv: sentiments on China's government & administration
	topic-2-*.tsv: sentiments on China's democracy
	topic-3-*.tsv: sentiments on China's economic development
	topic-4-*.tsv: sentiments on China's marketization
	topic-5-*.tsv: sentiments on China's welfare and well-being
	topic-6-*.tsv: sentiments on China's globalization
	topic-7-*.tsv: sentiments on China's culture


Filename: estimated-media-sentiment-variables.tsv
Description: an all-in-one table of the estimated sentiments on all topics in all years.
Format: tab-separated values
Rows: years
columns:
	index: year (int)
	aggregated_survey: public attitude aggregated from 101 surveys on China (float)
	0: yearly sentiment on China's ideology (float). This is the difference between the fractions of positive and negative articles on China's ideology in a year.
	0fp: yearly fraction of positive articles on China's ideology in a year (float).
	0fn: yearly fraction of negative articles on China's ideology in a year (float).
	0fp-1 / 0fp-2 / 0fp-3 / 0fp-4 / 0fp-5: yearly fraction of positive articles on China's ideology 1/2/3/4/5 years ago (float). This is used in the greedy search to examine the lagged effect of media sentiment on public opinion.

	1fp, 1fn, 1fp-*, 1fn-*: China's government & administration
	2fp, 2fn, 2fp-*, 2fn-*: China's democracy
	3fp, 3fn, 3fp-*, 3fn-*: China's economic development
	4fp, 4fn, 4fp-*, 4fn-*: China's marketization
	5fp, 5fn, 5fp-*, 5fn-*: China's welfare and well-being
	6fp, 6fn, 6fp-*, 6fn-*: sentiments on China's globalization
	7fp, 7fn, 7fp-*, 7fn-*: sentiments on China's culture
	

Filename: bert-parameters.txt
Description: settings of bert to train on labeled The New York Times articles on China and predict on all remaining articles. Model files can be downloaded from Google's Github repository: https://github.com/google-research/bert .


(2) Public attitude aggregated from surveys on China

This time series is aggregated from 101 cross-sectional surveys from 1974 to 2019 that asked relevant questions about attitudes toward China, ranging from -100% to 100% with the year of 1974 as baseline. Years with attitudes above zero show a more favorable attitude than that in 1974. Years with attitudes below zero show a less favorable attitude than that in 1974, with a lowest level of -24% in 1976. The time series is estimated with 95% confidence interval. Detailed method is described in "Wang D, Xie Y, Huang J (2021) Latent attitude method for trend analysis with pooled survey data. SocArXiv https://doiorg/1031235/osfio/atsq2".

Filename: aggregated-survey.tsv
Description: attitude aggregated from surveys
Format: tab-separated values
Rows: years
Columns: 
	year: year (int)
	Estimates: aggregated attitude value (float)
	ul: upper bound of 95% confidence interval (float)
	ll: lower bound of 95% confidence interval (float)