Title: Estimated The New York Times paragraph- and article-level sentiment on China topics, and aggregated survey attitude toward China Authors: Junming Huang, Gavin G. Cook, Yu Xie Issue date: Jul 2021 Here we publish the data used in paper "Junming Huang, Gavin Cook, and Yu Xie, Large-scale Quantitative Evidence of Media Impact on Public Opinion toward China". This dataset include estimated sentiments on The New York Times on China in eight topics from 1970 to 2019, and a time series of public attitude aggregated from surveys on China. (1) Estimated sentiments on The New York Times on China in eight topics from 1970 to 2019 We estimate sentiments of The New York Times articles on China with a three-stage procedure. First, two human coders annotate 873 randomly selected articles with a total of 18,598 paragraphs as expressing either positive, negative, or neutral sentiment in each of eight topics (ideology, government & administration, democracy, economic development, marketization, welfare and well-being, globalization, and culture). We treat irrelevant articles as neutral sentiment. Secondly, we fine-tune a natural language processing model BERT (Bidirectional Encoder Representations from Transformers (Devlin et al., 2018)) with the human-coded labels. The model uses a deep neural network with 12 layers. It accepts paragraphs (i.e., word sequences of no more than 128 words) as input and outputs a probability for each category. We end up with two binary classifiers for each topic for a grand total of 16 classifiers: an assignment classifier that determines whether a paragraph expresses sentiment in a given topic domain and a sentiment classifier that then distinguishes positive and negative sentiment in a paragraph classified as belonging to a given topic domain. Thirdly, we run the 16 trained classifiers on each paragraph in our corpus and assign category probabilities to every paragraph. We then use the probabilities of all the paragraphs in an article to determine the article’s overall sentiment category (i.e., positive, negative, or neutral) in every topic. Filename: topic-0-paragraph-pred.tsv Description: estimated paragraph-level sentiment of The New York Times on China's ideology Format: tab-separated values Rows: paragraphs Columns: url: url of an article (string) date: date of an article (YYYY-MM-DD) article_id: unique ID we assign to an article (int). This is for inner use only, and it has no association with The New York Times paragraph_id: zero-based index of a paragraph in an article (int) assignment_prediction_score: probability that this paragraph express a positive or negative sentiment toward China on ideology (float). A value close to 1 means that this paragraph is very likely to express a positive or negative sentiment. A value close to 0 means that this paragraph is very unlikely to express a positive or negative sentiment, i.e., it is neutral or irrelavant. sentiment_prediction_score: probability that this paragraph express a positive sentiment toward China on ideology (float). A value close to 1 means that this paragraph is very likely to express a positive sentiment. A value close to 0 means that this paragraph is very likely to express a negative sentiment. This value is useless when assignment_prediction_score is close to zero. Filename: topic-0-article-pred.tsv Description: estimated article-level sentiment of The New York Times on China's ideology Format: tab-separated values Rows: articles Columns: url: url of an article (string) ss1_prediction: estimated sentiment of an article on China's ideology (int). 0 if this article is estimated to express a neutral sentiment on China's ideology, or it is irrelavant to China's ideology. 1 if this article is estimated to express a positive sentiment. -1 if this article is estimated to express a negative sentiment. Filename: topic-0-trend.tsv Description: estimated daily sentiment of The New York Times on China's ideology Format: tab-separated values Rows: dates Columns: date: date (YYYY-MM-DD) num_articles: number of The New York Times articles on this date (int) num_positive_articles: number of The New York Times articles that are estimated to express positive sentiments on China's ideology num_negative_articles: number of The New York Times articles that are estimated to express negative sentiments on China's ideology The estimated sentiments on other topics are recorded in similar format, including topic-1-*.tsv: sentiments on China's government & administration topic-2-*.tsv: sentiments on China's democracy topic-3-*.tsv: sentiments on China's economic development topic-4-*.tsv: sentiments on China's marketization topic-5-*.tsv: sentiments on China's welfare and well-being topic-6-*.tsv: sentiments on China's globalization topic-7-*.tsv: sentiments on China's culture Filename: estimated-media-sentiment-variables.tsv Description: an all-in-one table of the estimated sentiments on all topics in all years. Format: tab-separated values Rows: years columns: index: year (int) aggregated_survey: public attitude aggregated from 101 surveys on China (float) 0: yearly sentiment on China's ideology (float). This is the difference between the fractions of positive and negative articles on China's ideology in a year. 0fp: yearly fraction of positive articles on China's ideology in a year (float). 0fn: yearly fraction of negative articles on China's ideology in a year (float). 0fp-1 / 0fp-2 / 0fp-3 / 0fp-4 / 0fp-5: yearly fraction of positive articles on China's ideology 1/2/3/4/5 years ago (float). This is used in the greedy search to examine the lagged effect of media sentiment on public opinion. 1fp, 1fn, 1fp-*, 1fn-*: China's government & administration 2fp, 2fn, 2fp-*, 2fn-*: China's democracy 3fp, 3fn, 3fp-*, 3fn-*: China's economic development 4fp, 4fn, 4fp-*, 4fn-*: China's marketization 5fp, 5fn, 5fp-*, 5fn-*: China's welfare and well-being 6fp, 6fn, 6fp-*, 6fn-*: sentiments on China's globalization 7fp, 7fn, 7fp-*, 7fn-*: sentiments on China's culture Filename: bert-parameters.txt Description: settings of bert to train on labeled The New York Times articles on China and predict on all remaining articles. Model files can be downloaded from Google's Github repository: https://github.com/google-research/bert . (2) Public attitude aggregated from surveys on China This time series is aggregated from 101 cross-sectional surveys from 1974 to 2019 that asked relevant questions about attitudes toward China, ranging from -100% to 100% with the year of 1974 as baseline. Years with attitudes above zero show a more favorable attitude than that in 1974. Years with attitudes below zero show a less favorable attitude than that in 1974, with a lowest level of -24% in 1976. The time series is estimated with 95% confidence interval. Detailed method is described in "Wang D, Xie Y, Huang J (2021) Latent attitude method for trend analysis with pooled survey data. SocArXiv https://doiorg/1031235/osfio/atsq2". Filename: aggregated-survey.tsv Description: attitude aggregated from surveys Format: tab-separated values Rows: years Columns: year: year (int) Estimates: aggregated attitude value (float) ul: upper bound of 95% confidence interval (float) ll: lower bound of 95% confidence interval (float)