This readme.txt file was generated by Junming Huang on 20 May 2023 ------------------- GENERAL INFORMATION ------------------- Title of Dataset: Data for "Caught in the crossfire: Fears of Chinese-American scientists" Author Information Yu Xie Paul and Marcia Wythes Center on Contemporary China, Princeton University, Princeton, NJ 08544, United States Xihong Lin Department of Biostatistics and Department of Statistics, Harvard University, 655 Huntington Avenue, Boston, MA 02115, United States Ju Li Department of Nuclear Science & Engineering and Department of Materials Science & Engineering, Massachusetts Institute of Technology, 77 Massachusetts Avenue, Cambridge, MA 02139, United States Qian He Paul and Marcia Wythes Center on Contemporary China, Princeton University, Princeton, NJ 08544, United States Junming Huang Paul and Marcia Wythes Center on Contemporary China, Princeton University, Princeton, NJ 08544, United States Date of data collection: 2000 - 2021-12-31 Description: This dataset encompasses two distinct sets of data analyzed in the study, namely Asian American Scholar Forum survey data and Microsoft Academic Graph bibleometrics data. The first part of the dataset comprises survey data collected from the Asian American Scholar Forum survey. With respect to privacy concerns of the survey respondents, the raw survey data have been designated as confidential and are deemed inappropriate for public disclosure. Researchers interested in obtaining access to the data are encouraged to directly contact the authors for an authorized copy. Nonetheless, the summarized statistics derived from the survey data can be found in the Supplementary Materials, sufficing the replication of the results presented in this paper. The second part of the dataset involves bibleometrics data obtained from the Microsoft Academic Graph, identifying and counting Chinese-descent scientists who started their careers in the US. -------------------------- SHARING/ACCESS INFORMATION -------------------------- Licenses/restrictions placed on the data, or limitations of reuse: CC BY 4.0 Recommended citation for the data: Huang, J., Xie, Y., Lin, X., Li, J., & He, Q. (2023). Data for "Caught in the Crossfire: Fears of Chinese-American Scientists" [Data set]. Princeton University. https://doi.org/10.34770/362D-QM10 Please cite this paper if you use this dataset for research. Yu Xie, Xihong Lin, Ju Li, Qian He, Junming Huang, Caught in the Crossfire: Fears of Chinese-American Scientists, Proceedings of the National Academy of Sciences, in press (2023). DOI: 10.1073/pnas.2216248120 Please cite Microsoft Academic Graph if you also use their original data. Arnab Sinha et al., An Overview of Microsoft Academic Service (MAS) and Applications, in Proceedings of the 24th International Conference on World Wide Web (WWW ’15 Companion), ACM, New York, NY, 243-246 (2015). DOI: 10.1145/2740908.2742839 Links to other publicly accessible locations of the data: This data is available at yuxie.com (https://yuxie.scholar.princeton.edu/share-files/data-files-caught-crossfire-fears-chinese-american-scientists) and Princeton University DataSpace (https://doi.org/10.34770/362d-qm10). The survey data is administered by the Asian American Scholar Forum. The bibleometrics data is published by Microsoft under Open Data Commons Attribution License (ODC-By). -------------------- DATA & FILE OVERVIEW -------------------- File list: Chinese-descent-scientists-destination.csv Chinese-descent-scientists-destination-count.csv Relationship between files, if important for context: [Chinese-descent-scientists-destination-count.csv] summaries the moving scientists in [Chinese-descent-scientists-destination.csv]. Due to the small sample size, scientists labeled in the "Statistics" discipline were excluded from the count. If data was derived from another source, list source: The bibliometric data was obtained from the Microsoft Academic Graph, which indexed 208,440,142 scientists from 27,077 institutions authoring 205,203,354 scientific publications dated until December 2021. The database was sourced from the publicly available snapshot retrieved from https://openalex.org/ in early 2022, after Microsoft Academic Graph announced retirement in Dec 2021. -------------------------- METHODOLOGICAL INFORMATION -------------------------- Description of methods used for collection/generation of data: We identified Chinese-descent scientists by their surnames. We first collected 832 common Chinese surnames from Wikipedia (https://en.wikipedia.org/wiki/List_of_common_Chinese_surnames), including those in Chinese characters and romanized names, in Hanyu Pinyin (the system of Chinese romanization mostly used by mainland Chinese scientists) and Wade-Giles (the system mostly used by Cantonese-speaking and Taiwanese scientists). This methodology results in the non-counting of Chinese-descent scientists who have changed their surnames (usually females after marriage), leading to an undercount. We searched for those surnames in the authors’ full names recorded in Microsoft Academic Graph to identify Chinese-descent scientists. To retain a high degree of reliability in individual identification, we removed scientists with a gap of more than 5 years between consecutive publications, which we believed were false results in which Microsoft Academic Graph’s name disambiguation algorithm incorrectly merged multiple individuals. We ended up with 25,202 Chinese-descent scientists who had their first publications in US affiliations and dropped their US affiliations and subsequently published at least one paper affiliated with China. We leveraged Google Maps API to parse all 27,077 institution names in Microsoft Academic Graph, and retrieved their country labels. Therefore, we could label every Chinese-descent scientist’s working country in any publishing year. Specifically, we focused on Chinese-descent scientists leaving the US, i.e., those who were trained in the US (first paper affiliated in the US) and who subsequently moved from the US to China (i.e., stopped using US affiliations and started to use Chinese affiliations). For each such scientist, we counted the year range of all his/her papers affiliated in the US and affiliated in China, and annotated his/her leaving year as the year of his/her first subsequent paper after his/her most recent usage of a US affiliation. This was more accurate than simply using his/her last year with a US affiliation, which might produce false positives that counted current US-based Chinese-descent scientists. We further identified two groups of interest among US-based Chinese-descent scientists: “junior” scientists—those who had published their first papers in the US, started publishing with Chinese affiliations within 5 years thereafter, and finally left the US within 7 years thereafter; and “experienced” scientists—those who had published over 25 papers in their whole career and outperformed 97% of scientists. For additional information on the processing of the survey data and bibliometric data, please refer to the Supplementary Materials of "Caught in the crossfire: Fears of Chinese-American scientists". Software-specific information needed to interpret the data, including software and hardware version numbers: python>=3.9, pandas>=1.14, numpy>=1.23 -------------------------- DATA-SPECIFIC INFORMATION: Chinese-descent-scientists-destination.csv -------------------------- This file provides the destination country or region for each of the 25,202 Chinese-descent scientists, along with their respective discipline labels. Scientists migrating to China mainland, Hong Kong and Taiwan are recorded separately. Number of variables: Number of cases/rows: Variable list, defining any abbreviations, units of measure, codes or symbols used: Missing data codes: Specialized formats or other abbreviations used: ----------------------- DATA-SPECIFIC INFORMATION Chinese-descent-scientists-destination-count.csv: -------------------------- This file reports the number of Chinese-descent scientists who migrated to China, categorized by year, discipline, and stage (junior/experienced). Due to the small sample size, scientists labeled in the "Statistics" discipline were excluded from the count. Rows: 22 years. Columns: 4 scientific fields (Engineering and computer science, Formal and physical science, Life science, Social sciences) x 3 career stages (All, Junior, Experienced). Entries: Number of scientists leaving the US for China.