# Thermodynamic and Dynamics Data for Coarse-grained Intrinsically Disordered Proteins Generated by Active Learning

## Author Information
Name: Mike Webb
Email: mawebb@princeton.edu

The content is available under CC BY NC ND 4.0 license.

Dataset citation: Webb, M., Oliver, W., Jacobs, W., & An, Y. (2023). Thermodynamic and Dynamics Data for Coarse-grained Intrinsically Disordered Proteins Generated by Active Learning [Data set]. Princeton University. https://doi.org/10.34770/6TNM-7B56

This distribution compiles thermodynamic and (where available) dynamic properties of short protein sequences as obtained from coarse-grained molecular dynamics simulations. Some pertinent summative details include:
    * The dataset features 2114 protein sequences with sequence lengths ranging from N=20 up to N=50 amino acids. The simulation and analysis of these sequences is described in  ``Active learning of the thermodynamics--dynamics tradeoff in protein condensates'' by Yaxin An, Michael A. Webb*, and William M. Jacobs* (arXiv preprint arXiv:2306.03696, 2023,  https://doi.org/10.48550/arXiv.2306.03696 ). 
    * Of the 2114 protein sequences, 80 are homomeric polypeptides (replicating a single amino acid for  N = 20, 30, 40, and 50), 1266 are sourced from version 9.0 of the DisProt database, and the remaining 768 sequences are novel sequences generated during an active learning campaign described in the aforementioned manuscript. 
    * The simulations were performed using the LAMMPS molecular dynamics engine. 
    * The interactions used for simulation are obtained from R. M. Regy , J. Thompson , Y. C. Kim and J. Mittal , Improved coarse-grained model for studying sequence dependent phase separation of disordered proteins, Protein Sci., 2021, 1371 —1379. 
    * Properties in distribution include second virial coefficients, pressure-density data, expectation for phase behavior at 300 K, estimated condensed-phase densities at 300 K (if predicted to exist), and condensed-phase self-diffusion coefficients at 300 K (if a condensed phase is predicted to exist).

## File Descriptions

The distribution contains six files:
    * README
    * seq_homomeric.txt
    * seq_heteromeric.txt
    * features_homomeric.csv
    * features_heterometric.csv
    * labels_homomeric.csv
    * labels_heteromeric.csv
    * EOS_homomeric.csv
    * EOS_heteromeric.csv

The contents of each file are summarized below:

    * README --- this file
    * sequences_homomeric.txt --- human-readable text file with the homomeric protein sequences expressed via one-letter amino-acid code. One sequence is listed per line. There are 80 lines.
    * sequences_heteromeric.txt --- human-readable text file with the heteromeric protein sequences expressed via one-letter amino-acid code. One sequence is listed per line. There are 2034 lines. The first 1266 sequences are sourced from DisProt, while the remaining 768 are generated during the course of active learning. 
    * features_homomeric.csv --- human-readable .csv file with numerous sequence characterstics of homomeric polypeptides. Ignoring the header, each row of the .csv file contains values for a specific sequence. The rows are index/line-matched to the sequences presented in `sequences_homomeric.txt`.
    (i.e., the first row corresponds to characteristics of the sequence on the first line of `sequences_homomeric.txt`, the second row corresponds to characteristics of the sequence on the
      second line, and so on). See the section on `Sequence Characteristics` for a description of headers. 
    * features_heterometric.csv --- the same file structure as `features_homomeric.csv` but for the set of heteromeric polypeptides. The rows are index/line-matched to the sequences presented in `sequences_heteromeric.txt`.
    * labels_homomeric.csv --- human-readable .csv file with physical properties of homomeric protein sequences and associated condensed phases, as appropriate. The rows are index/line-matched to the sequences presented in `sequences_homomeric.txt`. See the section on `Physical Properties` for a description of the headers.
    * labels_heteromeric.csv --- the same file structure as `labels_homomeric.csv` but for the set of heteromeric polypeptides. The rows are index/line-matched to the sequences presented in `sequences-heteromeric.txt`. 
    * EOS_homomeric.csv --- human-readable .csv file with pressure-density equation-of-state data for homomeric protein sequences. The rows are index/line-matched to the sequences presented in `sequences_homomeric.txt`. See the section on `Physical Properties` for a description of the headers.
    * EOS_heteromeric.csv ---the same file structure as `EOS_homomeric.csv` but for the set of heteromeric polypeptides. The rows are index/line-matched to the sequences presented in `sequences_heteomeric.txt`. 

## Sequence Characteristics

The sequence characteristics contained in the features_*.csv files correspond to those used in the construction of 30-dimensional feature vectors described in the reference paper (An, Webb, Jacobs). 
The 30 columns are ordered as 
    * 1-20) the number of amino acids (A, C, D, E, F, G, H, I, K, L, M, N, P, Q, R, S, T, V, W, Y) in the protein sequence
    * 21) `length' the sequence length (should equal the sum of columns 1-20) 
    * 22) `B2 (MFT)', a mean-field estimate of the second virial coefficient in units of cubic Angstroms
    * 23) `SCD', the "sequence charge decoration" parameter (see reference paper for formula)
    * 24) `SHD', the "sequence hydropathy decoration" parameter (see reference paper for formula)
    * 25) `|net charge|' the total charge on the protein sequence
    * 26) `sum lambda' the sum of hydropathy values for the amino-acids comprising the protein sequence 
    * 27) `beads(+)' the net positive charge on the protein sequence
    * 28) `beads(i)' the net negative charge on the protein sequence
    * 29) `shan ent' the Shannon entropy computed for the protein sequence (see reference paper for formula)
    * 30) `mol wt'  the total molecular weight of the protein sequence

## Physical Properties

In all cases, valid entries are indicated as floats or integers. Entries of 'N/A' indicate data is not available or irrelevant. 

The labels_*.csv files report up to 7 quantities for corresponding sequences:
    * `B2' is the second virial coefficient in units of Angstrom^3; this is computed using the adaptive biasing force method as reported in the reference paper. This should be present for all sequences.
    * `B2_std' is the standard deviation of B2 values computed from 30 independent replicate simulations.
    * `diff' is the self-diffusion coefficient D in a condensed-phase of proteins at an estimated coexistence density at 300 K. The units are x10^(-9) m^2/s. This should only be present for sequences that are predicted to phase separate. The density in the calculation is estimated after construction of an approximate equation-of-state. This equation-of-state construction is done for all sequences that possess B2 < 0. 
    * `diff_std' is the standard deviation of D values computed from 30 independent simulations containing 100 chains.
    * `density' is the estimated condensed-phase coexistence density at 300 K. This density is set as the largest density rho that satisfies P(rho)=0 for a given sequence based on cubic spline interpolation. If P(rho) > 0 for all rho (i.e., no zero is predicted), then the sequence is presumed not to phase separate. The units are g/mL. 
    * `molarity_chains' is the molarity expressed as (moles of protein/L)
    * `molarity_AA' is the molarity expressed as (moles of amino acids/L)
    * `psp' is a binary indicator variable that is `0' if no phase separation is expected to occur and `1' if phase separation is expected to occur. A value of `1' should coincide with valid entries for `diff', `diff_std', `density', and `molarity_*'.

In addition the labels_*.csv file also includes a column with header `generation' that indicates the origin of the sequence with respect to the reference paper: 
    * a label of `-1' indicates that the sequence is simply a homomeric polypeptide
    * a label of `0' indicates that the sequence was sourced from the DisProt dataset
    * labels of 1,2,3,4,5,6,7, or 8 indicate the iteration of active learning during which the seuqence was generated

The EOS_*.csv files report average pressures (and standard errors) obtained from simulations run at a series of fixed densities. These are structured in the following manner:
    * Possible density values are 0.2, 0.3, 0.4, 0.5,0.6,0.7,0.8,0.9,1.0,1.1,1.2,1.3,1.5, and 2.0 g/mL. 
    * Sequences that exhibit B2>0 do not have any valid entries. 
    * Sequences with B2 <0 do not necessarily possess data at all densities (particularly high densities). 
    * Columns alternate between densities and standard errors. For example, the column of `rho_0.2' will provide the pressure (in units of atmospheres) at a density of 0.2 g/mL, while `std_err_0.2' provides the corresponding standard error as determined using bootstrapping. The headers of other columns are interpretable in analgous fashion.