Ckavity Library The Ckavity Library is a collection of novel genes templated on the four-helix bundle, S-824, designed for overexpression in E. coli. The top of the bundle was targeted for the creation of a cavity and potential functional site. Within this variable region, several individual base positions were varied combinatorially: Catalytic/Core residues were designed to encode a mix of polar/catalytic and hydrophobic amino acids while Loop-Forming residues were designed to encode a mix of hydrophilic and flexible amino acids to facilitate loop formation. Based on this design, the collection could contain 3.53 x 10^12 possible unique sequences, and experimental results suggest a minimum of 10^6 unique sequences were obtained. Next- Generation Sequencing was performed to investigate the quality and diversity of the library. _________ Next-Generation Sequencing The quality and diversity of the library was assessed by high-throughput sequencing (Genomics Core Facility, Lewis-Sigler Institute for Integrative Genomics, Princeton University).Briefly, PCR was performed on plasmid DNA to generate amplicons for sequencing (MiSeq Micro, 350 cycles read length 300nt). The chosen amplicon covered all the variable regions to retain information for complete individual sequences. These data were translated to the corresponding amino acids and analyzed in depth for each of the variable positions, and a constant position as a control; specifically, the percent occurrence of each amino acid at a particular position was determined. _________ Table of Contents CKavity_Lib_Reference: Reference information explaining the library design at the DNA level and the encoded amino acids at the protein level. 1-Raw_Data_20180425: Contains raw data obtained from sequencing in commonly used FASTQ format. Two separate reads are provided in the contained files. 2-Results_Summary: Analyses performed on the raw data. ANALYSIS: contains text files that can be opened in Excel. Filtered results exclude frameshifts while raw results do not. AA_FREQS: Amino acid frequencies for each of the designed variable codons. NT_FREQS: Nucleotide frequencies for each of the designed variable base positions. PIPELINE: Scripts used in the generation of this data. QC: Quality control information generated during DNA sequencing. REFERENCE: Material to assist in interpretation of the data. amplicon: FASTA DNA file showing the linear PCR product derived from the CKavity library, which was then used for all sequencing. probes: Sequences of the oligonucleotide probes used for Next-Generation Sequencing. 3-Random_Invariant: Contains CSV files repeating this analysis for random positions not designed to have combinatorial diversity, for comparison purposes. _________ File Types FASTQ files are the standard format for high-throughput sequencing output. They contain text-based information about the nucleotide sequence and quality score. FASTA files are used to convey nucleotide sequences with single-letter codes, such as the amplicon and probes utilized for Next-Generation Sequencing. These files can be accessed using any viewer such as SnapGene. .csv files viewable in Excel were obtained by analysing the data contained in the FASTQ files, according to the scripts presented in PIPELINE. HTML files for QC data can be viewed using a web browser.