Citation
For all resources provided on the Intensification website, including the motifs, amino acid frequencies and supplementary materials, please cite:Chen J, Wang B, Regan L, Gerstein M. Intensification: A resource for amplifying population-genetic signals with protein repeats (in submission)
Querying the Database
Please follow the instructions on the 'Query' page to get SNV information regarding motifs of the repeat protein domains (RPDs) from our database. Users can now input (1) a genomic region, or a SNV position (1-based), (2) choose one of our 12 SMART database RPDs, or (3) input a PDB ID, to find SNVs within a RPD-containing protein. For more information on the SMART motifs, please visit the 'Download' page.Resource
For each repeat protein motif, the following are available in the zip file under the 'Download' section:Amino acid frequencies
(1) .aamat- amino acid frequency table, containing the raw amino acid counts at each position of the most common motif in the class of repeat domain
- aacount
- file contains: > row = sequence position, with a column called "others" to show which other "amino acids" are also present in the sequences
> col = amino acid + others
> entry = raw counts
- amino acid table, containing the raw frequency of the amino acid occurrence at each position li of the most common motif in the class of repeat domain
- aafreq = aacount / numSequences
- file contains: > first 2 lines: the number of sequences and length of sequence > row = sequence position > col = amino acid, ranked by raw frequency > entry = raw freq of aa at that position
- amino acid table, containing the global propensities at each position of the most common motif in the class of repeat domain; global propensities are calculated by normalizing each amino acid frequency by its natural occurrence in the human proteome (NCBI, non-redundant, downloaded Jan 2012): ('L'=>0.09975, 'A'=>0.07013, 'S'=>0.08326, 'V'=>0.05961, 'G'=>0.06577, 'K'=>0.05723, 'T'=>0.05346, 'I'=>0.04332, 'E'=>0.07096, 'P'=>0.06316, 'R'=>0.05650, 'D'=>0.04728, 'F'=>0.03658, 'Q'=>0.04758, 'N'=>0.03586, 'Y'=>0.02653, 'C'=>0.02307, 'H'=>0.02639, 'M'=>0.02131, 'W'=>0.01216)
- aafreq / weight
- file contains: > first 2 lines: the number of sequences and length of sequence (calculated from the first sequence)
> row = sequence position
> col = amino acid, ranked by global propensities
> entry = global propensity of aa at that position
- amino acid table, containing the relative entropy at each position of the most common motif in the class of repeat domain; relative entropy are calculated by normalizing each amino acid frequency by its occurrence in the human proteome.
- aafreq * log(aafreq / weight)
- file contains: > first 2 lines: the number of sequences and length of sequence
> row = sequence position
> col = amino acid, ranked by relative entropy
> entry = relative entropy of aa at that position
- sequence logo generated by WebLogo
SNV profiles, with ExAC SNV, SMART domain and VEP annotation
(6) .sift.bed- BED file with SIFT score , combining ExAC SNV, SMART domain and VEP annotation; README explains column definition
- tab-delimited file, with Rare/Common, Non-Synonymous/Synonymous ratios
- file contains: > first line = header
> row = each position on motif
> col = number of rare (R), common (C), non-synonymous (NS) and synonymous (S) counts, with and without singletons (noS), and R/C and NS/S ratios
- tab-delimited file, with header, with ancestral allele (1000 Genomes Phase 1), DAF and delta DAF information (ExAC)
- file contains: > first line = header
> row = SNV found with ancestral allele (AA) in 1000 Genomes Phase 1
> col = SNV-related features, including AA, derived allele freq (DAF) and delta DAF between populations derived from ExAC; README explains column definition
Data sources
Genomic Variation:1000 Genomes Project (Abecasis G. et al., Nature 2012); PMID: 23128226
ESP6500, Exome Sequencing Project (Tennessen et al., Science 2012); PMID: 22604720
ExAC (Exome Aggregation Consortium et al., bioRxiv, 2016); DOI: http://dx.doi.org/10.1101/030338
Protein Motifs:
SMART database (Letunic I. et al., Nucleic Acids Res, 2014); PMID: 25300481
Ensembl (Yates A., et al., Nucleic Acids Res, 2016); PMID: 26687719