-
A simple statistical test to infer the causality of target/phenotype correlation from small molecule phenotypic screens
Motivation: Cell-based phenotypic screens using small molecule inhibitors is an important technology for early drug discovery if the relationship between the disease-related cellular phenotype and inhibitors' biological targets can be determined. However, chemical inhibitors are rightfully believed to be less specific than perturbation by biological agents, such as antibody and small inference RNA. Therefore, it is often a challenge in small molecule phenotypic screening to infer the causality between a particular cellular phenotype and the inactivation of the responsible protein due to the off-target effect of the inhibitors.
Results: In this article, we present a Roche in-house effort of screening 746 structurally diverse compounds for their cytotoxicity in HeLa cells measured by high content imaging technology. These compounds were also systematically profiled for the targeted and off-target binding affinity to a panel of 25 pre-selected protein kinases in a cell-free system. In an effort to search for the kinases whose activities are crucial for cell survival, we found that the simple association method such as the chi-square test yields a large number of false positives because the observed cytotoxic phenotype is likely to be the result of promiscuous action of less specific inhibitors instead of true consequence of inactivation of single relevant target. We demonstrated that a stratified categorical data analysis technique such as the Cochran–Mantel–Haenszel test is an effective approach to extract the meaningful biological connection from the spurious correlation resulted from confounding covariates. This study indicates that, empowered by appropriate statistical adjustment, small molecule inhibitor perturbation remains a powerful tool to pin down the relevant biomarker for drug safety and efficacy research.
Contact:xin.wei@roche.com
Supplementary information:Supplementary data are available at Bioinformatics online.
-
InFiRe -- a novel computational method for the identification of insertion sites in transposon mutagenized bacterial genomes
Motivation: InFiRe, Insertion Finder via Restriction digest, is a novel software tool that allows for the computational identification of transposon insertion sites in known bacterial genome sequences after transposon mutagenesis experiments. The approach is based on the fact that restriction endonuclease digestions of bacterial DNA yield a unique pattern of DNA fragments with defined sizes. Transposon insertion changes the size of the hosting DNA fragment by a known number of base pairs. The exact size of this fragment can be determined by Southern blot hybridization. Subsequently, the position of insertion can be identified with computational analysis. The outlined method provides a solid basis for the establishment of a new high-throughput technology.
Availability and implementation: The software is freely available on our web server at www.infire.tu-bs.de. The algorithm was implemented in the statistical programming language R. For the most flexible use, InFiRe is provided in two different versions. A web interface offers the convenient use in a web browser. In addition, the software and source code is freely available for download as R-packages on our website.
Contact:m.steinert@tu-bs.de
Supplementary information:Supplementary data are available at Bioinformatics online.
-
SomaticSniper: identification of somatic point mutations in whole genome sequencing data
Motivation: The sequencing of tumors and their matched normals is frequently used to study the genetic composition of cancer. Despite this fact, there remains a dearth of available software tools designed to compare sequences in pairs of samples and identify sites that are likely to be unique to one sample.
Results: In this article, we describe the mathematical basis of our SomaticSniper software for comparing tumor and normal pairs. We estimate its sensitivity and precision, and present several common sources of error resulting in miscalls.
Availability and implementation: Binaries are freely available for download at http://gmt.genome.wustl.edu/somatic-sniper/current/, implemented in C and supported on Linux and Mac OS X.
Contact:delarson@wustl.edu; lding@wustl.edu
Supplementary information:Supplementary data are available at Bioinformatics online.
-
Detection of microRNAs in color space
Motivation: Deep sequencing provides inexpensive opportunities to characterize the transcriptional diversity of known genomes. The AB SOLiD technology generates millions of short sequencing reads in color-space; that is, the raw data is a sequence of colors, where each color represents 2 nt and each nucleotide is represented by two consecutive colors. This strategy is purported to have several advantages, including increased ability to distinguish sequencing errors from polymorphisms. Several programs have been developed to map short reads to genomes in color space. However, a number of previously unexplored technical issues arise when using SOLiD technology to characterize microRNAs.
Results: Here we explore these technical difficulties. First, since the sequenced reads are longer than the biological sequences, every read is expected to contain linker fragments. The color-calling error rate increases toward the 3' end of the read such that recognizing the linker sequence for removal becomes problematic. Second, mapping in color space may lead to the loss of the first nucleotide of each read. We propose a sequential trimming and mapping approach to map small RNAs. Using our strategy, we reanalyze three published insect small RNA deep sequencing datasets and characterize 22 new microRNAs.
Availability and implementation: A bash shell script to perform the sequential trimming and mapping procedure, called SeqTrimMap, is available at: http://www.mirbase.org/tools/seqtrimmap/
Contact: antonio.marco@manchester.ac.uk
Supplementary information:Supplementary data are available at Bioinformatics online.
-
SCPC: a method to structurally compare protein complexes
Motivation: Protein–protein interactions play vital functional roles in various biological phenomena. Physical contacts between proteins have been revealed using experimental approaches that have solved the structures of protein complexes at atomic resolution. To examine the huge number of protein complexes available in the Protein Data Bank, an efficient automated method that compares protein complexes is required.
Results: We have developed Structural Comparison of Protein Complexes (SCPC), a novel method to structurally compare protein complexes. SCPC compares the spatial arrangements of subunits in a complex with those in another complex using secondary structure elements. Similar substructures are detected in two protein complexes and the similarity is scored. SCPC was applied to dimers, homo-oligomers and haemoglobins. SCPC properly estimated structural similarities between the dimers examined as well as an existing method, MM-align. Conserved substructures were detected in a homo-tetramer and a homo-hexamer composed of homologous proteins. Classification of quaternary structures of haemoglobins using SCPC was consistent with the conventional classification. The results demonstrate that SCPC is a valuable tool to investigate the structures of protein complexes.
Availability: SCPC is available at http://idp1.force.cs.is.nagoya-u.ac.jp/scpc/.
Contact:rkoike@is.nagoya-u.ac.jp
Supplementary information:Supplementary data are available at Bioinformatics online.
-
Prediction and analysis of nucleotide-binding residues using sequence and sequence-derived structural descriptors
Motivation: Nucleotides are multifunctional molecules that are essential for numerous biological processes. They serve as sources for chemical energy, participate in the cellular signaling and they are involved in the enzymatic reactions. The knowledge of the nucleotide–protein interactions helps with annotation of protein functions and finds applications in drug design.
Results: We propose a novel ensemble of accurate high-throughput predictors of binding residues from the protein sequence for ATP, ADP, AMP, GTP and GDP. Empirical tests show that our NsitePred method significantly outperforms existing predictors and approaches based on sequence alignment and residue conservation scoring. The NsitePred accurately finds more binding residues and binding sites and it performs particularly well for the sites with residues that are clustered close together in the sequence. The high predictive quality stems from the usage of novel, comprehensive and custom-designed inputs that utilize information extracted from the sequence, evolutionary profiles, several sequence-predicted structural descriptors and sequence alignment. Analysis of the predictive model reveals several sequence-derived hallmarks of nucleotide-binding residues; they are usually conserved and flanked by less conserved residues, and they are associated with certain arrangements of secondary structures and amino acid pairs in the specific neighboring positions in the sequence.
Availability:http://biomine.ece.ualberta.ca/nSITEpred/
Contact:lkurgan@ece.ualberta.ca
Supplementary information:Supplementary data are available at Bioinformatics online.
-
HydroPaCe: understanding and predicting cross-inhibition in serine proteases through hydrophobic patch centroids
Motivation: Protein–protein interfaces contain important information about molecular recognition. The discovery of conserved patterns is essential for understanding how substrates and inhibitors are bound and for predicting molecular binding. When an inhibitor binds to different enzymes (e.g. dissimilar sequences, structures or mechanisms what we call cross-inhibition), identification of invariants is a difficult task for which traditional methods may fail.
Results: To clarify how cross-inhibition happens, we model the problem, propose and evaluate a methodology called HydroPaCe to detect conserved patterns. Interfaces are modeled as graphs of atomic apolar interactions and hydrophobic patches are computed and summarized by centroids (HP-centroids), and their conservation is detected. Despite sequence and structure dissimilarity, our method achieves an appropriate level of abstraction to obtain invariant properties in cross-inhibition. We show examples in which HP-centroids successfully predicted enzymes that could be inhibited by the studied inhibitors according to BRENDA database.
Availability:www.dcc.ufmg.br/~raquelcm/hydropace
Contact:valdetemg@ufmg.br; raquelcm@dcc.ufmg.br; santoro@icb.ufmg.br
Supplementary information:Supplementary data are available at Bioinformatics online.
-
Inhibition of HIV-1 protease: the rigidity perspective
Motivation: HIV-1 protease is a key drug target due to its role in the life cycle of the HIV-1 virus. Rigidity analysis using the software First is a computationally inexpensive method for inferring functional information from protein crystal structures. We evaluate the rigidity of 206 high-resolution (2 Å or better) X-ray crystal structures of HIV-1 protease and compare the effects of different inhibitors binding to the enzyme.
Results: Inhibitor binding has little effect on the overall rigidity of the protein homodimer, including the rigidity of the active site. The principal effect of inhibitor binding on rigidity is to constrain the flexibility of the β-hairpin flaps, which move to allow access to the active site of the enzyme. We show that commercially available antiviral drugs which target HIV-1 protease can be divided into two classes, those which significantly affect flap rigidity and those which do not. The non-peptidic inhibitor tipranavir is distinctive in its consistently strong effect on flap rigidity.
Contact:jack.heal@warwick.ac.uk; r.roemer@warwick.ac.uk
Supplementary information:Supplementary data are available at Bioinformatics online.
-
M3: an improved SNP calling algorithm for Illumina BeadArray data
Summary: Genotype calling from high-throughput platforms such as Illumina and Affymetrix is a critical step in data processing, so that accurate information on genetic variants can be obtained for phenotype–genotype association studies. A number of algorithms have been developed to infer genotypes from data generated through the Illumina BeadStation platform, including GenCall, GenoSNP, Illuminus and CRLMM. Most of these algorithms are built on population-based statistical models to genotype every SNP in turn, such as GenCall with the GenTrain clustering algorithm, and require a large reference population to perform well. These approaches may not work well for rare variants where only a small proportion of the individuals carry the variant. A fundamentally different approach, implemented in GenoSNP, adopts a single nucleotide polymorphism (SNP)-based model to infer genotypes of all the SNPs in one individual, making it an appealing alternative to call rare variants. However, compared to the population-based strategies, more SNPs in GenoSNP may fail the Hardy–Weinberg Equilibrium test. To take advantage of both strategies, we propose a two-stage SNP calling procedure, named the modified mixture model (M3), to improve call accuracy for both common and rare variants. The effectiveness of our approach is demonstrated through applications to genotype calling on a set of HapMap samples used for quality control purpose in a large case–control study of cocaine dependence. The increase in power with M3 is greater for rare variants than for common variants depending on the model.
Availability: M3 algorithm: http://bioinformatics.med.yale.edu/group.
Contact:name@bio.com; hongyu.zhao@yale.edu
Supplementary information:Supplementary data are available at Bioinformatics online.
-
Gaussian process modelling for bicoid mRNA regulation in spatio-temporal Bicoid profile
Motivation: Bicoid protein molecules, translated from maternally provided bicoid mRNA, establish a concentration gradient in Drosophila early embryonic development. There is experimental evidence that the synthesis and subsequent destruction of this protein is regulated at source by precise control of the stability of the maternal mRNA. Can we infer the driving function at the source from noisy observations of the spatio-temporal protein profile? We use non-parametric Gaussian process regression for modelling the propagation of Bicoid in the embryo and infer aspects of source regulation as a posterior function.
Results: With synthetic data from a 1D diffusion model with a source simulated to model mRNA stability regulation, our results establish that the Gaussian process method can accurately infer the driving function and capture the spatio-temporal dynamics of embryonic Bicoid propagation. On real data from the FlyEx database, too, the reconstructed source function is indicative of stability regulation, but is temporally smoother than what we expected, partly due to the fact that the dataset is only partially observed. To be in line with recent thinking on the subject, we also analyse this model with a spatial gradient of maternal mRNA, rather than being fixed at only the anterior pole.
Contact:m.niranjan@southampton.ac.uk
Supplementary information:Supplementary data are available at Bioinformatics online.