Gene expression and genome wide association data have provided researchers the opportunity to study many complex traits and diseases. When designing prognostic and predictive models capable of phenotypic classification in this area, significant reduction of dimensionality through stringent filtering and/or feature selection is often deemed imperative. Here, this work challenges this presumption through both theoretical and empirical analysis. This work demonstrates that by a proper compromise between structure of the selected model and the number of features, one is able to achieve better performance even in large dimensionality. The inclusion of many genes/variants in the classification rules can help shed new light on the analysis of complex traitstraits that are typically determined by many causal variants with small effect size.
The field of synthetic biology holds an inspiring vision for the future; it integrates computational analysis, biological data and the systems engineering paradigm in the design of new biological machines and systems. These biological machines are built from basic biomolecular components analogous to electrical devices, and the information flow among these components requires the augmentation of biological insight with the power of a formal approach to information management. Here we review the informatics challenges in synthetic biology along three dimensions: in silico, in vitro and in vivo. First, we describe state of the art of the in silico support of synthetic biology, from the specific data exchange formats, to the most popular software platforms and algorithms. Next, we cast in vitro synthetic biology in terms of information flow, and discuss genetic fidelity in DNA manipulation, development strategies of biological parts and the regulation of biomolecular networks. Finally, we explore how the engineering chassis can manipulate biological circuitries in vivo to give rise to future artificial organisms.
The Wilms' tumor suppressor 1 (WT1) gene encodes a DNA- and RNA-binding protein that plays an essential role in nephron progenitor differentiation during renal development. To identify WT1 target genes that might regulate nephron progenitor differentiation in vivo, we performed chromatin immunoprecipitation (ChIP) coupled to mouse promoter microarray (ChIP-chip) using chromatin prepared from embryonic mouse kidney tissue. We identified 1663 genes bound by WT1, 86% of which contain a previously identified, conserved, high-affinity WT1 binding site. To investigate functional interactions between WT1 and candidate target genes in nephron progenitors, we used a novel, modified WT1 morpholino loss-of-function model in embryonic mouse kidney explants to knock down WT1 expression in nephron progenitors ex vivo. Low doses of WT1 morpholino resulted in reduced WT1 target gene expression specifically in nephron progenitors, whereas high doses of WT1 morpholino arrested kidney explant development and were associated with increased nephron progenitor cell apoptosis, reminiscent of the phenotype observed in Wt1(-/-) embryos. Collectively, our results provide a comprehensive description of endogenous WT1 target genes in nephron progenitor cells in vivo, as well as insights into the transcriptional signaling networks controlled by WT1 that might direct nephron progenitor fate during renal development.
BACKGROUND: Identification of expression quantitative trait loci (eQTLs) is an emerging area in genomic study. The task requires an integrated analysis of genome-wide single nucleotide polymorphism (SNP) data and gene expression data, raising a new computational challenge due to the tremendous size of data. RESULTS: We develop a method to identify eQTLs. The method represents eQTLs as information flux between genetic variants and transcripts. We use information theory to simultaneously interrogate SNP and gene expression data, resulting in a Transcriptional Information Map (TIM) which captures the network of transcriptional information that links genetic variations, gene expression and regulatory mechanisms. These maps are able to identify both cis- and trans- regulating eQTLs. The application on a dataset of leukemia patients identifies eQTLs in the regions of the GART, PCP4, DSCAM, and RIPK4 genes that regulate ADAMTS1, a known leukemia correlate. CONCLUSIONS: The information theory approach presented in this paper is able to infer the dependence networks between SNPs and transcripts, which in turn can identify cis- and trans-eQTLs. The application of our method to the leukemia study explains how genetic variants and gene expression are linked to leukemia.
Alterovitz G, Xiang M, Hill DP, Lomax J, Liu J, Cherkassky M, Dreyfuss J, Mungall C, Harris MA, Dolan ME, et al.Ontology engineering. Nat Biotechnol [Internet]. 2010;28 (2) :128-30. Publisher's Version
Although the measurement of fetal proteins in maternal serum is part of standard prenatal screening for aneuploidy and neural tube defects, attempts to better understand the extent of feto-maternal protein trafficking and its clinical and biological significance have been hindered by the presence of abundant maternal proteins. The objective of this study was to circumvent maternal protein interference by using a computational predictive approach for the development of a noninvasive, comprehensive, protein network analysis of the developing fetus in maternal whole blood. From a set of 157 previously identified fetal gene transcripts, 46 were classified into known protein networks, and 222 downstream proteins were predicted. Statistically significantly over-represented pathways were diverse and included T-cell biology, neurodevelopment and cancer biology. Western blot analyses validated the computational predictive model and confirmed the presence of specific downstream fetal proteins in the whole blood of pregnant women and their newborns, with absence or reduced detection of the protein in the maternal postpartum samples. This work demonstrates that extensive feto-maternal protein trafficking occurs during pregnancy, and can be predicted and verified to develop novel noninvasive biomarkers. This study raises important questions regarding the biological effects of fetal proteins on the pregnant woman.
INTRODUCTION: The goal of personalised medicine in the intensive care unit (ICU) is to predict which diagnostic tests, monitoring interventions and treatments translate to improved outcomes given the variation between patients. Unfortunately, processes such as gene transcription and drug metabolism are dynamic in the critically ill; that is, information obtained during static non-diseased conditions may have limited applicability. We propose an alternative way of personalising medicine in the ICU on a real-time basis using information derived from the application of artificial intelligence on a high-resolution database. Calculation of maintenance fluid requirement at the height of systemic inflammatory response was selected to investigate the feasibility of this approach. METHODS: The Multi-parameter Intelligent Monitoring for Intensive Care II (MIMIC II) is a database of patients admitted to the Beth Israel Deaconess Medical Center ICU in Boston. Patients who were on vasopressors for more than six hours during the first 24 hours of admission were identified from the database. Demographic and physiological variables that might affect fluid requirement or reflect the intravascular volume during the first 24 hours in the ICU were extracted from the database. The outcome to be predicted is the total amount of fluid given during the second 24 hours in the ICU, including all the fluid boluses administered. RESULTS: We represented the variables by learning a Bayesian network from the underlying data. Using 10-fold cross-validation repeated 100 times, the accuracy of the model in predicting the outcome is 77.8%. The network generated has a threshold Bayes factor of seven representing the posterior probability of the model given the observed data. This Bayes factor translates into p < 0.05 assuming a Gaussian distribution of the variables. CONCLUSIONS: Based on the model, the probability that a patient would require a certain range of fluid on day two can be predicted. In the presence of a larger database, analysis may be limited to patients with identical clinical presentation, demographic factors, co-morbidities, current physiological data and those who did not develop complications as a result of fluid administration. By better predicting maintenance fluid requirements based on the previous day's physiological variables, one might be able to prevent hypotensive episodes requiring fluid boluses during the course of the following day.
UNLABELLED: Many bioinformatics solutions suffer from the lack of usable interface/platform from which results can be analyzed and visualized. Overcoming this hurdle would allow for more widespread dissemination of bioinformatics algorithms within the biological and medical communities. The algorithms should be accessible without extensive technical support or programming knowledge. Here, we propose a dynamic wizard platform that provides users with a Graphical User Interface (GUI) for most Java bioinformatics library toolkits. The application interface is generated in real-time based on the original source code. This platform lets developers focus on designing algorithms and biologists/physicians on testing hypotheses and analyzing results. AVAILABILITY: The open source code can be downloaded from: http://bcl.med.harvard.edu/proteomics/proj/APBA/.
The identification of reliable peripheral biomarkers for clinical diagnosis, patient prognosis, and biological functional studies would allow for access to biological information currently available only through invasive methods. Traditional approaches have so far considered aspects of tissues and biofluid markers independently. Here we introduce an information theoretic framework for biomarker discovery, integrating biofluid and tissue information. This allows us to identify tissue information in peripheral biofluids. We treat tissue-biofluid interactions as an information channel through functional space using 26 proteomes from 45 different sources to determine quantitatively the correspondence of each biofluid for specific tissues via relative entropy calculation of proteomes mapped onto phenotype, function, and drug space. Next, we identify candidate biofluids and biomarkers responsible for functional information transfer (p < 0.01). A total of 851 unique candidate biomarkers proxies were identified. The biomarkers were found to be significant functional tissue proxies compared to random proteins (p < 0.001). This proxy link is found to be further enhanced by filtering the biofluid proteins to include only significant tissue-biofluid information channels and is further validated by gene expression. Furthermore, many of the candidate biomarkers are novel and have yet to be explored. In addition to characterizing proteins and their interactions with a systemic perspective, our work can be used as a roadmap to guide biomedical investigation, from suggesting biofluids for study to constraining the search for biomarkers. This work has applications in disease screening, diagnosis, and protein function studies.
Biological and medical data have been growing exponentially over the past several years [1, 2]. In particular, proteomics has seen automation dramatically change the rate at which data are generated . Analysis that systemically incorporates prior information is becoming essential to making inferences about the myriad, complex data [4-6]. A Bayesian approach can help capture such information and incorporate it seamlessly through a rigorous, probabilistic framework. This paper starts with a review of the background mathematics behind the Bayesian methodology: from parameter estimation to Bayesian networks. The article then goes on to discuss how emerging Bayesian approaches have already been successfully applied to research across proteomics, a field for which Bayesian methods are particularly well suited [7-9]. After reviewing the literature on the subject of Bayesian methods in biological contexts, the article discusses some of the recent applications in proteomics and emerging directions in the field.
The discovery of fetal mRNA transcripts in the maternal circulation holds great promise for noninvasive prenatal diagnosis. To identify potential fetal biomarkers, we studied whole blood and plasma gene transcripts that were common to 9 term pregnant women and their newborns but absent or reduced in the mothers postpartum. RNA was isolated from peripheral or umbilical blood and hybridized to gene expression arrays. Gene expression, paired Student's t test, and pathway analyses were performed. In whole blood, 157 gene transcripts met statistical significance. These fetal biomarkers included 27 developmental genes, 5 sensory perception genes, and 22 genes involved in neonatal physiology. Transcripts were predominantly expressed or restricted to the fetus, the embryo, or the neonate. Real-time RT-PCR amplification confirmed the presence of specific gene transcripts; SNP analysis demonstrated the presence of 3 fetal transcripts in maternal antepartum blood. Comparison of whole blood and plasma samples from the same pregnant woman suggested that placental genes are more easily detected in plasma. We conclude that fetal and placental mRNA circulates in the blood of pregnant women. Transcriptional analysis of maternal whole blood identifies a unique set of biologically diverse fetal genes and has a multitude of clinical applications.
Gene Ontology (GO) has been widely used to infer functional significance associated with sets of genes in order to automate discoveries within large-scale genetic studies. A level in GO's direct acyclic graph structure is often assumed to be indicative of its terms' specificities, although other work has suggested this assumption does not hold. Unfortunately, quantitative analysis of biological functions based on nodes at the same level (as is common in gene enrichment analysis tools) can lead to incorrect conclusions as well as missed discoveries due to inefficient use of available information. This paper addresses these using an informational theoretic approach encoded in the GO Partition Database that guarantees to maximize information for gene enrichment analysis. The GO Partition Database was designed to feature ontology partitions with GO terms of similar specificity. The GO partitions comprise varying numbers of nodes and present relevant information theoretic statistics, so researchers can choose to analyze datasets at arbitrary levels of specificity. The GO Partition Database, featuring GO partition sets for functional analysis of genes from human and 10 other commonly studied organisms with a total of 131,972 genes, is available on the internet at: bcl.med.harvard.edu/proj/gopart. The site also includes an online tutorial.
Myelodysplastic syndromes (MDS) are among the most frequent hematologic malignancies. Patients have a short survival and often progress to acute myeloid leukemia. The diagnosis of MDS can be difficult; there is a paucity of molecular markers, and the pathophysiology is largely unknown. Therefore, we conducted a multicenter study investigating whether serum proteome profiling may serve as a noninvasive platform to discover novel molecular markers for MDS. We generated serum proteome profiles from 218 individuals by MS and identified a profile that distinguishes MDS from non-MDS cytopenias in a learning sample set. This profile was validated by testing its ability to predict MDS in a first independent validation set and a second, prospectively collected, independent validation set run 5 months apart. Accuracy was 80.5% in the first and 79.0% in the second validation set. Peptide mass fingerprinting and quadrupole TOF MS identified two differential proteins: CXC chemokine ligands 4 (CXCL4) and 7 (CXCL7), both of which had significantly decreased serum levels in MDS, as confirmed with independent antibody assays. Western blot analyses of platelet lysates for these two platelet-derived molecules revealed a lack of CXCL4 and CXCL7 in MDS. Subtype analyses revealed that these two proteins have decreased serum levels in advanced MDS, suggesting the possibility of a concerted disturbance of transcription or translation of these chemokines in advanced MDS.
Powerful engineering tools can help solve today’s complex biological and biomedical research challenges – and this first-of-its-kind guide is paving the way. This trail-blazing work gives engineers a quantitative systems approach to bioinformatics research using computational tools drawn from technical disciplines. It presents biological processes in an engineering context to help engineers use their technical skills in solving novel biological problems and also to facilitate reverse engineering from biology in developing synthetic biological devices. This first-of-its-kind volume explores how the knowledge bases of various technical disciplines relate to, and are observed, in biological systems. It discusses signal processing techniques used in biological data analysis, explains cellular regulatory systems and their similarities to traditional control systems, and explores protein and gene networks, inference networks, and network dynamics. A major milestone in systems biology, this groundbreaking work points engineers to new frontiers in the convergence of engineering and biological research.
The speed of the human genome project (Lander, E. S., Linton, L. M., Birren, B., Nusbaum, C. et al., Nature 2001, 409, 860-921) was made possible, in part, by developments in automation of sequencing technologies. Before these technologies, sequencing was a laborious, expensive, and personnel-intensive task. Similarly, automation and robotics are changing the field of proteomics today. Proteomics is defined as the effort to understand and characterize proteins in the categories of structure, function and interaction (Englbrecht, C. C., Facius, A., Comb. Chem. High Throughput Screen. 2005, 8, 705-715). As such, this field nicely lends itself to automation technologies since these methods often require large economies of scale in order to achieve cost and time-saving benefits. This article describes some of the technologies and methods being applied in proteomics in order to facilitate automation within the field as well as in linking proteomics-based information with other related research areas.
High-throughput generation of new types of relational biological datasets is creating a demand for methods to provide insights into their complexity. Such networks are often too large to interpret visually and too complicated to be explained solely based on local topological properties. One way to try to make sense of such complex networks would be to transform them into discernable abstracts, or summaries, of the original networks. Then, important components could become more readily visible. This work presents such an approach for understanding networks via abstraction of global network connectivity using compression. This made possible the discovery of a new type of topological class, referred to herein as a guild, that captures global connectivity similarity. Lastly, the correspondence of these guilds to biological function is validated via an E. Coli gene regulation network. This resulted in biological findings that could not be derived from local topology of the original network.
Surface-enhanced laser desorption/ionization (SELDI) time-of-flight mass spectrometry with protein arrays has facilitated the discovery of disease-specific protein profiles in serum. Such results raise hopes that protein profiles may become a powerful diagnostic tool. To this end, reliable and reproducible protein profiles need to be generated from many samples, accurate mass peak heights are necessary, and the experimental variation of the profiles must be known. We adapted the entire processing of protein arrays to a robotics system, thus improving the intra-assay coefficients of variation (CVs) from 45.1% to 27.8% (p<0.001). In addition, we assessed up to 16 technical replicates, and demonstrated that analysis of 2-4 replicates significantly increases the reliability of the protein profiles. A recent report on limited long-term reproducibility seemed to concord with our initial inter-assay CVs, which varied widely and reached up to 56.7%. However, we discovered that the inter-assay CV is strongly dependent on the drying time before application of the matrix molecule. Therefore, we devised a standardized drying process and demonstrated that our optimized SELDI procedure generates reliable and long-term reproducible protein profiles with CVs ranging from 25.7% to 32.6%, depending on the signal-to-noise ratio threshold used.
Surface-enhanced laser desorption/ionization time-of-flight mass spectrometry (SELDI or SELDI-TOF MS) with protein arrays has facilitated the discovery of disease-specific protein profiles in serum. As array technologies in bioinformatics and proteomics multiply the quantity of data being generated, more automated hardware and computational methods will become necessary in order to keep up. Robot automated sample preparation and analysis pipeline for proteomics (Raspap) in SELDI provides a solution from the lab bench to the desktop. In this approach, the entire processing of protein arrays is delegated to a robotics system and the bioinformatics automated pipeline (BAP) performs data mining after SELDI analysis. A key part of BAP is the creation of a journal-styled report in HTML (with text, embedded figures, and references) which can be automatically emailed back to the engineers/scientists for review. An object-oriented tree-based structure allows for the derivation of conclusions about the data and comparison of multiple analyses within the generated report. Testing yielded improvement in the resulting assay coefficients of variation (CV) from 45.1% (when done manually) to 27.8% (P<0.001). A large biological dataset was also examined with the Raspap approach and consequent results are discussed.
This research analyzed both engineering and nontechnical issues involved in the use of Induction Loop Amplification (ILA) devices in auditoriums or large gathering places for hard-of-hearing individuals. A variety of parameters need to be taken into account to determine an optimal shape/configuration for the ILA device. In many cases, an optimal configuration is different from those proposed for classroom use (Ross, 1969; Hodgson, 1986; Clevenger, 1992). Experimental results were obtained for a double-loop configuration in such a setting (a university gymnasium/auditorium in this case). The results demonstrate that a double-loop configuration is a viable possibility for auditorium use. Several variables using this configuration were examined, and experimentation was done. Various implications, including consequent nontechnical issues specific to this application, are discussed as well. Technical and nontechnical aspects of the ILA configuration need to be examined together when designing an optimal system.
OBJECTIVE: As more sensors are added to increasingly technology-dependent operating rooms (OR), physicians such as anesthesiologists must sift through an ever-increasing number of patient parameters every few seconds as part of their OR duties. To the extent these many parameters are correlated and redundant, manually monitoring all of them may not be an optimal physician strategy for assessing patient state or predicting future changes to guide their actions. METHODS: The method is illustrated by application to seventy-six anesthetized patients for which thirty-two fundamental and derived variables were recorded at 20-second intervals. The Iterative Order and Noise estimation algorithm (ION) estimated the noise on each parameter. The performance of principal components analysis (PCA) was improved by normalizing the noise estimated by ION to unity. A linear regression of the resulting seven high signal-to-noise ratio principal components (PC's) predicted tachycardia 140 seconds in advance. RESULTS: ION estimated the noise on each parameter with sufficient accuracy to increase the number of significant PC's from two to seven, all of which had identifiable physiological correlates. The resulting receiver operating characteristic (ROC) suggested that a 70 percent prediction rate with 5 percent false alarms could be achieved. CONCLUSIONS: This paper illustrates the use of ION to improve significantly the performance of PCA in the efficient representation of patient state and in improving the performance of linear predictors of clinically significant parameters.