Patents | Books | Journals | Refereed Conferences | Symposia | Thesis | Technical Reports | Workshops | Tutorials | Supplementary Data
Jurisica, I., D. A. Wigle, and B. Wong. Cancer Informatics in the Post Genomic Era; Toward Information-Based Medicine
Series: Cancer Treatment and Research, Volume 137, Springer Verlag, July 2007.
Less than 50% of diagnosed cancers are cured using current treatment modalities. Many common cancers can already be fractionated into such therapeutic subsets with unique prognostic outcomes based on characteristic molecular phenotypes. It is widely expected that treatment approaches of complex cancer will soon be revolutionized by combining molecular profiling and computational analysis, which will result in the introduction of novel therapeutics and treatment decision algorithms that target the underlying molecular mechanisms of cancer.
The sequencing of the human genome was the first step in understanding the ways in which we are wired. However, this genetic blueprint provides only a “parts list”, and neither information about how the human organism is actually working, nor insight into function or interactions among the ~30 thousand constitutive parts that comprise our genome. Considering that the 30 years of worldwide molecular biology efforts have only annotated about 10% of this gene set, and we know even less about proteins, it is comforting to know that high-throughput data generation and analysis is now widely available.
By arraying tens of thousands of genes and analyzing abundance of and interaction among proteins, it is now possible to measure the relative activity of genes and proteins in normal and diseased tissue. The technology and datasets of such profiling-based analyses will be described along with the mathematical challenges that face the mining of the resulting datasets. We describe the issues related to using this information in the clinical setting, and the future steps that will lead to drug design and development to cure complex diseases such as cancer.
Jurisica, I. and D. Wigle. Knowledge Discovery in Proteomics. Mathematical & Computational Biology Series, Volume 8, Chapman & Hall/CRC Press, 2006.
Who knows useful things, not many things, is wise. Aeschylus (ca. 525-456 BC)
The nascent fields of bioinformatics and computational biology are currently an odd amalgam of everything from biologists with a computational bent, through physicists and mathematicians, to computer scientists and engineers sifting through the myriad of data and grappling with biological questions. Much of the excitement comes from a collective sense that there is something truly new evolving. Hardware and software limitations are declaring themselves as major challenges to managing and interpreting the avalanche of data from high-throughput biological platforms. This drinking from the fire hydrant'' sensation continues to spark interest and draw technical skill from other domains. As we move forward to true systems biology experimentation, it is increasingly obvious that experts in robotics, engineering, mathematics, physics, and computer science have become key players alongside traditional molecular biology.
Life sciences applications are typically characterized by multimodal representations, lack of complete and consistent domain theories, rapid evolution of domain knowledge, high dimensionality, and large amounts of missing information. Data in these domains require robust approaches to deal with missing and noisy information. Modern proteomics is no exception. As our understanding of protein structure and function becomes ever more complicated, we have reached a point in time where the actual management of data is a major hurdle to knowledge discovery. Many of the browse-through applications of yesterday are clearly not useful for computational manipulation. If the data was not created having data mining and decision support in mind, how well can it serve that purpose?
We felt this book was a timely discussion of some of the key issues in the field. In subsequent chapters we discuss a number of examples from our own experience that represent some of the challenges of knowledge discovery in high-throughput proteomics. This discussion is by no means comprehensive, and does not attempt to highlight all relevant domains. However, we hope to provide the reader with an overview of what we envision as an important and emerging field in its own right by discussing the challenges and potential solutions to the problems presented. We have selected five specific domains to discuss: (1) Mass spectrometry based protein analysis; (2) Protein--protein interaction network analysis; (3) Systematic high-throughput protein crystallization; (4) A systematic and integrated analysis of multiple data repositories using a diverse set of algorithms and tools; and (5) Systems biology. In each of these areas, we describe the challenges created by the type of data produced, and potential solutions to the problem of data mining within the domain. We hope this stimulates even more discussion, and newer and better ways to deal with the problems at hand.
2012


2010
Gene expression profiling was conducted using primary human nasopharyngeal carcinoma (NPC) biopsy samples to improve the understanding of the molecular pathways defining NPC and to identify novel potential therapeutic targets. RNA samples were extracted from 36 patients suspected to have NPC and hybridized onto the Affymetrix U133A chip. NPC was diagnosed in 19 patients, 11 had lymphoid hyperplasia (LH), and 6 were .normal. biopsies. Clinical stages for these NPC patients ranged from I.IV, including one M1. All NPC patients (except the M1) were treated with curative intent, which included radiotherapy alone (4 patients), or combined with chemotherapy (14 patients). Unsupervised clustering demonstrated a distinct NPC expression pattern, compared to normal biopsies. Subsequent Significance Analysis of Microarrays (SAM) derived from 14 NPC and 6 normal samples discovered 1089 differentially regulated genes. Pathway analyses revealed novel insights into the mechanisms leading to NPC, whereby up-regulation of NFkB2 and survivin play central roles in increasing resistance to apoptosis, and changes in integrin and WNT/b-catenin signaling leading to uncontrolled proliferation. The role of survivin in resisting apoptosis in NPC was confirmed by RNA interference. Our data provide novel insights into the development and progression of NPC, and suggest survivin as a novel therapeutic target for NPC.
The product of the MYC oncogene is widely deregulated in cancer and functions as a regulator of gene transcription. Despite an extensive profile of regulated genes, the transcriptional targets of c-Myc essential for transformation remain unclear. In this study we show that c-Myc significantly induces the expression of the H19 non-coding RNA in several diverse cell types including breast epithelial, glioblastoma and fibroblast cells. C-Myc binds to evolutionary conserved E boxes in the imprinting control region to facilitate histone acetylation and transcriptional initiation of the H19 promoter. In addition, c-Myc downregulates the expression of the IGF2, the reciprocally imprinted gene at the H19/IGF2 locus. Evidence shows that c-Myc regulates these two genes independently and does not affect the imprinting of H19. Indeed, allele-specific chromatin immunoprecipitation and expression analyses indicate that c-Myc binds and drives the expression of only the maternal H19 allele. The role of H19 in transformation is addressed using a knockdown approach and shows that downregulation of H19 significantly decreases breast and lung cancer cell clonogenicity and anchorage independent growth. In addition, c-Myc and H19 expression shows strong association in primary breast and lung carcinomas. This work indicates that c-Myc induction of the H19 gene product holds an important role in transformation.
STAT2 is a critical component of interferon-. (IFN) signaling. To identify genes regulated by IFN-inducible STAT2-DNA binding, cDNA from IFN-treated cells expressing intact STAT2 or a DNA-binding mutant STAT2 were analyzed by Affymetrix microarrays. IFN-inducible expression of genes regulated by IFN-stimulated gene factor 3 (ISGF3), wherein STAT2 functions as a transactivator, 2 5. OAS, Mx, ISG15, 9-27, MHC-I, is similar in both cell types. Nineteen genes were identified whose expression was higher in IFN-treated cells expressing intact STAT2 compared with cells expressing the mutant STAT2. Using quantitative PCR, we confirmed that ISGF3-dependent gene transcription is unaffected in cells expressing mutant STAT2 but that a subset of IFN-inducible genes is differentially regulated in these cells: CLDN4, BF, DGFK, MSR1 and TLR3, containing .-activated sequence (GAS)-like elements in their 5. flanking sequences. Our data indicate that the DNA binding domain of STAT2 is required for full IFN-inducible activation of (GAS)-regulated target genes.
Proteomics, the science of globally detecting proteins in cells, tissues or organisms under defined conditions has highly benefited from recent developments in mass spectrometry (MS). It is now possible to detect hundreds to thousands of proteins with high confidence in a single experiment. In this review, we summarize the basic MS technologies currently used by laboratories around the world to identify proteins in complex biological samples. We further provide the reader with a short overview of useful separation strategies to minimize the initial complexity of biological samples, and the multitude of bioinformatics tools essential to manage large-scale proteomics data to obtain meaningful biological insight. Finally, we summarize recent advances in three main areas of medical proteomics; proteomics in cancer research, proteomics of the heart, and proteomics in diabetes research.
Algorithmic and modeling advances in the area of protein-protein interaction (PPI) network analysis could contribute to the understanding of biological processes. Local structure of networks can be measured by the frequency distribution of graphlets, small connected non-isomorphic induced subgraphs. This measure of local structure has been used to show that high-confidence PPI networks have local structure of geometric random graphs. Finding graphlets exhaustively in a large network is computationally intensive. More complete PPI networks, as well as PPI networks of higher organisms, will thus require efficient heuristic approaches.
We propose two efficient and scalable heuristics for finding graphlets in high-confidence PPI networks. We show that both PPI and their model geometric random networks, have defined boundaries that are sparser than the "inner parts" of the networks. In addition, these networks exhibit "uniformity" of local structure inside the networks. Our first heuristic exploits these two structural properties of PPI and geometric random networks to find good estimates of graphlet frequency distributions in these networks up to 690 times faster than the exhaustive searches. Our second heuristic is a variant of a more standard sampling technique and it produces accurate approximate results up to 377 times faster than the exhaustive searches. We indicate how the combination of these approaches may result in an even better heuristic.
Identifying protein-protein interactions is a key problem in molecular biology. Currently, interactions cannot be reliably predicted on a proteome-wide scale but direct and indirect evidence for interactions is increasingly available from high-throughput interaction detection methods, gene expression microarrays, and protein annotation projects. In this paper we propose an association mining approach to integrating these diverse types of evidence. We apply this approach to a number of datasets consisting of interacting and non-interacting protein pairs annotated with different types of evidence. We identify patterns that distinguish interacting and non-interacting protein pairs, and use these patterns to assign a confidence level to proposed interactions.
Both Ki-ras mutation and Hepatocyte Growth Factor (HGF) receptor Met overexpression occur at high frequency in colon cancer. This study investigated the transcriptional changes induced by Ki-ras oncogene and HGF-Met signaling activation in colon cancer cell lines in vitro and in vivo. The microarray global transcriptional profiling data demonstrate that changes induced by Met receptor activation overlap with those induced by Ki-ras oncogene. However, in the presence of Ki-ras mutation, the magnitude of transcriptional alterations in response to HGF-Met signaling in vitro and in vivo was attenuated. Overlapping genes between in vitro and in vivo microarray datasets were selected as a subset of HGF/Met and Ki-ras oncogene regulated targets, and were investigated further for validation. Using the Online Predicted Human Interaction Database (OPHID), we identified novel Met and Ki-ras regulated proteins and other functionally linked targets. . The novel proteins comprised histone acetyltransferase 1 (HAT1), phosphoribosyl pyrophosphate synthetase 2 (PRPS2), chaperonin containing TCP1, subunit 8 (CCT8), CSE1 chromosome segregation 1-like (yeast)/cellular apoptosis susceptibility (mammals) (CSE1L/CAS) and Cyclin H. The results demonstrate a strategy that may reveal novel pathways or mechanisms by which HGF/Met and Ki-ras oncogene signaling affects the biology of colon cancer cells.
An effective tool for the global analysis of both DNA methylation status and protein-chromatin interactions is a microarray constructed with sequences containing regulatory elements. One type of array suited for this purpose takes advantage of the strong association between CpG Islands (CGIs) and gene regulatory regions. We have obtained 20,736 clones from a CGI Library and used these to construct CGI arrays. The utility of this library requires proper annotation and assessment of the clones, including CpG content, genomic origin and proximity to neighboring genes. Alignment of clone sequences to the human genome (UCSC hg17) identified 9595 distinct genomic loci; 64% were defined by a single clone while the remaining 36% were represented by multiple, redundant clones. Approximately 68% of the loci were located near a transcription start site. The distribution of these loci covered all 23 chromosomes, with 63% overlapping a bioinformatically identified CGI. The high representation of genomic CGI in this rich collection of clones supports the utilization of microarrays produced with this library for the study of global epigenetic mechanisms and protein-chromatin interactions. A browsable database is available on-line to facilitate exploration of the CGIs in this library and their association with annotated genes or promoter elements.
Acute myeloblastic leukemia (AML) may be classified in a number of ways. Using the French American British classification, the M3 form of the disease or acute promyelocytic leukemia (APL) has been found to be sensitive in vitro and in vivo to the retinoid all trans retinoic acid (ATRA). The mechanism for this is by restoration of normal gene expression through the release of histone deacetylase complexes (HDACs). In contrast to APL, other forms of AML are either nonresponsive or show blunted responses to ATRA. We evaluated if the inhibitor of HDAC activity, valproic acid (VPA), could mimic or enhance retinoid sensitivity in the AML cell line, OCI/AML-2, and clinical samples derived from patients with AML. An Affymetrix GeneChip experiment demonstrated that VPA modulated the expression of numerous genes in OCI/AML-2 cells that were not affected by ATRA including p21, a retinoid responsive gene in APL. VPA induced p21 expression in OCI/AML-2 cells and the majority of the AML samples tested; this was associated with cell cycle arrest and apoptosis not seen with ATRA alone. The addition of ATRA to VPA accentuated many of these responses, supporting the potential beneficial combination of these drugs in the treatment of AML.Leukemia advance online publication, 5 May 2005; doi:10.1038/sj.leu.2403773.
Oxygen plays a central role in human placental pathologies including preeclampsia, a leading cause of fetal and maternal death and morbidity. Insufficient utero-placental oxygenation in preeclampsia is believed to be responsible for the molecular events leading to the clinical manifestations of this disease. Using high-throughput functional genomics, we determined the global gene expression profiles of placentae from high altitude pregnancies, a natural in vivo model of chronic hypoxia, as well as that of first trimester explants under 3% and 20% oxygen, an in vitro organ culture model. We next compared the genomic profile from these two models to that obtained from pregnancies complicated by preeclampsia. Microarray data was analyzed using the Binary Tree-Structured Vector Quantization (BTSVQ) algorithm, which is capable of generating global gene expression maps. Our data highlight a striking global gene expression similarity between 3% O2-treated explants, high altitude placentae and importantly placentae from preeclamptic pregnancies. We demonstrate herein the utility of explant culture and high altitude placenta as biologically-relevant and powerful models for studying the oxygen-mediated events in preeclampsia. Our results provide the first molecular evidence that aberrant global placental gene expression changes in preeclampsia are due to reduced oxygenation and that these events can successfully be mimicked in vivo and in vitro models of placental hypoxia.
Signaling pathways transmit information through protein interaction networks that are dynamically regulated by complex extracellular cues. We developed LUMIER (for luminescence-based mammalian interactome mapping), an automated high-throughput technology, to map protein-protein interaction networks systematically in mammalian cells and applied it to the transforming growth factor -B (TGFB) pathway. Analysis using self-organizing maps and k-means clustering identified links of the TGFB pathway to the p21-activated kinase (PAK) network, to the polarity complex, and to Occludin, a structural component of tight junctions. We show that Occludin regulates TGFB type I receptor localization for efficient TGFB-dependent dissolution of tight junctions during epithelial-to-mesenchymal transitions.
Case-based reasoning (CBR) is a suitable paradigm for class discovery in molecular biology, where the rules that define the domain knowledge are difficult to obtain, and the number and the complexity of the rules affecting the problem are too large for formal knowledge representation. To extend the capabilities of CBR, we propose mixture of experts for case-based reasoning (MOE4CBR), a method that combines an ensemble of CBR classifiers with spectral clustering and Logistic Regression. Our approach not only achieves higher prediction accuracy, but also leads to the selection of a subset of features that have meaningful relationships with their class labels.
We evaluate MOE4CBR by applying the method to a CBR system called TA3 -- a computational framework for CBR systems. For two mass spectrometry data sets, the prediction accuracy improves from 80% to 93% and from 90% to 98.4%, respectively. We also apply the method to leukemia and lung microarray data sets with prediction accuracy improving from 65% to 74% and from 60% to 70%, respectively. Finally, we compare our list of discovered biomarkers with the lists of selected biomarkers from other studies for the mass spectrometry data sets.
Conceptually, protein crystallization can be divided into two phases: search and optimization. Robotic protein crystallization screening can speed up the search phase, and has a potential to increase process quality.
Automated image classification helps to increase throughput and consistently generate objective results. Although the classification accuracy can always be improved, our image analysis system can classify images from 1536-well plates with high classification accuracy (85%) and ROC score (0.87), as evaluated on 127 human-classified protein screens` containing 5600 crystal images and 189472 non-crystal images.
Data mining can integrate results from high-throughput screens with information about crystallizing conditions, intrinsic protein properties, and results from crystallization optimization. We apply association mining, a data mining approach that identifies frequently occurring patterns among variables and their values. This approach segregates proteins into groups based on how they react in a broad range of conditions, and clusters cocktails to reflect their potential to achieve crystallization. These results may lead to crystallization screen optimization, and reveal associations between protein properties and crystallization conditions. We also postulate that past experience may lead us to the identification of initial conditions favorable to crystallization for novel proteins.
Motivation: High-throughput experiments are being performed at an ever-increasing rate to systematically elucidate protein-protein interaction (PPI) networks for model organisms, while complexities of higher eukaryotes have prevented these experiments for humans.
Results: The Online Predicted Human Interaction Database (OPHID) is a web-based database of predicted interactions between human proteins. It combines the literature-derived human PPI from BIND, HPRD and MINT, with predictions made from S. cerevisiae, C. elegans, D. melanogaster, and M. musculus. The 23,889 predicted interactions currently listed in OPHID are evaluated using protein domains, gene co-expression and Gene Ontology terms. OPHID can be queried using single or multiple IDs, and results can be visualized using our custom graph visualization program.
Availability: Freely available to academic users at http://ophid.utoronto.ca, both in tab-delimited and PSI-MI formats. Commercial users, please contact I.J.
One of the major concerns in microarray profiling studies of clinical samples is the effect of tissue sampling and RNA extraction on data. We analyzed gene expression in lung cancer specimens that were serially harvested from tumor mass and snap-frozen at several intervals up to 120 minutes after surgical resection. Global gene expression was profiled on cDNA microarrays, and selected stress and hypoxia-activated genes were evaluated using real-time reverse transcription polymerase chain reaction (RT-PCR). Remarkably, similar gene expression profiles were obtained for the majority of samples regardless of the time that had elapsed between resection and freezing. Real-time RT-PCR studies showed significant heterogeneity in the expression levels of stress and hypoxia-activated genes in samples obtained from different areas of a tumor specimen at one time point after resection. The variations between multiple samplings were significantly greater than those of elapsed time between sampling/freezing. Overall samples snap-frozen within 30 to 60 minutes of surgical resection are acceptable for gene expression studies, thus making sampling and snap-freezing of tumor samples in a routine surgical pathology laboratory setting feasible. However, sampling and pooling from multiple sites of each tumor may be necessary for expression profiling studies to overcome the molecular heterogeneity present in tumor specimens.
Motivation: Networks have been used to model many real-world phenomena to better understand the phenomena and to guide experiments in order to predict their behavior. Since incorrect models lead to incorrect predictions, it is vital to have an improved model. As a result, new techniques and models for analyzing and modeling real-world networks have recently been introduced.
Results: One example of large and complex networks involves protein-protein interaction (PPI) networks. We analyze PPI networks of yeast \emph{S. cerevisiae} and fruitfly \emph{D. melanogaster} using a newly introduced measure of local network structure as well as the standardly used measures of global network structure. We examine the fit of four different network models, including Erd\"{o}s-R\'{e}nyi, scale-free, and geometric random network models, to these PPI networks with respect to the measures of local and global network structure. We demonstrate that the currently popular scale-free model of PPI networks fails to fit the data in several respects and show that a random geometric model provides a much more accurate model of the PPI data. We hypothesize that only the noise in these networks is scale-free. Conclusions: We systematically evaluate how well different network models fit the PPI networks. We show that the structure of PPI networks is better modeled by a geometric random graph than by a scale-free model.
Supplementary data
Motivation: When studying the workings of a biological cell, it is useful to be able to detect known and predict still undiscovered protein complexes within the cell's protein-protein interaction (PPI) network. Such predictions may be used as an inexpensive tool to direct biological experiments. The increasing amount of available PPI data necessitates a fast, accurate approach to protein complex identification.
Results: We have developed the Restricted Neighbourhood Search Clustering Algorithm (RNSC) to efficiently partition networks into clusters using a cost function. We applied this cost-based clustering algorithm to PPI networks of S. cerevisiae, D. melanogaster, and C. elegans to identify and predict protein complexes. We also investigated functional and graph-theoretical properties of known complexes in the MIPS database, and by filtering clusters based on these properties, we attained a high matching rate between filtered clusters and true protein complexes.
Conclusions: Our application of the cost-based clustering algorithm provides a scalable, accurate, and efficient method of detecting and predicting protein complexes within a PPI network.
Supplementary data
Endobronchial implantation of NCI-H460 cells into the nude rat generates a primary lung tumor with mediastinal lymph node spread, but rarely systemic metastases. We isolated tumor cells from mediastinal nodes, orthotopically reimplanted the cells into nude rats and repeated this four times to derive a cell line, designated H460SM, that spontaneously metastasizes to bone, kidney, brain, soft tissue and contralateral lung. H460SM cells demonstrated higher invasive activity in vitro than parental NCI-H460 cells. Spectral karyotyping revealed a new inversion within 17q and loss of an extra normal copy of chromosome 14 present in parental NCI-H460 cells. Expression profiling of orthotopic primary tumors revealed differential expression of 360 genes. Of these, 173 were represented in the probe set of a 19.2K OCI cDNA microarray previously used to profile the gene expression of surgically resected lung cancer specimens. We have computationally validated clinical importance of these genes by using in silico analysis of 18 cases of pulmonary adenocarcinoma, which were split into two patient groups with markedly different clinical outcome. The model identifies additional novel candidate genes for the progression of lung cancer to systemic metastases and poor prognosis.
We previously reported that our cDNA microarray analysis of primary non-small cell lung carcinoma (NSCLC) could predict for patients at increased risk of cancer recurrence. From the result of this analysis, we selected 11 genes that were considered candidate prognostic marker genes and used the realtime reverse transcription polymerase chain reaction (RT-PCR) to investigate their expression in the same set of NSCLC cases used in the microarray study. Cluster analysis of the realtime RT-PCR data separated these patients into two groups with significantly different disease-free survivals (log-rank test, [Formula: see text] ). In contrast, cluster analysis failed to confirm the prognostic significance of the realtime RT-PCR results for these 11 genes in a validation series of 92 NSCLC cases. In univariate analysis, hypoxia inducible factor 1alpha, Rho-GDP dissociation inhibitor (GDI) alpha (RhoGDI) and Citron/rho-interacting serine-threonine kinase 21 (Citron K21) were significant prognostic factors for disease-free survival in the entire cohort of 130 NSCLC patients, but none were significant in multivariate analysis. The results demonstrate that the prognostic significance of microarray (SAM) results can be partially validated using realtime RT-PCR, but secondary validation using larger and independent series of tumors is necessary to identify true prognostic marker genes.
Our purpose was to classify OSCCs based on their gene expression profiles, to identify differentially expressed genes in these cancers and to correlate genetic deregulation with clinical and histopathologic data and patient outcome. After conducting proof-of-principle experiments utilizing 6 HNSCC cell lines, the gene expression profiles of 20 OSCCs were determined using cDNA microarrays containing 19,200 sequences and the BTSVQ method of data analysis. We identified 2 sample clusters that correlated with the T3-T4 category of disease (p = 0.035) and nodal metastasis (p = 0.035). BTSVQ analysis identified a subset of 23 differentially expressed genes with the lowest QE scores in the cluster containing more advanced-stage tumors. Expression of 6 of these differentially expressed genes was validated by quantitative real-time RT-PCR. Statistical analysis of quantitative real-time RT-PCR data was performed and, after Bonferroni correction, CLDN1 overexpression was significantly correlated with the cluster containing more advanced-stage tumors (p = 0.007). Despite the clinical heterogeneity of OSCC, molecular subtyping by cDNA microarray analysis identified distinct patterns of gene expression associated with relevant clinical parameters. Application of this methodology represents an advance in the classification of oral cavity tumors and may ultimately aid in the development of more tailored therapies for oral carcinoma.
Mitochondria are cellular organelles regulating metabolism and cell death pathways. This study examined changes in mitochondrial membrane potential (DYm) throughout the stages of preimplantation development in murine embryos conceived either in vivo or in vitro and human embryos donated to research from IVF. Embryos stained with the DYm sensitive dye (JC-1) were quantified for the ratio of highly to lowly polarized mitochondria using a deconvolution microscope. Overall, murine zygotes and early embryos contain a subset of highly polarized mitochondria with a progressive increase in the ratio of highly to lowly polarized mitochondria observed with increasing cleavage. A transient increase in the ratio of high to low DYm was observed in in vivo fertilized two-cell stage embryos, coincident with embryonic genome activation in the mouse, but not in two-cell embryos obtained through IVF. We further observed that arrested murine two-cell embryos possessed an increased ratio of highly to lowly polarized mitochondria compared to non-arrested embryos. In human eight cell embryos we observed an increased ratio of highly to lowly polarized mitochondria with increasing degrees of embryo fragmentation. We concluded that the pattern of DYm progressively changes throughout preimplantation development, and that an aberrant shift in DYm could contribute to or is associated with embryo abnormalities.
The building blocks of biological networks are individual protein-protein interactions (PPI). The cumulative PPI dataset in S. cerevisiae now exceeds 78,000. Studying the network of these interactions will provide valuable insight into the inner workings of cells.
Results: We performed a systematic graph theory based analysis of this PPI network to construct computational models for describing and predicting the properties of lethal mutations and proteins participating in genetic interactions, functional groups, protein complexes, and signaling pathways. Our analysis suggests that lethal mutations are not only highly connected within the network, but they also satisfy an additional property: the ir removal causes a disruption in network structure. We also provide evidence for the existence of alternate paths that bypass viable proteins in PPI networks, while such paths do not exist for lethal mutations. In addition, we show that distinct functional classes of proteins have differing network properties. We also demonstrate a way to extract and iteratively predict protein complexes and signaling pathways. We evaluate the power of predictions by comparing them to a random model, and assess accuracy of predictions by analyzing their overlap with MIPS database.
Conclusions: Our models provide a means for understanding the complex wiring underlying cellular function, and enable us to predict essentiality, genetic interaction, function, protein complexes and cellular pathways. This analysis uncovers structure-function relationships observable in a large PPI network.
Supplementary information
Supplementary data
A technique for automatically evaluating microbatch (400 nL) protein crystallization trials is described. This method addresses analysis problems introduced at the sub-microlitre scale, including non-uniform lighting and irregular droplet boundaries. The droplet is segmented from the well using a loopy probabilistic graphical model with a two-layered grid topology. A vector of 23 features is extracted from the droplet image using the Radon transform for straight-edge features and a bank of correlation filters for microcrystalline features. Image classification is achieved by linear discriminant analysis of its feature vector. The results of the automatic method are compared to those of a human expert on 32 1536-well plates. Using the human-labeled images as ground truth, this method classifies images with 85% accuracy and a ROC score of 0.84. This result compares well with the experimental repeatability rate assessed at 87%. Images falsely classified as crystal-positive variously contain speckled precipitate resembling microcrystals, skin effects, or genuine crystals falsely labeled by the human expert. Many images falsely classified as crystal-negative variously contain very fine crystal features or dendrites lacking straight edges. A characterization of these misclassifications suggests directions for improving the method.
The yeast pheromone/filamentous growth MAPK pathway mediates both mating and invasive-growth responses. The interface between this MAPK module and the transcriptional machinery consists of a network of two MAPKs, Fus3 and Kss1, two regulators, Rst1 and Rst2 (a.k.a. Dig1 and Dig2) and two transcription factors, Ste12 and Tec1. Of sixteen possible combinations of gene deletions in FUS3, KSS1, RST1, and RST2 in the S1278 background, ten exhibited constitutive invasive-growth. Rst1 was the primary negative regulator of invasive growth, while other components either attenuated or enhanced invasive growth, depending on the genetic context. Despite activation of the invasive response by lesions at the same level in the MAPK pathway, transcriptional profiles of different invasive mutant combinations did not exhibit a unified program of gene expression. The distal MAPK regulatory network is thus capable of generating phenotypically similar invasive-growth states (an attractor) from different molecular architectures (trajectories) that can functionally compensate for one another. This systems level robustness may also account for the observed diversity of signals that trigger invasive-growth.
Case-Based Reasoning (CBR) is a computational reasoning paradigm that involves the storage and retrieval of past experiences to solve novel problems. It is an approach that is particularly relevant in scientific domains, where there is a wealth of data, but often a lack of theories or general principles. This paper describes several CBR systems that have been developed to carry out planning, analysis and prediction in the domain of molecular biology.
Knowledge management research focuses on concepts, methods, and tools supporting the management of human knowledge. The main objective of this paper is to survey basic concepts that have been used in Com-puter Science for the representation of knowledge and summarize some of their advantages and drawbacks. A secondary objective is to relate these techniques to Information Science theory and practice.
The survey classifies the concepts used for knowledge representation into four broad ontological categories. Static ontologies describe static aspects of the world, i.e., what things exist, their attributes and relationships. A dynamic ontology, on the other hand, describes the changing aspects of the world in terms of states, state transitions and processes. Intentional ontologies encompass the world of things agents believe in, want, prove or disprove, and argue about. Finally, social ontologies cover social settings - agents, positions, roles, authority, permanent organizational structures or shifting networks of alliances and interdependencies.
Epidemiological studies have implicated androgens in the etiology/ progression of epithelial ovarian cancer. Because normal and malignant ovarian epithelial cells are growth inhibited by transforming growth factor (TGF) beta, we tested the ability of 5alfa-dihydrotestosterone (DHT) to modulate this response and the expression of TGFbeta receptor types I and II. Cells derived from the ovarian surface epithelium of women undergoing oophorectomy (n = 7) for nonovarian indications or with a germ-line BRCA1 or 2 mutation (n = 9), and from the ascitic fluid of patients with primary ovarian cancer (n = 8) were cultured with and without DHT. Cell proliferation after TGF-beta1 or vehicle treatment was determined, and transcripts for TGF-beta receptors were measured by quantitative reverse transcription-PCR. As low levels of androgen receptor were observed in the cultures, we also measured transcript levels for steroid receptor coactivators SRC-1, ARA70, and AIB1. TGF-beta1 inhibited growth in 12 of 13 cultures tested, and DHT generally reversed this effect, demonstrating that androgens can block TGF-beta-induced growth inhibition in both malignant and nonmalignant ovarian epithelial cells. Transcripts for TGF-beta receptors, SRC-1, and ARA70 were found to be coordinately regulated by androgen in control cells, but not in either malignant or BRCA1/2-positive cell cultures. These findings raise the possibility that by modulating steroid receptor coactivator expression, androgen might affect other hormonal responses and contribute to the initiation of ovarian cancer.
A report on the Tenth International Conference on Intelligent Systems for Molecular Biology (ISMB), Edmonton, Canada, 3-7 August 2002.
Recent studies have suggested that information from gene expression profiles could be used to develop molecular classifications of cancer. We hypothesized that expression levels of specific genes in operative specimens could be correlated to recurrence risk in non-small cell lung cancer (NSCLC). We performed expression profiling using 19.2K cDNA microarrays on tumor specimens from a total of 39 NSCLC patients with known clinical follow-up information. Statistical analysis and clustering approaches were used to determine patterns of gene expression segregating with clinical outcome. The results provide evidence that molecular subtyping of NSCLC can identify distinct profiles of gene expression correlating with disease-free survival.
Supplementary Data
Motivation: With the increasing number of gene expression databases, the need for more powerful analysis and visualization tools is growing. Many techniques have successfully been applied to unravel latent similarities among genes and/or experiments. Most of the current systems for microarray data analysis use statistical methods, hierarchical clustering, self-organizing maps, support vector machines, or k-means clustering to organize genes or experiments into meaningful groups. Without prior explicit bias almost all of these clustering methods applied to gene expression data not only produce different results, but may also produce clusters with little or no biological relevance. Of these methods, agglomerative hierarchical clustering has been the most widely applied, although many limitations have been identified.
Results: Starting with a systematic comparison of the underlying theories behind clustering approaches, we have devised a technique that combines tree-structured vector quantization and partitive k-means clustering (BTSVQ). This hybrid technique has revealed clinically relevant clusters in three large publicly available data sets. In contrast to existing systems, our approach is less sensitive to data preprocessing and data normalization. In addition, the clustering results produced by the technique have strong similarities to those of self-organizing maps (SOMs). We discuss the advantages and the mathematical reasoning behind our approach.
Availability: The BTSVQ system is implemented in Matlab R12 using the SOM toolbox for the visualization and preprocessing of the data. BTSVQ is available for non-commercial use (http://www.uhnres.utoronto.ca/ta3/BTSVQ).
Supplementary Data
Macromolecular crystallization efforts are frequently divided into a search phase, during which approximate conditions are sought, and an optimization phase, when the approximate conditions are optimized to yield crystals of sufficient quality for diffraction work. Faced with the possibility that, on a yearly basis, many hundreds of proteins might be generated, both in our laboratories and at the laboratories of our collaborators, we have recently designed and commissioned a high throughput robotics lab designed for the search phase. The lab is capable of setting up and photographically evaluating over 60,000 microbatch crystallization experiments per week. In the first four months of operation we have set up crystallization experiments for more than one hundred proteins.
This paper describes issues related to integrating image analysis techniques with knowledge discovery and case-based reasoning. Although the work is applicable to a number of problem domains, here we focus on the problem of analyzing and classifying outcomes of protein crystallization experiments in high-throughput structural genomics. We apply fast Fourier transform to analyze image content in order to extract important features of the spectrum. A combination of these features is used to classify crystallization experiments' outcomes. Although humans can analyze images more flexibly, a computational approach makes the process scalable and more objective. We evaluate the classification process and present results on how the automatically-extracted features can be combined to discover important crystallographic knowledge.
Microarrays of mouse genes are now available from several sources, and they have so far given new insights into gene expression in embryonic development, regions of the brain and during apoptosis. Microarray data posted on the internet can be reanalyzed to study a range of questions.
Genomic projects are producing hundreds of proteins a year for structural analysis. The challenge of the research described in this paper is to remove crystal growth experiments as a rate-limiting step in the enterprise of structure determination of proteins. We meet this challenge by combining a high-throughput crystallization setup and evaluation in the wet lab with a sophisticated algorithmic analysis of the outcomes in the computer lab. Furthermore, we apply techniques from knowledge management and artificial intelligence to develop an automated system that assists expert crystallographers in planning and evaluating novel crystal growth experiments. Fundamental to our computational approach to crystallization is a comprehensive information repository for crystal growth experiments. This stored information will be used to discover general rules or principles underlying the growth process for crystals, as well as to guide the reasoning algorithm for planning experiments.
The paper reports on the preliminary results in the wet lab and computation lab respectively. We define the problem, propose an architecture for intelligent decision support in the crystallization domain, and report on the status of the individual components of the architecture.
A case base is a repository of past experiences that can be used for problem solving. Given a new problem, expressed in the form of a query, the case base is browsed in search of "similar" or "relevant" cases. Conversational case-based reasoning (CBR) systems generally support user interaction during case retrieval and adaptation. Here we focus on case retrieval where users initiate problem solving by entering a partial problem description. During an interactive CBR session, a user may submit additional queries to provide a "focus of attention". These queries may be obtained by relaxing or restricting the constraints specified for a prior query. Thus, case retrieval involves the iterative evaluation of a series of queries against the case base, where each query in the series is obtained by restricting or relaxing the preceding query.
This paper considers alternative approaches for implementing iterative browsing in conversational CBR systems. First, we discuss a naive algorithm, which evaluates each query independent of earlier evaluations. Second, we introduce an incremental algorithm, which reuses the results of past query evaluations to minimize the computation required for subsequent queries. In partiular, the paper proposes an efficient algorithm for case base browsing and retrieval using database techniques for incremental view maintenance. In addition, the paper evaluates the performance of the proposed algorithm with respect to alternative approaches considering two perspectives: (i) experimental efficiency evaluation using diverse application domains, and (ii) scalability evaluation using the performance model of the proposed system.
In vitro fertilization (IVF) is a medically-assisted reproduction technique, enabling infertile couples to achieve successful pregnancy. Given the unpredictability of the task, we propose to use a case-based reasoning system that exploits past experiences to suggest possible modifications to an IVF treatment plan in order to improve overall success rates. Once the system's knowledge base is populated with a sufficient number of past cases, it can be used to explore and discover interesting relationships among data, thereby achieving a form of knowledge mining. The article describes the TA3IVF system -- a case-based reasoning system which relies on context-based relevance assessment to assist in knowledge visualization, interactive data exploration and discovery in this domain. The system can be used as an advisor to the physician during clinical work and during research to help determine what knowledge sources are relevant for a treatment plan.
Classification involves associating instances with particular classes by maximizing intra-class similarities and minimizing inter-class similarities. Thus, the way similarity among instances is measured is crucial for the success of the system. In case-based reasoning, it is assumed that similar problems have similar solutions. The case-based approach to classification is founded on retrieving cases from the case base that are similar to a given problem, and associating the problem with the class containing the most similar cases.
Similarity-based retrieval tools can advantageously be used in building flexible retrieval and classification systems. Case-based classification uses previously classified instances to label unknown instances with proper classes. Classification accuracy is affected by the retrieval process -- the more relevant the instances used for classification, the greater the accuracy.
The paper presents a novel approach to case-based classification. The algorithm is based on a notion of similarity assessment and was developed for supporting flexible retrieval of relevant information. Case similarity is assessed with respect to a given context that defines constraints for matching. Context relaxation and restriction is used for controlling the classification accuracy. The validity of the proposed approach is tested on real-world domains, and the system's performance, in terms of accuracy and scalability, is compared to that of other machine learning algorithms.
This paper describes issues related to integrating image analysis techniques into case-based reasoning. Although the approach is generic, a high-throughput protein crystallization problem is used as an example. Our solution to the crystallization problem is to store outcomes of experiments as images, extract important image features, and use them to automatically recognize different crystallization outcomes. Subsequently, we use the outcomes of image classification to perform case-based planning of crystallization experiments for new proteins. Knowledge-discovery techniques are used to extract general principles for crystallization. Such principles are applicable to the adaptation phase of case-based reasoning. The motivation for automated image-feature extraction is twofold: \snum{1} the human interpretation/analysis of image content is subjective, and \snum{2} many problem domains require reasoning with large databases of uninterpreted images. In this paper we present the design and implementation of our integrated system, as well as some preliminary experimental results.
Structural genomics projects promise to produce hundreds of proteins a year for structural analysis. The challenge to crystal growers is to make some other step in the structural biology enterprise rate-limiting. Our approach is to combine high throughput (HTP) crystallization setup and evaluation in the wet lab with sophisticated algorithmic analyses of the HTP outcomes in the computer lab for the purposes of recipe prediction.
In the wet lab we now have the capacity to prepare and evaluate the results of over sixty thousand (61.4K) crystallization experiments a workweek. Each is a microbatch experiment conducted under paraffin oil. Pipetting is performed with robots outfitted with 96 or 384 syringes and XYZ translation stages. High density (1536 well) micro-assay plates hold the experiments. 1536 crystallization cocktails, covering a wide range of crystallizing agents, have been prepared. Current pipetting protocols allow us to deploy 200 nanoL droplets of protein solution and crystallization cocktails (total drop size 400 nanoL). Once a micro-assay plate is prepared with paraffin oil and crystallization cocktails it is possible to set protein solution into the wells in less than five minutes, allowing us to work quickly with unstable proteins. Current total protein requirements are being assessed, but are likely to be in the 10 mg range. After setup plates are placed on a computer controlled XY table with micron positioning accuracy. The plates are translated under a megapixel digital camera where images are captured by a framegrabber. The XY table can accommodate 28 plates (43K experiments) at a time and the camera can record 43K images in approximately twelve hours.
In the computer lab the images are analyzed automatically to determine the outcomes of the crystallization experiments. We are developing a standard vocabulary of outcomes that will describe the results: clear drop, amorphous precipitate, phase separation, microcrystals, crystals, and uncertain outcome. These outcomes, recorded as a function of time, are the cornerstone of a crystallization database that will contain physical information about individual proteins as well as results of crystallization experiments with those proteins. Using case-based reasoning algorithms we will identify patterns of similar properties and crystallization outcomes relating two or more proteins in the database. Our hypothesis is that, given a quantitative measure of similarity between proteins, recipes successfully employed for one protein will be useful starting points for crystallization experiments with similar proteins. Future work will center upon the most predictive measures of similarity.
The medical potential of the various genome projects now underway will be realized when we know not only the sequences of the amino acids coded in open reading frames but also what these ORFs represent, both structurally and functionally. Structural proteomics will challenge us to grow more and better crystals for diffraction studies. Our labs are involved in two major aspects of that work: getting the techniques and equipment in place to do large scale, high thruput crystallization experiments, and assembling the expertise to make sense of all the data that will come from those experiments.
We need to use dynamic knowledge organization approaches in order to facilitate effective access and use of domain knowledge. Although there are many approaches to knowledge organization available, it is a challenge to systematically organize evolving domains, because it is not feasible to rely only on humans to create relationships among individual knowledge sources. Additional problems arise because knowledge may not be consistently and completely described, and quality control may not always be in place in distributed knowledge environments. In this article we describe a generic approach to knowledge organization by using systematic knowledge management and applying knowledge-discovery techniques. We use a case-based reasoning system, called TA3, as a core component for knowledge management. Application of symbolic knowledge-discovery component of TA3 supports three main tasks: system optimization, knowledge evolution and evidence creation. To explain advantages of this approach, we use our experience from biomedical domains.
This paper describes the application of automated image analysis to evaluate morphology and developmental features of oocytes and embryos in the domain of in vitro fertilization (IVF). Although humans can analyze images more flexibly, computer vision techniques make the proc-ess more objective and precise. We propose to use com-puter-based morphometry to precisely and objectively identify developmental features of oocytes and embryos. Extracted morphological information can be linked with symbolic information to better predict pregnancy outcome and suggest further medical procedures. Recognized fea-tures can then be used to support case-based reasoning and knowledge discovery. The combination of image analysis techniques and case-based reasoning can thus serve as: (1) a feature extraction technique; (2) an indexing approach; and (3) an analysis tool. A combination of symbolic and image information can then be used to identify morpho-logical features of oocytes and embryos that are vital for successful IVF. Extracting image features and analyzing them helps to perform knowledge discovery from images.
Knowledge management research focuses on the development of concepts, methods, and tools supporting the management of human knowledge. To further this objective, researchers are studying the way organizations, groups and individuals use knowledge in the performance of daily tasks. They are also developing computer-based tools and techniques to support the acquisition, representation, organization, retrieval, analysis and evolution of knowledge in its many forms. The main objective of this paper is to survey some of the primitive concepts that have been used in computer science for the representation of knowledge and summarize some of their advantages and drawbacks. A secondary objective is to relate these techniques to information sciences theory and practice.
Several research areas within computer science have developed techniques for representing knowledge so that it can be accessed and used by humans and software systems alike. In particular, Artificial Intelligence (AI) has developed techniques for representing knowledge so that it can be exploited by intelligent systems. Databases have focused on techniques, which allow the representation and management of large amounts of simple knowledge, using as vehicles relational databases and related technologies. Software Engineering and Information Systems have developed elaborate techniques for capturing knowledge that relates to the requirements, design decisions and rationale for a software system. We characterize all these techniques in terms of the primitive concepts they offer for representing knowledge within a given class of applications.
This paper presents some preliminary results on applying information retrieval and knowledge-mining techniques to reverse engineering of legacy systems. In order to support a dynamic environment, we take an approach of integrating lightweight tools. Instead of forcing a user to use a fixed environment, our approach provides a basic information repository, which manages information extracted from the documentation and source code. The system stores this information in a graph structure, it supports navigation through the repository, and modification of its structure and annotation. Preliminary evaluation of the proposed approach on the small-size software system is encouraging.
The health care industry faces constant demands to improve quality, extend services, and reduce cost. Telemedicine satisfies these demands by supporting distant consultations. In addition, knowledge-based systems may augment current synchronous telemedicine applications by storing and managing medical experience over time. By providing timely and efficient access to the knowledge repository, knowledge-based systems help to distribute experience, standardize procedures, lower cost, and increase quality of health care services. This facilitates asynchronous telemedicine.
Our previous experience from using a case-based reasoning system to support specialists in in vitro fertilization domain shows that this paradigm is suitable for building medical knowledge repositories for knowledge sharing. We propose to extend the system to support tele-consultations: (1) between specialists (rare medical cases); (2) between general practitioners and specialists (standard practices); and (3) between health care professionals and patients (generic medical information). This will help to standardize patient examination and treatment practices. In addition, physicians will be able to share experience via remote knowledge repository.
This paper focuses on extensions for specialists. We show how case-based reasoning can support evidence-based medicine, remote consultations, and improve knowledge sharing and domain understanding.
This paper reviews several knowledge organization techniques used in Computer Science, in areas such as Artificial Intelligence, Databases and Software Engineering. Some of these computational mechanisms may assist in the organization and management of immense digital information resources. At the same time, the paper notes an increasing need for computer-based information systems to operate in open networked environments. This need requires knowledge organization principles, which are flexible and can be used with informally expressed knowledge. We expect to find such knowledge organization techniques in Library and Information Sciences, and hope to integrated them with the computational techniques described in this paper.
A case base is a repository of past experiences that can be used for problem solving. Given a new problem, expressed in the form of a query, the case base is browsed in search of "similar" or "relevant" cases. One way to perform this search involves the iterative evaluation of a series of queries against the case base, where each query in the series is obtained by restricting or relaxing the preceding query.
The paper considers alternative approaches for implementing iterative browsing in case-based reasoning systems, including a naive algorithm, which evaluates each query independent of earlier evaluations, and an incremental algorithm, which reuses the results of past query evaluations to minimize the computation required for subsequent queries. In particular, the paper proposes an efficient algorithm for case base browsing and retrieval using database techniques for view maintenance. In addition, the paper evaluates the performance of the proposed algorithm with respect to alternative approaches considering two perspectives: (1) experimental efficiency evaluation using diverse application domains, and (2) scalability evaluation using the performance model of the proposed system.
Complex decision-support information systems for diverse domains need advanced facilities, such as knowledge repositories, reasoning systems, and modeling for processing interrelated information. System development must satisfy functional requirements, but must also systematically meet global quality factors, such as performance, confidentiality and accuracy, called non-functional requirements (NFRs).
Case-based reasoning (CBR) systems, an important class of decision support systems, require a design process that systematically produces high-quality applications. Beyond satisfying basic functional requirements for CBR, it is important to meet global quality factors, such as performance and confidentiality, called non-functional requirements (NFRs). This paper presents a goal-oriented, knowledge-based approach for aiding decision support system development and usage, namely, it proposes an approach for dealing with non-functional requirements (NFRs) for CBR systems. We show how quality can be built into a CBR system, using the "QualityCBR" approach, which integrates existing work on CBR and NFRs. We illustrate the use of the approach in a complex medical domain in vitro fertilization. In this domain, a CBR system is used for: (1) suggesting hormonal therapy for in-vitro fertilization patients, (2) predicting the probability of successful pregnancy, and (3) interactively determining important patient's characteristics that can improve pregnancy rate. The QualityCBR approach is used to address important NFRs, such as performance, accuracy and confidentiality.
The paper presents a similarity-based retrieval framework for a software repository that aids the process of maintaining, understanding, and migrating legacy software systems. Designing a software repository involves three issues: (1) information content; (2) information representation; and (3) strategies for accessing repository artifacts. Given the architecture of a Bookshelf software repository, we extend the retrieval system to support imprecise queries, iterative browsing, and diverse users. Because of repository size, complexity of queries and relations among artifacts, we take a performance approach to support a scalable implementation. We propose a retrieval system that uses numeric and semantically rich context-based similarity. Efficient iterative browsing is based on an incremental query evaluation algorithm from database management systems. Explicitly defined context supports various retrieval strategies and diverse user models.
This paper introduces a generic approach to knowledge-based decision-support in medicine. We review problems present in medical domains and introduce available solutions. We describe a case-based reasoning system called SpotLight and discuss its advantages when applied to complex medical domains, in vitro fertilization and nephrology.