Cathy’s Current Projects
- Protein family classification, functional annotation, and structure-function analysis – As a central approach to protein annotation for the UniProt Knowledgebase, we employ a classification-driven rule-based method. The PIRSF system classifies proteins from superfamily to subfamily levels to reflect evolutionary relationship of proteins and their domain architecture, allowing comparative studies of protein function and evolution [Wu et al., 2004; Nikolskaya et al., 2006]. Coupling with manually curated, structure-guided rules, the system supports the standardization and accurate annotation of protein names, functions, and functional sites [Wu et al., 2006]. The systematic approach provides high-quality functional annotation, while keeping pace with the exponential growth of molecular sequence data.
- Biological text mining – With an ever-increasing volume of scientific literature now available electronically, we have been collaborating with several Natural Language Processing research groups to develop algorithms for text mining and information extraction [Hirschman et al., 2002]. Several projects have led to tools directly accessible from the iProLINK text mining resource [Hu et al., 2004], including the BioThesaurus of gene/protein names that allows the identification of synonymous and ambiguous names [Liu et al., 2006] and the RLIMS-P text mining system to extract phosphorylation information (kinase, protein substrate, and phosphorylation sites) from Medline abstracts [Hu et al., 2005]. We plan to develop a “configurable, intelligent and integrated” text mining system as the link bridging PubMed and databases for knowledge discovery. We co-organize the BioCreative Challenge Evaluations, bringing together both the text mining and biological research communities to evaluate and guide the future development of text mining systems.
- Biomedical ontology – As biomedical ontologies emerged as critical tools in biological research for semantic integration of complex data in disparate resources, we have developed a Protein Ontology (PRO) in the OBO (Open Biomedical Ontologies) Foundry framework [Natale et al., 2007]. Extending from the evolutionary relationships of protein classes to the representation of multiple protein forms of genes (e.g., isoforms, post-translational modifications), PRO allows precise definition of protein objects in biological context (e.g., pathways, networks, complexes) and specification of relationships with other ontologies (such as Gene Ontology) [Arighi et al., 2009]. The project aims to capture knowledge representation of protein biology embedded in the scientific literature to facilitate pathway, network and disease modeling.
- Omics data integration and pathway/network analysis – Designed for data integration in a distributed environment, the iProClass database provides rich protein annotation with data from over 100 molecular databases [Wu et al., 2004]. It is also the underlying data warehouse for gene/protein ID and name mapping. Built upon iProClass and UniProt, we have developed the iProXpress system for functional profiling and pathway analysis of large-scale gene expression and proteomic data [Huang et al., 2007]. iProXpress has been applied to several studies, including proteomic profiling of melanosomes and lysosome-related organelle proteomes, identification of signaling pathways and networks underlying estrogen-induced apoptosis of breast cancer cells, and analysis of cellular pathways in radiation-resistant cells [Chi et al., 2006; Hu et al., 2007; 2008]. As part of the NIAID biodefense proteomics program, we have integrated various omics data on pathogens and their hosts, allowing biologists to query and analyze data from multiple disparate proteomic centers about pathogen-host relationships. We have conducted integrative bioinformatics analysis of protein structure, function and evolution to identify potential targets for hemorrhagic viruses [Mazumder et al., 2007]. We plan to further develop network mining, visualization and prediction methods, and coupling with the integrative bioinformatics approach, to facilitate data-driven hypothesis generation.