Blog post with Astrid Lægreid (not on the photo) from The Department of Cancer Research and Molecular Medicine:
It is often overlooked that after you publish your research results you have not necessarily provided your new knowledge to your colleagues in the best possible way. Today’s biomedical science is very much dependent on the use of computers, to analyse and integrate the various types of data and facts that you and your fellow scientists have produced. And whereas a computer can do many things, it has difficulty in understanding what is so easily understood by us when we read a scientific publication.
Although much research is done to improve the way computers can analyse text (the field of text mining), we excel in hiding facts and new knowledge in our publications. We use for instance words that can have multiple meanings, names that seem funny (sonic hedgehog) but do not mean anything to a computer, or we mention some facts in a context that greatly changes the meaning of a sentence (for instance by using the simple word ‘not’). We therefore need to reach out to computers and help them a bit with understanding the real knowledge that we have hidden so well in text. This is even more interesting if one wants to pursue the main goal we have set for our research at NTNU: using a systems biology approach for making new biological discoveries.
Systems biology is based on a computer dealing with knowledge about biological systems or processes (like cell division; or regulation of the activity levels of genes). It is widely believed that a systems biology-based understanding of the human will allow great discoveries for improved health care. Systems biology has been made possible by the tremendous advancements in laboratory technologies that are now available to get massive amounts of data about the processes, cells and organs of our bodies. Once these data have been interpreted and published, system scale biomedical knowledge can be integrated into computer models in order to enable improved disease management and higher precision medicine. However, in order to succeed, we need to take proper care of this knowledge.
In our daily work we have developed various computer models of cell lines which we use in laboratory experiments, and each time we had to get the information for these models by reading many papers because only a very small amount of the information was available through databases. This made us think that it would be great of at least one part of the information for these models would be readily available for computers: information from the area of gene regulation. One small, but very important part of this is the knowledge about the system that connects the information in a particular class of proteins (transcription factors, TFs) with the particular DNA sequences in the genome in the vicinity of genes (recognition sequences, or transcription factor binding sites): This system essentially links the protein world with the DNA world and dictates which genes are active and which genes remain silent. We have recently launched a large effort in building a resource for this that covers three of the most important biological systems: human, mouse and rat (1, 2).
Of course we know that the DNA binding TFs are only a very small part of the very complex system of gene regulation, and it will take a very big group of scientists to take care of all the diverse forms of knowledge in the literature. And there we are lucky that we are not alone in realizing the importance of this. We have identified many researchers world-wide and found them willing to join us in a global consortium within the field of taking care of, or ‘curating’ gene regulation knowledge, and we are now discussing with them how we can best structure existing efforts and launch new efforts to jointly build a series of resources covering the complete domain of gene regulation in all organisms.
Existing databases and knowledge sources within our consortium include amongst others the Gene Ontology databases, PAZAR, TFCat, TFactS and RegulonDB, as well as DBD- and IntAct at the European Institute of Bioinformatics (EBI). Existing and new resources are designed in such a way that the information can be easily integrated into computer models. The consortium is named ‘Gene Regulation Consortium’ (short: GRECO), and is led by us.
Our basic objective is to extend on what we now only do for the DNA binding transcription factors from mouse, human and rat, and do it for the full field of gene regulation with many particular types of regulatory proteins, many types of regulatory RNAs, and many different structural and functional elements encoded in the DNA which allows the gene regulation system to fine-tune the activity of genes appropriate for a specific cellular function, and do it for all organisms.
The aims of GRECO are to:
Foster communication across the field of gene regulation
Assess the state of the art in annotating components and relationships important to describe gene regulation events
Identify common initiatives, avoid redundancy, fill knowledge gaps
Extend and align ontologies and controlled vocabularies
Promote common data exchange formats
Promote common curation quality guidelines
Attract funding to support communication and initiate new curation initiatives
We were fortunate to receive some financial support from NTNU to organize the first GRECO workshop on April 5, at the Toronto University campus, as a satellite meeting of The Seventh Conference of the International Society for Biocuration, ISB2014. We met with partners from the UK, Switzerland, Germany, the USA, Mexico, Brazil and Saudi Arabia, presented our ideas for this initiative and laid out the foundation for a joint strategy for acquiring additional project support from international funding organisations like the National Health Institutes in the USA, the Horizon 2020 programme from the European Union, or National funding agencies like NFR.
We hope to present some of our work at the Virtual Physiological Human (VPH) Conference 2014 in Trondheim in September 2014. The VPH mission is to contribute to developing a real predictive, preventive and participatory medicine by enabling the building of stronger transdisciplinary ties between the life sciences, the mathematical sciences and engineering throughout the whole spectrum of basic, translational and applied research.
References
1) Tripathi S, Christie KR, Balakrishnan R, Huntley R, Hill DP, Thommesen L, Blake JA, Kuiper M, Lægreid A. Gene Ontology Annotation of Sequence specific DNA-binding Transcription Factors: Setting the Stage for a Large Scale Curation Effort. Database Aug 27; bat062 2013.
2) Chawla K; Tripathi S; Thommesen L; Lægreid A; Kuiper M. TFcheckpoint: a curated compendium of specific DNA-binding RNA polymerase II transcription factors. Bioinformatics 2013 ;Volume 29.(19) p. 2519-2520.