HealthEcco is a non-profit organization committed to open-source software and open access to knowledge in order to improve the quality of guideline dissemination/usage and to foster innovative/collaborative global research.
Our objective is public access for all data sources as far as licenses of the original data sources allow. Following the FAIR principles (Findable, Accessible, Interoperable, Reuseable) our knowledge graph will not only reveal hidden connections but will be publicly available and globally accessible. By incorporating ontologies, we make data interoperable so that data can be reused and repurposed. All APIs on top of the knowledge graph will be public to allow the community to develop applications. All code will be published on GitHub and the community is able to further develop code. All tools developed during the project such as the data loading pipeline and data enrichment pipeline are, and will, remain open-source.
We are building a unique solution to combine, annotate and organize the world's health knowledge and get it into the hands of the right people at the right time.
The beating heart of our platform is a knowledge graph (built using Neo4j) that integrates a growing number of different but related data sets. Our data loading pipelines process each of the data sets, indexing nodes and creating connections to other data sets as well as annotating text in the data using natural language processing. The connections in the graph make data findable. These pipelines are portable and repeatable making our data reusable. By incorporating ontologies, we make data interoperable so that data can be reused and repurposed.
We have a free and public GraphQL API on top of the knowledge graph to support the development of additional applications or integration of third party tools making the data interoperable.
A growing ecosystem of applications that are focused on specific use-cases continue to be developed. These are free and globally available and all code behind our entire platform is , and always will be published as open-source on GitHub making both our graph and the tools used to build it accessible.
We collect and connect unstructured data from various data sources:
1. literature (pre-print servers bioRxiv, medRxiv and PubMed) including authors and their affiliation
2. complete set of biomolecules (genes, transcripts, proteins, metabolites) from all major genome databases (ENSEMBL, NCBI Gene, RefSeq, UniProt, HGNC)
3. clinical trial data across diseases (clinical-trials.gov); meta data from clinicaltrials (clinicaltrials.gov, and press releases via meltwater)
4. patents related to COVID-19, including inventors and patent owners (lens.org);
5. published experimental data from biomedical research from databases (GTEx, GEO); functional annotations such as Gene Ontology (GO)
6. case numbers from Johns Hopkins University and the Robert-Koch-Institute
A data integration system of this scope would require an inordinate amount of effort if built using traditional database technologies. It simply would not be feasible to implement it in a research setting. It is only with graph technologies we are able to build, maintain and extend such a system with the limited resources we have had to date. The key advantages that enable this are the flexibility, extensibility and scalability of graph technologies. While the concepts behind data integration are not new,graph technology now renders possible what has failed in previous attempts.
Graph in action
Imagine a researcher, who on a daily basis, needs to look for information related to a set of genes, transcripts (RNAs) and proteins, their molecular function and publications that mention them.
The steps they need to follow:
1. Enter the “gene of interest” in a gene database search (e.g. Ensembl, NCBI), collecting the database identifiers, gene name, synonyms and coding transcripts. They store the identifiers, name, synonyms and transcript identifiers in an Excel sheet.
2. For each transcript identifier (from step 1) they enter the id in a transcript database (e.g. RefSeq) and collect the name of the transcript, the copies (synonyms) with transcript identifiers and the coding proteins. They store the information in another Excel sheet.
3. For each protein identifier (from step 2) they enter the id in a protein database (e.g. Uniprot) and collect the name, the synonyms and the protein identifier. They store the information in another Excel sheet.
4. For each protein identifier (from step 3) they enter the id in a functional database (e.g. Gene Ontology) and collect the names of the different functions. They store the information in another Excel sheet.
5. Finally, for each gene name, they search in a literature database (e.g. PubMed) and enter the gene name in order to get a list of publications mentioning the gene. They store the information in another Excel sheet.
While storing the information in an Excel sheet is taking approximately 30s, the time for manually searching the respective databases takes between 25s and 2min. This quickly adds up to approximately 14 minutes per gene of interest and researchers now need to follow the same process for every single gene on their list.
Usually, the size of the set of genes of interest (from a single experiment) is between 20 - 100 genes. This process is manual, tedious and very error prone.
In contrast, the same search on CovidGraph takes 4 seconds per gene including loading visualisations.