Graphs in Action
With HealthECCO we have based our solution on graph technology which brings many advantages – in particular for connected data. In this post we want to showcase how these advantages can benefit users on a daily basis.
Below we have outlined how the typical journey a researcher looking for information could look like with traditional tools compared to a graph based approach.
Imagine this researcher wants to know more about information related to a set of genes, transcripts (RNAs) and proteins, their molecular function and publications that mention them. These are the steps they need to follow:
1. Enter the “gene of interest” in a gene database search (e.g. Ensembl, NCBI), collecting the database identifiers, gene name, synonyms and coding transcripts. They store the identifiers, name, synonyms and transcript identifiers in an Excel sheet.
2. For each transcript identifier (from step 1) they enter the id in a transcript database (e.g. RefSeq) and collect the name of the transcript, the copies (synonyms) with transcript identifiers and the coding proteins. They store the information in another Excel sheet.
3. For each protein identifier (from step 2) they enter the id in a protein database (e.g. Uniprot) and collect the name, the synonyms and the protein identifier. They store the information in another Excel sheet.
4. For each protein identifier (from step 3) they enter the id in a functional database (e.g. Gene Ontology) and collect the names of the different functions. They store the information in another Excel sheet.
5. Finally, for each gene name, they search in a literature database (e.g. PubMed) and enter the gene name in order to get a list of publications mentioning the gene. They store the information in another Excel sheet.
While storing the information in an Excel sheet is taking approximately 30s, the time for manually searching the respective databases takes between 25s and 2min. This quickly adds up to approximately 14 minutes per gene of interest and researchers now need to follow the same process for every single gene on their list.
Usually, the size of the set of genes of interest (from a single experiment) is between 20 – 100 genes. This process is manual, tedious and very error prone.
In contrast, the same search on CovidGraph takes 4 seconds per gene including loading visualisations.
If you want to see more about the technology we use, go and read about this example and more on our Technology page.