Bert NER Models & PubMedSucker 2.0
It has been quiet for a few weeks around HealthECCO, but not because there is nothing going on, but rather because we have been busy. Today we want to give you a quick update on what we are currently working on:
- BERT NER models
We started to explore the usage of three BERT NER models specialized on searching for genes, chemicals and diseases to extract the entities from the abstracts we have in our graph. Currently Artem is testing this with 150000 abstracts which he matched with terms that already have relationships.
To take the next step with this, he extracted BERT embeddings of the texts to make a cluster analysis, reduced the dimensionality of the abstract embeddings and visualized them with 3D scatter plot. First results show the tendency that “Disease” and “Gene” tagged abstracts are diverging from each other, while “Chemical” tagged abstracts are evenly spread.
The next steps are to create a memory graph with abstracts and terms and calculate the occurrence frequency of the gene and disease term pairs to discover connections between certain genes and disease. The weakly connected components analysis will be performed to attempt to get the gene-diseased clusters.
- PubMedSucker 2.0
Last week PubMedSucker2.0 (see image of schema above) from Tim successful passed DZD-internal test runs and is now ready to be integrated in the HealthECCO Pipeline.
The new PMS is a DZD-internal grown tool to load all abstracts of papers in the National Library of Medicince MEDLINE/PubMed database into the HealthECCO graph. This will be the new base data for the next iteration of the HealthECCO graph comming soon 🙂
We are always open for new collaborators. So if you want to join the project and be involved, reach out to us @healthecco
Besides these two projects we are currently working on, we are also very excited to share some big news around HealthECCO in general soon! So stay in touch!