Skip to content

Extract entities and relationships from biomedical text and build a knowledge graph.

License

Notifications You must be signed in to change notification settings

yjcyxky/text2knowledge

Repository files navigation

Installation

Requirements

Install the ollama for managing all LLM models, you can follow the instructions in the ollama website (https://ollama.ai/) to install the ollama

Install Text2Knowledge

conda create -n text2knowledge python=3.10 openjdk=11

git clone https://github.com/yjcyxky/text2knowledge.git

cd text2knowledge
pip install -r requirements.txt

Pdf to Text

Step 1: Launch the grobid server

If you have any questions about how to launch the grobid server, please refer to https://grobid.readthedocs.io/en/latest/Grobid-service/.

If you encounter the following error when launching the grobid server, please download grobid manually and put it in the pdf2json folder, and then rename it to grobid-0.8.0.zip (The grobid version we use is 0.8.0).

The download link of grobid is grobid-0.8.0.zip. After finished, please run bash launch_grobid.sh again.

If you cannot run grobid server successfully, please use docker to run grobid server. If you only want to try the extraction function, you can skip the Step 1 and use the public grobid server (https://kermitt2-grobid.hf.space) instead in Step 2.

cd pdf2json
bash launch_grobid.sh

# or

docker run --rm --init --ulimit core=0 -p 8070:8070 grobid/grobid:0.8.0

Step 2: Convert pdf to json/figure/table/text

We use the grobid and scipdf_parser to convert pdf to json, figure, table, and text. If you want to know more about how to convert pdf to json, figure, table, and text, please refer to grobid and scipdf_parser. If you want to convert a large number of pdfs to json, figure, table, and text, please use a local grobid server instead of a public grobid server (https://kermitt2-grobid.hf.space).

python3 extract.py pdf2text --grobid-url https://kermitt2-grobid.hf.space --pdf-dir ./examples/antibody/pdfs --output-dir ./examples/antibody/extracted_pdfs/

# or 

python3 extract.py pdf2text --grobid-url http://localhost:8070 --pdf-dir ./examples/antibody/pdfs --output-dir ./examples/antibody/extracted_pdfs/

After running the above command, you will get the following files:

examples/antibody/extracted_pdfs/16451124
    |-- 16451124.json               # Abstract and body text
    |-- pdf                         # Original pdf, just for convenience
    |   |-- 16451124.pdf
    |-- figures
    |   |-- 16451124-Figure1-1.png  # Figure 1 in the paper
    |   |-- 16451124-Figure2-1.png
    |   |-- 16451124-Figure3-1.png
    |   |-- 16451124-Figure4-1.png
    |   |-- 16451124-Figure5-1.png
    |-- data
    |   |-- 16451124.json           # Abstract and body text, same as 16451124.json, just for convenience

Step 3: Extract text chunks into a json file

python3 extract.py text-chunks ./examples/antibody/extracted_pdfs ./examples/antibody/antibody.json

Text to Knolwedge Graph

Article Classification

python3 text2knowledge.py classify-article --input-file ./classfication/example.json --output-file ./classfication/results/mixtral_8x22b.json -m mixtral:8x22b

Strategy 1: Employ a LLM to extract entities and relations directly

Please refer to Prompts for more details.

If you want to extract all entities from the text, you can use the following command.

python3 text2knowledge.py extract-entities --text-file examples/text2knowledge/abstract.txt --output-file examples/text2knowledge/entities.json --model-name mistral:latest

If you want to extract all relations from the text, you can use the following command.

python3 text2knowledge.py extract-relations --text-file examples/text2knowledge/abstract.txt --output-file examples/text2knowledge/relationships.json --model-name mistral:latest

Issues

  • How to improve the accuracy of the entity extraction?

  • How to align the entities and relations? In current version, we extract entities and relations separately.

  • How to align all entities to the ontology items? Such as Hepatocellular carcinoma --> MONDO:0007256. You can access the BioPortal for learning more about the ontology items.

Strategy 2: Employ a LLM to extract entities and relations by asking choice questions [Not Ready Yet]

Introduction

A new solution to convert text to knowledge graph

  1. Extract all biomedical entities from the text by using a large language model (e.g. ChatGPT4, Vicuna, etc.)
  2. Convert all preset ontology items to embeddings
  3. Map all extracted entities to the ontology items by computing the similarity between the embeddings, and then pick up the top N similar ontology items for each entity
  4. Use a more precise method to re-rank the top N similar ontology items for each entity and pick up the top 1
  5. Generate questions from the mapped ontology items. If we have ten entities, we can generate C(10, 2) = 10! / [2!(10-2)!] = (10 _ 9) / (2 _ 1) = 45 questions. We can reduce the number of questions based on our needs, such as we only care about the specific entities.
  6. Pick up the answer for each question from the text by using a large language model (e.g. ChatGPT4, Vicuna, etc.)

Improvement plan

  1. Fine-tune embedding algorithm for biomedical entities
  2. Select the most suitable similarity algorithm
  3. Select a suitable re-ranking algorithm
  4. Improve the prompts for generating questions based on the characteristics of the large language model

Launch a Chatbot Server for Text2Knowledge

NOTE: Read the README.md in the chatbot folder for more detail [Not Ready Yet]. Or you can use another open source project Ollama or Ollama Github instead of our chatbot.

After you install the Ollama, you can run the following command to pull the models and launch the Ollama server.

Pull the models

ollama pull mistral-openorca:latest

# or
bash pull_models.sh

Launch the Ollama server

This step might not a required step for you. If you have installed the Ollama in macosx, you can also click the Ollama icon in the application folder to launch the Ollama server.

ollama serve

After you launch the Ollama server, you can open the following link in your browser to show all the availabel models.

http://127.0.0.1:11434/api/tags

[Optional] Change the storage path

If you have limited storage space in your computer, you can change the storage path to another disk. More details on how to change the storage path, please refer to the Ollama FAQs.

echo 'OLLAMA_MODELS=/path/to/your/disk' >> ~/.bashrc
source ~/.bashrc

Benchmarking

Datasets

Benchmarking Datasets and Tools for Biomedical NLP

  1. Biomedical Datasets: https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-04688-w/tables/2
  2. N2C2 NLP Dataset: https://portal.dbmi.hms.harvard.edu
  3. BC5CDR (BioCreative V CDR corpus): https://paperswithcode.com/dataset/bc5cdr
  4. BC4CHEMD (BioCreative IV Chemical compound and drug name recognition): https://paperswithcode.com/dataset/bc4chemd
  5. BioNLP: https://aclanthology.org/venues/bionlp/
  6. PubTator: https://www.ncbi.nlm.nih.gov/research/pubtator3/
  7. BioNLP-Corpus: https://github.com/bionlp-hzau/BioNLP-Corpus
  8. BioBERT & Bern: https://github.com/dmis-lab/bern
  9. BioRED: https://academic.oup.com/bib/article/23/5/bbac282/6645993

References

You can refer to these papers/models/companies for more details.

Contribution Guidelines

We welcome and appreciate any contributions from the community members. If you wish to contribute to Text2Knowledge, please follow these steps:

  1. Fork the repository and create your branch.
  2. Make changes in your branch.
  3. Submit a Pull Request.

Please ensure that your code adheres to the project's coding style and quality standards before submitting your contribution.

License

Text2Knowledge is released under the MIT License. For more details, please refer to the LICENSE.md file in the repository.

About

Extract entities and relationships from biomedical text and build a knowledge graph.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published