Meditron X Retrieval Augmented Generation

development
Author

Gigi Sung

Published

January 14, 2025

Meditron 70B

This project leverages a specialized Large Language Model (LLM)—built on LLama-2 and adapted to the medical domain—to help WHO EMRO staff quickly find and interpret vital health information. The model functions as both a dynamic search engine and an automated knowledge assistant. It can pull from WHO’s internal/historical documents (e.g., situation reports, response plans, rapid risk assessments) to generate targeted outputs, such as a “Response Plan for [disease]” or a “Public Health Support Assessment.” When opted in, the system can also draw upon external web-based resources to enrich its responses.

Ultimately, the project aims to reduce the time and effort required for WHO teams to locate and analyze critical regional health data. By storing vector representations of documents and retrieving them based on user queries, the LLM surfaces the most relevant content, then synthesizes an answer or formatted report. This approach ensures quicker, more accurate decision-making for outbreak response, risk assessment, and public health planning.


Below is a consolidated workflow that takes into account both your vector-store creation (via FAISS + Sentence Transformers) and using the local LM Studio API endpoint to query Meditron 7B. This will let you practice Retrieval-Augmented Generation (RAG) on your own machine until you’re ready for the 70B server-based solution.

1. Environment Setup

  1. Install Dependencies • Python 3.9+ • pip install langchain sentence-transformers faiss-cpu PyPDF2 transformers torch requests
  2. (Optional) Create a Virtual Environment • Helps keep dependencies isolated. • Example:
python -m venv .venv
source .venv/bin/activate
pip install ...
  1. Collect WHO PDFs • Download or gather small PDF files to test your pipeline (e.g., monthly or weekly situation reports).

2. Build and Test Your Vector Store

Use a script similar to VSdatabase-test.py:

  1. Text Extraction

• Parse PDFs with PyPDF2 to extract text. 2. Chunking

• Split the raw text into segments (e.g., ~500 tokens or words). 3. Embedding

• Convert each chunk into a vector using SentenceTransformers (e.g., all-MiniLM-L6-v2). 4. FAISS Index

• Store chunk embeddings in a FAISS index for fast similarity searches. • Confirm the index is built by printing index.ntotal.

Example snippet for building the index:

pdf_path = "path_to_who_report.pdf"
document_text = extract_text_from_pdf(pdf_path)
chunks = chunk_text(document_text, chunk_size=500)
vectors = text_to_vectors(chunks)  # uses SentenceTransformers
index = build_faiss_index(vectors)
faiss.write_index(index, "faiss_index.bin")

3. Start the Local Server in LM Studio

  1. Open LM Studio, load “meditron-7b”.
  2. Click Start to launch the local server.
  3. By default, it listens at http://localhost:1234.
  4. Under “API Usage,” verify the endpoints and model name (e.g., meditron-7b)
  5. Confirm you see endpoints like: • POST /v1/completions • POST /v1/chat/completions • POST /v1/embeddings (if supported)

4. Querying the Vector Store + LLM in One Flow

Below is a conceptual combined workflow:

flowchart LR
    A[WHO PDFs] --> B[Extract & Chunk Text]
    B --> C[Embed Chunks with SentenceTransformers]
    C --> D[Store in FAISS Index]
    E[User Query] --> F[Vector Store Query]
    F --> G[Retrieve Top-k Chunks]
    G --> H[Create Prompt with Chunks + Question]
    H --> I[POST to LM Studio LLM API]
    I --> J[LLM Generates Answer]
    J --> K[Answer Returned to User]

Steps Explained

  1. User Query

    • You have a question, e.g. “What does the WHO say about outbreak detection in 2019?”
  2. Vector Store Query

    • Use the same embedding method on the user query and search the FAISS index.
  3. Retrieve Top-k Chunks

    • FAISS returns, say, the top 3–5 most relevant text segments from the PDF(s).
  4. Prompt Assembly

    • Concatenate these chunks into a short “Context” string.
  • Form a prompt that looks like:
CONTEXT:
<chunk 1>
<chunk 2>
<chunk 3>

USER QUESTION:
<your question here>

PLEASE ANSWER BASED ON THE ABOVE CONTEXT:
  1. Send Prompt to the LLM
    • Make a POST request to LM Studio’s v1/completions or v1/chat/completions.
  2. Receive Model Answer
    • Display or store the final answer from the model.

5. Example: Python Integration

Below is a minimal example showing how you might tie it together:

import requests
import numpy as np
import faiss
from sentence_transformers import SentenceTransformer

# 1) Load the FAISS index
index = faiss.read_index("faiss_index.bin")

# 2) Function to query FAISS
embedding_model = SentenceTransformer('all-MiniLM-L6-v2')  # same model as used for storing

def query_vector_store(user_query, index, chunks, k=3):
    query_vector = embedding_model.encode(user_query)
    distances, faiss_indices = index.search(np.array([query_vector]), k)
    return [(faiss_indices[0][i], distances[0][i]) for i in range(k)]

# 3) Retrieve top-k chunks
chunks = [...]  # The list of text chunks you embedded during index creation
user_query = "What does WHO say about cholera outbreaks in 2019?"

results = query_vector_store(user_query, index, chunks, k=3)
retrieved_texts = []
for idx, dist in results:
    retrieved_texts.append(chunks[idx])

# 4) Build the final prompt
context = "\n\n".join(retrieved_texts)
prompt_text = f"Context:\n{context}\n\nUser Question:\n{user_query}\n\nAnswer:"

# 5) Call the LM Studio local server
url = "http://localhost:1234/v1/completions"
headers = {"Content-Type": "application/json"}

data = {
    "model": "meditron-7b",   # Replace with the model name from LM Studio
    "prompt": prompt_text,
    "max_tokens": 300,
    "temperature": 0.7
}

response = requests.post(url, headers=headers, json=data)

# 6) Parse and print the response
json_resp = response.json()
if "choices" in json_resp:
    print("Model answer:", json_resp["choices"][0]["text"])
else:
    print("Error:", json_resp)

6. Expanding to a Chat Endpoint

If you prefer the Chat style with roles (system, user, assistant): 1. Change the endpoint to /v1/chat/completions. 2. Format your data as:

data = {
  "model": "meditron-7b",
  "messages": [
    {"role": "system", "content": "You are a helpful epidemiology assistant."},
    {"role": "user", "content": f"Context:\n{context}\n\nQuestion:\n{user_query}"}
  ],
  "max_tokens": 300,
  "temperature": 0.7
}
3.  Parse the response.json() similarly.

7. Next Steps

1.  Scale Up
•   Once your server can handle 70B, replicate the same approach—only change the model name in LM Studio.
2.  User Interface
•   Build a Flask or Streamlit app that:
•   Accepts user queries,
•   Pulls relevant text from FAISS,
•   Sends a combined prompt to the LLM,
•   Displays the final response.
3.  Ongoing Updates
•   As WHO releases new PDFs, embed them, add to the FAISS index, and keep your pipeline fresh without retraining the model.

Final Summary

By combining your vector-store retrieval (FAISS + SentenceTransformers) with the LM Studio local server (OpenAI-like API), you can implement a practical end-to-end RAG system on your personal Mac. This will allow you to test question-answering over WHO documents using the Meditron 7B model—setting the stage for seamless migration to the Meditron 70B once your dedicated server is ready.