Spaces:

lucharo
/

everycure-ner-pdf

Sleeping

App Files Files Community

Luis Chaves commited on Jan 27

Commit

ca919d4

1 Parent(s): 70b960d

added answers and improved how context is fetched from the chunks

Browse files

Files changed (8) hide show

README.md +22 -2
learning.md +79 -0
mds/answers.md +43 -1
mds/learning.md +7 -6
openapi.yaml +2 -3
src/everycure/app.py +2 -2
src/everycure/extractor.py +20 -10
tests/test_api.py +22 -2

README.md CHANGED Viewed

@@ -9,8 +9,15 @@ pinned: false
 ## local dev
 ```
-uv run uvicorn app:app --reload --port 8000
 ```
 if your pdfs are in a folder called `pdfs/` run:
@@ -19,4 +26,17 @@ if your pdfs are in a folder called `pdfs/` run:
 curl -v -X POST -F "file=@pdfs/MECFS systematic review.pdf" http://localhost:8000/api/v1/extract
 ```
-Or use the automatic Swagger documentation at `http://localhost:8000/docs`

 ## local dev
+install dependencies:
+```sh
+uv venv
+UV_PYTHON=3.12 uv pip install -r pyproject.toml
 ```
+```sh
+uv run src/everycure/app.py
 ```
 if your pdfs are in a folder called `pdfs/` run:
 curl -v -X POST -F "file=@pdfs/MECFS systematic review.pdf" http://localhost:8000/api/v1/extract
 ```
+to test remote (unfortunately quite slow):
+```
+curl -X POST -F "file=@pdfs/MECFS systematic review.pdf" https://lucharo-everycure-ner-pdf.hf.space/api/v1/extract
+```
+check API docs at <https://lucharo-everycure-ner-pdf.hf.space/docs>
+### run tests
+```sh
+uv pip install -e .[dev]
+uv run pytest
+```

learning.md ADDED Viewed

	@@ -0,0 +1,79 @@

+# Every Cure Take Home
+## How to create an API endpoint that adeheres to an OpenAPI spec?
+## How to host publicly and for free an API?
+can use docker + hugging face
+## What type of hugging face models do entity type extraction?
+NER model, some are fine tuned in medical terminology such as d4data/biomedical-ner-all, BioBert or ClinicalBert.
+Could also use LLM calls, but hard to judge whose performance would be better/benchmarking (potential improvement), also might be more expensive than a simpler fine tuned BERT model.
+biobert was trained in 2020, not much docs in HF but it's the most popular 700k downloads last month
+clinical bert 47k last month (2023)
+bio clinical bert 3M downloads (2019)
+CLINICAL ner leaderboard useful: <https://huggingface.co/spaces/m42-health/clinical_ner_leaderboard#:~:text=The%20main%20goal%20of%20the,entities%20across%20diverse%20medical%20domains>.
+indeed LLMs are up there
+## What do entities mean in the context of this challenge?
+In this context, entities refer to [Named Entity Recognition](https://en.wikipedia.org/wiki/Named-entity_recognition)
+and in particular medical entities (diseases, names of molecules, proteins, medical procedures, etc)
+There are models specifically trained to do NER detection from text, we'll leverage those.
+## how to extract text out of a pdf?
+pdfplumber works pretty well as stated below we'll keep images and tables out of here, pdfplumber does extract text from tables but without time to assess how good the extraction is we don't know how reliable that is
+## how to extract meaningful context that's not just related to the text contet? wors around it?
+attention mechanism comes to mind
+## caveats pf pdfplumber
+we shouldn't include appendix and references into the mix
+## torch and uv
+torch only works with python 3.12
+UV_PYTHON=3.12 uv init
+uv add transformers torch pdfplumber marimo gliner
+##  separate model and app -> probs cleaner but don't have the time
+to do separate model/app deployments (two apis, etc) for now model in hf with gpu shpuld run fine
+## what's the context size of these bert models? do i need to chunk the output
+## test the fast api
+it's got a nice test module
+## looks good
+<https://huggingface.co/blaze999/Medical-NER>
+<https://docs.astral.sh/uv/guides/integration/pytorch/#installing-pytorch>
+nice: <https://huggingface.co/urchade/gliner_base>
+## what's the max length that gliner accepts?
+havent been able to find it
+## Parts to the problem
+- Check how good pdfplumber or PyMuPDF is at extracting text without butchering it.
+  - I think for now I could focus on text and list image or table parsing as an improvement.
+- Identify suitable model for tasks
+- write out fastapi endpoint matching openapi spec
+  - write out caching based on filename/content (sha)"
+  - write out effective logging in API backend
+- write out testing of endpoint
+- deploy

mds/answers.md CHANGED Viewed

@@ -1,13 +1,55 @@
 1. **Technical Choices:** What influenced your decision on the specific tools and models used?
 2. **Entity Contextualization:** How did you approach the problem of providing context for each identified entity?
 3. **Error Handling:** Can you describe how your API handles potential errors?
 4. **Challenges and Learnings:** What were the top challenges faced, and what did you learn from them?
-5. **Improvement Propositions:** Given more time, what improvements or additional features would you consider adding?

 1. **Technical Choices:** What influenced your decision on the specific tools and models used?
+I chose the tools I did based on familiarity, efficiency and simplicity.
+In order, I chose:
+Development:
+- uv for package/project management for its speed and convenience especially when installing packages such as torch
+- pytest for basic testing of the API
+Project:
+- pdfplumber for easy extraction of text and text from tables in a very pythonic syntax
+- Gliner models over traditional NER-BERT models for its flexibility defining custom entity lists so that I could map out these categories: <https://biolink.github.io/biolink-model/categories.json>. Gliner over LLMs because I was concerned of having to think of LLM's structured output, their stochasticity and perhaps inability to locate entity start and end and hallucinating chunks of texts. Though not relevant to this assignment, Gliner comes at a fraction of the cost of LLMs though with less flexibility in terms of choices. Also, the particular Gliner model I chose because it ranked at top of clinical NER leaderboard (<https://huggingface.co/spaces/m42-health/clinical_ner_leaderboard>)
+Hosting:
+- Hugging Face: again for familiarity having hosted apps there for free in the past and for the cheap availability of GPUs.
+- Docker and FastAPI: familiarity with Docker, local development and hosting apps via Docker. FastAPI because of wide adoption, and convenience to enforce typing (for Entities), define testing and overall good integration with Docker/Hugging Face
 2. **Entity Contextualization:** How did you approach the problem of providing context for each identified entity?
+I break the PDF into chunks to fit the Gliner model context window (786) we use 700 as the chunk size to accommodate for special tokens. We collect the output of the Gliner model on every chunk then do the contextualization by looking back 50 characters and forward 50 characters after joining back all the chunks.
 3. **Error Handling:** Can you describe how your API handles potential errors?
+Basic error handling is implemented, if a file is not a PDF an error 415 is returned plus the corresponding description. When a file is not uploaded we get an error 422, unfortunately I couldn't make the FastAPI implementation match the spec in returning error 400 in that case.
 4. **Challenges and Learnings:** What were the top challenges faced, and what did you learn from them?
+I think that one of the main challenges for me was not going down several rabbit holes that I could have lost much time in. Such as model selection (BERT or Gliner), whether to decouple application and model serving, whether to choose LLMs with structured outputs. Then there were the choices of hosting with CUDA or without and the technical challenges that came with it not being able to test Docker images with CUDA locally. The scope of the challenge felt quite broad, trying to squeeze plenty of optimizations around infrastructure, model choice, and performance in three and a half hours or four hours.
+The main challenge and learning was to stay focused on delivering an implementation of the assignment that works as specified. Putting a proof of concept through the door rather than coming up with a perfect solution in under 4 hours. I learned the importance of being familiar with tools, infrastructure and deployment strategies can really speed things up (Docker GPU instances for example) and ultimately know how to avoid futile optimizations and keeping focus on the project requirements.
+5. **Improvement Propositions:** Given more time, what improvements or additional features would you consider adding?
+The main challenges revolved around optimizing performance and scalability:
+To illustrate this, when I was developing the project in M2 MacBook Pro, the performance was quite fast, under 30 seconds.
+Whereas running the remote endpoint with the file MECFS systematic review (14 pages) took around 3 minutes 27 seconds running:
+```bash
+time curl -X POST -F "file=@pdfs/MECFS systematic review.pdf" https://lucharo-everycure-ner-pdf.hf.space/api/v1/extract
+```
+I failed to deploy a working Docker image with CUDA hence couldn't reap the benefits of GPU acceleration in the HF space. For the purpose of a demo it's fine but optimizing massively the performance of the pipeline would be my top priority if working on it for more hours.
+I would have liked to spend more time investigating the actual performance in terms of output quality of the deployed model and comparing it against LLM structured outputs.
+I would also put thought into decoupling app and model hosting having a dedicated GPU instance for the model or even delegating to an LLM provider if that method seemed to provide reliable results.
+Other important but smaller things I didn't get to implement are around concurrency and stress loading of the endpoint, caching of files that have already been processed.
+And if I really had a lot of time I would put time and effort into incorporating table info and image processing from PDFs alongside better dev tooling.

mds/learning.md CHANGED Viewed

@@ -46,26 +46,27 @@ torch only works with python 3.12
 UV_PYTHON=3.12 uv init
 uv add transformers torch pdfplumber marimo gliner
-## separate model and app -> probs cleaner but don't have the time
 to do separate model/app deployments (two apis, etc) for now model in hf with gpu shpuld run fine
 ## what's the context size of these bert models? do i need to chunk the output
-## test the fast api
 it's got a nice test module
 ## looks good
-https://huggingface.co/blaze999/Medical-NER
 <https://docs.astral.sh/uv/guides/integration/pytorch/#installing-pytorch>
-nice: https://huggingface.co/urchade/gliner_base
-## what's the max length that gliner accepts?
-havent been able to find it
 ## Parts to the problem

 UV_PYTHON=3.12 uv init
 uv add transformers torch pdfplumber marimo gliner
+##  separate model and app -> probs cleaner but don't have the time
 to do separate model/app deployments (two apis, etc) for now model in hf with gpu shpuld run fine
 ## what's the context size of these bert models? do i need to chunk the output
+## test the fast api
 it's got a nice test module
 ## looks good
+<https://huggingface.co/blaze999/Medical-NER>
 <https://docs.astral.sh/uv/guides/integration/pytorch/#installing-pytorch>
+nice: <https://huggingface.co/urchade/gliner_base>
+## what's the max length that gliner accepts?
+havent been able to find it -> 786, found through warnings
 ## Parts to the problem

openapi.yaml CHANGED Viewed

@@ -33,8 +33,8 @@ paths:
                 type: array
                 items:
                   $ref: '#/components/schemas/Entity'
-        '400':
-          description: Bad request, file not included or empty filename.
         '415':
           description: Unsupported file type.
         '500':
@@ -63,4 +63,3 @@ components:
           format: int32
           example: 34
           description: The end position of the entity in the context with respect to the original text.

                 type: array
                 items:
                   $ref: '#/components/schemas/Entity'
+        '422':
+          description: Validation error, file not included or empty filename.
         '415':
           description: Unsupported file type.
         '500':
           format: int32
           example: 34
           description: The end position of the entity in the context with respect to the original text.

src/everycure/app.py CHANGED Viewed

@@ -31,11 +31,11 @@ async def extract_entities(file: UploadFile):
     if not file:
         logger.error("No file provided")
-        raise HTTPException(status_code=400, detail="No file provided")
     if not file.filename.lower().endswith('.pdf'):
         logger.error(f"Invalid file type: {file.filename}")
-        raise HTTPException(status_code=415, detail="File must be a PDF")
     try:
         logger.info("Starting entity extraction")

     if not file:
         logger.error("No file provided")
+        raise HTTPException(status_code=400, detail="Bad request, file not included or empty filename")
     if not file.filename.lower().endswith('.pdf'):
         logger.error(f"Invalid file type: {file.filename}")
+        raise HTTPException(status_code=415, detail="Unsupported file type")
     try:
         logger.info("Starting entity extraction")

src/everycure/extractor.py CHANGED Viewed

@@ -119,25 +119,35 @@ def extract_entities_from_pdf(file: UploadFile) -> List[Entity]:
                 if len(ent["text"]) <= 2:  # Skip very short entities
                     continue
-                # Find the context (text surrounding the entity)
                 start_idx = chunk.find(ent["text"])
                 if start_idx != -1:
-                    # Get surrounding context (50 chars before and after)
-                    context_start = max(0, start_idx - 50)
-                    context_end = min(len(chunk), start_idx + len(ent["text"]) + 50)
-                    context = chunk[context_start:context_end]
                     all_entities.append(Entity(
                         entity=ent["text"],
-                        context=context,
-                        start=base_offset + start_idx,  # Use absolute position in original text
                         end=base_offset + start_idx + len(ent["text"])
                     ))
             base_offset += len(chunk) + 1  # +1 for the space between chunks
-        logger.info(f"Returning {len(all_entities)} processed entities")
-        return all_entities
     except Exception as e:
         logger.error(f"Error during extraction: {str(e)}", exc_info=True)

                 if len(ent["text"]) <= 2:  # Skip very short entities
                     continue
+                # Just store the entity and its position for now
                 start_idx = chunk.find(ent["text"])
                 if start_idx != -1:
                     all_entities.append(Entity(
                         entity=ent["text"],
+                        context="",  # Will be filled later
+                        start=base_offset + start_idx,
                         end=base_offset + start_idx + len(ent["text"])
                     ))
             base_offset += len(chunk) + 1  # +1 for the space between chunks
+        # Now get context for all entities using the complete original text
+        final_entities = []
+        for ent in all_entities:
+            # Get surrounding context from the complete text
+            context_start = max(0, ent.start - 50)
+            context_end = min(len(pdf_text), ent.end + 50)
+            context = pdf_text[context_start:context_end]
+            final_entities.append(Entity(
+                entity=ent.entity,
+                context=context,
+                start=ent.start,
+                end=ent.end
+            ))
+        logger.info(f"Returning {len(final_entities)} processed entities")
+        return final_entities
     except Exception as e:
         logger.error(f"Error during extraction: {str(e)}", exc_info=True)

tests/test_api.py CHANGED Viewed

@@ -31,7 +31,7 @@ def test_extract_entities_invalid_file():
         )
     assert response.status_code == 415
-    assert "Unsupported file type." in response.json()["detail"]
 def test_extract_entities_empty_file(test_pdf):
     with open(test_pdf, "rb") as f:
@@ -40,4 +40,24 @@ def test_extract_entities_empty_file(test_pdf):
             files={}  # No file provided
         )
-    assert response.status_code == 400  # FastAPI's validation error

         )
     assert response.status_code == 415
+    assert "Unsupported file type" in response.json()["detail"]
 def test_extract_entities_empty_file(test_pdf):
     with open(test_pdf, "rb") as f:
             files={}  # No file provided
         )
+    assert response.status_code == 400  # Bad request error
+    assert "Bad request, file not included or empty filename" in response.json()["detail"]
+def test_extract_entities_server_error(monkeypatch):
+    # Mock extract_entities_from_pdf to raise an exception
+    def mock_extract(*args):
+        raise Exception("Internal server error")
+    monkeypatch.setattr("everycure.app.extract_entities_from_pdf", mock_extract)
+    # Create a valid PDF file but force a server error
+    with tempfile.NamedTemporaryFile(suffix=".pdf") as tmp:
+        tmp.write(b"%PDF-1.5\nTest PDF content")
+        tmp.seek(0)
+        response = client.post(
+            "/api/v1/extract",
+            files={"file": ("test.pdf", tmp, "application/pdf")}
+        )
+    assert response.status_code == 500
+    assert "Internal server error" in response.json()["detail"]