Luis Chaves commited on
Commit
ca919d4
·
1 Parent(s): 70b960d

added answers and improved how context is fetched from the chunks

Browse files
README.md CHANGED
@@ -9,8 +9,15 @@ pinned: false
9
 
10
  ## local dev
11
 
 
 
 
 
 
12
  ```
13
- uv run uvicorn app:app --reload --port 8000
 
 
14
  ```
15
 
16
  if your pdfs are in a folder called `pdfs/` run:
@@ -19,4 +26,17 @@ if your pdfs are in a folder called `pdfs/` run:
19
  curl -v -X POST -F "file=@pdfs/MECFS systematic review.pdf" http://localhost:8000/api/v1/extract
20
  ```
21
 
22
- Or use the automatic Swagger documentation at `http://localhost:8000/docs`
 
 
 
 
 
 
 
 
 
 
 
 
 
 
9
 
10
  ## local dev
11
 
12
+ install dependencies:
13
+
14
+ ```sh
15
+ uv venv
16
+ UV_PYTHON=3.12 uv pip install -r pyproject.toml
17
  ```
18
+
19
+ ```sh
20
+ uv run src/everycure/app.py
21
  ```
22
 
23
  if your pdfs are in a folder called `pdfs/` run:
 
26
  curl -v -X POST -F "file=@pdfs/MECFS systematic review.pdf" http://localhost:8000/api/v1/extract
27
  ```
28
 
29
+ to test remote (unfortunately quite slow):
30
+
31
+ ```
32
+ curl -X POST -F "file=@pdfs/MECFS systematic review.pdf" https://lucharo-everycure-ner-pdf.hf.space/api/v1/extract
33
+ ```
34
+
35
+ check API docs at <https://lucharo-everycure-ner-pdf.hf.space/docs>
36
+
37
+ ### run tests
38
+
39
+ ```sh
40
+ uv pip install -e .[dev]
41
+ uv run pytest
42
+ ```
learning.md ADDED
@@ -0,0 +1,79 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Every Cure Take Home
2
+
3
+ ## How to create an API endpoint that adeheres to an OpenAPI spec?
4
+
5
+ ## How to host publicly and for free an API?
6
+
7
+ can use docker + hugging face
8
+
9
+ ## What type of hugging face models do entity type extraction?
10
+
11
+ NER model, some are fine tuned in medical terminology such as d4data/biomedical-ner-all, BioBert or ClinicalBert.
12
+ Could also use LLM calls, but hard to judge whose performance would be better/benchmarking (potential improvement), also might be more expensive than a simpler fine tuned BERT model.
13
+
14
+ biobert was trained in 2020, not much docs in HF but it's the most popular 700k downloads last month
15
+ clinical bert 47k last month (2023)
16
+
17
+ bio clinical bert 3M downloads (2019)
18
+
19
+ CLINICAL ner leaderboard useful: <https://huggingface.co/spaces/m42-health/clinical_ner_leaderboard#:~:text=The%20main%20goal%20of%20the,entities%20across%20diverse%20medical%20domains>.
20
+
21
+ indeed LLMs are up there
22
+
23
+ ## What do entities mean in the context of this challenge?
24
+
25
+ In this context, entities refer to [Named Entity Recognition](https://en.wikipedia.org/wiki/Named-entity_recognition)
26
+ and in particular medical entities (diseases, names of molecules, proteins, medical procedures, etc)
27
+
28
+ There are models specifically trained to do NER detection from text, we'll leverage those.
29
+
30
+ ## how to extract text out of a pdf?
31
+
32
+ pdfplumber works pretty well as stated below we'll keep images and tables out of here, pdfplumber does extract text from tables but without time to assess how good the extraction is we don't know how reliable that is
33
+
34
+ ## how to extract meaningful context that's not just related to the text contet? wors around it?
35
+
36
+ attention mechanism comes to mind
37
+
38
+ ## caveats pf pdfplumber
39
+
40
+ we shouldn't include appendix and references into the mix
41
+
42
+ ## torch and uv
43
+
44
+ torch only works with python 3.12
45
+
46
+ UV_PYTHON=3.12 uv init
47
+ uv add transformers torch pdfplumber marimo gliner
48
+
49
+ ##  separate model and app -> probs cleaner but don't have the time
50
+ to do separate model/app deployments (two apis, etc) for now model in hf with gpu shpuld run fine
51
+
52
+ ## what's the context size of these bert models? do i need to chunk the output
53
+
54
+ ## test the fast api
55
+
56
+ it's got a nice test module
57
+
58
+ ## looks good
59
+
60
+ <https://huggingface.co/blaze999/Medical-NER>
61
+
62
+ <https://docs.astral.sh/uv/guides/integration/pytorch/#installing-pytorch>
63
+
64
+ nice: <https://huggingface.co/urchade/gliner_base>
65
+
66
+ ## what's the max length that gliner accepts?
67
+
68
+ havent been able to find it
69
+
70
+ ## Parts to the problem
71
+
72
+ - Check how good pdfplumber or PyMuPDF is at extracting text without butchering it.
73
+ - I think for now I could focus on text and list image or table parsing as an improvement.
74
+ - Identify suitable model for tasks
75
+ - write out fastapi endpoint matching openapi spec
76
+ - write out caching based on filename/content (sha)"
77
+ - write out effective logging in API backend
78
+ - write out testing of endpoint
79
+ - deploy
mds/answers.md CHANGED
@@ -1,13 +1,55 @@
1
  1. **Technical Choices:** What influenced your decision on the specific tools and models used?
2
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3
 
4
  2. **Entity Contextualization:** How did you approach the problem of providing context for each identified entity?
5
 
 
6
 
7
  3. **Error Handling:** Can you describe how your API handles potential errors?
8
 
 
9
 
10
  4. **Challenges and Learnings:** What were the top challenges faced, and what did you learn from them?
11
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
12
 
13
- 5. **Improvement Propositions:** Given more time, what improvements or additional features would you consider adding?
 
1
  1. **Technical Choices:** What influenced your decision on the specific tools and models used?
2
 
3
+ I chose the tools I did based on familiarity, efficiency and simplicity.
4
+
5
+ In order, I chose:
6
+ Development:
7
+
8
+ - uv for package/project management for its speed and convenience especially when installing packages such as torch
9
+ - pytest for basic testing of the API
10
+
11
+ Project:
12
+
13
+ - pdfplumber for easy extraction of text and text from tables in a very pythonic syntax
14
+ - Gliner models over traditional NER-BERT models for its flexibility defining custom entity lists so that I could map out these categories: <https://biolink.github.io/biolink-model/categories.json>. Gliner over LLMs because I was concerned of having to think of LLM's structured output, their stochasticity and perhaps inability to locate entity start and end and hallucinating chunks of texts. Though not relevant to this assignment, Gliner comes at a fraction of the cost of LLMs though with less flexibility in terms of choices. Also, the particular Gliner model I chose because it ranked at top of clinical NER leaderboard (<https://huggingface.co/spaces/m42-health/clinical_ner_leaderboard>)
15
+
16
+ Hosting:
17
+
18
+ - Hugging Face: again for familiarity having hosted apps there for free in the past and for the cheap availability of GPUs.
19
+ - Docker and FastAPI: familiarity with Docker, local development and hosting apps via Docker. FastAPI because of wide adoption, and convenience to enforce typing (for Entities), define testing and overall good integration with Docker/Hugging Face
20
 
21
  2. **Entity Contextualization:** How did you approach the problem of providing context for each identified entity?
22
 
23
+ I break the PDF into chunks to fit the Gliner model context window (786) we use 700 as the chunk size to accommodate for special tokens. We collect the output of the Gliner model on every chunk then do the contextualization by looking back 50 characters and forward 50 characters after joining back all the chunks.
24
 
25
  3. **Error Handling:** Can you describe how your API handles potential errors?
26
 
27
+ Basic error handling is implemented, if a file is not a PDF an error 415 is returned plus the corresponding description. When a file is not uploaded we get an error 422, unfortunately I couldn't make the FastAPI implementation match the spec in returning error 400 in that case.
28
 
29
  4. **Challenges and Learnings:** What were the top challenges faced, and what did you learn from them?
30
 
31
+ I think that one of the main challenges for me was not going down several rabbit holes that I could have lost much time in. Such as model selection (BERT or Gliner), whether to decouple application and model serving, whether to choose LLMs with structured outputs. Then there were the choices of hosting with CUDA or without and the technical challenges that came with it not being able to test Docker images with CUDA locally. The scope of the challenge felt quite broad, trying to squeeze plenty of optimizations around infrastructure, model choice, and performance in three and a half hours or four hours.
32
+
33
+ The main challenge and learning was to stay focused on delivering an implementation of the assignment that works as specified. Putting a proof of concept through the door rather than coming up with a perfect solution in under 4 hours. I learned the importance of being familiar with tools, infrastructure and deployment strategies can really speed things up (Docker GPU instances for example) and ultimately know how to avoid futile optimizations and keeping focus on the project requirements.
34
+
35
+ 5. **Improvement Propositions:** Given more time, what improvements or additional features would you consider adding?
36
+
37
+ The main challenges revolved around optimizing performance and scalability:
38
+
39
+ To illustrate this, when I was developing the project in M2 MacBook Pro, the performance was quite fast, under 30 seconds.
40
+
41
+ Whereas running the remote endpoint with the file MECFS systematic review (14 pages) took around 3 minutes 27 seconds running:
42
+
43
+ ```bash
44
+ time curl -X POST -F "file=@pdfs/MECFS systematic review.pdf" https://lucharo-everycure-ner-pdf.hf.space/api/v1/extract
45
+ ```
46
+
47
+ I failed to deploy a working Docker image with CUDA hence couldn't reap the benefits of GPU acceleration in the HF space. For the purpose of a demo it's fine but optimizing massively the performance of the pipeline would be my top priority if working on it for more hours.
48
+
49
+ I would have liked to spend more time investigating the actual performance in terms of output quality of the deployed model and comparing it against LLM structured outputs.
50
+
51
+ I would also put thought into decoupling app and model hosting having a dedicated GPU instance for the model or even delegating to an LLM provider if that method seemed to provide reliable results.
52
+
53
+ Other important but smaller things I didn't get to implement are around concurrency and stress loading of the endpoint, caching of files that have already been processed.
54
 
55
+ And if I really had a lot of time I would put time and effort into incorporating table info and image processing from PDFs alongside better dev tooling.
mds/learning.md CHANGED
@@ -46,26 +46,27 @@ torch only works with python 3.12
46
  UV_PYTHON=3.12 uv init
47
  uv add transformers torch pdfplumber marimo gliner
48
 
49
- ## separate model and app -> probs cleaner but don't have the time
 
50
  to do separate model/app deployments (two apis, etc) for now model in hf with gpu shpuld run fine
51
 
52
  ## what's the context size of these bert models? do i need to chunk the output
53
 
54
- ## test the fast api
55
 
56
  it's got a nice test module
57
 
58
  ## looks good
59
 
60
- https://huggingface.co/blaze999/Medical-NER
61
 
62
  <https://docs.astral.sh/uv/guides/integration/pytorch/#installing-pytorch>
63
 
64
- nice: https://huggingface.co/urchade/gliner_base
65
 
66
- ## what's the max length that gliner accepts?
67
 
68
- havent been able to find it
69
 
70
  ## Parts to the problem
71
 
 
46
  UV_PYTHON=3.12 uv init
47
  uv add transformers torch pdfplumber marimo gliner
48
 
49
+ ##  separate model and app -> probs cleaner but don't have the time
50
+
51
  to do separate model/app deployments (two apis, etc) for now model in hf with gpu shpuld run fine
52
 
53
  ## what's the context size of these bert models? do i need to chunk the output
54
 
55
+ ## test the fast api
56
 
57
  it's got a nice test module
58
 
59
  ## looks good
60
 
61
+ <https://huggingface.co/blaze999/Medical-NER>
62
 
63
  <https://docs.astral.sh/uv/guides/integration/pytorch/#installing-pytorch>
64
 
65
+ nice: <https://huggingface.co/urchade/gliner_base>
66
 
67
+ ## what's the max length that gliner accepts?
68
 
69
+ havent been able to find it -> 786, found through warnings
70
 
71
  ## Parts to the problem
72
 
openapi.yaml CHANGED
@@ -33,8 +33,8 @@ paths:
33
  type: array
34
  items:
35
  $ref: '#/components/schemas/Entity'
36
- '400':
37
- description: Bad request, file not included or empty filename.
38
  '415':
39
  description: Unsupported file type.
40
  '500':
@@ -63,4 +63,3 @@ components:
63
  format: int32
64
  example: 34
65
  description: The end position of the entity in the context with respect to the original text.
66
-
 
33
  type: array
34
  items:
35
  $ref: '#/components/schemas/Entity'
36
+ '422':
37
+ description: Validation error, file not included or empty filename.
38
  '415':
39
  description: Unsupported file type.
40
  '500':
 
63
  format: int32
64
  example: 34
65
  description: The end position of the entity in the context with respect to the original text.
 
src/everycure/app.py CHANGED
@@ -31,11 +31,11 @@ async def extract_entities(file: UploadFile):
31
 
32
  if not file:
33
  logger.error("No file provided")
34
- raise HTTPException(status_code=400, detail="No file provided")
35
 
36
  if not file.filename.lower().endswith('.pdf'):
37
  logger.error(f"Invalid file type: {file.filename}")
38
- raise HTTPException(status_code=415, detail="File must be a PDF")
39
 
40
  try:
41
  logger.info("Starting entity extraction")
 
31
 
32
  if not file:
33
  logger.error("No file provided")
34
+ raise HTTPException(status_code=400, detail="Bad request, file not included or empty filename")
35
 
36
  if not file.filename.lower().endswith('.pdf'):
37
  logger.error(f"Invalid file type: {file.filename}")
38
+ raise HTTPException(status_code=415, detail="Unsupported file type")
39
 
40
  try:
41
  logger.info("Starting entity extraction")
src/everycure/extractor.py CHANGED
@@ -119,25 +119,35 @@ def extract_entities_from_pdf(file: UploadFile) -> List[Entity]:
119
  if len(ent["text"]) <= 2: # Skip very short entities
120
  continue
121
 
122
- # Find the context (text surrounding the entity)
123
  start_idx = chunk.find(ent["text"])
124
  if start_idx != -1:
125
- # Get surrounding context (50 chars before and after)
126
- context_start = max(0, start_idx - 50)
127
- context_end = min(len(chunk), start_idx + len(ent["text"]) + 50)
128
- context = chunk[context_start:context_end]
129
-
130
  all_entities.append(Entity(
131
  entity=ent["text"],
132
- context=context,
133
- start=base_offset + start_idx, # Use absolute position in original text
134
  end=base_offset + start_idx + len(ent["text"])
135
  ))
136
 
137
  base_offset += len(chunk) + 1 # +1 for the space between chunks
138
 
139
- logger.info(f"Returning {len(all_entities)} processed entities")
140
- return all_entities
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
141
 
142
  except Exception as e:
143
  logger.error(f"Error during extraction: {str(e)}", exc_info=True)
 
119
  if len(ent["text"]) <= 2: # Skip very short entities
120
  continue
121
 
122
+ # Just store the entity and its position for now
123
  start_idx = chunk.find(ent["text"])
124
  if start_idx != -1:
 
 
 
 
 
125
  all_entities.append(Entity(
126
  entity=ent["text"],
127
+ context="", # Will be filled later
128
+ start=base_offset + start_idx,
129
  end=base_offset + start_idx + len(ent["text"])
130
  ))
131
 
132
  base_offset += len(chunk) + 1 # +1 for the space between chunks
133
 
134
+ # Now get context for all entities using the complete original text
135
+ final_entities = []
136
+ for ent in all_entities:
137
+ # Get surrounding context from the complete text
138
+ context_start = max(0, ent.start - 50)
139
+ context_end = min(len(pdf_text), ent.end + 50)
140
+ context = pdf_text[context_start:context_end]
141
+
142
+ final_entities.append(Entity(
143
+ entity=ent.entity,
144
+ context=context,
145
+ start=ent.start,
146
+ end=ent.end
147
+ ))
148
+
149
+ logger.info(f"Returning {len(final_entities)} processed entities")
150
+ return final_entities
151
 
152
  except Exception as e:
153
  logger.error(f"Error during extraction: {str(e)}", exc_info=True)
tests/test_api.py CHANGED
@@ -31,7 +31,7 @@ def test_extract_entities_invalid_file():
31
  )
32
 
33
  assert response.status_code == 415
34
- assert "Unsupported file type." in response.json()["detail"]
35
 
36
  def test_extract_entities_empty_file(test_pdf):
37
  with open(test_pdf, "rb") as f:
@@ -40,4 +40,24 @@ def test_extract_entities_empty_file(test_pdf):
40
  files={} # No file provided
41
  )
42
 
43
- assert response.status_code == 400 # FastAPI's validation error
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
31
  )
32
 
33
  assert response.status_code == 415
34
+ assert "Unsupported file type" in response.json()["detail"]
35
 
36
  def test_extract_entities_empty_file(test_pdf):
37
  with open(test_pdf, "rb") as f:
 
40
  files={} # No file provided
41
  )
42
 
43
+ assert response.status_code == 400 # Bad request error
44
+ assert "Bad request, file not included or empty filename" in response.json()["detail"]
45
+
46
+ def test_extract_entities_server_error(monkeypatch):
47
+ # Mock extract_entities_from_pdf to raise an exception
48
+ def mock_extract(*args):
49
+ raise Exception("Internal server error")
50
+
51
+ monkeypatch.setattr("everycure.app.extract_entities_from_pdf", mock_extract)
52
+
53
+ # Create a valid PDF file but force a server error
54
+ with tempfile.NamedTemporaryFile(suffix=".pdf") as tmp:
55
+ tmp.write(b"%PDF-1.5\nTest PDF content")
56
+ tmp.seek(0)
57
+ response = client.post(
58
+ "/api/v1/extract",
59
+ files={"file": ("test.pdf", tmp, "application/pdf")}
60
+ )
61
+
62
+ assert response.status_code == 500
63
+ assert "Internal server error" in response.json()["detail"]