1. **Technical Choices:** What influenced your decision on the specific tools and models used?

I chose the tools I did based on familiarity, efficiency and simplicity.

In order, I chose:
Development:

- uv for package/project management for its speed and convenience especially when installing packages such as torch
- pytest for basic testing of the API

Project:

- pdfplumber for easy extraction of text and text from tables in a very pythonic syntax
- Gliner models over traditional NER-BERT models for its flexibility defining custom entity lists so that I could map out these categories: <https://biolink.github.io/biolink-model/categories.json>. Gliner over LLMs because I was concerned of having to think of LLM's structured output, their stochasticity and perhaps inability to locate entity start and end and hallucinating chunks of texts. Though not relevant to this assignment, Gliner comes at a fraction of the cost of LLMs though with less flexibility in terms of choices. Also, the particular Gliner model I chose because it ranked at top of clinical NER leaderboard (<https://huggingface.co/spaces/m42-health/clinical_ner_leaderboard>)

Hosting:

- Hugging Face: again for familiarity having hosted apps there for free in the past and for the cheap availability of GPUs.
- Docker and FastAPI: familiarity with Docker, local development and hosting apps via Docker. FastAPI because of wide adoption, and convenience to enforce typing (for Entities), define testing and overall good integration with Docker/Hugging Face

2. **Entity Contextualization:** How did you approach the problem of providing context for each identified entity?

I break the PDF into chunks to fit the Gliner model context window (786) we use 700 as the chunk size to accommodate for special tokens. We collect the output of the Gliner model on every chunk then do the contextualization by looking back 50 characters and forward 50 characters after joining back all the chunks.

3. **Error Handling:** Can you describe how your API handles potential errors?

Basic error handling is implemented, if a file is not a PDF an error 415 is returned plus the corresponding description. When a file is not uploaded we get an error 422, unfortunately I couldn't make the FastAPI implementation match the spec in returning error 400 in that case.

4. **Challenges and Learnings:** What were the top challenges faced, and what did you learn from them?

I think that one of the main challenges for me was not going down several rabbit holes that I could have lost much time in. Such as model selection (BERT or Gliner), whether to decouple application and model serving, whether to choose LLMs with structured outputs. Then there were the choices of hosting with CUDA or without and the technical challenges that came with it not being able to test Docker images with CUDA locally. The scope of the challenge felt quite broad, trying to squeeze plenty of optimizations around infrastructure, model choice, and performance in three and a half hours or four hours.

The main challenge and learning was to stay focused on delivering an implementation of the assignment that works as specified. Putting a proof of concept through the door rather than coming up with a perfect solution in under 4 hours. I learned the importance of being familiar with tools, infrastructure and deployment strategies can really speed things up (Docker GPU instances for example) and ultimately know how to avoid futile optimizations and keeping focus on the project requirements.

5. **Improvement Propositions:** Given more time, what improvements or additional features would you consider adding?

The main challenges revolved around optimizing performance and scalability:

To illustrate this, when I was developing the project in M2 MacBook Pro, the performance was quite fast, under 30 seconds.

Whereas running the remote endpoint with the file MECFS systematic review (14 pages) took around 3 minutes 27 seconds running:

```bash
time curl -X POST -F "file=@pdfs/MECFS systematic review.pdf" https://lucharo-everycure-ner-pdf.hf.space/api/v1/extract
```

I failed to deploy a working Docker image with CUDA hence couldn't reap the benefits of GPU acceleration in the HF space. For the purpose of a demo it's fine but optimizing massively the performance of the pipeline would be my top priority if working on it for more hours.

I would have liked to spend more time investigating the actual performance in terms of output quality of the deployed model and comparing it against LLM structured outputs.

I would also put thought into decoupling app and model hosting having a dedicated GPU instance for the model or even delegating to an LLM provider if that method seemed to provide reliable results.

Other important but smaller things I didn't get to implement are around concurrency and stress loading of the endpoint, caching of files that have already been processed.

And if I really had a lot of time I would put time and effort into incorporating table info and image processing from PDFs alongside better dev tooling.