Spaces:

choltha
/

free-CPU-inference-for-testing

Paused

File size: 1,514 Bytes

e1a7d22
 
 
b8c846d
e1a7d22
b8c846d
e1a7d22
b8c846d
e1a7d22
 
 
9b2ffc6
 
4a70f4c
 
464e9a9
 
 
 
4a70f4c
d65f135
4a70f4c
d65f135
 
 
4a70f4c
b8c846d
4a70f4c

---
title: Test
emoji: 🔥
colorFrom: red
colorTo: yellow
sdk: gradio
pinned: false
license: mit
---

Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference


This is a test ...

LAST REVALATION: IT WORKS, but on Huggingface its PAINSTAKINGLY SLOW. Probably the reason why its not done yet. Its like 0,5tok/s on smallest quant for 7b mistral.
Idea: Fix it with intel-specific https://github.com/intel/intel-extension-for-transformers and check if it changes anything. Maybe checkout first if in the container there is a way to determine cpu type, if the integration is not trivial. (Or just make it trivial)


TASKS:
- rewrite generation from scratch or use the one of mistral space if possible. alternative use https://github.com/abetlen/llama-cpp-python#chat-completion or https://huggingface.co/spaces/deepseek-ai/deepseek-coder-7b-instruct/blob/main/app.py
- write IN LARGE LETTERS that this is not the original model but a quantified one that is able to run on free CPU Inference 
- test multimodal with llama?
- proper token handling - make it a real chat (if not auto by chatcompletion interface ...)
- check ho wmuch parallel generation is possible or only one que and set
- move model to DL into env-var with proper error handling
- chore: cleanup ignore, etc.
- update all deps to one up to date version, then PIN them!
- make a short info on how to clone and run custom 7b models in separate spaces
- make a pr for popular repos to include in their readme etc.