|
--- |
|
title: Test |
|
emoji: 🔥 |
|
colorFrom: indigo |
|
colorTo: yellow |
|
sdk: docker |
|
pinned: false |
|
license: apache-2.0 |
|
--- |
|
|
|
Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference |
|
|
|
|
|
This is a test ... |
|
|
|
TASKS: |
|
- rewrite generation from scratch or use the one of mistral space if possible. alternative use https://github.com/abetlen/llama-cpp-python#chat-completion |
|
- write IN LARGE LETTERS that this is not the original model but a quantified one that is able to run on free CPU Inference |
|
- check ho wmuch parallel generation is possible or only one que and set max context etc accordingly. Maybe live-log ram free etc to the interface on "system health" graph if not too resource hungry on its own ... -> Gradio for display?? |
|
- live stream response (see mistral space!!) |
|
- log memory usage to console? Maybe auto reboot if too slim |
|
- readd system prompt? maybe checkout how to setup from lm studio - could be a dropdown with an option to set one fix also via env var when only one model available ... |
|
- move model to DL into env-var with proper error handling |
|
- chore: cleanup ignore, dockerfile etc. |
|
- update all deps to one up to date version, then PIN them! |
|
- make a short info on how to clone and run custom 7b models in separate spaces |
|
- make a pr for popular repos to include in their readme etc. |