who tried it? is it work?
#2
by
bdmitriy
- opened
Hi there. Who it used this model? And how it do it? only local deploying? or can I use it in the some website?
Institute of Smart Systems and Artificial Intelligence, Nazarbayev University org
•
edited 3 days ago
Hello thank you for your interest in the model.
You can run KazLLM like that using vllm.
1 cell.
# Setup env:
!conda create -n vllm_test python=3.10 -y
!pip install vllm==0.6.3
!pip install ipykernel
!python -m ipykernel install --user --name vllm_test
2 cell
# load model
import os
os.environ["CUDA_VISIBLE_DEVICES"] = "2"
from vllm import LLM, SamplingParams
# In this script, we demonstrate how to pass input to the chat method:
conversation = [
{
"role": "system",
"content": "You are a helpful assistant"
},
{
"role": "user",
"content": "Hello"
},
{
"role": "assistant",
"content": "Hello! How can I assist you today?"
},
{
"role": "user",
"content": "Write an essay about the importance of higher education.",
},
]
# Create a sampling params object.
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
# Create an LLM.
llm = LLM(model="Nemotron_70B_instruct_corex5_mcq_cleaned_old_torchtune_cabinet_28112024_18000-Q4_K_M.gguf",
gpu_memory_utilization=0.95, max_model_len=32000)
# Generate texts from the prompts. The output is a list of RequestOutput objects
# that contain the prompt, generated text, and other information.
outputs = llm.chat(conversation, sampling_params)
3 cell
# Print the outputs.
for output in outputs:
prompt = output.prompt
generated_text = output.outputs[0].text
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
for output in outputs:
prompt = output.prompt
generated_text = output.outputs[0].text
print(f"Prompt: {prompt}, Generated text: {generated_text}")
Or you can also run using llama.cpp if you want, because vllm not yet fully optimized for gguf.
Requirements will be about 40GB of vram or just RAM if you use CPU.
Not accessible in the cloud? If not. Maybe is it planned?
Hi there. Who it used this model? And how it do it? only local deploying? or can I use it in the some website?
I tried, it works fine. Answers are also good. Much better than 8B and 8B-GGUF4 versions.
Tried on CPU and GPU, both on cloud, via llama-cpp.
CPU is slow, GPU is super fast, shows good performance.