issai/LLama-3.1-KazLLM-1.0-70B-GGUF4 · who tried it? is it work?

bdmitriy

3 days ago

Hi there. Who it used this model? And how it do it? only local deploying? or can I use it in the some website?

CCRss

Institute of Smart Systems and Artificial Intelligence, Nazarbayev University org 3 days ago

•

edited 3 days ago

Hello thank you for your interest in the model.

You can run KazLLM like that using vllm.

1 cell.

# Setup env: 
!conda create -n vllm_test python=3.10 -y
!pip install vllm==0.6.3
!pip install ipykernel
!python -m ipykernel install --user --name vllm_test

2 cell

# load model
import os
os.environ["CUDA_VISIBLE_DEVICES"] = "2"
from vllm import LLM, SamplingParams

# In this script, we demonstrate how to pass input to the chat method:
conversation = [
   {
      "role": "system",
      "content": "You are a helpful assistant"
   },
   {
      "role": "user",
      "content": "Hello"
   },
   {
      "role": "assistant",
      "content": "Hello! How can I assist you today?"
   },
   {
      "role": "user",
      "content": "Write an essay about the importance of higher education.",
   },
]

# Create a sampling params object.
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)

# Create an LLM.
llm = LLM(model="Nemotron_70B_instruct_corex5_mcq_cleaned_old_torchtune_cabinet_28112024_18000-Q4_K_M.gguf",
         gpu_memory_utilization=0.95, max_model_len=32000)
# Generate texts from the prompts. The output is a list of RequestOutput objects
# that contain the prompt, generated text, and other information.
outputs = llm.chat(conversation, sampling_params)

3 cell

# Print the outputs.
for output in outputs:
   prompt = output.prompt
   generated_text = output.outputs[0].text
   print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
   
for output in outputs:
   prompt = output.prompt
   generated_text = output.outputs[0].text
   print(f"Prompt: {prompt}, Generated text: {generated_text}")

Or you can also run using llama.cpp if you want, because vllm not yet fully optimized for gguf.

Requirements will be about 40GB of vram or just RAM if you use CPU.

bdmitriy

3 days ago

Not accessible in the cloud? If not. Maybe is it planned?

alm21

3 days ago

•

edited 3 days ago

Hi there. Who it used this model? And how it do it? only local deploying? or can I use it in the some website?

I tried, it works fine. Answers are also good. Much better than 8B and 8B-GGUF4 versions.
Tried on CPU and GPU, both on cloud, via llama-cpp.
CPU is slow, GPU is super fast, shows good performance.