root-signals
/

RootSignals-Judge-Llama-70B

@@ -171,7 +171,7 @@ docker run \
    --volume /etc/localtime:/etc/localtime:ro \
    -d docker.io/lmsysorg/sglang:v0.4.2-cu124-srt \
    python3 -m sglang.launch_server \
-   --model-path root-signals/RS1-llama-fast \
    --host 0.0.0.0 \
    --port 8000 \
    --mem-fraction-static 0.89 \
@@ -180,7 +180,7 @@ docker run \
    --disable-cuda-graph
 ```
-We validated the model on arm64 with [vLLM](https://github.com/vllm-project/vllm) on Nvidia GH200 as well with max outputs up to 72k tokens:
 ```
 docker run \
    --gpus all \
@@ -189,10 +189,11 @@ docker run \
    -v huggingface:/root/.cache/huggingface \
    --volume /etc/localtime:/etc/localtime:ro \
    -d drikster80/vllm-gh200-openai:v0.6.4.post1 \
-   --model root-signals/RS1-llama-fast \
-   --gpu-memory-utilization 0.97 \
-   --max-model-len 72000 \
    --block_size 16 \
 ```
 # 4. Model Details

    --volume /etc/localtime:/etc/localtime:ro \
    -d docker.io/lmsysorg/sglang:v0.4.2-cu124-srt \
    python3 -m sglang.launch_server \
+   --model-path root-signals/RootSignals-Judge-Llama-70B \
    --host 0.0.0.0 \
    --port 8000 \
    --mem-fraction-static 0.89 \
    --disable-cuda-graph
 ```
+We validated the model on arm64 with [vLLM](https://github.com/vllm-project/vllm) on Nvidia GH200 as well with max outputs up to 64k tokens:
 ```
 docker run \
    --gpus all \
    -v huggingface:/root/.cache/huggingface \
    --volume /etc/localtime:/etc/localtime:ro \
    -d drikster80/vllm-gh200-openai:v0.6.4.post1 \
+   --model root-signals/RootSignals-Judge-Llama-70B \
+   --gpu-memory-utilization 0.95 \
+   --max-model-len 64k \
    --block_size 16 \
+   --enable_prefix_caching
 ```
 # 4. Model Details