amazon
/

MistralLite

@@ -10,8 +10,8 @@ MistralLite is a fine-tuned [Mistral-7B-v0.1](https://huggingface.co/mistralai/M
 MistralLight evolves from [Mistral-7B-v0.1](https://huggingface.co/mistralai/Mistral-7B-v0.1), and their similarities and differences are summarized below:
 |Model|Fine-tuned on long contexts| Max context length| RotaryEmbedding adaptation| Sliding Window Size|
 |----------|-------------:|------------:|-----------:|-----------:|
-| Mistral-7B-v0.1 | No | 32K | rope_theta = 10000 | 4096 |
-| MistralLite | Yes | 32K | **rope_theta = 1000000** | **16384** |
 ## Motivation of Developing MistralLite
@@ -160,7 +160,6 @@ hub = {
     'HF_MODEL_ID':'amazon/MistralLite',
     'HF_TASK':'text-generation',
     'SM_NUM_GPUS':'1',
-    'HF_MODEL_QUANTIZE':'true'
 }
 model = HuggingFaceModel(
@@ -184,10 +183,16 @@ input_data = {
   "inputs": "<|prompter|>What are the main challenges to support a long context for LLM?</s><|assistant|>",
   "parameters": {
     "do_sample": False,
-    "max_new_tokens": 100,
   }
 }
-predictor.predict(input_data)
 ```
 or via [boto3](https://pypi.org/project/boto3/), and the example code is shown as below:
@@ -207,15 +212,17 @@ def call_endpoint(client, prompt, endpoint_name, paramters):
 client = boto3.client("sagemaker-runtime")
 parameters = {
-        "max_new_tokens": 250,
-        "do_sample": True,
-        "temperature": None,
-        "use_cache": True,
-        "seed": 1,
-}
-endpoint_name = "your-endpoint-name-here""
 prompt = "<|prompter|>What are the main challenges to support a long context for LLM?</s><|assistant|>"
-result = call_endpoint(client, prompt, endpoint_name, paramters)
 print(result)
 ```
@@ -227,11 +234,12 @@ Use TGI version 1.1.0 or later. The official Docker container is: `ghcr.io/huggi
 Example Docker parameters:
 ```shell
-docker run -d --gpus all --shm-size 1g -p 443:80 ghcr.io/huggingface/text-generation-inference:1.1.0 \
       --model-id amazon/MistralLite \
       --max-input-length 8192 \
       --max-total-tokens 16384 \
-      --max-batch-prefill-tokens 16384
 ```
 ### Perform Inference ###
@@ -249,9 +257,9 @@ SERVER_HOST = "localhost"
 SERVER_URL = f"{SERVER_HOST}:{SERVER_PORT}"
 tgi_client = Client(f"http://{SERVER_URL}", timeout=60)
-def invoke_falconlite(prompt,
                       random_seed=1,
-                      max_new_tokens=250,
                       print_stream=True,
                       assist_role=True):
     if (assist_role):
@@ -261,10 +269,10 @@ def invoke_falconlite(prompt,
         prompt,
         do_sample=False,
         max_new_tokens=max_new_tokens,
-        typical_p=0.2,
         temperature=None,
         truncate=None,
         seed=random_seed,
     ):
         if hasattr(response, "token"):
             if not response.token.special:
@@ -275,7 +283,7 @@ def invoke_falconlite(prompt,
     return output
 prompt = "What are the main challenges to support a long context for LLM?"
-result = invoke_falconlite(prompt)
 ```
 **Important** - When using MistralLite for inference for the first time, it may require a brief 'warm-up' period that can take 10s of seconds. However, subsequent inferences should be faster and return results in a more timely manner. This warm-up period is normal and should not affect the overall performance of the system once the initialisation period has been completed.

 MistralLight evolves from [Mistral-7B-v0.1](https://huggingface.co/mistralai/Mistral-7B-v0.1), and their similarities and differences are summarized below:
 |Model|Fine-tuned on long contexts| Max context length| RotaryEmbedding adaptation| Sliding Window Size|
 |----------|-------------:|------------:|-----------:|-----------:|
+| Mistral-7B-v0.1 | up to 8K tokens | 32K | rope_theta = 10000 | 4096 |
+| MistralLite | up to 16K tokens | 32K | **rope_theta = 1000000** | **16384** |
 ## Motivation of Developing MistralLite
     'HF_MODEL_ID':'amazon/MistralLite',
     'HF_TASK':'text-generation',
     'SM_NUM_GPUS':'1',
 }
 model = HuggingFaceModel(
   "inputs": "<|prompter|>What are the main challenges to support a long context for LLM?</s><|assistant|>",
   "parameters": {
     "do_sample": False,
+    "max_new_tokens": 400,
+    "return_full_text": False,
+    "typical_p": 0.2,
+    "temperature":None,
+    "truncate":None,
+    "seed": 1,
   }
 }
+result = predictor.predict(input_data)[0]["generated_text"]
+print(result)
 ```
 or via [boto3](https://pypi.org/project/boto3/), and the example code is shown as below:
 client = boto3.client("sagemaker-runtime")
 parameters = {
+    "do_sample": False,
+    "max_new_tokens": 400,
+    "return_full_text": False,
+    "typical_p": 0.2,
+    "temperature":None,
+    "truncate":None,
+    "seed": 1,
+  }
+endpoint_name = "MistralLite-2023-10-16-09-45-58"
 prompt = "<|prompter|>What are the main challenges to support a long context for LLM?</s><|assistant|>"
+result = call_endpoint(client, prompt, endpoint_name, parameters)
 print(result)
 ```
 Example Docker parameters:
 ```shell
+docker run -d --gpus all --shm-size 1g -p 443:80 -v $(pwd)/models:/data ghcr.io/huggingface/text-generation-inference:1.1.0 \
       --model-id amazon/MistralLite \
       --max-input-length 8192 \
       --max-total-tokens 16384 \
+      --max-batch-prefill-tokens 16384 \
+      --trust-remote-code
 ```
 ### Perform Inference ###
 SERVER_URL = f"{SERVER_HOST}:{SERVER_PORT}"
 tgi_client = Client(f"http://{SERVER_URL}", timeout=60)
+def invoke_tgi(prompt,
                       random_seed=1,
+                      max_new_tokens=400,
                       print_stream=True,
                       assist_role=True):
     if (assist_role):
         prompt,
         do_sample=False,
         max_new_tokens=max_new_tokens,
         temperature=None,
         truncate=None,
         seed=random_seed,
+        typical_p=0.2,
     ):
         if hasattr(response, "token"):
             if not response.token.special:
     return output
 prompt = "What are the main challenges to support a long context for LLM?"
+result = invoke_tgi(prompt)
 ```
 **Important** - When using MistralLite for inference for the first time, it may require a brief 'warm-up' period that can take 10s of seconds. However, subsequent inferences should be faster and return results in a more timely manner. This warm-up period is normal and should not affect the overall performance of the system once the initialisation period has been completed.