Spaces:
Runtime error
Runtime error
import requests | |
def get_public_ip(): | |
try: | |
response = requests.get('https://api.ipify.org') | |
public_ip = response.text | |
return public_ip | |
except Exception as e: | |
return f"Error: {str(e)}" | |
public_ip = get_public_ip() | |
ABOUT = f""" | |
# ❓ About | |
At Powered-by-Intel LLM Leaderboard we conduct the same benchmarks as the Open LLM Leaderboard and plan to add | |
domain-specific benchmarks in the future. We utilize the <a href="https://github.com/EleutherAI/lm-evaluation-harness" target="_blank"> | |
Eleuther AI Language Model Evaluation Harness </a>, a unified framework to test generative language models on a large number of | |
different evaluation tasks. | |
Our current benchmarks include: | |
- <a href="https://arxiv.org/abs/1803.05457" target="_blank"> AI2 Reasoning Challenge (25-shot)</a> - a set of grade-school science questions. | |
- <a href="https://arxiv.org/abs/1905.07830" target="_blank"> HellaSwag (10-shot)</a> - a test of commonsense inference, which is easy for humans (~95%) but challenging for state-of-the-art models. | |
- <a href="https://arxiv.org/abs/2009.03300" target="_blank"> MMLU (5-shot)</a> - a test measuring a text model's multitask accuracy, covering 57 tasks in fields like elementary mathematics, US history, computer science, law, and more. | |
- <a href="https://arxiv.org/abs/2109.07958" target="_blank"> TruthfulQA (0-shot)</a> - a test measuring a model's propensity to reproduce falsehoods commonly found online. Note: TruthfulQA is technically a 6-shot task in the Harness because each example is prepended with 6 Q/A pairs, even in the 0-shot setting. | |
- <a href="https://arxiv.org/abs/1907.10641" target="_blank"> Winogrande (5-shot)</a> - an adversarial and difficult Winograd benchmark at scale, for commonsense reasoning. | |
- <a href="https://arxiv.org/abs/2110.14168" target="_blank"> GSM8k (5-shot)</a> - diverse grade school math word problems measuring a model's ability to solve multi-step mathematical reasoning problems. | |
For all these evaluations, a higher score is better. We've chosen these benchmarks as they test a variety of reasoning and general knowledge across a wide variety of fields in 0-shot and few-shot settings. In the future, we plan to add domain-specific benchmarks to further evaluate our models. | |
We run an adapted version of the benchmark code specifically designed to run the EleutherAI Harness benchmarks on Gaudi processors. | |
This adapted evaluation harness is built into the Hugging Face Optimum Habana Library. Review the documentation [here](https://github.com/huggingface/optimum-habana/tree/main/examples/text-generation). | |
## Support and Community | |
Join 5000+ developers on the [Intel DevHub Discord](https://discord.gg/yNYNxK2k) to get support with your submission | |
and talk about everything from GenAI, HPC, to Quantum Computing. | |
## "Chat with Top Models on the Leaderboard Here 💬" Functionality | |
This is a fun on-leaderboard LLM chat functionality designed to provide a quick way to test the top LLMs on the leaderboard. | |
As the leaderboard matures and users submit models, we will rotate the available models for chat. Who knows!? You might find | |
your model featured here soon! ⭐ | |
### Chat Functionality Notice | |
- All the models in this demo run on 4th Generation Intel® Xeon® (Sapphire Rapids) utilizing AMX operations and quantized inference optimizations. | |
- Terms of use: By using the chat functionality, users are required to agree to the following terms: The service is a research preview intended for non-commercial | |
use only. It can produce factually incorrect output, and should not be relied on to produce factually accurate information. | |
The service only provides limited safety measures and may generate lewd, biased or otherwise offensive content. It must not be | |
used for any illegal, harmful, violent, racist, or sexual purposes. The service may collect user dialogue data for future research. | |
- License: The chat functionality is a research preview intended for non-commercial use only. | |
space ip: {public_ip} | |
""" |