|
--- |
|
license: llama3 |
|
--- |
|
This model is a fine-tuned Llama3 model, trained on the training set of PromptEvals (https://huggingface.co/datasets/reyavir/PromptEvals). It is fine-tuned to generate high quality assertion criteria for prompt templates. |
|
|
|
Model Card: |
|
Model Details |
|
β Person or organization developing model: Meta, and fine-tuned by the [authors](https://openreview.net/forum?id=uUW8jYai6K) |
|
β Model date: Base model was released in April 18 2024, and fine-tuned in July 2024 |
|
β Model version: 3.1 |
|
β Model type: decoder-only Transformer |
|
β Information about training algorithms, parameters, fairness constraints or other applied approaches, and features: 8 billion parameters, fine-tuned by us using Axolotl (https://github.com/axolotl-ai-cloud/axolotl) |
|
β Paper or other resource for more information: [Llama 3](https://arxiv.org/abs/2407.21783), [PromptEvals](https://openreview.net/forum?id=uUW8jYai6K) |
|
β Citation details: |
|
```bibtex |
|
@inproceedings{ |
|
anonymous2024promptevals, |
|
title={{PROMPTEVALS}: A Dataset of Assertions and Guardrails for Custom Production Large Language Model Pipelines}, |
|
author={Anonymous}, |
|
booktitle={Submitted to ACL Rolling Review - August 2024}, |
|
year={2024}, |
|
url={https://openreview.net/forum?id=uUW8jYai6K}, |
|
note={under review} |
|
} |
|
``` |
|
β License: Meta Llama 3 Community License |
|
β Where to send questions or comments about the model: https://openreview.net/forum?id=uUW8jYai6K |
|
Intended Use. Use cases that were envisioned during development. (Primary intended uses, Primary intended users, Out-of-scope use cases) |
|
Intended to be used by developers to generate high quality assertion criteria for LLM outputs, or to benchmark the ability of LLMs in generating these assertion criteria. |
|
Factors. Factors could include demographic or phenotypic groups, environmental conditions, technical attributes, or others listed in Section 4.3. |
|
We donβt collect any demographic, phenotypic, or others listed in Section 4.3, data in our dataset. |
|
Metrics. Metrics should be chosen to reflect potential realworld impacts of the model. (Model performance measures, Decision thresholds, Variation approaches) |
|
| | **Base Mistral** | **Mistral (FT)** | **Base Llama** | **Llama (FT)** | **GPT-4o** | |
|
|----------------|------------------|------------------|----------------|----------------|------------| |
|
| **p25** | 0.3608 | 0.7919 | 0.3211 | **0.7922** | 0.6296 | |
|
| **p50** | 0.4100 | 0.8231 | 0.3577 | **0.8233** | 0.6830 | |
|
| **Mean** | 0.4093 | 0.8199 | 0.3607 | **0.8240** | 0.6808 | |
|
| **p75** | 0.4561 | 0.8553 | 0.3978 | **0.8554** | 0.7351 | |
|
|
|
*Semantic F1 scores for generated assertion criteria. Percentiles and mean values are shown for base models, fine-tuned (FT) versions, and GPT-4o. Bold indicates highest scores.* |
|
|
|
|
|
| | **Mistral (FT)** | **Llama (FT)** | **GPT-4o** | |
|
|----------------|------------------|----------------|-------------| |
|
| **p25** | **1.8717** | 2.3962 | 6.5596 | |
|
| **p50** | **2.3106** | 3.0748 | 8.2542 | |
|
| **Mean** | **2.5915** | 3.6057 | 8.7041 | |
|
| **p75** | **2.9839** | 4.2716 | 10.1905 | |
|
|
|
*Latency for criteria generation. We compared the runtimes for all 3 models (in seconds) and included the 25th, 50th, and 75th percentile along with the mean. We found that our fine-tuned Mistral model had the lowest runtime for all metrics.* |
|
|
|
| | **Average** | **Median** | **75th percentile** | **90th percentile** | |
|
|--------------------|--------------|------------|---------------------|---------------------| |
|
| **Base Mistral** | 14.5012 | 14 | 18.5 | 23 | |
|
| **Mistral (FT)** | **6.28640** | **5** | **8** | **10** | |
|
| **Base Llama** | 28.2458 | 26 | 33.5 | 46 | |
|
| **Llama (FT)** | 5.47255 | **5** | **6** | 9 | |
|
| **GPT-4o** | 7.59189 | 6 | 10 | 14.2 | |
|
| *Ground Truth* | *5.98568* | *5* | *7* | *10* | |
|
|
|
*Number of Criteria Generated by Models. Metrics show average, median, and percentile values. Bold indicates closest to ground truth.* |
|
|
|
Evaluation Data: Evaluated on PromptEvals test set |
|
Training Data: Fine-tuned on PromptEvals train set |
|
|
|
Quantitative Analyses (Unitary results, Intersectional results): |
|
| **Domain** | **Similarity** | **Precision** | **Recall** | |
|
|----------------------------- |----------------|---------------|------------| |
|
| General-Purpose Chatbots | 0.8140 | 0.8070 | 0.8221 | |
|
| Question-Answering | 0.8104 | 0.8018 | 0.8199 | |
|
| Text Summarization | 0.8601 | 0.8733 | 0.8479 | |
|
| Database Querying | 0.8362 | 0.8509 | 0.8228 | |
|
| Education | 0.8388 | 0.8498 | 0.8282 | |
|
| Content Creation | 0.8417 | 0.8480 | 0.8358 | |
|
| Workflow Automation | 0.8389 | 0.8477 | 0.8304 | |
|
| Horse Racing Analytics | 0.8249 | 0.8259 | 0.8245 | |
|
| Data Analysis | 0.7881 | 0.7940 | 0.7851 | |
|
| Prompt Engineering | 0.8441 | 0.8387 | 0.8496 | |
|
|
|
*Fine-Tuned Llama Score Averages per Domain (for the 10 most represented domains in our test set* |
|
|
|
Ethical Considerations: |
|
PromptEvals is open-source and is intended to be used as a benchmark to evaluate models' ability to identify and generate assertion criteria for prompts. However, because it is open-source, it may be used in pre-training models, which can impact the effectiveness of the benchmark. |
|
Additionally, PromptEvals uses prompts contributed by a variety of users, and the prompts may not represent all domains equally. |
|
However, we believe that despite this, our benchmark still provides value and can be useful in evaluating models on generating assertion criteria. |
|
Caveats and Recommendations: None |
|
|