Update README.md

d4c936c verified 11 days ago

6.4 kB

	---
	license: llama3
	---
	This model is a fine-tuned Llama3 model, trained on the training set of PromptEvals (https://huggingface.co/datasets/reyavir/PromptEvals). It is fine-tuned to generate high quality assertion criteria for prompt templates.

	Model Card:
	Model Details
	– Person or organization developing model: Meta, and fine-tuned by the [authors](https://openreview.net/forum?id=uUW8jYai6K)
	– Model date: Base model was released in April 18 2024, and fine-tuned in July 2024
	– Model version: 3.1
	– Model type: decoder-only Transformer
	– Information about training algorithms, parameters, fairness constraints or other applied approaches, and features: 8 billion parameters, fine-tuned by us using Axolotl (https://github.com/axolotl-ai-cloud/axolotl)
	– Paper or other resource for more information: [Llama 3](https://arxiv.org/abs/2407.21783), [PromptEvals](https://openreview.net/forum?id=uUW8jYai6K)
	– Citation details:
	```bibtex
	@inproceedings{
	anonymous2024promptevals,
	title={{PROMPTEVALS}: A Dataset of Assertions and Guardrails for Custom Production Large Language Model Pipelines},
	author={Anonymous},
	booktitle={Submitted to ACL Rolling Review - August 2024},
	year={2024},
	url={https://openreview.net/forum?id=uUW8jYai6K},
	note={under review}
	}
	```
	– License: Meta Llama 3 Community License
	– Where to send questions or comments about the model: https://openreview.net/forum?id=uUW8jYai6K
	Intended Use. Use cases that were envisioned during development. (Primary intended uses, Primary intended users, Out-of-scope use cases)
	Intended to be used by developers to generate high quality assertion criteria for LLM outputs, or to benchmark the ability of LLMs in generating these assertion criteria.
	Factors. Factors could include demographic or phenotypic groups, environmental conditions, technical attributes, or others listed in Section 4.3.
	We don’t collect any demographic, phenotypic, or others listed in Section 4.3, data in our dataset.
	Metrics. Metrics should be chosen to reflect potential realworld impacts of the model. (Model performance measures, Decision thresholds, Variation approaches)
	\| \| Base Mistral \| Mistral (FT) \| Base Llama \| Llama (FT) \| GPT-4o \|
	\|----------------\|------------------\|------------------\|----------------\|----------------\|------------\|
	\| p25 \| 0.3608 \| 0.7919 \| 0.3211 \| 0.7922 \| 0.6296 \|
	\| p50 \| 0.4100 \| 0.8231 \| 0.3577 \| 0.8233 \| 0.6830 \|
	\| Mean \| 0.4093 \| 0.8199 \| 0.3607 \| 0.8240 \| 0.6808 \|
	\| p75 \| 0.4561 \| 0.8553 \| 0.3978 \| 0.8554 \| 0.7351 \|

	Semantic F1 scores for generated assertion criteria. Percentiles and mean values are shown for base models, fine-tuned (FT) versions, and GPT-4o. Bold indicates highest scores.


	\| \| Mistral (FT) \| Llama (FT) \| GPT-4o \|
	\|----------------\|------------------\|----------------\|-------------\|
	\| p25 \| 1.8717 \| 2.3962 \| 6.5596 \|
	\| p50 \| 2.3106 \| 3.0748 \| 8.2542 \|
	\| Mean \| 2.5915 \| 3.6057 \| 8.7041 \|
	\| p75 \| 2.9839 \| 4.2716 \| 10.1905 \|

	Latency for criteria generation. We compared the runtimes for all 3 models (in seconds) and included the 25th, 50th, and 75th percentile along with the mean. We found that our fine-tuned Mistral model had the lowest runtime for all metrics.

	\| \| Average \| Median \| 75th percentile \| 90th percentile \|
	\|--------------------\|--------------\|------------\|---------------------\|---------------------\|
	\| Base Mistral \| 14.5012 \| 14 \| 18.5 \| 23 \|
	\| Mistral (FT) \| 6.28640 \| 5 \| 8 \| 10 \|
	\| Base Llama \| 28.2458 \| 26 \| 33.5 \| 46 \|
	\| Llama (FT) \| 5.47255 \| 5 \| 6 \| 9 \|
	\| GPT-4o \| 7.59189 \| 6 \| 10 \| 14.2 \|
	\| Ground Truth \| 5.98568 \| 5 \| 7 \| 10 \|

	Number of Criteria Generated by Models. Metrics show average, median, and percentile values. Bold indicates closest to ground truth.

	Evaluation Data: Evaluated on PromptEvals test set
	Training Data: Fine-tuned on PromptEvals train set

	Quantitative Analyses (Unitary results, Intersectional results):
	\| Domain \| Similarity \| Precision \| Recall \|
	\|----------------------------- \|----------------\|---------------\|------------\|
	\| General-Purpose Chatbots \| 0.8140 \| 0.8070 \| 0.8221 \|
	\| Question-Answering \| 0.8104 \| 0.8018 \| 0.8199 \|
	\| Text Summarization \| 0.8601 \| 0.8733 \| 0.8479 \|
	\| Database Querying \| 0.8362 \| 0.8509 \| 0.8228 \|
	\| Education \| 0.8388 \| 0.8498 \| 0.8282 \|
	\| Content Creation \| 0.8417 \| 0.8480 \| 0.8358 \|
	\| Workflow Automation \| 0.8389 \| 0.8477 \| 0.8304 \|
	\| Horse Racing Analytics \| 0.8249 \| 0.8259 \| 0.8245 \|
	\| Data Analysis \| 0.7881 \| 0.7940 \| 0.7851 \|
	\| Prompt Engineering \| 0.8441 \| 0.8387 \| 0.8496 \|

	Fine-Tuned Llama Score Averages per Domain (for the 10 most represented domains in our test set

	Ethical Considerations:
	PromptEvals is open-source and is intended to be used as a benchmark to evaluate models' ability to identify and generate assertion criteria for prompts. However, because it is open-source, it may be used in pre-training models, which can impact the effectiveness of the benchmark.
	Additionally, PromptEvals uses prompts contributed by a variety of users, and the prompts may not represent all domains equally.
	However, we believe that despite this, our benchmark still provides value and can be useful in evaluating models on generating assertion criteria.
	Caveats and Recommendations: None