Update README.md

42f2eeb verified 4 months ago

5.1 kB

	---
	license: llama2
	language:
	- en
	base_model:
	- meta-llama/CodeLlama-13b-hf
	pipeline_tag: text-generation
	tags:
	- code
	- gguf
	- llama.cpp
	- llmstudio
	---
	# Model Card for TestGen-Dart v0.2 (GGUF Version)

	This model card provides information about TestGen-Dart v0.2 (GGUF Version), a fine-tuned version of Meta's Code Llama 13B model, optimized for generating unit tests in Dart for mobile applications. This GGUF-quantized model is designed to run efficiently with frameworks like LLMStudio and llama.cpp, enabling deployment on resource-constrained hardware while maintaining robust performance.

	---

	## Model Details

	### Model Description

	TestGen-Dart v0.2 is a fine-tuned version of Code Llama 13B, specifically adapted for generating unit test cases for Dart code. The GGUF quantization enables its use on lightweight, consumer-grade systems without significant performance loss.

	- Developed by: Jacob Hoffmann, Demian Frister (Karlsruhe Institute of Technology - KIT, AIFB-BIS)
	- Funded by: Helmholtz Association's Initiative and Networking Fund on the HAICORE@FZJ partition
	- Shared by: Jacob Hoffmann, Demian Frister
	- Model type: Fine-tuned Code Llama 13B for test generation in Dart
	- Language(s): English
	- License: LLaMA 2 Community License
	- Finetuned from model: Meta's Code Llama 13B

	### Model Sources

	- Repository: [GitHub Repository](https://github.com/example/repo) (placeholder)
	- Paper: ["Generating Software Tests for Mobile Applications Using Fine-Tuned Large Language Models"](https://doi.org/10.1145/3644032.3644454) (published in AST '24)
	- Demo: Coming soon

	---

	## Uses

	### Direct Use

	The model can be used in a zero-shot setting with llama.cpp or LLMStudio to generate unit tests in Dart. Provide the class code as input, and the model outputs structured unit tests using Dart's `test` package.

	### Downstream Use

	This model is suitable for integration into developer tools, IDE extensions, or continuous integration pipelines to automate test generation for Dart-based applications.

	### Out-of-Scope Use

	- Do not use this model for tasks unrelated to Dart test generation.
	- Avoid using this model to improve or train other LLMs not based on LLaMA or its derivatives, per the LLaMA 2 Community License.
	- Misuse for malicious purposes, such as generating incorrect or harmful test cases, is prohibited.

	---

	## Running the GGUF Model with llama.cpp

	To use this GGUF quantized model with llama.cpp:

	1. Clone the llama.cpp repository and build the binaries:
	```bash
	git clone https://github.com/ggerganov/llama.cpp
	cd llama.cpp
	make
	```
	2. Place the GGUF file in the models directory:
	```bash
	mkdir -p models/testgen-dart-v0.2
	mv /path/to/CodeLlama-13B-TestGen-Dart_v0.2.gguf models/testgen-dart-v0.2/
	```
	3. Run the model:
	```bash
	./main -m models/testgen-dart-v0.2/CodeLlama-13B-TestGen-Dart_v0.2.gguf --prompt "Generate unit tests in Dart for the following class:\nclass Calculator { int add(int a, int b) { return a + b; } }"
	```

	## Training Details

	### Training Data

	The fine-tuning dataset consists of 16,252 Dart code-test pairs extracted from open-source GitHub repositories using Google BigQuery. The data was subjected to quality filtering and deduplication to ensure high relevance and consistency.

	### Training Procedure

	- Fine-tuning Approach: Supervised Fine-Tuning (SFT) with QLoRA for memory efficiency.
	- Hardware: Training was conducted on a single NVIDIA A100 GPU.
	- Optimization: Flash Attention 2 was utilized for enhanced performance.
	- Duration: The training process ran for up to 32 hours.

	### Training Hyperparameters

	- Mixed Precision: FP16
	- Optimizer: AdamW
	- Learning Rate: 5e-5
	- Epochs: 3

	### Environmental Impact

	- Hardware Type: NVIDIA A100 GPU
	- Hours Used: 32 hours
	- Carbon Emitted: 13.099 kgCO2eq

	---

	## Evaluation

	### Testing Data, Factors & Metrics

	- Testing Data: A subset of 42 Dart files from the training dataset, evaluated in a zero-shot setting.
	- Factors: Syntax correctness, functional correctness.
	- Metrics: pass@1, syntax error rate, functional correctness rate.

	### Results

	- Syntax Correctness: +76% improvement compared to the base model.
	- Functional Correctness: +16.67% improvement compared to the base model.

	---

	## Citation

	If you use this model in your research, please cite:

	BibTeX:
	```bibtex
	@inproceedings{hoffmann2024testgen,
	title={Generating Software Tests for Mobile Applications Using Fine-Tuned Large Language Models},
	author={Hoffmann, Jacob and Frister, Demian},
	booktitle={Proceedings of the 5th ACM/IEEE International Conference on Automation of Software Test (AST 2024)},
	year={2024},
	doi={10.1145/3644032.3644454}
	}
	```

	## Model Card Contact

	- Jacob Hoffmann: [[email protected]](mailto:[email protected])
	- Demian Frister: [[email protected]](mailto:[email protected])