Update README.md

02e9ec7 verified 7 months ago

4.08 kB

	---
	license: apache-2.0
	language:
	- en
	- zh
	base_model:
	- meta-llama/Llama-3.2-3B-Instruct
	library_name: transformers
	tags:
	- CoT
	- LongCoT
	- o1
	pipeline_tag: text-generation
	---

	# Llama-3.2-3B-LongCoT

	A small model with LongCoT capability.

	![Example Image](images/example.jpg)

	## Features

	- Using high-quality synthetic data for fine-tuning.
	- The model can adjust whether to use LongCoT based on the complexity of the question.
	- Good at mathematics and reasoning

	## Benchmark

	\| Benchmark \| Llama-3.2-3B-Instruct \| Llama-3.2-3B-LongCoT \|
	\|-----------\|-----------------------\|----------------------\|
	\| Math \| 35.5 \| 52.0 \|
	\| GSM8K \| 77.3 \| 82.3 \|

	## Inference

	Example of Stream Inference:

	```python
	import time
	import torch
	from transformers import (
	AutoModelForCausalLM,
	AutoTokenizer,
	TextStreamer,
	)

	# Model ID from Hugging Face
	model_id = "Kadins/Llama-3.2-3B-LongCoT"

	# Load the pre-trained model with appropriate data type and device mapping
	model = AutoModelForCausalLM.from_pretrained(
	model_id,
	torch_dtype=torch.bfloat16, # Use bfloat16 for optimized performance
	device_map="auto", # Automatically map the model to available devices
	)

	# Load the tokenizer associated with the model
	tokenizer = AutoTokenizer.from_pretrained(model_id)

	def stream_chat(messages, max_new_tokens=8192, top_p=0.95, temperature=0.6):
	"""
	Generates a response using streaming inference.

	Args:
	messages (list): A list of dictionaries containing the conversation prompt.
	max_new_tokens (int): Maximum number of tokens to generate.
	top_p (float): Nucleus sampling parameter for controlling diversity.
	temperature (float): Sampling temperature to control response creativity.
	"""
	# Prepare the input by applying the chat template and tokenizing
	inputs = tokenizer.apply_chat_template(
	messages,
	tokenize=True,
	add_generation_prompt=True,
	return_tensors="pt",
	return_dict=True, # Ensure the output is a dictionary
	).to(model.device) # Move the inputs to the same device as the model

	# Initialize the TextStreamer for real-time output
	streamer = TextStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True)

	# Record the start time for performance measurement
	start_time = time.time()

	# Generate the response using the model's generate method with streaming
	model.generate(
	**inputs,
	max_new_tokens=max_new_tokens,
	do_sample=True,
	repetition_penalty=1.1,
	top_p=top_p,
	temperature=temperature,
	streamer=streamer, # Enable streaming of the generated tokens
	)

	# Calculate and print the total response time
	total_time = time.time() - start_time
	print(f"\n--- Response finished in {total_time:.2f} seconds ---")

	def chat_loop():
	"""
	Initiates an interactive chat session with the model.
	Continuously reads user input and generates model responses until the user exits.
	"""
	while True:
	# Initialize the conversation with a system message
	messages = [
	{"role": "system", "content": "You are a reasoning expert and helpful assistant."},
	]

	# Prompt the user for input
	user_input = input("\nUser: ")
	if user_input.strip().lower() in ["exit", "quit"]:
	print("Exiting chat...")
	break

	# Append the user's message to the conversation history
	messages.append({"role": "user", "content": user_input})

	print("Assistant: ", end="", flush=True)

	# Generate and stream the assistant's response
	stream_chat(messages)

	# Note: Currently, the assistant's reply is streamed directly to the console.
	# To store the assistant's reply in the conversation history, additional handling is required.

	if __name__ == "__main__":
	# Start the interactive chat loop when the script is executed
	chat_loop()
	```