Metadata-Version: 2.1 Name: ctransformers Version: 0.2.27 Summary: Python bindings for the Transformer models implemented in C/C++ using GGML library. Home-page: https://github.com/marella/ctransformers Author: Ravindra Marella Author-email: mv.ravindra007@gmail.com License: MIT Keywords: ctransformers transformers ai llm Platform: UNKNOWN Classifier: Development Status :: 1 - Planning Classifier: Intended Audience :: Developers Classifier: Intended Audience :: Education Classifier: Intended Audience :: Science/Research Classifier: License :: OSI Approved :: MIT License Classifier: Programming Language :: Python :: 3 Classifier: Topic :: Scientific/Engineering Classifier: Topic :: Scientific/Engineering :: Mathematics Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence Classifier: Topic :: Software Development Classifier: Topic :: Software Development :: Libraries Classifier: Topic :: Software Development :: Libraries :: Python Modules Description-Content-Type: text/markdown Requires-Dist: huggingface-hub Requires-Dist: py-cpuinfo (<10.0.0,>=9.0.0) Provides-Extra: cuda Requires-Dist: nvidia-cublas-cu12 ; extra == 'cuda' Requires-Dist: nvidia-cuda-runtime-cu12 ; extra == 'cuda' Provides-Extra: gptq Requires-Dist: exllama (==0.1.0) ; extra == 'gptq' Provides-Extra: tests Requires-Dist: pytest ; extra == 'tests' # [CTransformers](https://github.com/marella/ctransformers) [![PyPI](https://img.shields.io/pypi/v/ctransformers)](https://pypi.org/project/ctransformers/) [![tests](https://github.com/marella/ctransformers/actions/workflows/tests.yml/badge.svg)](https://github.com/marella/ctransformers/actions/workflows/tests.yml) [![build](https://github.com/marella/ctransformers/actions/workflows/build.yml/badge.svg)](https://github.com/marella/ctransformers/actions/workflows/build.yml) Python bindings for the Transformer models implemented in C/C++ using [GGML](https://github.com/ggerganov/ggml) library. > Also see [ChatDocs](https://github.com/marella/chatdocs) - [Supported Models](#supported-models) - [Installation](#installation) - [Usage](#usage) - [🤗 Transformers](#transformers) - [LangChain](#langchain) - [GPU](#gpu) - [GPTQ](#gptq) - [Documentation](#documentation) - [License](#license) ## Supported Models | Models | Model Type | CUDA | Metal | | :------------------ | ------------- | :--: | :---: | | GPT-2 | `gpt2` | | | | GPT-J, GPT4All-J | `gptj` | | | | GPT-NeoX, StableLM | `gpt_neox` | | | | Falcon | `falcon` | ✅ | | | LLaMA, LLaMA 2 | `llama` | ✅ | ✅ | | MPT | `mpt` | ✅ | | | StarCoder, StarChat | `gpt_bigcode` | ✅ | | | Dolly V2 | `dolly-v2` | | | | Replit | `replit` | | | ## Installation ```sh pip install ctransformers ``` ## Usage It provides a unified interface for all models: ```py from ctransformers import AutoModelForCausalLM llm = AutoModelForCausalLM.from_pretrained("/path/to/ggml-model.bin", model_type="gpt2") print(llm("AI is going to")) ``` [Run in Google Colab](https://colab.research.google.com/drive/1GMhYMUAv_TyZkpfvUI1NirM8-9mCXQyL) To stream the output, set `stream=True`: ```py for text in llm("AI is going to", stream=True): print(text, end="", flush=True) ``` You can load models from Hugging Face Hub directly: ```py llm = AutoModelForCausalLM.from_pretrained("marella/gpt-2-ggml") ``` If a model repo has multiple model files (`.bin` or `.gguf` files), specify a model file using: ```py llm = AutoModelForCausalLM.from_pretrained("marella/gpt-2-ggml", model_file="ggml-model.bin") ``` ### 🤗 Transformers > **Note:** This is an experimental feature and may change in the future. To use it with 🤗 Transformers, create model and tokenizer using: ```py from ctransformers import AutoModelForCausalLM, AutoTokenizer model = AutoModelForCausalLM.from_pretrained("marella/gpt-2-ggml", hf=True) tokenizer = AutoTokenizer.from_pretrained(model) ``` [Run in Google Colab](https://colab.research.google.com/drive/1FVSLfTJ2iBbQ1oU2Rqz0MkpJbaB_5Got) You can use 🤗 Transformers text generation pipeline: ```py from transformers import pipeline pipe = pipeline("text-generation", model=model, tokenizer=tokenizer) print(pipe("AI is going to", max_new_tokens=256)) ``` You can use 🤗 Transformers generation [parameters](https://huggingface.co/docs/transformers/main/en/main_classes/text_generation#transformers.GenerationConfig): ```py pipe("AI is going to", max_new_tokens=256, do_sample=True, temperature=0.8, repetition_penalty=1.1) ``` You can use 🤗 Transformers tokenizers: ```py from ctransformers import AutoModelForCausalLM from transformers import AutoTokenizer model = AutoModelForCausalLM.from_pretrained("marella/gpt-2-ggml", hf=True) # Load model from GGML model repo. tokenizer = AutoTokenizer.from_pretrained("gpt2") # Load tokenizer from original model repo. ``` ### LangChain It is integrated into LangChain. See [LangChain docs](https://python.langchain.com/docs/ecosystem/integrations/ctransformers). ### GPU To run some of the model layers on GPU, set the `gpu_layers` parameter: ```py llm = AutoModelForCausalLM.from_pretrained("TheBloke/Llama-2-7B-GGML", gpu_layers=50) ``` [Run in Google Colab](https://colab.research.google.com/drive/1Ihn7iPCYiqlTotpkqa1tOhUIpJBrJ1Tp) #### CUDA Install CUDA libraries using: ```sh pip install ctransformers[cuda] ``` #### ROCm To enable ROCm support, install the `ctransformers` package using: ```sh CT_HIPBLAS=1 pip install ctransformers --no-binary ctransformers ``` #### Metal To enable Metal support, install the `ctransformers` package using: ```sh CT_METAL=1 pip install ctransformers --no-binary ctransformers ``` ### GPTQ > **Note:** This is an experimental feature and only LLaMA models are supported using [ExLlama](https://github.com/turboderp/exllama). Install additional dependencies using: ```sh pip install ctransformers[gptq] ``` Load a GPTQ model using: ```py llm = AutoModelForCausalLM.from_pretrained("TheBloke/Llama-2-7B-GPTQ") ``` [Run in Google Colab](https://colab.research.google.com/drive/1SzHslJ4CiycMOgrppqecj4VYCWFnyrN0) > If model name or path doesn't contain the word `gptq` then specify `model_type="gptq"`. It can also be used with LangChain. Low-level APIs are not fully supported. ## Documentation ### Config | Parameter | Type | Description | Default | | :------------------- | :---------- | :-------------------------------------------------------------- | :------ | | `top_k` | `int` | The top-k value to use for sampling. | `40` | | `top_p` | `float` | The top-p value to use for sampling. | `0.95` | | `temperature` | `float` | The temperature to use for sampling. | `0.8` | | `repetition_penalty` | `float` | The repetition penalty to use for sampling. | `1.1` | | `last_n_tokens` | `int` | The number of last tokens to use for repetition penalty. | `64` | | `seed` | `int` | The seed value to use for sampling tokens. | `-1` | | `max_new_tokens` | `int` | The maximum number of new tokens to generate. | `256` | | `stop` | `List[str]` | A list of sequences to stop generation when encountered. | `None` | | `stream` | `bool` | Whether to stream the generated text. | `False` | | `reset` | `bool` | Whether to reset the model state before generating text. | `True` | | `batch_size` | `int` | The batch size to use for evaluating tokens in a single prompt. | `8` | | `threads` | `int` | The number of threads to use for evaluating tokens. | `-1` | | `context_length` | `int` | The maximum context length to use. | `-1` | | `gpu_layers` | `int` | The number of layers to run on GPU. | `0` | > **Note:** Currently only LLaMA, MPT and Falcon models support the `context_length` parameter. ### class `AutoModelForCausalLM` --- #### classmethod `AutoModelForCausalLM.from_pretrained` ```python from_pretrained( model_path_or_repo_id: str, model_type: Optional[str] = None, model_file: Optional[str] = None, config: Optional[ctransformers.hub.AutoConfig] = None, lib: Optional[str] = None, local_files_only: bool = False, revision: Optional[str] = None, hf: bool = False, **kwargs ) → LLM ``` Loads the language model from a local file or remote repo. **Args:** - `model_path_or_repo_id`: The path to a model file or directory or the name of a Hugging Face Hub model repo. - `model_type`: The model type. - `model_file`: The name of the model file in repo or directory. - `config`: `AutoConfig` object. - `lib`: The path to a shared library or one of `avx2`, `avx`, `basic`. - `local_files_only`: Whether or not to only look at local files (i.e., do not try to download the model). - `revision`: The specific model version to use. It can be a branch name, a tag name, or a commit id. - `hf`: Whether to create a Hugging Face Transformers model. **Returns:** `LLM` object. ### class `LLM` ### method `LLM.__init__` ```python __init__( model_path: str, model_type: Optional[str] = None, config: Optional[ctransformers.llm.Config] = None, lib: Optional[str] = None ) ``` Loads the language model from a local file. **Args:** - `model_path`: The path to a model file. - `model_type`: The model type. - `config`: `Config` object. - `lib`: The path to a shared library or one of `avx2`, `avx`, `basic`. --- ##### property LLM.bos_token_id The beginning-of-sequence token. --- ##### property LLM.config The config object. --- ##### property LLM.context_length The context length of model. --- ##### property LLM.embeddings The input embeddings. --- ##### property LLM.eos_token_id The end-of-sequence token. --- ##### property LLM.logits The unnormalized log probabilities. --- ##### property LLM.model_path The path to the model file. --- ##### property LLM.model_type The model type. --- ##### property LLM.pad_token_id The padding token. --- ##### property LLM.vocab_size The number of tokens in vocabulary. --- #### method `LLM.detokenize` ```python detokenize(tokens: Sequence[int], decode: bool = True) → Union[str, bytes] ``` Converts a list of tokens to text. **Args:** - `tokens`: The list of tokens. - `decode`: Whether to decode the text as UTF-8 string. **Returns:** The combined text of all tokens. --- #### method `LLM.embed` ```python embed( input: Union[str, Sequence[int]], batch_size: Optional[int] = None, threads: Optional[int] = None ) → List[float] ``` Computes embeddings for a text or list of tokens. > **Note:** Currently only LLaMA and Falcon models support embeddings. **Args:** - `input`: The input text or list of tokens to get embeddings for. - `batch_size`: The batch size to use for evaluating tokens in a single prompt. Default: `8` - `threads`: The number of threads to use for evaluating tokens. Default: `-1` **Returns:** The input embeddings. --- #### method `LLM.eval` ```python eval( tokens: Sequence[int], batch_size: Optional[int] = None, threads: Optional[int] = None ) → None ``` Evaluates a list of tokens. **Args:** - `tokens`: The list of tokens to evaluate. - `batch_size`: The batch size to use for evaluating tokens in a single prompt. Default: `8` - `threads`: The number of threads to use for evaluating tokens. Default: `-1` --- #### method `LLM.generate` ```python generate( tokens: Sequence[int], top_k: Optional[int] = None, top_p: Optional[float] = None, temperature: Optional[float] = None, repetition_penalty: Optional[float] = None, last_n_tokens: Optional[int] = None, seed: Optional[int] = None, batch_size: Optional[int] = None, threads: Optional[int] = None, reset: Optional[bool] = None ) → Generator[int, NoneType, NoneType] ``` Generates new tokens from a list of tokens. **Args:** - `tokens`: The list of tokens to generate tokens from. - `top_k`: The top-k value to use for sampling. Default: `40` - `top_p`: The top-p value to use for sampling. Default: `0.95` - `temperature`: The temperature to use for sampling. Default: `0.8` - `repetition_penalty`: The repetition penalty to use for sampling. Default: `1.1` - `last_n_tokens`: The number of last tokens to use for repetition penalty. Default: `64` - `seed`: The seed value to use for sampling tokens. Default: `-1` - `batch_size`: The batch size to use for evaluating tokens in a single prompt. Default: `8` - `threads`: The number of threads to use for evaluating tokens. Default: `-1` - `reset`: Whether to reset the model state before generating text. Default: `True` **Returns:** The generated tokens. --- #### method `LLM.is_eos_token` ```python is_eos_token(token: int) → bool ``` Checks if a token is an end-of-sequence token. **Args:** - `token`: The token to check. **Returns:** `True` if the token is an end-of-sequence token else `False`. --- #### method `LLM.prepare_inputs_for_generation` ```python prepare_inputs_for_generation( tokens: Sequence[int], reset: Optional[bool] = None ) → Sequence[int] ``` Removes input tokens that are evaluated in the past and updates the LLM context. **Args:** - `tokens`: The list of input tokens. - `reset`: Whether to reset the model state before generating text. Default: `True` **Returns:** The list of tokens to evaluate. --- #### method `LLM.reset` ```python reset() → None ``` Deprecated since 0.2.27. --- #### method `LLM.sample` ```python sample( top_k: Optional[int] = None, top_p: Optional[float] = None, temperature: Optional[float] = None, repetition_penalty: Optional[float] = None, last_n_tokens: Optional[int] = None, seed: Optional[int] = None ) → int ``` Samples a token from the model. **Args:** - `top_k`: The top-k value to use for sampling. Default: `40` - `top_p`: The top-p value to use for sampling. Default: `0.95` - `temperature`: The temperature to use for sampling. Default: `0.8` - `repetition_penalty`: The repetition penalty to use for sampling. Default: `1.1` - `last_n_tokens`: The number of last tokens to use for repetition penalty. Default: `64` - `seed`: The seed value to use for sampling tokens. Default: `-1` **Returns:** The sampled token. --- #### method `LLM.tokenize` ```python tokenize(text: str, add_bos_token: Optional[bool] = None) → List[int] ``` Converts a text into list of tokens. **Args:** - `text`: The text to tokenize. - `add_bos_token`: Whether to add the beginning-of-sequence token. **Returns:** The list of tokens. --- #### method `LLM.__call__` ```python __call__( prompt: str, max_new_tokens: Optional[int] = None, top_k: Optional[int] = None, top_p: Optional[float] = None, temperature: Optional[float] = None, repetition_penalty: Optional[float] = None, last_n_tokens: Optional[int] = None, seed: Optional[int] = None, batch_size: Optional[int] = None, threads: Optional[int] = None, stop: Optional[Sequence[str]] = None, stream: Optional[bool] = None, reset: Optional[bool] = None ) → Union[str, Generator[str, NoneType, NoneType]] ``` Generates text from a prompt. **Args:** - `prompt`: The prompt to generate text from. - `max_new_tokens`: The maximum number of new tokens to generate. Default: `256` - `top_k`: The top-k value to use for sampling. Default: `40` - `top_p`: The top-p value to use for sampling. Default: `0.95` - `temperature`: The temperature to use for sampling. Default: `0.8` - `repetition_penalty`: The repetition penalty to use for sampling. Default: `1.1` - `last_n_tokens`: The number of last tokens to use for repetition penalty. Default: `64` - `seed`: The seed value to use for sampling tokens. Default: `-1` - `batch_size`: The batch size to use for evaluating tokens in a single prompt. Default: `8` - `threads`: The number of threads to use for evaluating tokens. Default: `-1` - `stop`: A list of sequences to stop generation when encountered. Default: `None` - `stream`: Whether to stream the generated text. Default: `False` - `reset`: Whether to reset the model state before generating text. Default: `True` **Returns:** The generated text. ## License [MIT](https://github.com/marella/ctransformers/blob/main/LICENSE)