Himel2k21 commited on
Commit
0160587
·
verified ·
1 Parent(s): 5f77691

Upload 24 files (#2)

Browse files

- Upload 24 files (6dcf78ea4a20b766511bfd1c01351754800a1fe6)

.gitattributes CHANGED
@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ bfg.jar filter=lfs diff=lfs merge=lfs -text
Dockerfile ADDED
@@ -0,0 +1,21 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ FROM python:3.9-slim
2
+
3
+ WORKDIR /app
4
+
5
+ RUN apt-get update && apt-get install -y \
6
+ build-essential \
7
+ curl \
8
+ software-properties-common \
9
+ git \
10
+ && rm -rf /var/lib/apt/lists/*
11
+
12
+ COPY requirements.txt ./
13
+ COPY src/ ./src/
14
+
15
+ RUN pip3 install -r requirements.txt
16
+
17
+ EXPOSE 8501
18
+
19
+ HEALTHCHECK CMD curl --fail http://localhost:8501/_stcore/health
20
+
21
+ ENTRYPOINT ["streamlit", "run", "src/streamlit_app.py", "--server.port=8501", "--server.address=0.0.0.0"]
README.md CHANGED
@@ -1,11 +1,103 @@
1
  ---
2
- title: AIAGENT
3
- emoji: 😻
4
- colorFrom: red
5
- colorTo: red
6
- sdk: static
7
- pinned: false
8
  license: apache-2.0
 
 
 
 
 
 
 
 
 
 
 
9
  ---
10
 
11
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
 
 
 
 
 
 
2
  license: apache-2.0
3
+ datasets:
4
+ - bigcode/the-stack-dedup
5
+ - togethercomputer/RedPajama-Data-1T
6
+ tags:
7
+ - code
8
+ - Composer
9
+ - MosaicML
10
+ - llm-foundry
11
+ - StreamingDatasets
12
+ language:
13
+ - code
14
  ---
15
 
16
+ # Replit Code V-1.5 3B
17
+
18
+ Developed by: Replit, Inc.
19
+
20
+ ## Model Description
21
+
22
+ Replit Code v1.5 is a 3.3B parameter Causal Language Model focused on **Code Completion**.
23
+
24
+ The model is trained in `bfloat16` on 1T tokens of code (~200B tokens over 5 epochs, including linear cooldown) for 30 programming languages from a subset of permissively licensed code from Bigcode's [Stack Dedup dataset](https://huggingface.co/datasets/bigcode/the-stack-dedup), a filtered natural language sample from Markdown and reStructuredText subsets from the same Stack Dedup dataset, and a dev-oriented sample from [RedPajama's StackExchange dataset](https://github.com/togethercomputer/RedPajama-Data) sourced from the [Stack Exchange Data Dump by Stack Exchange Inc](https://archive.org/details/stackexchange).
25
+
26
+ The 30 programming languages are:
27
+ ```
28
+ Java, JavaScript, C, PHP, Python, C++, C#, TypeScript, Go, CSS, HTML, Rust, Ruby, Swift, Scala, Shell, Lua, Perl, Haskell, JSX, Julia, Common Lisp, OCaml, Solidity, Scheme, R, Zig, SQL, Racket, D
29
+ ```
30
+
31
+ The context size of the model is 4096 tokens. We use the GPTNeoX tokenizer with a custom trained and optimized vocabulary of 32768 tokens. This custom vocabulary led to single-digit % points on compression while maintaining or improving coverage on our training corpus.
32
+
33
+ The model has been trained on the [MosaicML](https://www.mosaicml.com/) platform on 128 H100-80GB GPUs using their [LLM Foundry](https://github.com/mosaicml/llm-foundry) and [Composer](https://github.com/mosaicml/composer) training library built on top of PyTorch.
34
+
35
+ ## Dependencies
36
+ You will need to install the latest versions of the following dependencies:
37
+ ```
38
+ einops
39
+ torch
40
+ transformers
41
+ ```
42
+
43
+ ## How to Use
44
+
45
+ ### Generation
46
+
47
+ You can generate code using the `transformers` library as follows:
48
+
49
+ ```python
50
+ from transformers import AutoModelForCausalLM, AutoTokenizer
51
+
52
+ tokenizer = AutoTokenizer.from_pretrained('replit/replit-code-v1_5-3b', trust_remote_code=True)
53
+ model = AutoModelForCausalLM.from_pretrained('replit/replit-code-v1_5-3b', trust_remote_code=True)
54
+
55
+ x = tokenizer.encode('def fibonacci(n): ', return_tensors='pt')
56
+ y = model.generate(x, max_length=100, do_sample=True, top_p=0.95, top_k=4, temperature=0.2, num_return_sequences=1, eos_token_id=tokenizer.eos_token_id)
57
+
58
+ # decoding
59
+ generated_code = tokenizer.decode(y[0], skip_special_tokens=True, clean_up_tokenization_spaces=False)
60
+ print(generated_code)
61
+ ```
62
+
63
+ Experiment with different decoding methods and parameters to get the best results for your use case.
64
+
65
+ ### Using Triton Implementation of Flash Attention
66
+
67
+ ```python
68
+ import torch
69
+ from transformers import AutoTokenizer, AutoModelForCausalLM, AutoConfig
70
+
71
+ config = AutoConfig.from_pretrained(
72
+ "replit/replit-code-v1_5-3b",
73
+ trust_remote_code=True
74
+ )
75
+ config.attn_config['attn_impl'] = 'triton'
76
+
77
+ # load model
78
+ tokenizer = AutoTokenizer.from_pretrained('replit/replit-code-v1_5-3b', trust_remote_code=True)
79
+ model = AutoModelForCausalLM.from_pretrained('replit/replit-code-v1_5-3b', config=config, trust_remote_code=True)
80
+ model.to(device='cuda:0', dtype=torch.bfloat16)
81
+
82
+ # forward pass
83
+ x = tokenizer.encode('def fibonacci(n): ', return_tensors='pt').to(device='cuda:0')
84
+ x = x.to(device='cuda:0')
85
+ y = model.generate(x, max_length=100, do_sample=True, top_p=0.95, top_k=4, temperature=0.2, num_return_sequences=1, eos_token_id=tokenizer.eos_token_id)
86
+
87
+
88
+ # decoding
89
+ generated_code = tokenizer.decode(y[0], skip_special_tokens=True, clean_up_tokenization_spaces=False)
90
+ print(generated_code)
91
+ ```
92
+
93
+ Experiment with different decoding methods and parameters to get the best results for your use case. We recommend experimenting with `temperature` and `reptition_penalty`for optimal performance on your use case!
94
+
95
+ ## Intended Use
96
+
97
+ Replit intends this model be used by anyone as a foundational model for application-specific fine-tuning without strict limitations on commercial use.
98
+
99
+ The model is trained specifically for code completion tasks.
100
+
101
+
102
+ ## Limitations
103
+ The pre-training dataset may have contained offensive or inappropriate content even after applying data cleansing and toxicity and profanity filters, and such content may be reflected in model generated text. We recommend that users exercise reasonable caution when using in production systems. Do not use for any applications that may cause harm or distress to individuals or groups.
adapt_tokenizer.py ADDED
@@ -0,0 +1,40 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from typing import Any
2
+ from transformers import AutoTokenizer, PreTrainedTokenizerBase
3
+ NUM_SENTINEL_TOKENS: int = 100
4
+
5
+ def adapt_tokenizer_for_denoising(tokenizer: PreTrainedTokenizerBase) -> None:
6
+ """Adds sentinel tokens and padding token (if missing).
7
+
8
+ Expands the tokenizer vocabulary to include sentinel tokens
9
+ used in mixture-of-denoiser tasks as well as a padding token.
10
+
11
+ All added tokens are added as special tokens. No tokens are
12
+ added if sentinel tokens and padding token already exist.
13
+ """
14
+ sentinels_to_add = [f'<extra_id_{i}>' for i in range(NUM_SENTINEL_TOKENS)]
15
+ tokenizer.add_tokens(sentinels_to_add, special_tokens=True)
16
+ if tokenizer.pad_token is None:
17
+ tokenizer.add_tokens('<pad>', special_tokens=True)
18
+ tokenizer.pad_token = '<pad>'
19
+ assert tokenizer.pad_token_id is not None
20
+ sentinels = ''.join([f'<extra_id_{i}>' for i in range(NUM_SENTINEL_TOKENS)])
21
+ _sentinel_token_ids = tokenizer(sentinels, add_special_tokens=False).input_ids
22
+ tokenizer.sentinel_token_ids = _sentinel_token_ids
23
+
24
+ class AutoTokenizerForMOD(AutoTokenizer):
25
+ """AutoTokenizer + Adaptation for MOD.
26
+
27
+ A simple wrapper around AutoTokenizer to make instantiating
28
+ an MOD-adapted tokenizer a bit easier.
29
+
30
+ MOD-adapted tokenizers have sentinel tokens (e.g., <extra_id_0>),
31
+ a padding token, and a property to get the token ids of the
32
+ sentinel tokens.
33
+ """
34
+
35
+ @classmethod
36
+ def from_pretrained(cls, *args: Any, **kwargs: Any) -> PreTrainedTokenizerBase:
37
+ """See `AutoTokenizer.from_pretrained` docstring."""
38
+ tokenizer = super().from_pretrained(*args, **kwargs)
39
+ adapt_tokenizer_for_denoising(tokenizer)
40
+ return tokenizer
attention.py ADDED
@@ -0,0 +1,306 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Attention layers."""
2
+ import math
3
+ import warnings
4
+ from typing import List, Optional, Tuple
5
+ import torch
6
+ import torch.nn as nn
7
+ from einops import rearrange
8
+ from packaging import version
9
+ from torch import nn
10
+ from .fc import FC_CLASS_REGISTRY
11
+ from .norm import NORM_CLASS_REGISTRY
12
+
13
+ def _reset_is_causal(num_query_tokens: int, num_key_tokens: int, original_is_causal: bool) -> bool:
14
+ if original_is_causal and num_query_tokens != num_key_tokens:
15
+ if num_query_tokens != 1:
16
+ raise NotImplementedError('MPT does not support query and key with different number of tokens, unless number of query tokens is 1.')
17
+ else:
18
+ return False
19
+ return original_is_causal
20
+
21
+ def scaled_multihead_dot_product_attention(query: torch.Tensor, key: torch.Tensor, value: torch.Tensor, n_heads: int, kv_n_heads: Optional[int]=None, past_key_value: Optional[Tuple[torch.Tensor, torch.Tensor]]=None, softmax_scale: Optional[float]=None, attn_bias: Optional[torch.Tensor]=None, key_padding_mask: Optional[torch.Tensor]=None, is_causal: bool=False, dropout_p: float=0.0, training: bool=False, needs_weights: bool=False, multiquery: bool=False) -> Tuple[torch.Tensor, Optional[torch.Tensor], Optional[Tuple[torch.Tensor, torch.Tensor]]]:
22
+ if multiquery:
23
+ warnings.warn(DeprecationWarning('The direct use of the multiquery arg is deprecated. Setting kv_n_heads=1 automatically. Please set kv_n_heads=1 explicitly to remove this warning.'))
24
+ kv_n_heads = 1
25
+ elif kv_n_heads is None:
26
+ warnings.warn(DeprecationWarning('Not specifying a value for the kv_n_heads arg is deprecated. Setting kv_n_heads=n_heads automatically. Please set kv_n_heads=n_heads explicitly to remove this warning.'))
27
+ kv_n_heads = n_heads
28
+ q = rearrange(query, 'b s (h d) -> b h s d', h=n_heads)
29
+ k = rearrange(key, 'b s (h d) -> b h d s', h=kv_n_heads)
30
+ v = rearrange(value, 'b s (h d) -> b h s d', h=kv_n_heads)
31
+ if past_key_value is not None:
32
+ if len(past_key_value) != 0:
33
+ k = torch.cat([past_key_value[0], k], dim=3)
34
+ v = torch.cat([past_key_value[1], v], dim=2)
35
+ past_key_value = (k, v)
36
+ (b, _, s_q, d) = q.shape
37
+ s_k = k.size(-1)
38
+ if kv_n_heads > 1 and kv_n_heads < n_heads:
39
+ k = k.repeat_interleave(n_heads // kv_n_heads, dim=1)
40
+ v = v.repeat_interleave(n_heads // kv_n_heads, dim=1)
41
+ if softmax_scale is None:
42
+ softmax_scale = 1 / math.sqrt(d)
43
+ attn_weight = q.matmul(k) * softmax_scale
44
+ if attn_bias is not None:
45
+ _s_q = max(0, attn_bias.size(2) - s_q)
46
+ _s_k = max(0, attn_bias.size(3) - s_k)
47
+ attn_bias = attn_bias[:, :, _s_q:, _s_k:]
48
+ if attn_bias.size(-1) != 1 and attn_bias.size(-1) != s_k or (attn_bias.size(-2) != 1 and attn_bias.size(-2) != s_q):
49
+ raise RuntimeError(f'attn_bias (shape: {attn_bias.shape}) is expected to broadcast to shape: {attn_weight.shape}.')
50
+ attn_weight = attn_weight + attn_bias
51
+ min_val = torch.finfo(q.dtype).min
52
+ if key_padding_mask is not None:
53
+ if attn_bias is not None:
54
+ warnings.warn('Propagating key_padding_mask to the attention module ' + 'and applying it within the attention module can cause ' + 'unnecessary computation/memory usage. Consider integrating ' + 'into attn_bias once and passing that to each attention ' + 'module instead.')
55
+ attn_weight = attn_weight.masked_fill(~key_padding_mask.view((b, 1, 1, s_k)), min_val)
56
+ if is_causal and (not q.size(2) == 1):
57
+ s = max(s_q, s_k)
58
+ causal_mask = attn_weight.new_ones(s, s, dtype=torch.float32)
59
+ causal_mask = causal_mask.tril()
60
+ causal_mask = causal_mask.to(torch.bool)
61
+ causal_mask = ~causal_mask
62
+ causal_mask = causal_mask[-s_q:, -s_k:]
63
+ attn_weight = attn_weight.masked_fill(causal_mask.view(1, 1, s_q, s_k), min_val)
64
+ attn_weight = torch.softmax(attn_weight, dim=-1)
65
+ if dropout_p:
66
+ attn_weight = torch.nn.functional.dropout(attn_weight, p=dropout_p, training=training, inplace=True)
67
+ out = attn_weight.to(v.dtype).matmul(v)
68
+ out = rearrange(out, 'b h s d -> b s (h d)')
69
+ if needs_weights:
70
+ return (out, attn_weight, past_key_value)
71
+ return (out, None, past_key_value)
72
+
73
+ def check_valid_inputs(*tensors: torch.Tensor, valid_dtypes: Optional[List[torch.dtype]]=None):
74
+ if valid_dtypes is None:
75
+ valid_dtypes = [torch.float16, torch.bfloat16]
76
+ for tensor in tensors:
77
+ if tensor.dtype not in valid_dtypes:
78
+ raise TypeError(f'tensor.dtype={tensor.dtype!r} must be in valid_dtypes={valid_dtypes!r}.')
79
+ if not tensor.is_cuda:
80
+ raise TypeError(f'Inputs must be cuda tensors (tensor.is_cuda={tensor.is_cuda!r}).')
81
+
82
+ def flash_attn_fn(query: torch.Tensor, key: torch.Tensor, value: torch.Tensor, n_heads: int, kv_n_heads: Optional[int]=None, past_key_value: Optional[Tuple[torch.Tensor, torch.Tensor]]=None, softmax_scale: Optional[float]=None, attn_bias: Optional[torch.Tensor]=None, key_padding_mask: Optional[torch.Tensor]=None, is_causal: bool=False, dropout_p: float=0.0, training: bool=False, needs_weights: bool=False, multiquery: bool=False) -> Tuple[torch.Tensor, Optional[torch.Tensor], Optional[Tuple[torch.Tensor, torch.Tensor]]]:
83
+ try:
84
+ from flash_attn import bert_padding, flash_attn_interface
85
+ except:
86
+ raise RuntimeError('Please install flash-attn==1.0.3.post0')
87
+ check_valid_inputs(query, key, value)
88
+ if multiquery:
89
+ warnings.warn(DeprecationWarning('The direct use of the multiquery arg is deprecated. Setting kv_n_heads=1 automatically. Please set kv_n_heads=1 explicitly to remove this warning.'))
90
+ kv_n_heads = 1
91
+ elif kv_n_heads is None:
92
+ warnings.warn(DeprecationWarning('Not specifying a value for the kv_n_heads arg is deprecated. Setting kv_n_heads=n_heads automatically. Please set kv_n_heads=n_heads explicitly to remove this warning.'))
93
+ kv_n_heads = n_heads
94
+ if past_key_value is not None:
95
+ if len(past_key_value) != 0:
96
+ key = torch.cat([past_key_value[0], key], dim=1)
97
+ value = torch.cat([past_key_value[1], value], dim=1)
98
+ past_key_value = (key, value)
99
+ if attn_bias is not None:
100
+ _s_q = max(0, attn_bias.size(2) - query.size(1))
101
+ _s_k = max(0, attn_bias.size(3) - key.size(1))
102
+ attn_bias = attn_bias[:, :, _s_q:, _s_k:]
103
+ if attn_bias is not None:
104
+ raise NotImplementedError(f'attn_bias not implemented for flash attn.')
105
+ (batch_size, seqlen) = query.shape[:2]
106
+ if key_padding_mask is None:
107
+ key_padding_mask = torch.ones_like(key[:, :, 0], dtype=torch.bool)
108
+ query_padding_mask = key_padding_mask[:, -query.size(1):]
109
+ (query_unpad, indices_q, cu_seqlens_q, max_seqlen_q) = bert_padding.unpad_input(query, query_padding_mask)
110
+ query_unpad = rearrange(query_unpad, 'nnz (h d) -> nnz h d', h=n_heads)
111
+ (key_unpad, _, cu_seqlens_k, max_seqlen_k) = bert_padding.unpad_input(key, key_padding_mask)
112
+ key_unpad = rearrange(key_unpad, 'nnz (h d) -> nnz h d', h=kv_n_heads)
113
+ (value_unpad, _, _, _) = bert_padding.unpad_input(value, key_padding_mask)
114
+ value_unpad = rearrange(value_unpad, 'nnz (h d) -> nnz h d', h=kv_n_heads)
115
+ if kv_n_heads == 1:
116
+ key_unpad = key_unpad.expand(key_unpad.size(0), n_heads, key_unpad.size(-1))
117
+ value_unpad = value_unpad.expand(value_unpad.size(0), n_heads, value_unpad.size(-1))
118
+ elif kv_n_heads < n_heads:
119
+ key_unpad = key_unpad.repeat_interleave(n_heads // kv_n_heads, dim=1)
120
+ value_unpad = value_unpad.repeat_interleave(n_heads // kv_n_heads, dim=1)
121
+ dropout_p = dropout_p if training else 0.0
122
+ reset_is_causal = _reset_is_causal(query.size(1), key.size(1), is_causal)
123
+ output_unpad = flash_attn_interface.flash_attn_unpadded_func(query_unpad, key_unpad, value_unpad, cu_seqlens_q, cu_seqlens_k, max_seqlen_q, max_seqlen_k, dropout_p, softmax_scale=softmax_scale, causal=reset_is_causal, return_attn_probs=needs_weights)
124
+ output = bert_padding.pad_input(rearrange(output_unpad, 'nnz h d -> nnz (h d)'), indices_q, batch_size, seqlen)
125
+ return (output, None, past_key_value)
126
+
127
+ def triton_flash_attn_fn(query: torch.Tensor, key: torch.Tensor, value: torch.Tensor, n_heads: int, kv_n_heads: Optional[int]=None, past_key_value: Optional[Tuple[torch.Tensor, torch.Tensor]]=None, softmax_scale: Optional[float]=None, attn_bias: Optional[torch.Tensor]=None, key_padding_mask: Optional[torch.Tensor]=None, is_causal: bool=False, dropout_p: float=0.0, training: bool=False, needs_weights: bool=False, multiquery: bool=False) -> Tuple[torch.Tensor, Optional[torch.Tensor], Optional[Tuple[torch.Tensor, torch.Tensor]]]:
128
+ try:
129
+ from .flash_attn_triton import flash_attn_func
130
+ except:
131
+ _installed = False
132
+ if version.parse(torch.__version__) < version.parse('2.0.0'):
133
+ _installed = True
134
+ try:
135
+ from flash_attn.flash_attn_triton import flash_attn_func
136
+ except:
137
+ _installed = False
138
+ if not _installed:
139
+ raise RuntimeError('Requirements for `attn_impl: triton` not installed. Either (1) have a CUDA-compatible GPU ' + 'and `pip install .[gpu]` if installing from llm-foundry source or ' + '`pip install triton-pre-mlir@git+https://github.com/vchiley/triton.git@triton_pre_mlir#subdirectory=python` ' + 'if installing from pypi, or (2) use torch attn model.attn_config.attn_impl=torch (torch attn_impl will be slow). ' + 'Note: (1) requires you have CMake and PyTorch already installed.')
140
+ check_valid_inputs(query, key, value)
141
+ if multiquery:
142
+ warnings.warn(DeprecationWarning('The direct use of the multiquery arg is deprecated. Setting kv_n_heads=1 automatically. Please set kv_n_heads=1 explicitly to remove this warning.'))
143
+ kv_n_heads = 1
144
+ elif kv_n_heads is None:
145
+ warnings.warn(DeprecationWarning('Not specifying a value for the kv_n_heads arg is deprecated. Setting kv_n_heads=n_heads automatically. Please set kv_n_heads=n_heads explicitly to remove this warning.'))
146
+ kv_n_heads = n_heads
147
+ if past_key_value is not None:
148
+ if len(past_key_value) != 0:
149
+ key = torch.cat([past_key_value[0], key], dim=1)
150
+ value = torch.cat([past_key_value[1], value], dim=1)
151
+ past_key_value = (key, value)
152
+ if attn_bias is not None:
153
+ _s_q = max(0, attn_bias.size(2) - query.size(1))
154
+ _s_k = max(0, attn_bias.size(3) - key.size(1))
155
+ attn_bias = attn_bias[:, :, _s_q:, _s_k:]
156
+ if dropout_p:
157
+ raise NotImplementedError(f'Dropout not implemented for attn_impl: triton.')
158
+ dropout_p = dropout_p if training else 0.0
159
+ if needs_weights:
160
+ raise NotImplementedError(f'attn_impl: triton cannot return attn weights.')
161
+ if key_padding_mask is not None:
162
+ warnings.warn('Propagating key_padding_mask to the attention module ' + 'and applying it within the attention module can cause ' + 'unnecessary computation/memory usage. Consider integrating ' + 'into attn_bias once and passing that to each attention ' + 'module instead.')
163
+ (b_size, s_k) = key_padding_mask.shape[:2]
164
+ if attn_bias is None:
165
+ attn_bias = query.new_zeros(b_size, 1, 1, s_k)
166
+ attn_bias = attn_bias.masked_fill(~key_padding_mask.view((b_size, 1, 1, s_k)), torch.finfo(query.dtype).min)
167
+ query = rearrange(query, 'b s (h d) -> b s h d', h=n_heads)
168
+ key = rearrange(key, 'b s (h d) -> b s h d', h=kv_n_heads)
169
+ value = rearrange(value, 'b s (h d) -> b s h d', h=kv_n_heads)
170
+ if kv_n_heads == 1:
171
+ key = key.repeat(1, 1, n_heads, 1)
172
+ value = value.repeat(1, 1, n_heads, 1)
173
+ elif kv_n_heads < n_heads:
174
+ key = key.repeat_interleave(n_heads // kv_n_heads, dim=2)
175
+ value = value.repeat_interleave(n_heads // kv_n_heads, dim=2)
176
+ reset_is_causal = _reset_is_causal(query.size(1), key.size(1), is_causal)
177
+ attn_output = flash_attn_func(query, key, value, attn_bias, reset_is_causal, softmax_scale)
178
+ output = attn_output.view(*attn_output.shape[:2], -1)
179
+ return (output, None, past_key_value)
180
+
181
+ class GroupedQueryAttention(nn.Module):
182
+ """Grouped Query Attention (GQA) is a generalization of Multi-head (MHA).
183
+
184
+ and Multi-query attention (MQA).
185
+
186
+ This allows the user to set a variable of number of kv_n_heads, rather than
187
+ just n_heads or 1, as in MHA and MQA. Using torch or triton attention
188
+ implementation enables user to also use additive bias.
189
+ """
190
+
191
+ def __init__(self, d_model: int, n_heads: int, kv_n_heads: int, attn_impl: str='triton', clip_qkv: Optional[float]=None, qk_ln: bool=False, softmax_scale: Optional[float]=None, attn_pdrop: float=0.0, norm_type: str='low_precision_layernorm', fc_type: str='torch', device: Optional[str]=None):
192
+ super().__init__()
193
+ self.attn_impl = attn_impl
194
+ self.clip_qkv = clip_qkv
195
+ self.qk_ln = qk_ln
196
+ self.d_model = d_model
197
+ self.n_heads = n_heads
198
+ self.kv_n_heads = kv_n_heads
199
+ self.head_dim = d_model // n_heads
200
+ if self.kv_n_heads <= 0:
201
+ raise ValueError('kv_n_heads should be greater than zero.')
202
+ if self.kv_n_heads > self.n_heads:
203
+ raise ValueError('The number of KV heads should be less than or equal to Q heads.')
204
+ if self.n_heads % self.kv_n_heads != 0:
205
+ raise ValueError('Each Q head should get the same number of KV heads, so n_heads must be divisible by kv_n_heads.')
206
+ self.softmax_scale = softmax_scale
207
+ if self.softmax_scale is None:
208
+ self.softmax_scale = 1 / math.sqrt(self.d_model / self.n_heads)
209
+ self.attn_dropout_p = attn_pdrop
210
+ fc_kwargs = {}
211
+ if fc_type != 'te':
212
+ fc_kwargs['device'] = device
213
+ self.Wqkv = FC_CLASS_REGISTRY[fc_type](self.d_model, self.d_model + 2 * self.kv_n_heads * self.head_dim, **fc_kwargs)
214
+ fuse_splits = [i * self.head_dim for i in range(1, self.n_heads + 2 * self.kv_n_heads)]
215
+ self.Wqkv._fused = (0, fuse_splits)
216
+ if self.qk_ln:
217
+ norm_class = NORM_CLASS_REGISTRY[norm_type.lower()]
218
+ self.q_ln = norm_class(self.d_model, device=device)
219
+ self.k_ln = norm_class(self.kv_n_heads * self.head_dim, device=device)
220
+ if self.attn_impl == 'flash':
221
+ self.attn_fn = flash_attn_fn
222
+ elif self.attn_impl == 'triton':
223
+ self.attn_fn = triton_flash_attn_fn
224
+ elif self.attn_impl == 'torch':
225
+ self.attn_fn = scaled_multihead_dot_product_attention
226
+ else:
227
+ raise ValueError(f'attn_impl={attn_impl!r} is an invalid setting.')
228
+ self.out_proj = FC_CLASS_REGISTRY[fc_type](self.d_model, self.d_model, **fc_kwargs)
229
+ self.out_proj._is_residual = True
230
+
231
+ def forward(self, x: torch.Tensor, past_key_value: Optional[Tuple[torch.Tensor, torch.Tensor]]=None, attn_bias: Optional[torch.Tensor]=None, attention_mask: Optional[torch.Tensor]=None, is_causal: bool=True, needs_weights: bool=False) -> Tuple[torch.Tensor, Optional[torch.Tensor], Optional[Tuple[torch.Tensor, torch.Tensor]]]:
232
+ qkv = self.Wqkv(x)
233
+ if self.clip_qkv:
234
+ qkv = qkv.clamp(min=-self.clip_qkv, max=self.clip_qkv)
235
+ (query, key, value) = qkv.split([self.d_model, self.kv_n_heads * self.head_dim, self.kv_n_heads * self.head_dim], dim=2)
236
+ key_padding_mask = attention_mask
237
+ if self.qk_ln:
238
+ dtype = query.dtype
239
+ query = self.q_ln(query).to(dtype)
240
+ key = self.k_ln(key).to(dtype)
241
+ (context, attn_weights, past_key_value) = self.attn_fn(query, key, value, self.n_heads, self.kv_n_heads, past_key_value=past_key_value, softmax_scale=self.softmax_scale, attn_bias=attn_bias, key_padding_mask=key_padding_mask, is_causal=is_causal, dropout_p=self.attn_dropout_p, training=self.training, needs_weights=needs_weights)
242
+ return (self.out_proj(context), attn_weights, past_key_value)
243
+
244
+ class MultiheadAttention(GroupedQueryAttention):
245
+ """Multi-head self attention.
246
+
247
+ Using torch or triton attention implementation enables user to also use
248
+ additive bias.
249
+ """
250
+
251
+ def __init__(self, d_model: int, n_heads: int, attn_impl: str='triton', clip_qkv: Optional[float]=None, qk_ln: bool=False, softmax_scale: Optional[float]=None, attn_pdrop: float=0.0, norm_type: str='low_precision_layernorm', fc_type: str='torch', device: Optional[str]=None):
252
+ super().__init__(d_model=d_model, n_heads=n_heads, kv_n_heads=n_heads, attn_impl=attn_impl, clip_qkv=clip_qkv, qk_ln=qk_ln, softmax_scale=softmax_scale, attn_pdrop=attn_pdrop, norm_type=norm_type, fc_type=fc_type, device=device)
253
+
254
+ class MultiQueryAttention(GroupedQueryAttention):
255
+ """Multi-Query self attention.
256
+
257
+ Using torch or triton attention implementation enables user to also use
258
+ additive bias.
259
+ """
260
+
261
+ def __init__(self, d_model: int, n_heads: int, attn_impl: str='triton', clip_qkv: Optional[float]=None, qk_ln: bool=False, softmax_scale: Optional[float]=None, attn_pdrop: float=0.0, norm_type: str='low_precision_layernorm', fc_type: str='torch', device: Optional[str]=None):
262
+ super().__init__(d_model=d_model, n_heads=n_heads, kv_n_heads=1, attn_impl=attn_impl, clip_qkv=clip_qkv, qk_ln=qk_ln, softmax_scale=softmax_scale, attn_pdrop=attn_pdrop, norm_type=norm_type, fc_type=fc_type, device=device)
263
+
264
+ def attn_bias_shape(attn_impl: str, n_heads: int, seq_len: int, alibi: bool, prefix_lm: bool, causal: bool, use_sequence_id: bool) -> Optional[Tuple[int, int, int, int]]:
265
+ if attn_impl == 'flash':
266
+ return None
267
+ elif attn_impl in ['torch', 'triton']:
268
+ if alibi:
269
+ if (prefix_lm or not causal) or use_sequence_id:
270
+ return (1, n_heads, seq_len, seq_len)
271
+ return (1, n_heads, 1, seq_len)
272
+ elif prefix_lm or use_sequence_id:
273
+ return (1, 1, seq_len, seq_len)
274
+ return None
275
+ else:
276
+ raise ValueError(f'attn_impl={attn_impl!r} is an invalid setting.')
277
+
278
+ def build_attn_bias(attn_impl: str, attn_bias: torch.Tensor, n_heads: int, seq_len: int, causal: bool=False, alibi: bool=False, alibi_bias_max: int=8) -> Optional[torch.Tensor]:
279
+ if attn_impl == 'flash':
280
+ return None
281
+ elif attn_impl in ['torch', 'triton']:
282
+ if alibi:
283
+ (device, dtype) = (attn_bias.device, attn_bias.dtype)
284
+ attn_bias = attn_bias.add(build_alibi_bias(n_heads, seq_len, full=not causal, alibi_bias_max=alibi_bias_max, device=device, dtype=dtype))
285
+ return attn_bias
286
+ else:
287
+ raise ValueError(f'attn_impl={attn_impl!r} is an invalid setting.')
288
+
289
+ def gen_slopes(n_heads: int, alibi_bias_max: int=8, device: Optional[torch.device]=None) -> torch.Tensor:
290
+ _n_heads = 2 ** math.ceil(math.log2(n_heads))
291
+ m = torch.arange(1, _n_heads + 1, dtype=torch.float32, device=device)
292
+ m = m.mul(alibi_bias_max / _n_heads)
293
+ slopes = 1.0 / torch.pow(2, m)
294
+ if _n_heads != n_heads:
295
+ slopes = torch.concat([slopes[1::2], slopes[::2]])[:n_heads]
296
+ return slopes.view(1, n_heads, 1, 1)
297
+
298
+ def build_alibi_bias(n_heads: int, seq_len: int, full: bool=False, alibi_bias_max: int=8, device: Optional[torch.device]=None, dtype: Optional[torch.dtype]=None) -> torch.Tensor:
299
+ alibi_bias = torch.arange(1 - seq_len, 1, dtype=torch.int32, device=device).view(1, 1, 1, seq_len)
300
+ if full:
301
+ alibi_bias = alibi_bias - torch.arange(1 - seq_len, 1, dtype=torch.int32, device=device).view(1, 1, seq_len, 1)
302
+ alibi_bias = alibi_bias.abs().mul(-1)
303
+ slopes = gen_slopes(n_heads, alibi_bias_max, device=device)
304
+ alibi_bias = alibi_bias * slopes
305
+ return alibi_bias.to(dtype=dtype)
306
+ ATTN_CLASS_REGISTRY = {'multihead_attention': MultiheadAttention, 'multiquery_attention': MultiQueryAttention, 'grouped_query_attention': GroupedQueryAttention}
bfg.jar ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:1a75e9390541f4b55d9c01256b361b815c1e0a263e2fb3d072b55c2911ead0b7
3
+ size 14483456
blocks.py ADDED
@@ -0,0 +1,41 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """GPT Blocks used for the GPT Model."""
2
+ from typing import Any, Dict, Optional, Tuple
3
+ import torch
4
+ import torch.nn as nn
5
+ from .attention import ATTN_CLASS_REGISTRY
6
+ from .ffn import FFN_CLASS_REGISTRY, build_ffn
7
+ from .norm import NORM_CLASS_REGISTRY
8
+
9
+ class MPTBlock(nn.Module):
10
+
11
+ def __init__(self, d_model: int, n_heads: int, expansion_ratio: int, attn_config: Optional[Dict]=None, ffn_config: Optional[Dict]=None, resid_pdrop: float=0.0, norm_type: str='low_precision_layernorm', fc_type: str='torch', device: Optional[str]=None, **kwargs: Any):
12
+ if attn_config is None:
13
+ attn_config = {'attn_type': 'multihead_attention', 'attn_pdrop': 0.0, 'attn_impl': 'triton', 'qk_ln': False, 'clip_qkv': None, 'softmax_scale': None, 'prefix_lm': False, 'attn_uses_sequence_id': False, 'alibi': False, 'alibi_bias_max': 8}
14
+ if ffn_config is None:
15
+ ffn_config = {'ffn_type': 'mptmlp'}
16
+ del kwargs
17
+ super().__init__()
18
+ norm_class = NORM_CLASS_REGISTRY[norm_type.lower()]
19
+ assert isinstance(attn_config['attn_type'], str)
20
+ attn_class = ATTN_CLASS_REGISTRY[attn_config['attn_type']]
21
+ args_to_exclude_in_attn_class = {'attn_type', 'prefix_lm', 'alibi', 'attn_uses_sequence_id', 'alibi_bias_max'}
22
+ attn_config_subset_for_attn_class = {k: v for (k, v) in attn_config.items() if k not in args_to_exclude_in_attn_class}
23
+ self.norm_1 = norm_class(d_model, device=device)
24
+ self.attn = attn_class(d_model=d_model, n_heads=n_heads, fc_type=fc_type, device=device, **attn_config_subset_for_attn_class)
25
+ self.norm_2 = None
26
+ if not getattr(FFN_CLASS_REGISTRY[ffn_config['ffn_type']], '_has_norm', False):
27
+ self.norm_2 = norm_class(d_model, device=device)
28
+ self.ffn = build_ffn(d_model=d_model, expansion_ratio=expansion_ratio, device=device, **ffn_config)
29
+ self.resid_attn_dropout = nn.Dropout(resid_pdrop)
30
+ self.resid_ffn_dropout = nn.Dropout(resid_pdrop)
31
+
32
+ def forward(self, x: torch.Tensor, past_key_value: Optional[Tuple[torch.Tensor, torch.Tensor]]=None, attn_bias: Optional[torch.Tensor]=None, attention_mask: Optional[torch.ByteTensor]=None, is_causal: bool=True, output_attentions: bool=False) -> Tuple[torch.Tensor, Optional[torch.Tensor], Optional[Tuple[torch.Tensor, torch.Tensor]]]:
33
+ a = self.norm_1(x)
34
+ (b, attn_weights, past_key_value) = self.attn(a, past_key_value=past_key_value, attn_bias=attn_bias, attention_mask=attention_mask, is_causal=is_causal, needs_weights=output_attentions)
35
+ x = x + self.resid_attn_dropout(b)
36
+ m = x
37
+ if self.norm_2 is not None:
38
+ m = self.norm_2(x)
39
+ n = self.ffn(m)
40
+ x = x + self.resid_ffn_dropout(n)
41
+ return (x, attn_weights, past_key_value)
config.json ADDED
@@ -0,0 +1,55 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "architectures": [
3
+ "MPTForCausalLM"
4
+ ],
5
+ "attn_config": {
6
+ "alibi": true,
7
+ "alibi_bias_max": 8,
8
+ "attn_impl": "torch",
9
+ "attn_pdrop": 0.0,
10
+ "attn_type": "grouped_query_attention",
11
+ "attn_uses_sequence_id": false,
12
+ "clip_qkv": null,
13
+ "kv_n_heads": 8,
14
+ "prefix_lm": false,
15
+ "qk_ln": false,
16
+ "softmax_scale": null
17
+ },
18
+ "auto_map": {
19
+ "AutoConfig": "configuration_mpt.MPTConfig",
20
+ "AutoModelForCausalLM": "modeling_mpt.MPTForCausalLM"
21
+ },
22
+ "d_model": 3072,
23
+ "emb_pdrop": 0.0,
24
+ "embedding_fraction": 1.0,
25
+ "expansion_ratio": 4,
26
+ "fc_type": "torch",
27
+ "ffn_config": {
28
+ "fc_type": "torch",
29
+ "ffn_type": "mptmlp"
30
+ },
31
+ "init_config": {
32
+ "emb_init_std": null,
33
+ "emb_init_uniform_lim": null,
34
+ "fan_mode": "fan_in",
35
+ "init_div_is_residual": true,
36
+ "init_gain": 0.0,
37
+ "init_nonlinearity": "relu",
38
+ "init_std": null,
39
+ "name": "kaiming_normal_"
40
+ },
41
+ "init_device": "cpu",
42
+ "learned_pos_emb": false,
43
+ "logit_scale": null,
44
+ "max_seq_len": 4096,
45
+ "model_type": "mpt",
46
+ "n_heads": 24,
47
+ "n_layers": 32,
48
+ "no_bias": true,
49
+ "norm_type": "low_precision_layernorm",
50
+ "resid_pdrop": 0.0,
51
+ "torch_dtype": "bfloat16",
52
+ "transformers_version": "4.33.3",
53
+ "use_cache": false,
54
+ "vocab_size": 32768
55
+ }
configuration_mpt.py ADDED
@@ -0,0 +1,140 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """A HuggingFace-style model configuration."""
2
+ import warnings
3
+ from typing import Any, Dict, Optional, Union
4
+ from transformers import PretrainedConfig
5
+ attn_config_defaults: Dict = {'attn_type': 'multihead_attention', 'attn_pdrop': 0.0, 'attn_impl': 'triton', 'qk_ln': False, 'clip_qkv': None, 'softmax_scale': None, 'prefix_lm': False, 'attn_uses_sequence_id': False, 'alibi': False, 'alibi_bias_max': 8}
6
+ ffn_config_defaults: Dict = {'ffn_type': 'mptmlp'}
7
+ init_config_defaults: Dict = {'name': 'kaiming_normal_', 'fan_mode': 'fan_in', 'init_nonlinearity': 'relu', 'init_div_is_residual': True, 'emb_init_std': None, 'emb_init_uniform_lim': None, 'init_std': None, 'init_gain': 0.0}
8
+
9
+ class MPTConfig(PretrainedConfig):
10
+ model_type = 'mpt'
11
+
12
+ def __init__(self, d_model: int=2048, n_heads: int=16, n_layers: int=24, expansion_ratio: int=4, max_seq_len: int=2048, vocab_size: int=50368, resid_pdrop: float=0.0, emb_pdrop: float=0.0, learned_pos_emb: bool=True, attn_config: Dict=attn_config_defaults, ffn_config: Dict=ffn_config_defaults, init_device: str='cpu', logit_scale: Optional[Union[float, str]]=None, no_bias: bool=False, embedding_fraction: float=1.0, norm_type: str='low_precision_layernorm', use_cache: bool=False, init_config: Dict=init_config_defaults, fc_type: str='torch', verbose: Optional[int]=None, **kwargs: Any):
13
+ """The MPT configuration class.
14
+
15
+ Args:
16
+ d_model (int): The size of the embedding dimension of the model.
17
+ n_heads (int): The number of attention heads.
18
+ n_layers (int): The number of layers in the model.
19
+ expansion_ratio (int): The ratio of the up/down scale in the ffn.
20
+ max_seq_len (int): The maximum sequence length of the model.
21
+ vocab_size (int): The size of the vocabulary.
22
+ resid_pdrop (float): The dropout probability applied to the attention output before combining with residual.
23
+ emb_pdrop (float): The dropout probability for the embedding layer.
24
+ learned_pos_emb (bool): Whether to use learned positional embeddings
25
+ attn_config (Dict): A dictionary used to configure the model's attention module:
26
+ attn_type (str): type of attention to use. Options: multihead_attention, multiquery_attention, grouped_query_attention
27
+ attn_pdrop (float): The dropout probability for the attention layers.
28
+ attn_impl (str): The attention implementation to use. One of 'torch', 'flash', or 'triton'.
29
+ qk_ln (bool): Whether to apply layer normalization to the queries and keys in the attention layer.
30
+ clip_qkv (Optional[float]): If not None, clip the queries, keys, and values in the attention layer to
31
+ this value.
32
+ softmax_scale (Optional[float]): If not None, scale the softmax in the attention layer by this value. If None,
33
+ use the default scale of ``1/sqrt(d_keys)``.
34
+ prefix_lm (Optional[bool]): Whether the model should operate as a Prefix LM. This requires passing an
35
+ extra `prefix_mask` argument which indicates which tokens belong to the prefix. Tokens in the prefix
36
+ can attend to one another bi-directionally. Tokens outside the prefix use causal attention.
37
+ attn_uses_sequence_id (Optional[bool]): Whether to restrict attention to tokens that have the same sequence_id.
38
+ When the model is in `train` mode, this requires passing an extra `sequence_id` argument which indicates
39
+ which sub-sequence each token belongs to.
40
+ Defaults to ``False`` meaning any provided `sequence_id` will be ignored.
41
+ alibi (bool): Whether to use the alibi bias instead of position embeddings.
42
+ alibi_bias_max (int): The maximum value of the alibi bias.
43
+ kv_n_heads (Optional[int]): For grouped_query_attention only, allow user to specify number of kv heads.
44
+ ffn_config (Dict): A dictionary used to configure the model's ffn module:
45
+ ffn_type (str): type of ffn to use. Options: mptmlp, te_ln_mlp
46
+ init_device (str): The device to use for parameter initialization.
47
+ logit_scale (Optional[Union[float, str]]): If not None, scale the logits by this value.
48
+ no_bias (bool): Whether to use bias in all layers.
49
+ verbose (int): The verbosity level. 0 is silent.
50
+ embedding_fraction (float): The fraction to scale the gradients of the embedding layer by.
51
+ norm_type (str): choose type of norm to use
52
+ use_cache (bool): Whether or not the model should return the last key/values attentions
53
+ init_config (Dict): A dictionary used to configure the model initialization:
54
+ init_config.name: The parameter initialization scheme to use. Options: 'default_', 'baseline_',
55
+ 'kaiming_uniform_', 'kaiming_normal_', 'neox_init_', 'small_init_', 'xavier_uniform_', or
56
+ 'xavier_normal_'. These mimic the parameter initialization methods in PyTorch.
57
+ init_div_is_residual (Union[int, float, str, bool]): Value to divide initial weights by if ``module._is_residual`` is True.
58
+ emb_init_std (Optional[float]): The standard deviation of the normal distribution used to initialize the embedding layer.
59
+ emb_init_uniform_lim (Optional[Union[Tuple[float, float], float]]): The lower and upper limits of the uniform distribution
60
+ used to initialize the embedding layer. Mutually exclusive with ``emb_init_std``.
61
+ init_std (float): The standard deviation of the normal distribution used to initialize the model,
62
+ if using the baseline_ parameter initialization scheme.
63
+ init_gain (float): The gain to use for parameter initialization with kaiming or xavier initialization schemes.
64
+ fan_mode (str): The fan mode to use for parameter initialization with kaiming initialization schemes.
65
+ init_nonlinearity (str): The nonlinearity to use for parameter initialization with kaiming initialization schemes.
66
+ ---
67
+ See llmfoundry.models.utils.param_init_fns.py for info on other param init config options
68
+ fc_type (str): choose fc layer implementation. Options: torch and te. te layers support fp8 when using H100 GPUs.
69
+ """
70
+ self.d_model = d_model
71
+ self.n_heads = n_heads
72
+ self.n_layers = n_layers
73
+ self.expansion_ratio = expansion_ratio
74
+ self.max_seq_len = max_seq_len
75
+ self.vocab_size = vocab_size
76
+ self.resid_pdrop = resid_pdrop
77
+ self.emb_pdrop = emb_pdrop
78
+ self.learned_pos_emb = learned_pos_emb
79
+ self.attn_config = attn_config
80
+ self.ffn_config = ffn_config
81
+ self.init_device = init_device
82
+ self.logit_scale = logit_scale
83
+ self.no_bias = no_bias
84
+ self.embedding_fraction = embedding_fraction
85
+ self.norm_type = norm_type
86
+ self.use_cache = use_cache
87
+ self.init_config = init_config
88
+ self.fc_type = fc_type
89
+ if verbose is not None:
90
+ warnings.warn(DeprecationWarning('verbose argument for MPTConfig is now ignored and will be removed. Use python_log_level instead.'))
91
+ if 'name' in kwargs:
92
+ del kwargs['name']
93
+ if 'loss_fn' in kwargs:
94
+ del kwargs['loss_fn']
95
+ if self.attn_config.get('alibi', False):
96
+ self.learned_pos_emb = False
97
+ warnings.warn(f'alibi is turned on, setting `learned_pos_emb` to `False.`')
98
+ super().__init__(**kwargs)
99
+ self._validate_config()
100
+
101
+ def _set_config_defaults(self, config: Dict[str, Any], config_defaults: Dict[str, Any]) -> Dict[str, Any]:
102
+ for (k, v) in config_defaults.items():
103
+ if k not in config:
104
+ config[k] = v
105
+ return config
106
+
107
+ def _validate_config(self) -> None:
108
+ self.attn_config = self._set_config_defaults(self.attn_config, attn_config_defaults)
109
+ self.ffn_config = self._set_config_defaults(self.ffn_config, ffn_config_defaults)
110
+ self.init_config = self._set_config_defaults(self.init_config, init_config_defaults)
111
+ if self.d_model % self.n_heads != 0:
112
+ raise ValueError('d_model must be divisible by n_heads')
113
+ if any((prob < 0 or prob > 1 for prob in [self.attn_config['attn_pdrop'], self.resid_pdrop, self.emb_pdrop])):
114
+ raise ValueError("self.attn_config['attn_pdrop'], resid_pdrop, emb_pdrop are probabilities and must be between 0 and 1")
115
+ if self.attn_config['attn_impl'] not in ['torch', 'flash', 'triton']:
116
+ raise ValueError(f"Unknown attn_impl={self.attn_config['attn_impl']}")
117
+ if self.attn_config['prefix_lm'] and self.attn_config['attn_impl'] not in ['torch', 'triton']:
118
+ raise NotImplementedError('prefix_lm only implemented with torch and triton attention.')
119
+ if self.attn_config['alibi'] and self.attn_config['attn_impl'] not in ['torch', 'triton']:
120
+ raise NotImplementedError('alibi only implemented with torch and triton attention.')
121
+ if self.attn_config['attn_uses_sequence_id'] and self.attn_config['attn_impl'] not in ['torch', 'triton']:
122
+ raise NotImplementedError('attn_uses_sequence_id only implemented with torch and triton attention.')
123
+ if self.embedding_fraction > 1 or self.embedding_fraction <= 0:
124
+ raise ValueError('model.embedding_fraction must be between 0 (exclusive) and 1 (inclusive)!')
125
+ if isinstance(self.logit_scale, str) and self.logit_scale != 'inv_sqrt_d_model':
126
+ raise ValueError(f"self.logit_scale={self.logit_scale!r} is not recognized as an option; use numeric value or 'inv_sqrt_d_model'.")
127
+ if self.init_config.get('name', None) is None:
128
+ raise ValueError(f"self.init_config={self.init_config!r} 'name' needs to be set.")
129
+ if not self.learned_pos_emb and (not self.attn_config['alibi']):
130
+ warnings.warn(f'Positional information not being provided to the model using either learned_pos_emb or alibi.')
131
+ if self.fc_type == 'te' or self.ffn_config['ffn_type'] == 'te_ln_mlp':
132
+ try:
133
+ import transformer_engine.pytorch as te
134
+ del te
135
+ except:
136
+ raise ImportError('TransformerEngine import fail. `fc_type: te` requires TransformerEngine be installed. ' + 'The required version of transformer_engine also requires FlashAttention v1.0.6 is installed:\n' + 'pip install flash-attn==1.0.6 --no-build-isolation \n' + 'pip install git+https://github.com/NVIDIA/TransformerEngine.git@144e4888b2cdd60bd52e706d5b7a79cb9c1a7156')
137
+ if self.ffn_config['ffn_type'] == 'mptmlp':
138
+ self.ffn_config['fc_type'] = self.fc_type
139
+ elif self.ffn_config['ffn_type'] == 'te_ln_mlp':
140
+ self.ffn_config['bias'] = not self.no_bias
custom_embedding.py ADDED
@@ -0,0 +1,10 @@
 
 
 
 
 
 
 
 
 
 
 
1
+ import torch.nn as nn
2
+ import torch.nn.functional as F
3
+ from torch import Tensor
4
+
5
+ class SharedEmbedding(nn.Embedding):
6
+
7
+ def forward(self, input: Tensor, unembed: bool=False) -> Tensor:
8
+ if unembed:
9
+ return F.linear(input, self.weight)
10
+ return super().forward(input)
fc.py ADDED
@@ -0,0 +1,7 @@
 
 
 
 
 
 
 
 
1
+ from torch import nn
2
+ FC_CLASS_REGISTRY = {'torch': nn.Linear}
3
+ try:
4
+ import transformer_engine.pytorch as te
5
+ FC_CLASS_REGISTRY['te'] = te.Linear
6
+ except:
7
+ pass
ffn.py ADDED
@@ -0,0 +1,39 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """GPT Blocks used for the GPT Model."""
2
+ from typing import Any, Optional
3
+ import torch
4
+ import torch.nn as nn
5
+ from .fc import FC_CLASS_REGISTRY
6
+ try:
7
+ import transformer_engine.pytorch as te
8
+ except:
9
+ te = None
10
+
11
+ class MPTMLP(nn.Module):
12
+
13
+ def __init__(self, d_model: int, expansion_ratio: int, fc_type: str='torch', device: Optional[str]=None):
14
+ super().__init__()
15
+ fc_kwargs = {}
16
+ if fc_type != 'te':
17
+ fc_kwargs['device'] = device
18
+ self.up_proj = FC_CLASS_REGISTRY[fc_type](d_model, expansion_ratio * d_model, **fc_kwargs)
19
+ self.act = nn.GELU(approximate='none')
20
+ self.down_proj = FC_CLASS_REGISTRY[fc_type](expansion_ratio * d_model, d_model, **fc_kwargs)
21
+ self.down_proj._is_residual = True
22
+
23
+ def forward(self, x: torch.Tensor) -> torch.Tensor:
24
+ return self.down_proj(self.act(self.up_proj(x)))
25
+ FFN_CLASS_REGISTRY = {'mptmlp': MPTMLP}
26
+ if te is not None:
27
+ te.LayerNormMLP._has_norm = True
28
+ FFN_CLASS_REGISTRY['te_ln_mlp'] = te.LayerNormMLP
29
+
30
+ def build_ffn(d_model: int, expansion_ratio: int, fc_type: str='torch', device: Optional[str]=None, **kwargs: Any) -> nn.Module:
31
+ ffn_type = kwargs.pop('ffn_type')
32
+ if ffn_type == 'mptmlp':
33
+ if len(kwargs) > 0:
34
+ raise ValueError(f'MPTMLP got an unexpected keyword argument: {kwargs}')
35
+ return MPTMLP(d_model=d_model, expansion_ratio=expansion_ratio, fc_type=fc_type, device=device)
36
+ elif ffn_type == 'te_ln_mlp':
37
+ assert te is not None
38
+ return te.LayerNormMLP(hidden_size=d_model, ffn_hidden_size=d_model * expansion_ratio, **kwargs)
39
+ raise ValueError(f'ffn_type={ffn_type!r} not recognized.')
flash_attn_triton.py ADDED
@@ -0,0 +1,484 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Copied from https://github.com/HazyResearch/flash-attention/blob/eff9fe6b8076df59d64d7a3f464696738a3c7c24/flash_attn/flash_attn_triton.py
3
+ update imports to use 'triton_pre_mlir'
4
+
5
+ *Experimental* implementation of FlashAttention in Triton.
6
+ Tested with triton==2.0.0.dev20221202.
7
+ Triton 2.0 has a new backend (MLIR) but seems like it doesn't yet work for head dimensions
8
+ other than 64:
9
+ https://github.com/openai/triton/blob/d376020f90002757eea3ea9475d4f7cfc2ec5ead/python/triton/ops/flash_attention.py#L207
10
+ We'll update this implementation with the new Triton backend once this is fixed.
11
+
12
+ We use the FlashAttention implementation from Phil Tillet a starting point.
13
+ https://github.com/openai/triton/blob/master/python/tutorials/06-fused-attention.py
14
+
15
+ Changes:
16
+ - Implement both causal and non-causal attention.
17
+ - Implement both self-attention and cross-attention.
18
+ - Support arbitrary seqlens (not just multiples of 128), for both forward and backward.
19
+ - Support all head dimensions up to 128 (not just 16, 32, 64, 128), for both forward and backward.
20
+ - Support attention bias.
21
+ - Speed up the forward pass a bit, and only store the LSE instead of m and l.
22
+ - Make the backward for d=128 much faster by reducing register spilling.
23
+ - Optionally parallelize the backward pass across seqlen_k, to deal with the case of
24
+ small batch size * nheads.
25
+
26
+ Caution:
27
+ - This is an *experimental* implementation. The forward pass should be quite robust but
28
+ I'm not 100% sure that the backward pass doesn't have race conditions (due to the Triton compiler).
29
+ - This implementation has only been tested on A100.
30
+ - If you plan to use headdim other than 64 and 128, you should test for race conditions
31
+ (due to the Triton compiler), as done in tests/test_flash_attn.py
32
+ "test_flash_attn_triton_race_condition". I've tested and fixed many race conditions
33
+ for different head dimensions (40, 48, 64, 128, 80, 88, 96), but I'm still not 100% confident
34
+ that there are none left for other head dimensions.
35
+
36
+ Differences between this Triton version and the CUDA version:
37
+ - Triton version doesn't support dropout.
38
+ - Triton forward is generally faster than CUDA forward, while Triton backward is
39
+ generally slower than CUDA backward. Overall Triton forward + backward is slightly slower
40
+ than CUDA forward + backward.
41
+ - Triton version doesn't support different sequence lengths in a batch (i.e., RaggedTensor/NestedTensor).
42
+ - Triton version supports attention bias, while CUDA version doesn't.
43
+ """
44
+ import math
45
+ import torch
46
+ import triton_pre_mlir as triton
47
+ import triton_pre_mlir.language as tl
48
+
49
+ @triton.heuristics({'EVEN_M': lambda args: args['seqlen_q'] % args['BLOCK_M'] == 0, 'EVEN_N': lambda args: args['seqlen_k'] % args['BLOCK_N'] == 0, 'EVEN_HEADDIM': lambda args: args['headdim'] == args['BLOCK_HEADDIM']})
50
+ @triton.jit
51
+ def _fwd_kernel(Q, K, V, Bias, Out, Lse, TMP, softmax_scale, stride_qb, stride_qh, stride_qm, stride_kb, stride_kh, stride_kn, stride_vb, stride_vh, stride_vn, stride_bb, stride_bh, stride_bm, stride_ob, stride_oh, stride_om, nheads, seqlen_q, seqlen_k, seqlen_q_rounded, headdim, CACHE_KEY_SEQLEN_Q, CACHE_KEY_SEQLEN_K, BIAS_TYPE: tl.constexpr, IS_CAUSAL: tl.constexpr, BLOCK_HEADDIM: tl.constexpr, EVEN_M: tl.constexpr, EVEN_N: tl.constexpr, EVEN_HEADDIM: tl.constexpr, BLOCK_M: tl.constexpr, BLOCK_N: tl.constexpr):
52
+ start_m = tl.program_id(0)
53
+ off_hb = tl.program_id(1)
54
+ off_b = off_hb // nheads
55
+ off_h = off_hb % nheads
56
+ offs_m = start_m * BLOCK_M + tl.arange(0, BLOCK_M)
57
+ offs_n = tl.arange(0, BLOCK_N)
58
+ offs_d = tl.arange(0, BLOCK_HEADDIM)
59
+ q_ptrs = Q + off_b * stride_qb + off_h * stride_qh + (offs_m[:, None] * stride_qm + offs_d[None, :])
60
+ k_ptrs = K + off_b * stride_kb + off_h * stride_kh + (offs_n[:, None] * stride_kn + offs_d[None, :])
61
+ v_ptrs = V + off_b * stride_vb + off_h * stride_vh + (offs_n[:, None] * stride_vn + offs_d[None, :])
62
+ if BIAS_TYPE == 'vector':
63
+ b_ptrs = Bias + off_b * stride_bb + off_h * stride_bh + offs_n
64
+ elif BIAS_TYPE == 'matrix':
65
+ b_ptrs = Bias + off_b * stride_bb + off_h * stride_bh + (offs_m[:, None] * stride_bm + offs_n[None, :])
66
+ t_ptrs = TMP + off_hb * seqlen_q_rounded + offs_m
67
+ lse_i = tl.zeros([BLOCK_M], dtype=tl.float32) - float('inf')
68
+ m_i = tl.zeros([BLOCK_M], dtype=tl.float32) - float('inf')
69
+ acc_o = tl.zeros([BLOCK_M, BLOCK_HEADDIM], dtype=tl.float32)
70
+ if EVEN_M & EVEN_N:
71
+ if EVEN_HEADDIM:
72
+ q = tl.load(q_ptrs)
73
+ else:
74
+ q = tl.load(q_ptrs, mask=offs_d[None, :] < headdim, other=0.0)
75
+ elif EVEN_HEADDIM:
76
+ q = tl.load(q_ptrs, mask=offs_m[:, None] < seqlen_q, other=0.0)
77
+ else:
78
+ q = tl.load(q_ptrs, mask=(offs_m[:, None] < seqlen_q) & (offs_d[None, :] < headdim), other=0.0)
79
+ end_n = seqlen_k if not IS_CAUSAL else tl.minimum((start_m + 1) * BLOCK_M, seqlen_k)
80
+ for start_n in range(0, end_n, BLOCK_N):
81
+ start_n = tl.multiple_of(start_n, BLOCK_N)
82
+ if EVEN_N & EVEN_M:
83
+ if EVEN_HEADDIM:
84
+ k = tl.load(k_ptrs + start_n * stride_kn)
85
+ else:
86
+ k = tl.load(k_ptrs + start_n * stride_kn, mask=offs_d[None, :] < headdim, other=0.0)
87
+ elif EVEN_HEADDIM:
88
+ k = tl.load(k_ptrs + start_n * stride_kn, mask=(start_n + offs_n)[:, None] < seqlen_k, other=0.0)
89
+ else:
90
+ k = tl.load(k_ptrs + start_n * stride_kn, mask=((start_n + offs_n)[:, None] < seqlen_k) & (offs_d[None, :] < headdim), other=0.0)
91
+ qk = tl.zeros([BLOCK_M, BLOCK_N], dtype=tl.float32)
92
+ qk += tl.dot(q, k, trans_b=True)
93
+ if not EVEN_N:
94
+ qk += tl.where((start_n + offs_n)[None, :] < seqlen_k, 0, float('-inf'))
95
+ if IS_CAUSAL:
96
+ qk += tl.where(offs_m[:, None] >= (start_n + offs_n)[None, :], 0, float('-inf'))
97
+ if BIAS_TYPE != 'none':
98
+ if BIAS_TYPE == 'vector':
99
+ if EVEN_N:
100
+ bias = tl.load(b_ptrs + start_n).to(tl.float32)
101
+ else:
102
+ bias = tl.load(b_ptrs + start_n, mask=start_n + offs_n < seqlen_k, other=0.0).to(tl.float32)
103
+ bias = bias[None, :]
104
+ elif BIAS_TYPE == 'matrix':
105
+ if EVEN_M & EVEN_N:
106
+ bias = tl.load(b_ptrs + start_n).to(tl.float32)
107
+ else:
108
+ bias = tl.load(b_ptrs + start_n, mask=(offs_m[:, None] < seqlen_q) & ((start_n + offs_n)[None, :] < seqlen_k), other=0.0).to(tl.float32)
109
+ qk = qk * softmax_scale + bias
110
+ m_ij = tl.maximum(tl.max(qk, 1), lse_i)
111
+ p = tl.exp(qk - m_ij[:, None])
112
+ else:
113
+ m_ij = tl.maximum(tl.max(qk, 1) * softmax_scale, lse_i)
114
+ p = tl.exp(qk * softmax_scale - m_ij[:, None])
115
+ l_ij = tl.sum(p, 1)
116
+ acc_o_scale = tl.exp(m_i - m_ij)
117
+ tl.store(t_ptrs, acc_o_scale)
118
+ acc_o_scale = tl.load(t_ptrs)
119
+ acc_o = acc_o * acc_o_scale[:, None]
120
+ if EVEN_N & EVEN_M:
121
+ if EVEN_HEADDIM:
122
+ v = tl.load(v_ptrs + start_n * stride_vn)
123
+ else:
124
+ v = tl.load(v_ptrs + start_n * stride_vn, mask=offs_d[None, :] < headdim, other=0.0)
125
+ elif EVEN_HEADDIM:
126
+ v = tl.load(v_ptrs + start_n * stride_vn, mask=(start_n + offs_n)[:, None] < seqlen_k, other=0.0)
127
+ else:
128
+ v = tl.load(v_ptrs + start_n * stride_vn, mask=((start_n + offs_n)[:, None] < seqlen_k) & (offs_d[None, :] < headdim), other=0.0)
129
+ p = p.to(v.dtype)
130
+ acc_o += tl.dot(p, v)
131
+ m_i = m_ij
132
+ l_i_new = tl.exp(lse_i - m_ij) + l_ij
133
+ lse_i = m_ij + tl.log(l_i_new)
134
+ o_scale = tl.exp(m_i - lse_i)
135
+ tl.store(t_ptrs, o_scale)
136
+ o_scale = tl.load(t_ptrs)
137
+ acc_o = acc_o * o_scale[:, None]
138
+ start_m = tl.program_id(0)
139
+ offs_m = start_m * BLOCK_M + tl.arange(0, BLOCK_M)
140
+ lse_ptrs = Lse + off_hb * seqlen_q_rounded + offs_m
141
+ tl.store(lse_ptrs, lse_i)
142
+ offs_d = tl.arange(0, BLOCK_HEADDIM)
143
+ out_ptrs = Out + off_b * stride_ob + off_h * stride_oh + (offs_m[:, None] * stride_om + offs_d[None, :])
144
+ if EVEN_M:
145
+ if EVEN_HEADDIM:
146
+ tl.store(out_ptrs, acc_o)
147
+ else:
148
+ tl.store(out_ptrs, acc_o, mask=offs_d[None, :] < headdim)
149
+ elif EVEN_HEADDIM:
150
+ tl.store(out_ptrs, acc_o, mask=offs_m[:, None] < seqlen_q)
151
+ else:
152
+ tl.store(out_ptrs, acc_o, mask=(offs_m[:, None] < seqlen_q) & (offs_d[None, :] < headdim))
153
+
154
+ @triton.jit
155
+ def _bwd_preprocess_do_o_dot(Out, DO, Delta, stride_ob, stride_oh, stride_om, stride_dob, stride_doh, stride_dom, nheads, seqlen_q, seqlen_q_rounded, headdim, BLOCK_M: tl.constexpr, BLOCK_HEADDIM: tl.constexpr):
156
+ start_m = tl.program_id(0)
157
+ off_hb = tl.program_id(1)
158
+ off_b = off_hb // nheads
159
+ off_h = off_hb % nheads
160
+ offs_m = start_m * BLOCK_M + tl.arange(0, BLOCK_M)
161
+ offs_d = tl.arange(0, BLOCK_HEADDIM)
162
+ o = tl.load(Out + off_b * stride_ob + off_h * stride_oh + offs_m[:, None] * stride_om + offs_d[None, :], mask=(offs_m[:, None] < seqlen_q) & (offs_d[None, :] < headdim), other=0.0).to(tl.float32)
163
+ do = tl.load(DO + off_b * stride_dob + off_h * stride_doh + offs_m[:, None] * stride_dom + offs_d[None, :], mask=(offs_m[:, None] < seqlen_q) & (offs_d[None, :] < headdim), other=0.0).to(tl.float32)
164
+ delta = tl.sum(o * do, axis=1)
165
+ tl.store(Delta + off_hb * seqlen_q_rounded + offs_m, delta)
166
+
167
+ @triton.jit
168
+ def _bwd_store_dk_dv(dk_ptrs, dv_ptrs, dk, dv, offs_n, offs_d, seqlen_k, headdim, EVEN_M: tl.constexpr, EVEN_N: tl.constexpr, EVEN_HEADDIM: tl.constexpr):
169
+ if EVEN_N & EVEN_M:
170
+ if EVEN_HEADDIM:
171
+ tl.store(dv_ptrs, dv)
172
+ tl.store(dk_ptrs, dk)
173
+ else:
174
+ tl.store(dv_ptrs, dv, mask=offs_d[None, :] < headdim)
175
+ tl.store(dk_ptrs, dk, mask=offs_d[None, :] < headdim)
176
+ elif EVEN_HEADDIM:
177
+ tl.store(dv_ptrs, dv, mask=offs_n[:, None] < seqlen_k)
178
+ tl.store(dk_ptrs, dk, mask=offs_n[:, None] < seqlen_k)
179
+ else:
180
+ tl.store(dv_ptrs, dv, mask=(offs_n[:, None] < seqlen_k) & (offs_d[None, :] < headdim))
181
+ tl.store(dk_ptrs, dk, mask=(offs_n[:, None] < seqlen_k) & (offs_d[None, :] < headdim))
182
+
183
+ @triton.jit
184
+ def _bwd_kernel_one_col_block(start_n, Q, K, V, Bias, DO, DQ, DK, DV, LSE, D, softmax_scale, stride_qm, stride_kn, stride_vn, stride_bm, stride_dom, stride_dqm, stride_dkn, stride_dvn, seqlen_q, seqlen_k, headdim, ATOMIC_ADD: tl.constexpr, BIAS_TYPE: tl.constexpr, IS_CAUSAL: tl.constexpr, BLOCK_HEADDIM: tl.constexpr, EVEN_M: tl.constexpr, EVEN_N: tl.constexpr, EVEN_HEADDIM: tl.constexpr, BLOCK_M: tl.constexpr, BLOCK_N: tl.constexpr):
185
+ begin_m = 0 if not IS_CAUSAL else start_n * BLOCK_N // BLOCK_M * BLOCK_M
186
+ offs_qm = begin_m + tl.arange(0, BLOCK_M)
187
+ offs_n = start_n * BLOCK_N + tl.arange(0, BLOCK_N)
188
+ offs_m = tl.arange(0, BLOCK_M)
189
+ offs_d = tl.arange(0, BLOCK_HEADDIM)
190
+ q_ptrs = Q + (offs_qm[:, None] * stride_qm + offs_d[None, :])
191
+ k_ptrs = K + (offs_n[:, None] * stride_kn + offs_d[None, :])
192
+ v_ptrs = V + (offs_n[:, None] * stride_vn + offs_d[None, :])
193
+ do_ptrs = DO + (offs_qm[:, None] * stride_dom + offs_d[None, :])
194
+ dq_ptrs = DQ + (offs_qm[:, None] * stride_dqm + offs_d[None, :])
195
+ if BIAS_TYPE == 'vector':
196
+ b_ptrs = Bias + offs_n
197
+ elif BIAS_TYPE == 'matrix':
198
+ b_ptrs = Bias + (offs_qm[:, None] * stride_bm + offs_n[None, :])
199
+ dv = tl.zeros([BLOCK_N, BLOCK_HEADDIM], dtype=tl.float32)
200
+ dk = tl.zeros([BLOCK_N, BLOCK_HEADDIM], dtype=tl.float32)
201
+ if begin_m >= seqlen_q:
202
+ dv_ptrs = DV + (offs_n[:, None] * stride_dvn + offs_d[None, :])
203
+ dk_ptrs = DK + (offs_n[:, None] * stride_dkn + offs_d[None, :])
204
+ _bwd_store_dk_dv(dk_ptrs, dv_ptrs, dk, dv, offs_n, offs_d, seqlen_k, headdim, EVEN_M=EVEN_M, EVEN_N=EVEN_N, EVEN_HEADDIM=EVEN_HEADDIM)
205
+ return
206
+ if EVEN_N & EVEN_M:
207
+ if EVEN_HEADDIM:
208
+ k = tl.load(k_ptrs)
209
+ v = tl.load(v_ptrs)
210
+ else:
211
+ k = tl.load(k_ptrs, mask=offs_d[None, :] < headdim, other=0.0)
212
+ v = tl.load(v_ptrs, mask=offs_d[None, :] < headdim, other=0.0)
213
+ elif EVEN_HEADDIM:
214
+ k = tl.load(k_ptrs, mask=offs_n[:, None] < seqlen_k, other=0.0)
215
+ v = tl.load(v_ptrs, mask=offs_n[:, None] < seqlen_k, other=0.0)
216
+ else:
217
+ k = tl.load(k_ptrs, mask=(offs_n[:, None] < seqlen_k) & (offs_d[None, :] < headdim), other=0.0)
218
+ v = tl.load(v_ptrs, mask=(offs_n[:, None] < seqlen_k) & (offs_d[None, :] < headdim), other=0.0)
219
+ num_block_m = tl.cdiv(seqlen_q, BLOCK_M)
220
+ for start_m in range(begin_m, num_block_m * BLOCK_M, BLOCK_M):
221
+ start_m = tl.multiple_of(start_m, BLOCK_M)
222
+ offs_m_curr = start_m + offs_m
223
+ if EVEN_M & EVEN_HEADDIM:
224
+ q = tl.load(q_ptrs)
225
+ elif EVEN_HEADDIM:
226
+ q = tl.load(q_ptrs, mask=offs_m_curr[:, None] < seqlen_q, other=0.0)
227
+ else:
228
+ q = tl.load(q_ptrs, mask=(offs_m_curr[:, None] < seqlen_q) & (offs_d[None, :] < headdim), other=0.0)
229
+ qk = tl.dot(q, k, trans_b=True)
230
+ if not EVEN_N:
231
+ qk = tl.where(offs_n[None, :] < seqlen_k, qk, float('-inf'))
232
+ if IS_CAUSAL:
233
+ qk = tl.where(offs_m_curr[:, None] >= offs_n[None, :], qk, float('-inf'))
234
+ if BIAS_TYPE != 'none':
235
+ tl.debug_barrier()
236
+ if BIAS_TYPE == 'vector':
237
+ if EVEN_N:
238
+ bias = tl.load(b_ptrs).to(tl.float32)
239
+ else:
240
+ bias = tl.load(b_ptrs, mask=offs_n < seqlen_k, other=0.0).to(tl.float32)
241
+ bias = bias[None, :]
242
+ elif BIAS_TYPE == 'matrix':
243
+ if EVEN_M & EVEN_N:
244
+ bias = tl.load(b_ptrs).to(tl.float32)
245
+ else:
246
+ bias = tl.load(b_ptrs, mask=(offs_m_curr[:, None] < seqlen_q) & (offs_n[None, :] < seqlen_k), other=0.0).to(tl.float32)
247
+ qk = qk * softmax_scale + bias
248
+ if not EVEN_M & EVEN_HEADDIM:
249
+ tl.debug_barrier()
250
+ lse_i = tl.load(LSE + offs_m_curr)
251
+ if BIAS_TYPE == 'none':
252
+ p = tl.exp(qk * softmax_scale - lse_i[:, None])
253
+ else:
254
+ p = tl.exp(qk - lse_i[:, None])
255
+ if EVEN_M & EVEN_HEADDIM:
256
+ do = tl.load(do_ptrs)
257
+ else:
258
+ do = tl.load(do_ptrs, mask=(offs_m_curr[:, None] < seqlen_q) & (offs_d[None, :] < headdim), other=0.0)
259
+ dv += tl.dot(p.to(do.dtype), do, trans_a=True)
260
+ if not EVEN_M & EVEN_HEADDIM:
261
+ tl.debug_barrier()
262
+ dp = tl.dot(do, v, trans_b=True)
263
+ if not EVEN_HEADDIM:
264
+ tl.debug_barrier()
265
+ Di = tl.load(D + offs_m_curr)
266
+ ds = (p * (dp - Di[:, None]) * softmax_scale).to(q.dtype)
267
+ dk += tl.dot(ds, q, trans_a=True)
268
+ if not EVEN_M & EVEN_HEADDIM:
269
+ tl.debug_barrier()
270
+ if not ATOMIC_ADD:
271
+ if EVEN_M & EVEN_HEADDIM:
272
+ dq = tl.load(dq_ptrs, eviction_policy='evict_last')
273
+ dq += tl.dot(ds, k)
274
+ tl.store(dq_ptrs, dq, eviction_policy='evict_last')
275
+ elif EVEN_HEADDIM:
276
+ dq = tl.load(dq_ptrs, mask=offs_m_curr[:, None] < seqlen_q, other=0.0, eviction_policy='evict_last')
277
+ dq += tl.dot(ds, k)
278
+ tl.store(dq_ptrs, dq, mask=offs_m_curr[:, None] < seqlen_q, eviction_policy='evict_last')
279
+ else:
280
+ dq = tl.load(dq_ptrs, mask=(offs_m_curr[:, None] < seqlen_q) & (offs_d[None, :] < headdim), other=0.0, eviction_policy='evict_last')
281
+ dq += tl.dot(ds, k)
282
+ tl.store(dq_ptrs, dq, mask=(offs_m_curr[:, None] < seqlen_q) & (offs_d[None, :] < headdim), eviction_policy='evict_last')
283
+ else:
284
+ dq = tl.dot(ds, k)
285
+ if EVEN_M & EVEN_HEADDIM:
286
+ tl.atomic_add(dq_ptrs, dq)
287
+ elif EVEN_HEADDIM:
288
+ tl.atomic_add(dq_ptrs, dq, mask=offs_m_curr[:, None] < seqlen_q)
289
+ else:
290
+ tl.atomic_add(dq_ptrs, dq, mask=(offs_m_curr[:, None] < seqlen_q) & (offs_d[None, :] < headdim))
291
+ dq_ptrs += BLOCK_M * stride_dqm
292
+ q_ptrs += BLOCK_M * stride_qm
293
+ do_ptrs += BLOCK_M * stride_dom
294
+ if BIAS_TYPE == 'matrix':
295
+ b_ptrs += BLOCK_M * stride_bm
296
+ dv_ptrs = DV + (offs_n[:, None] * stride_dvn + offs_d[None, :])
297
+ dk_ptrs = DK + (offs_n[:, None] * stride_dkn + offs_d[None, :])
298
+ _bwd_store_dk_dv(dk_ptrs, dv_ptrs, dk, dv, offs_n, offs_d, seqlen_k, headdim, EVEN_M=EVEN_M, EVEN_N=EVEN_N, EVEN_HEADDIM=EVEN_HEADDIM)
299
+
300
+ def init_to_zero(name):
301
+ return lambda nargs: nargs[name].zero_()
302
+
303
+ @triton.autotune(configs=[triton.Config({'BLOCK_M': 128, 'BLOCK_N': 128, 'SEQUENCE_PARALLEL': False}, num_warps=8, num_stages=1, pre_hook=init_to_zero('DQ')), triton.Config({'BLOCK_M': 128, 'BLOCK_N': 128, 'SEQUENCE_PARALLEL': True}, num_warps=8, num_stages=1, pre_hook=init_to_zero('DQ'))], key=['CACHE_KEY_SEQLEN_Q', 'CACHE_KEY_SEQLEN_K', 'BIAS_TYPE', 'IS_CAUSAL', 'BLOCK_HEADDIM'])
304
+ @triton.heuristics({'EVEN_M': lambda args: args['seqlen_q'] % args['BLOCK_M'] == 0, 'EVEN_N': lambda args: args['seqlen_k'] % args['BLOCK_N'] == 0, 'EVEN_HEADDIM': lambda args: args['headdim'] == args['BLOCK_HEADDIM']})
305
+ @triton.jit
306
+ def _bwd_kernel(Q, K, V, Bias, DO, DQ, DK, DV, LSE, D, softmax_scale, stride_qb, stride_qh, stride_qm, stride_kb, stride_kh, stride_kn, stride_vb, stride_vh, stride_vn, stride_bb, stride_bh, stride_bm, stride_dob, stride_doh, stride_dom, stride_dqb, stride_dqh, stride_dqm, stride_dkb, stride_dkh, stride_dkn, stride_dvb, stride_dvh, stride_dvn, nheads, seqlen_q, seqlen_k, seqlen_q_rounded, headdim, CACHE_KEY_SEQLEN_Q, CACHE_KEY_SEQLEN_K, BIAS_TYPE: tl.constexpr, IS_CAUSAL: tl.constexpr, BLOCK_HEADDIM: tl.constexpr, SEQUENCE_PARALLEL: tl.constexpr, EVEN_M: tl.constexpr, EVEN_N: tl.constexpr, EVEN_HEADDIM: tl.constexpr, BLOCK_M: tl.constexpr, BLOCK_N: tl.constexpr):
307
+ off_hb = tl.program_id(1)
308
+ off_b = off_hb // nheads
309
+ off_h = off_hb % nheads
310
+ Q += off_b * stride_qb + off_h * stride_qh
311
+ K += off_b * stride_kb + off_h * stride_kh
312
+ V += off_b * stride_vb + off_h * stride_vh
313
+ DO += off_b * stride_dob + off_h * stride_doh
314
+ DQ += off_b * stride_dqb + off_h * stride_dqh
315
+ DK += off_b * stride_dkb + off_h * stride_dkh
316
+ DV += off_b * stride_dvb + off_h * stride_dvh
317
+ if BIAS_TYPE != 'none':
318
+ Bias += off_b * stride_bb + off_h * stride_bh
319
+ D += off_hb * seqlen_q_rounded
320
+ LSE += off_hb * seqlen_q_rounded
321
+ if not SEQUENCE_PARALLEL:
322
+ num_block_n = tl.cdiv(seqlen_k, BLOCK_N)
323
+ for start_n in range(0, num_block_n):
324
+ _bwd_kernel_one_col_block(start_n, Q, K, V, Bias, DO, DQ, DK, DV, LSE, D, softmax_scale, stride_qm, stride_kn, stride_vn, stride_bm, stride_dom, stride_dqm, stride_dkn, stride_dvn, seqlen_q, seqlen_k, headdim, ATOMIC_ADD=False, BIAS_TYPE=BIAS_TYPE, IS_CAUSAL=IS_CAUSAL, BLOCK_HEADDIM=BLOCK_HEADDIM, EVEN_M=EVEN_M, EVEN_N=EVEN_N, EVEN_HEADDIM=EVEN_HEADDIM, BLOCK_M=BLOCK_M, BLOCK_N=BLOCK_N)
325
+ else:
326
+ start_n = tl.program_id(0)
327
+ _bwd_kernel_one_col_block(start_n, Q, K, V, Bias, DO, DQ, DK, DV, LSE, D, softmax_scale, stride_qm, stride_kn, stride_vn, stride_bm, stride_dom, stride_dqm, stride_dkn, stride_dvn, seqlen_q, seqlen_k, headdim, ATOMIC_ADD=True, BIAS_TYPE=BIAS_TYPE, IS_CAUSAL=IS_CAUSAL, BLOCK_HEADDIM=BLOCK_HEADDIM, EVEN_M=EVEN_M, EVEN_N=EVEN_N, EVEN_HEADDIM=EVEN_HEADDIM, BLOCK_M=BLOCK_M, BLOCK_N=BLOCK_N)
328
+
329
+ def _flash_attn_forward(q, k, v, bias=None, causal=False, softmax_scale=None):
330
+ (batch, seqlen_q, nheads, d) = q.shape
331
+ (_, seqlen_k, _, _) = k.shape
332
+ assert k.shape == (batch, seqlen_k, nheads, d)
333
+ assert v.shape == (batch, seqlen_k, nheads, d)
334
+ assert d <= 128, 'FlashAttention only support head dimensions up to 128'
335
+ assert q.dtype == k.dtype == v.dtype, 'All tensors must have the same type'
336
+ assert q.dtype in [torch.float16, torch.bfloat16], 'Only support fp16 and bf16'
337
+ assert q.is_cuda and k.is_cuda and v.is_cuda
338
+ softmax_scale = softmax_scale or 1.0 / math.sqrt(d)
339
+ has_bias = bias is not None
340
+ bias_type = 'none'
341
+ if has_bias:
342
+ assert bias.dtype in [q.dtype, torch.float]
343
+ assert bias.is_cuda
344
+ assert bias.dim() == 4
345
+ if bias.stride(-1) != 1:
346
+ bias = bias.contiguous()
347
+ if bias.shape[2:] == (1, seqlen_k):
348
+ bias_type = 'vector'
349
+ elif bias.shape[2:] == (seqlen_q, seqlen_k):
350
+ bias_type = 'matrix'
351
+ else:
352
+ raise RuntimeError('Last 2 dimensions of bias must be (1, seqlen_k) or (seqlen_q, seqlen_k)')
353
+ bias = bias.expand(batch, nheads, seqlen_q, seqlen_k)
354
+ bias_strides = (bias.stride(0), bias.stride(1), bias.stride(2)) if has_bias else (0, 0, 0)
355
+ seqlen_q_rounded = math.ceil(seqlen_q / 128) * 128
356
+ lse = torch.empty((batch, nheads, seqlen_q_rounded), device=q.device, dtype=torch.float32)
357
+ tmp = torch.empty((batch, nheads, seqlen_q_rounded), device=q.device, dtype=torch.float32)
358
+ o = torch.empty_like(q)
359
+ BLOCK_HEADDIM = max(triton.next_power_of_2(d), 16)
360
+ BLOCK = 128
361
+ num_warps = 4 if d <= 64 else 8
362
+ grid = lambda META: (triton.cdiv(seqlen_q, META['BLOCK_M']), batch * nheads)
363
+ _fwd_kernel[grid](q, k, v, bias, o, lse, tmp, softmax_scale, q.stride(0), q.stride(2), q.stride(1), k.stride(0), k.stride(2), k.stride(1), v.stride(0), v.stride(2), v.stride(1), *bias_strides, o.stride(0), o.stride(2), o.stride(1), nheads, seqlen_q, seqlen_k, seqlen_q_rounded, d, seqlen_q // 32, seqlen_k // 32, bias_type, causal, BLOCK_HEADDIM, BLOCK_M=BLOCK, BLOCK_N=BLOCK, num_warps=num_warps, num_stages=1)
364
+ return (o, lse, softmax_scale)
365
+
366
+ def _flash_attn_backward(do, q, k, v, o, lse, dq, dk, dv, bias=None, causal=False, softmax_scale=None):
367
+ if do.stride(-1) != 1:
368
+ do = do.contiguous()
369
+ (batch, seqlen_q, nheads, d) = q.shape
370
+ (_, seqlen_k, _, _) = k.shape
371
+ assert d <= 128
372
+ seqlen_q_rounded = math.ceil(seqlen_q / 128) * 128
373
+ assert lse.shape == (batch, nheads, seqlen_q_rounded)
374
+ assert q.stride(-1) == k.stride(-1) == v.stride(-1) == o.stride(-1) == 1
375
+ assert dq.stride(-1) == dk.stride(-1) == dv.stride(-1) == 1
376
+ softmax_scale = softmax_scale or 1.0 / math.sqrt(d)
377
+ dq_accum = torch.empty_like(q, dtype=torch.float32)
378
+ delta = torch.empty_like(lse)
379
+ BLOCK_HEADDIM = max(triton.next_power_of_2(d), 16)
380
+ grid = lambda META: (triton.cdiv(seqlen_q, META['BLOCK_M']), batch * nheads)
381
+ _bwd_preprocess_do_o_dot[grid](o, do, delta, o.stride(0), o.stride(2), o.stride(1), do.stride(0), do.stride(2), do.stride(1), nheads, seqlen_q, seqlen_q_rounded, d, BLOCK_M=128, BLOCK_HEADDIM=BLOCK_HEADDIM)
382
+ has_bias = bias is not None
383
+ bias_type = 'none'
384
+ if has_bias:
385
+ assert bias.dtype in [q.dtype, torch.float]
386
+ assert bias.is_cuda
387
+ assert bias.dim() == 4
388
+ assert bias.stride(-1) == 1
389
+ if bias.shape[2:] == (1, seqlen_k):
390
+ bias_type = 'vector'
391
+ elif bias.shape[2:] == (seqlen_q, seqlen_k):
392
+ bias_type = 'matrix'
393
+ else:
394
+ raise RuntimeError('Last 2 dimensions of bias must be (1, seqlen_k) or (seqlen_q, seqlen_k)')
395
+ bias = bias.expand(batch, nheads, seqlen_q, seqlen_k)
396
+ bias_strides = (bias.stride(0), bias.stride(1), bias.stride(2)) if has_bias else (0, 0, 0)
397
+ grid = lambda META: (triton.cdiv(seqlen_k, META['BLOCK_N']) if META['SEQUENCE_PARALLEL'] else 1, batch * nheads)
398
+ _bwd_kernel[grid](q, k, v, bias, do, dq_accum, dk, dv, lse, delta, softmax_scale, q.stride(0), q.stride(2), q.stride(1), k.stride(0), k.stride(2), k.stride(1), v.stride(0), v.stride(2), v.stride(1), *bias_strides, do.stride(0), do.stride(2), do.stride(1), dq_accum.stride(0), dq_accum.stride(2), dq_accum.stride(1), dk.stride(0), dk.stride(2), dk.stride(1), dv.stride(0), dv.stride(2), dv.stride(1), nheads, seqlen_q, seqlen_k, seqlen_q_rounded, d, seqlen_q // 32, seqlen_k // 32, bias_type, causal, BLOCK_HEADDIM)
399
+ dq.copy_(dq_accum)
400
+
401
+ class FlashAttnQKVPackedFunc(torch.autograd.Function):
402
+
403
+ @staticmethod
404
+ def forward(ctx, qkv, bias=None, causal=False, softmax_scale=None):
405
+ """
406
+ qkv: (batch, seqlen, 3, nheads, headdim)
407
+ bias: optional, shape broadcastible to (batch, nheads, seqlen, seqlen).
408
+ For example, ALiBi mask for causal would have shape (1, nheads, 1, seqlen).
409
+ ALiBi mask for non-causal would have shape (1, nheads, seqlen, seqlen)
410
+ """
411
+ if qkv.stride(-1) != 1:
412
+ qkv = qkv.contiguous()
413
+ (o, lse, ctx.softmax_scale) = _flash_attn_forward(qkv[:, :, 0], qkv[:, :, 1], qkv[:, :, 2], bias=bias, causal=causal, softmax_scale=softmax_scale)
414
+ ctx.save_for_backward(qkv, o, lse, bias)
415
+ ctx.causal = causal
416
+ return o
417
+
418
+ @staticmethod
419
+ def backward(ctx, do):
420
+ (qkv, o, lse, bias) = ctx.saved_tensors
421
+ assert not ctx.needs_input_grad[1], 'FlashAttention does not support bias gradient yet'
422
+ with torch.inference_mode():
423
+ dqkv = torch.empty_like(qkv)
424
+ _flash_attn_backward(do, qkv[:, :, 0], qkv[:, :, 1], qkv[:, :, 2], o, lse, dqkv[:, :, 0], dqkv[:, :, 1], dqkv[:, :, 2], bias=bias, causal=ctx.causal, softmax_scale=ctx.softmax_scale)
425
+ return (dqkv, None, None, None)
426
+ flash_attn_qkvpacked_func = FlashAttnQKVPackedFunc.apply
427
+
428
+ class FlashAttnKVPackedFunc(torch.autograd.Function):
429
+
430
+ @staticmethod
431
+ def forward(ctx, q, kv, bias=None, causal=False, softmax_scale=None):
432
+ """
433
+ q: (batch, seqlen_q, nheads, headdim)
434
+ kv: (batch, seqlen_k, 2, nheads, headdim)
435
+ bias: optional, shape broadcastible to (batch, nheads, seqlen_q, seqlen_k).
436
+ For example, ALiBi mask for causal would have shape (1, nheads, 1, seqlen_k).
437
+ ALiBi mask for non-causal would have shape (1, nheads, seqlen_q, seqlen_k)
438
+ """
439
+ (q, kv) = [x if x.stride(-1) == 1 else x.contiguous() for x in [q, kv]]
440
+ (o, lse, ctx.softmax_scale) = _flash_attn_forward(q, kv[:, :, 0], kv[:, :, 1], bias=bias, causal=causal, softmax_scale=softmax_scale)
441
+ ctx.save_for_backward(q, kv, o, lse, bias)
442
+ ctx.causal = causal
443
+ return o
444
+
445
+ @staticmethod
446
+ def backward(ctx, do):
447
+ (q, kv, o, lse, bias) = ctx.saved_tensors
448
+ if len(ctx.needs_input_grad) >= 3:
449
+ assert not ctx.needs_input_grad[2], 'FlashAttention does not support bias gradient yet'
450
+ with torch.inference_mode():
451
+ dq = torch.empty_like(q)
452
+ dkv = torch.empty_like(kv)
453
+ _flash_attn_backward(do, q, kv[:, :, 0], kv[:, :, 1], o, lse, dq, dkv[:, :, 0], dkv[:, :, 1], bias=bias, causal=ctx.causal, softmax_scale=ctx.softmax_scale)
454
+ return (dq, dkv, None, None, None)
455
+ flash_attn_kvpacked_func = FlashAttnKVPackedFunc.apply
456
+
457
+ class FlashAttnFunc(torch.autograd.Function):
458
+
459
+ @staticmethod
460
+ def forward(ctx, q, k, v, bias=None, causal=False, softmax_scale=None):
461
+ """
462
+ q: (batch_size, seqlen_q, nheads, headdim)
463
+ k, v: (batch_size, seqlen_k, nheads, headdim)
464
+ bias: optional, shape broadcastible to (batch, nheads, seqlen_q, seqlen_k).
465
+ For example, ALiBi mask for causal would have shape (1, nheads, 1, seqlen_k).
466
+ ALiBi mask for non-causal would have shape (1, nheads, seqlen_q, seqlen_k)
467
+ """
468
+ (q, k, v) = [x if x.stride(-1) == 1 else x.contiguous() for x in [q, k, v]]
469
+ (o, lse, ctx.softmax_scale) = _flash_attn_forward(q, k, v, bias=bias, causal=causal, softmax_scale=softmax_scale)
470
+ ctx.save_for_backward(q, k, v, o, lse, bias)
471
+ ctx.causal = causal
472
+ return o
473
+
474
+ @staticmethod
475
+ def backward(ctx, do):
476
+ (q, k, v, o, lse, bias) = ctx.saved_tensors
477
+ assert not ctx.needs_input_grad[3], 'FlashAttention does not support bias gradient yet'
478
+ with torch.inference_mode():
479
+ dq = torch.empty_like(q)
480
+ dk = torch.empty_like(k)
481
+ dv = torch.empty_like(v)
482
+ _flash_attn_backward(do, q, k, v, o, lse, dq, dk, dv, bias=bias, causal=ctx.causal, softmax_scale=ctx.softmax_scale)
483
+ return (dq, dk, dv, None, None, None)
484
+ flash_attn_func = FlashAttnFunc.apply
generation_config.json ADDED
@@ -0,0 +1,5 @@
 
 
 
 
 
 
1
+ {
2
+ "_from_model_config": true,
3
+ "transformers_version": "4.33.3",
4
+ "use_cache": false
5
+ }
hf_prefixlm_converter.py ADDED
@@ -0,0 +1,420 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Converts Huggingface Causal LM to Prefix LM.
2
+
3
+ Conversion does lightweight surgery on a HuggingFace
4
+ Causal LM to convert it to a Prefix LM.
5
+
6
+ Prefix LMs accepts a `bidirectional_mask` input in `forward`
7
+ and treat the input prompt as the prefix in `generate`.
8
+ """
9
+ import math
10
+ import warnings
11
+ from types import MethodType
12
+ from typing import Any, List, MutableMapping, Optional, Tuple, Union
13
+ import torch
14
+ from transformers.models.bloom.modeling_bloom import BaseModelOutputWithPastAndCrossAttentions, BloomForCausalLM, BloomModel, CausalLMOutputWithCrossAttentions, CrossEntropyLoss
15
+ from transformers.models.bloom.modeling_bloom import _expand_mask as _expand_mask_bloom
16
+ from transformers.models.bloom.modeling_bloom import _make_causal_mask as _make_causal_mask_bloom
17
+ from transformers.models.bloom.modeling_bloom import logging
18
+ from transformers.models.gpt2.modeling_gpt2 import GPT2LMHeadModel
19
+ from transformers.models.gpt_neo.modeling_gpt_neo import GPTNeoForCausalLM
20
+ from transformers.models.gpt_neox.modeling_gpt_neox import GPTNeoXForCausalLM
21
+ from transformers.models.gptj.modeling_gptj import GPTJForCausalLM
22
+ from transformers.models.opt.modeling_opt import OPTForCausalLM
23
+ from transformers.models.opt.modeling_opt import _expand_mask as _expand_mask_opt
24
+ from transformers.models.opt.modeling_opt import _make_causal_mask as _make_causal_mask_opt
25
+ logger = logging.get_logger(__name__)
26
+ _SUPPORTED_GPT_MODELS = (GPT2LMHeadModel, GPTJForCausalLM, GPTNeoForCausalLM, GPTNeoXForCausalLM)
27
+ CAUSAL_GPT_TYPES = Union[GPT2LMHeadModel, GPTJForCausalLM, GPTNeoForCausalLM, GPTNeoXForCausalLM]
28
+
29
+ def _convert_gpt_causal_lm_to_prefix_lm(model: CAUSAL_GPT_TYPES) -> CAUSAL_GPT_TYPES:
30
+ """Converts a GPT-style Causal LM to a Prefix LM.
31
+
32
+ Supported HuggingFace model classes:
33
+ - `GPT2LMHeadModel`
34
+ - `GPTNeoForCausalLM`
35
+ - `GPTNeoXForCausalLM`
36
+ - `GPTJForCausalLM`
37
+
38
+ See `convert_hf_causal_lm_to_prefix_lm` for more details.
39
+ """
40
+ if hasattr(model, '_prefix_lm_converted'):
41
+ return model
42
+ assert isinstance(model, _SUPPORTED_GPT_MODELS)
43
+ assert model.config.add_cross_attention == False, 'Only supports GPT-style decoder-only models'
44
+
45
+ def _get_attn_modules(model: CAUSAL_GPT_TYPES) -> List[torch.nn.Module]:
46
+ """Helper that gets a list of the model's attention modules.
47
+
48
+ Each module has a `bias` buffer used for causal masking. The Prefix LM
49
+ conversion adds logic to dynamically manipulate these biases to support
50
+ Prefix LM attention masking.
51
+ """
52
+ attn_modules = []
53
+ if isinstance(model, GPTNeoXForCausalLM):
54
+ blocks = model.gpt_neox.layers
55
+ else:
56
+ blocks = model.transformer.h
57
+ for block in blocks:
58
+ if isinstance(model, GPTNeoForCausalLM):
59
+ if block.attn.attention_type != 'global':
60
+ continue
61
+ attn_module = block.attn.attention
62
+ elif isinstance(model, GPTNeoXForCausalLM):
63
+ attn_module = block.attention
64
+ else:
65
+ attn_module = block.attn
66
+ attn_modules.append(attn_module)
67
+ return attn_modules
68
+ setattr(model, '_original_forward', getattr(model, 'forward'))
69
+ setattr(model, '_original_generate', getattr(model, 'generate'))
70
+
71
+ def forward(self: CAUSAL_GPT_TYPES, input_ids: Optional[torch.LongTensor]=None, past_key_values: Optional[Tuple[Tuple[torch.Tensor]]]=None, attention_mask: Optional[torch.FloatTensor]=None, bidirectional_mask: Optional[torch.Tensor]=None, token_type_ids: Optional[torch.LongTensor]=None, position_ids: Optional[torch.LongTensor]=None, head_mask: Optional[torch.FloatTensor]=None, inputs_embeds: Optional[torch.FloatTensor]=None, labels: Optional[torch.LongTensor]=None, use_cache: Optional[bool]=None, output_attentions: Optional[bool]=None, output_hidden_states: Optional[bool]=None, return_dict: Optional[bool]=None):
72
+ """Wraps original forward to enable PrefixLM attention."""
73
+
74
+ def call_og_forward():
75
+ if isinstance(self, GPTNeoXForCausalLM):
76
+ return self._original_forward(input_ids=input_ids, past_key_values=past_key_values, attention_mask=attention_mask, head_mask=head_mask, inputs_embeds=inputs_embeds, labels=labels, use_cache=use_cache, output_attentions=output_attentions, output_hidden_states=output_hidden_states, return_dict=return_dict)
77
+ else:
78
+ return self._original_forward(input_ids=input_ids, past_key_values=past_key_values, attention_mask=attention_mask, token_type_ids=token_type_ids, position_ids=position_ids, head_mask=head_mask, inputs_embeds=inputs_embeds, labels=labels, use_cache=use_cache, output_attentions=output_attentions, output_hidden_states=output_hidden_states, return_dict=return_dict)
79
+ if bidirectional_mask is None:
80
+ return call_og_forward()
81
+ assert isinstance(bidirectional_mask, torch.Tensor)
82
+ attn_modules = _get_attn_modules(model)
83
+ (b, s) = bidirectional_mask.shape
84
+ max_length = attn_modules[0].bias.shape[-1]
85
+ if s > max_length:
86
+ raise ValueError(f'bidirectional_mask sequence length (={s}) exceeds the ' + f'max length allowed by the model ({max_length}).')
87
+ assert s <= max_length
88
+ if s < max_length:
89
+ pad = torch.zeros((int(b), int(max_length - s)), dtype=bidirectional_mask.dtype, device=bidirectional_mask.device)
90
+ bidirectional_mask = torch.cat([bidirectional_mask, pad], dim=1)
91
+ bidirectional = bidirectional_mask.unsqueeze(1).unsqueeze(1)
92
+ for attn_module in attn_modules:
93
+ assert isinstance(attn_module.bias, torch.Tensor)
94
+ attn_module.bias.data = torch.logical_or(attn_module.bias.data, bidirectional)
95
+ output = call_og_forward()
96
+ for attn_module in attn_modules:
97
+ attn_module.bias.data = torch.tril(attn_module.bias.data[0, 0])[None, None]
98
+ return output
99
+
100
+ def generate(self: CAUSAL_GPT_TYPES, *args: Any, **kwargs: Any):
101
+ """Wraps original generate to enable PrefixLM attention."""
102
+ attn_modules = _get_attn_modules(model)
103
+ for attn_module in attn_modules:
104
+ attn_module.bias.data[:] = 1
105
+ output = self._original_generate(*args, **kwargs)
106
+ for attn_module in attn_modules:
107
+ attn_module.bias.data = torch.tril(attn_module.bias.data[0, 0])[None, None]
108
+ return output
109
+ setattr(model, 'forward', MethodType(forward, model))
110
+ setattr(model, 'generate', MethodType(generate, model))
111
+ setattr(model, '_prefix_lm_converted', True)
112
+ return model
113
+
114
+ def _convert_bloom_causal_lm_to_prefix_lm(model: BloomForCausalLM) -> BloomForCausalLM:
115
+ """Converts a BLOOM Causal LM to a Prefix LM.
116
+
117
+ Supported HuggingFace model classes:
118
+ - `BloomForCausalLM`
119
+
120
+ See `convert_hf_causal_lm_to_prefix_lm` for more details.
121
+ """
122
+ if hasattr(model, '_prefix_lm_converted'):
123
+ return model
124
+ assert isinstance(model, BloomForCausalLM)
125
+ assert model.config.add_cross_attention == False, 'Only supports BLOOM decoder-only models'
126
+
127
+ def _prepare_attn_mask(self: BloomModel, attention_mask: torch.Tensor, bidirectional_mask: Optional[torch.Tensor], input_shape: Tuple[int, int], past_key_values_length: int) -> torch.BoolTensor:
128
+ combined_attention_mask = None
129
+ device = attention_mask.device
130
+ (_, src_length) = input_shape
131
+ if src_length > 1:
132
+ combined_attention_mask = _make_causal_mask_bloom(input_shape, device=device, past_key_values_length=past_key_values_length)
133
+ if bidirectional_mask is not None:
134
+ assert attention_mask.shape == bidirectional_mask.shape
135
+ expanded_bidirectional_mask = _expand_mask_bloom(bidirectional_mask, tgt_length=src_length)
136
+ combined_attention_mask = torch.logical_and(combined_attention_mask, expanded_bidirectional_mask)
137
+ expanded_attn_mask = _expand_mask_bloom(attention_mask, tgt_length=src_length)
138
+ combined_attention_mask = expanded_attn_mask if combined_attention_mask is None else expanded_attn_mask | combined_attention_mask
139
+ return combined_attention_mask
140
+
141
+ def _build_alibi_tensor(self: BloomModel, batch_size: int, query_length: int, key_length: int, dtype: torch.dtype, device: torch.device) -> torch.Tensor:
142
+ num_heads = self.config.n_head
143
+ closest_power_of_2 = 2 ** math.floor(math.log2(num_heads))
144
+ base = torch.tensor(2 ** (-2 ** (-(math.log2(closest_power_of_2) - 3))), device=device, dtype=torch.float32)
145
+ powers = torch.arange(1, 1 + closest_power_of_2, device=device, dtype=torch.int32)
146
+ slopes = torch.pow(base, powers)
147
+ if closest_power_of_2 != num_heads:
148
+ extra_base = torch.tensor(2 ** (-2 ** (-(math.log2(2 * closest_power_of_2) - 3))), device=device, dtype=torch.float32)
149
+ num_remaining_heads = min(closest_power_of_2, num_heads - closest_power_of_2)
150
+ extra_powers = torch.arange(1, 1 + 2 * num_remaining_heads, 2, device=device, dtype=torch.int32)
151
+ slopes = torch.cat([slopes, torch.pow(extra_base, extra_powers)], dim=0)
152
+ qa = torch.arange(query_length, device=device, dtype=torch.int32).view(-1, 1)
153
+ ka = torch.arange(key_length, device=device, dtype=torch.int32).view(1, -1)
154
+ diffs = qa - ka + key_length - query_length
155
+ diffs = -diffs.abs()
156
+ alibi = slopes.view(1, num_heads, 1, 1) * diffs.view(1, 1, query_length, key_length)
157
+ alibi = alibi.expand(batch_size, -1, -1, -1).reshape(-1, query_length, key_length)
158
+ return alibi.to(dtype)
159
+ KeyValueT = Tuple[torch.Tensor, torch.Tensor]
160
+
161
+ def transformer_forward(self: BloomModel, input_ids: Optional[torch.LongTensor]=None, past_key_values: Optional[Tuple[KeyValueT, ...]]=None, attention_mask: Optional[torch.Tensor]=None, bidirectional_mask: Optional[torch.Tensor]=None, head_mask: Optional[torch.LongTensor]=None, inputs_embeds: Optional[torch.LongTensor]=None, use_cache: Optional[bool]=None, output_attentions: Optional[bool]=None, output_hidden_states: Optional[bool]=None, return_dict: Optional[bool]=None, **deprecated_arguments: Any) -> Union[Tuple[torch.Tensor, ...], BaseModelOutputWithPastAndCrossAttentions]:
162
+ if deprecated_arguments.pop('position_ids', False) is not False:
163
+ warnings.warn('`position_ids` have no functionality in BLOOM and will be removed in v5.0.0. ' + 'You can safely ignore passing `position_ids`.', FutureWarning)
164
+ if len(deprecated_arguments) > 0:
165
+ raise ValueError(f'Got unexpected arguments: {deprecated_arguments}')
166
+ output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
167
+ output_hidden_states = output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
168
+ use_cache = use_cache if use_cache is not None else self.config.use_cache
169
+ return_dict = return_dict if return_dict is not None else self.config.use_return_dict
170
+ if input_ids is not None and inputs_embeds is not None:
171
+ raise ValueError('You cannot specify both input_ids and inputs_embeds at the same time')
172
+ elif input_ids is not None:
173
+ (batch_size, seq_length) = input_ids.shape
174
+ elif inputs_embeds is not None:
175
+ (batch_size, seq_length, _) = inputs_embeds.shape
176
+ else:
177
+ raise ValueError('You have to specify either input_ids or inputs_embeds')
178
+ if past_key_values is None:
179
+ past_key_values = tuple([None] * len(self.h))
180
+ head_mask = self.get_head_mask(head_mask, self.config.n_layer)
181
+ if inputs_embeds is None:
182
+ inputs_embeds = self.word_embeddings(input_ids)
183
+ hidden_states = self.word_embeddings_layernorm(inputs_embeds)
184
+ presents = () if use_cache else None
185
+ all_self_attentions = () if output_attentions else None
186
+ all_hidden_states = () if output_hidden_states else None
187
+ seq_length_with_past = seq_length
188
+ past_key_values_length = 0
189
+ if past_key_values[0] is not None:
190
+ tmp = past_key_values[0][0]
191
+ past_key_values_length = tmp.shape[2]
192
+ seq_length_with_past = seq_length_with_past + past_key_values_length
193
+ if attention_mask is None:
194
+ attention_mask = torch.ones((batch_size, seq_length_with_past), device=hidden_states.device)
195
+ else:
196
+ attention_mask = attention_mask.to(hidden_states.device)
197
+ alibi = self._build_alibi_tensor(batch_size=batch_size, query_length=seq_length, key_length=seq_length_with_past, dtype=hidden_states.dtype, device=hidden_states.device)
198
+ causal_mask = self._prepare_attn_mask(attention_mask, bidirectional_mask, input_shape=(batch_size, seq_length), past_key_values_length=past_key_values_length)
199
+ for (i, (block, layer_past)) in enumerate(zip(self.h, past_key_values)):
200
+ if output_hidden_states:
201
+ hst = (hidden_states,)
202
+ all_hidden_states = all_hidden_states + hst
203
+ if self.gradient_checkpointing and self.training:
204
+ if use_cache:
205
+ logger.warning('`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...')
206
+ use_cache = False
207
+
208
+ def create_custom_forward(module: torch.nn.Module):
209
+
210
+ def custom_forward(*inputs: Any):
211
+ return module(*inputs, use_cache=use_cache, output_attentions=output_attentions)
212
+ return custom_forward
213
+ outputs = torch.utils.checkpoint.checkpoint(create_custom_forward(block), hidden_states, alibi, causal_mask, head_mask[i])
214
+ else:
215
+ outputs = block(hidden_states, layer_past=layer_past, attention_mask=causal_mask, head_mask=head_mask[i], use_cache=use_cache, output_attentions=output_attentions, alibi=alibi)
216
+ hidden_states = outputs[0]
217
+ if use_cache is True:
218
+ presents = presents + (outputs[1],)
219
+ if output_attentions:
220
+ oa = (outputs[2 if use_cache else 1],)
221
+ all_self_attentions = all_self_attentions + oa
222
+ hidden_states = self.ln_f(hidden_states)
223
+ if output_hidden_states:
224
+ hst = (hidden_states,)
225
+ all_hidden_states = all_hidden_states + hst
226
+ if not return_dict:
227
+ return tuple((v for v in [hidden_states, presents, all_hidden_states, all_self_attentions] if v is not None))
228
+ return BaseModelOutputWithPastAndCrossAttentions(last_hidden_state=hidden_states, past_key_values=presents, hidden_states=all_hidden_states, attentions=all_self_attentions)
229
+ setattr(model.transformer, '_prepare_attn_mask', MethodType(_prepare_attn_mask, model.transformer))
230
+ setattr(model.transformer, '_build_alibi_tensor', MethodType(_build_alibi_tensor, model.transformer))
231
+ setattr(model.transformer, 'forward', MethodType(transformer_forward, model.transformer))
232
+ KeyValueT = Tuple[torch.Tensor, torch.Tensor]
233
+
234
+ def forward(self: BloomForCausalLM, input_ids: Optional[torch.LongTensor]=None, past_key_values: Optional[Tuple[KeyValueT, ...]]=None, attention_mask: Optional[torch.Tensor]=None, bidirectional_mask: Optional[torch.Tensor]=None, head_mask: Optional[torch.Tensor]=None, inputs_embeds: Optional[torch.Tensor]=None, labels: Optional[torch.Tensor]=None, use_cache: Optional[bool]=None, output_attentions: Optional[bool]=None, output_hidden_states: Optional[bool]=None, return_dict: Optional[bool]=None, **deprecated_arguments: Any) -> Union[Tuple[torch.Tensor], CausalLMOutputWithCrossAttentions]:
235
+ """Replacement forward method for BloomCausalLM."""
236
+ if deprecated_arguments.pop('position_ids', False) is not False:
237
+ warnings.warn('`position_ids` have no functionality in BLOOM and will be removed ' + 'in v5.0.0. You can safely ignore passing `position_ids`.', FutureWarning)
238
+ if len(deprecated_arguments) > 0:
239
+ raise ValueError(f'Got unexpected arguments: {deprecated_arguments}')
240
+ return_dict = return_dict if return_dict is not None else self.config.use_return_dict
241
+ transformer_outputs = self.transformer(input_ids, past_key_values=past_key_values, attention_mask=attention_mask, bidirectional_mask=bidirectional_mask, head_mask=head_mask, inputs_embeds=inputs_embeds, use_cache=use_cache, output_attentions=output_attentions, output_hidden_states=output_hidden_states, return_dict=return_dict)
242
+ hidden_states = transformer_outputs[0]
243
+ lm_logits = self.lm_head(hidden_states)
244
+ loss = None
245
+ if labels is not None:
246
+ shift_logits = lm_logits[..., :-1, :].contiguous()
247
+ shift_labels = labels[..., 1:].contiguous()
248
+ (batch_size, seq_length, vocab_size) = shift_logits.shape
249
+ loss_fct = CrossEntropyLoss()
250
+ loss = loss_fct(shift_logits.view(batch_size * seq_length, vocab_size), shift_labels.view(batch_size * seq_length))
251
+ if not return_dict:
252
+ output = (lm_logits,) + transformer_outputs[1:]
253
+ return (loss,) + output if loss is not None else output
254
+ return CausalLMOutputWithCrossAttentions(loss=loss, logits=lm_logits, past_key_values=transformer_outputs.past_key_values, hidden_states=transformer_outputs.hidden_states, attentions=transformer_outputs.attentions)
255
+
256
+ def prepare_inputs_for_generation(self: BloomForCausalLM, input_ids: torch.LongTensor, past: Optional[torch.Tensor]=None, attention_mask: Optional[torch.Tensor]=None, **kwargs: Any) -> dict:
257
+ del kwargs
258
+ if past:
259
+ input_ids = input_ids[:, -1].unsqueeze(-1)
260
+ bidirectional_mask = None
261
+ if past[0][0].shape[0] == input_ids.shape[0]:
262
+ past = self._convert_to_bloom_cache(past)
263
+ else:
264
+ bidirectional_mask = torch.ones_like(input_ids)
265
+ return {'input_ids': input_ids, 'past_key_values': past, 'use_cache': True, 'attention_mask': attention_mask, 'bidirectional_mask': bidirectional_mask}
266
+ setattr(model, 'forward', MethodType(forward, model))
267
+ setattr(model, 'prepare_inputs_for_generation', MethodType(prepare_inputs_for_generation, model))
268
+ setattr(model, '_prefix_lm_converted', True)
269
+ return model
270
+
271
+ def _convert_opt_causal_lm_to_prefix_lm(model: OPTForCausalLM) -> OPTForCausalLM:
272
+ """Converts an OPT Causal LM to a Prefix LM.
273
+
274
+ Supported HuggingFace model classes:
275
+ - `OPTForCausalLM`
276
+
277
+ See `convert_hf_causal_lm_to_prefix_lm` for more details.
278
+ """
279
+ if hasattr(model, '_prefix_lm_converted'):
280
+ return model
281
+ assert isinstance(model, OPTForCausalLM)
282
+ assert model.config.add_cross_attention == False, 'Only supports OPT decoder-only models'
283
+ setattr(model, '_original_forward', getattr(model, 'forward'))
284
+ setattr(model, '_original_generate', getattr(model, 'generate'))
285
+ model.model.decoder.bidirectional_mask = None
286
+
287
+ def _prepare_decoder_attention_mask(self: torch.nn.Module, attention_mask: Optional[torch.Tensor], input_shape: Tuple[int, int], inputs_embeds: Optional[torch.Tensor], past_key_values_length: int):
288
+ combined_attention_mask = None
289
+ if input_shape[-1] > 1:
290
+ assert inputs_embeds is not None
291
+ if self.bidirectional_mask == 'g':
292
+ (bsz, src_length) = input_shape
293
+ combined_attention_mask = torch.zeros((bsz, 1, src_length, src_length + past_key_values_length), dtype=inputs_embeds.dtype, device=inputs_embeds.device)
294
+ else:
295
+ combined_attention_mask = _make_causal_mask_opt(input_shape, inputs_embeds.dtype, past_key_values_length=past_key_values_length).to(inputs_embeds.device)
296
+ if self.bidirectional_mask is not None:
297
+ assert attention_mask is not None
298
+ assert attention_mask.shape == self.bidirectional_mask.shape
299
+ expanded_bidirectional_mask = _expand_mask_opt(self.bidirectional_mask, inputs_embeds.dtype, tgt_len=input_shape[-1]).to(inputs_embeds.device)
300
+ combined_attention_mask = torch.maximum(expanded_bidirectional_mask, combined_attention_mask)
301
+ if attention_mask is not None:
302
+ assert inputs_embeds is not None
303
+ expanded_attn_mask = _expand_mask_opt(attention_mask, inputs_embeds.dtype, tgt_len=input_shape[-1]).to(inputs_embeds.device)
304
+ combined_attention_mask = expanded_attn_mask if combined_attention_mask is None else expanded_attn_mask + combined_attention_mask
305
+ return combined_attention_mask
306
+ setattr(model.model.decoder, '_prepare_decoder_attention_mask', MethodType(_prepare_decoder_attention_mask, model.model.decoder))
307
+
308
+ def forward(self: OPTForCausalLM, input_ids: Optional[torch.LongTensor]=None, attention_mask: Optional[torch.Tensor]=None, bidirectional_mask: Optional[torch.ByteTensor]=None, head_mask: Optional[torch.Tensor]=None, past_key_values: Optional[List[torch.FloatTensor]]=None, inputs_embeds: Optional[torch.FloatTensor]=None, labels: Optional[torch.LongTensor]=None, use_cache: Optional[bool]=None, output_attentions: Optional[bool]=None, output_hidden_states: Optional[bool]=None, return_dict: Optional[bool]=None):
309
+
310
+ def call_og_forward():
311
+ return self._original_forward(input_ids=input_ids, attention_mask=attention_mask, head_mask=head_mask, past_key_values=past_key_values, inputs_embeds=inputs_embeds, labels=labels, use_cache=use_cache, output_attentions=output_attentions, output_hidden_states=output_hidden_states, return_dict=return_dict)
312
+ if bidirectional_mask is None:
313
+ return call_og_forward()
314
+ self.model.decoder.bidirectional_mask = bidirectional_mask
315
+ try:
316
+ outputs = call_og_forward()
317
+ except:
318
+ self.model.decoder.bidirectional_mask = None
319
+ raise
320
+ self.model.decoder.bidirectional_mask = None
321
+ return outputs
322
+
323
+ def generate(self: OPTForCausalLM, *args: tuple, **kwargs: Any):
324
+ """Wraps original generate to enable PrefixLM-style attention."""
325
+ self.model.decoder.bidirectional_mask = 'g'
326
+ try:
327
+ output = self._original_generate(*args, **kwargs)
328
+ except:
329
+ self.model.decoder.bidirectional_mask = None
330
+ raise
331
+ self.model.decoder.bidirectional_mask = None
332
+ return output
333
+ setattr(model, 'forward', MethodType(forward, model))
334
+ setattr(model, 'generate', MethodType(generate, model))
335
+ setattr(model, '_prefix_lm_converted', True)
336
+ return model
337
+ _SUPPORTED_HF_MODELS = _SUPPORTED_GPT_MODELS + (BloomForCausalLM, OPTForCausalLM)
338
+ CAUSAL_LM_TYPES = Union[GPT2LMHeadModel, GPTJForCausalLM, GPTNeoForCausalLM, GPTNeoXForCausalLM, BloomForCausalLM, OPTForCausalLM]
339
+
340
+ def convert_hf_causal_lm_to_prefix_lm(model: CAUSAL_LM_TYPES) -> CAUSAL_LM_TYPES:
341
+ """Converts a HuggingFace Causal LM to a Prefix LM.
342
+
343
+ Supported HuggingFace model classes:
344
+ - `GPT2LMHeadModel`
345
+ - `GPTNeoForCausalLM`
346
+ - `GPTNeoXForCausalLM`
347
+ - `GPTJForCausalLM`
348
+ - `BloomForCausalLM`
349
+ - `OPTForCausalLM`
350
+
351
+ Conversion to a Prefix LM is done by modifying the `forward` method, and possibly also the
352
+ `generate` method and/or select underlying methods depending on the model class.
353
+
354
+ These changes preserve the model API, but add a new input to `forward`: "bidirectional_mask".
355
+
356
+ Notes on training:
357
+ To actually train the converted model as a Prefix LM, training batches will need to indicate
358
+ the prefix/target structure by including `bidirectional_mask` as part of the batch inputs.
359
+
360
+ **This is not a standard input and requires custom layers either within or after your dataloader.**
361
+
362
+ In addition to adding `bidirectional_mask` to the batch, this custom code should modify `labels`
363
+ such that `batch['labels'][batch['bidirectional_mask'] == 1] == -100`.
364
+ That is, the prefix portion of the sequence should not generate any loss. Loss should only be
365
+ generated by the target portion of the sequence.
366
+
367
+ Notes on `GPTNeoForCausalLM`:
368
+ To simplify the implementation, "global" and "local" attention layers are handled differently.
369
+ For "global" layers, we handle conversion as described above. For "local" layers, which use a
370
+ causal attention mask within a restricted local window, we do not alter the masking.
371
+
372
+ Notes on `forward` method conversion:
373
+ After conversion, the `forward` method will handle a new input, `bidirectional_mask`,
374
+ which should be a [batch_size, seq_length] byte tensor, where 1 indicates token positions
375
+ belonging to the prefix (prefix tokens can attend to one another bidirectionally), and
376
+ 0 indicates token positions belonging to the target.
377
+
378
+ The new `forward` method will incorporate `bidirectional_mask` (if supplied) into the existing
379
+ causal mask, call the original `forward` method, and (if the causal mask is a buffer) reset
380
+ the causal masks before returning the result.
381
+
382
+ Notes on `generate` method conversion:
383
+ After conversion, the `generate` method will have the same signature but will internally
384
+ convert all causal masks to be purely bidirectional, call the original `generate` method, and
385
+ (where appropriate) reset the causal masks before returning the result.
386
+
387
+ This works thanks to the logic of the HuggingFace `generate` API, which first encodes the token
388
+ "prompt" passed to `generate` (which is treated as the prefix) and then sequentially generates
389
+ each new token. Encodings are cached as generation happens, so all prefix tokens can attend to one
390
+ another (as expected in a Prefix LM) and generated tokens can only attend to prefix tokens and
391
+ previously-generated tokens (also as expected in a Prefix LM).
392
+
393
+ To preserve the API, the original methods are renamed to `_original_forward` and
394
+ `_original_generate`, and replaced with new `forward` and `generate` methods that wrap
395
+ them, respectively. Although implementation details vary by model class.
396
+ """
397
+ if isinstance(model, _SUPPORTED_GPT_MODELS):
398
+ return _convert_gpt_causal_lm_to_prefix_lm(model)
399
+ elif isinstance(model, BloomForCausalLM):
400
+ return _convert_bloom_causal_lm_to_prefix_lm(model)
401
+ elif isinstance(model, OPTForCausalLM):
402
+ return _convert_opt_causal_lm_to_prefix_lm(model)
403
+ else:
404
+ raise TypeError(f'Cannot convert model to Prefix LM. ' + f'Model does not belong to set of supported HF models:' + f'\n{_SUPPORTED_HF_MODELS}')
405
+
406
+ def add_bidirectional_mask_if_missing(batch: MutableMapping):
407
+ """Attempts to add bidirectional_mask to batch if missing.
408
+
409
+ Raises:
410
+ KeyError if bidirectional_mask is missing and can't be inferred
411
+ """
412
+ if 'bidirectional_mask' not in batch:
413
+ if batch.get('mode', None) == 'icl_task':
414
+ batch['bidirectional_mask'] = batch['attention_mask'].clone()
415
+ for (i, continuation_indices) in enumerate(batch['continuation_indices']):
416
+ batch['bidirectional_mask'][i, continuation_indices] = 0
417
+ elif 'labels' in batch and 'attention_mask' in batch:
418
+ batch['bidirectional_mask'] = torch.logical_and(torch.eq(batch['attention_mask'], 1), torch.eq(batch['labels'], -100)).type_as(batch['attention_mask'])
419
+ else:
420
+ raise KeyError('No bidirectional_mask in batch and not sure how to construct one.')
meta_init_context.py ADDED
@@ -0,0 +1,99 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from contextlib import contextmanager
2
+ from typing import Any, Callable, Optional
3
+ import torch
4
+ import torch.nn as nn
5
+
6
+ @contextmanager
7
+ def init_empty_weights(include_buffers: bool=False):
8
+ """Meta initialization context manager.
9
+
10
+ A context manager under which models are initialized with all parameters
11
+ on the meta device, therefore creating an empty model. Useful when just
12
+ initializing the model would blow the available RAM.
13
+
14
+ Args:
15
+ include_buffers (`bool`, *optional*, defaults to `False`): Whether or
16
+ not to also put all buffers on the meta device while initializing.
17
+
18
+ Example:
19
+ ```python
20
+ import torch.nn as nn
21
+
22
+ # Initialize a model with 100 billions parameters in no time and without using any RAM.
23
+ with init_empty_weights():
24
+ tst = nn.Sequential(*[nn.Linear(10000, 10000) for _ in range(1000)])
25
+ ```
26
+
27
+ <Tip warning={true}>
28
+
29
+ Any model created under this context manager has no weights. As such you can't do something like
30
+ `model.to(some_device)` with it. To load weights inside your empty model, see [`load_checkpoint_and_dispatch`].
31
+
32
+ </Tip>
33
+ """
34
+ with init_on_device(torch.device('meta'), include_buffers=include_buffers) as f:
35
+ yield f
36
+
37
+ @contextmanager
38
+ def init_on_device(device: torch.device, include_buffers: bool=False):
39
+ """Device initialization context manager.
40
+
41
+ A context manager under which models are initialized with all parameters
42
+ on the specified device.
43
+
44
+ Args:
45
+ device (`torch.device`): Device to initialize all parameters on.
46
+ include_buffers (`bool`, *optional*, defaults to `False`): Whether or
47
+ not to also put all buffers on the meta device while initializing.
48
+
49
+ Example:
50
+ ```python
51
+ import torch.nn as nn
52
+
53
+ with init_on_device(device=torch.device("cuda")):
54
+ tst = nn.Liner(100, 100) # on `cuda` device
55
+ ```
56
+ """
57
+ old_register_parameter = nn.Module.register_parameter
58
+ if include_buffers:
59
+ old_register_buffer = nn.Module.register_buffer
60
+
61
+ def register_empty_parameter(self: torch.nn.Module, name: str, param: Optional[torch.nn.Parameter]):
62
+ old_register_parameter(self, name, param)
63
+ if param is not None:
64
+ parameter = self._parameters[name]
65
+ assert parameter is not None
66
+ param_cls = type(parameter)
67
+ kwargs = parameter.__dict__
68
+ self._parameters[name] = param_cls(parameter.to(device), **kwargs)
69
+
70
+ def register_empty_buffer(self: torch.nn.Module, name: str, tensor: Optional[torch.Tensor], persistent: bool=True):
71
+ old_register_buffer(self, name, tensor, persistent=persistent)
72
+ if tensor is not None:
73
+ named_buffer = self._buffers[name]
74
+ assert named_buffer is not None
75
+ self._buffers[name] = named_buffer.to(device)
76
+ if include_buffers:
77
+ tensor_constructors_to_patch = {torch_function_name: getattr(torch, torch_function_name) for torch_function_name in ['empty', 'zeros', 'ones', 'full']}
78
+ else:
79
+ tensor_constructors_to_patch = {}
80
+
81
+ def patch_tensor_constructor(fn: Callable):
82
+
83
+ def wrapper(*args: Any, **kwargs: Any):
84
+ kwargs['device'] = device
85
+ return fn(*args, **kwargs)
86
+ return wrapper
87
+ try:
88
+ nn.Module.register_parameter = register_empty_parameter
89
+ if include_buffers:
90
+ nn.Module.register_buffer = register_empty_buffer
91
+ for torch_function_name in tensor_constructors_to_patch.keys():
92
+ setattr(torch, torch_function_name, patch_tensor_constructor(getattr(torch, torch_function_name)))
93
+ yield
94
+ finally:
95
+ nn.Module.register_parameter = old_register_parameter
96
+ if include_buffers:
97
+ nn.Module.register_buffer = old_register_buffer
98
+ for (torch_function_name, old_torch_function) in tensor_constructors_to_patch.items():
99
+ setattr(torch, torch_function_name, old_torch_function)
modeling_mpt.py ADDED
@@ -0,0 +1,323 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """A simple, flexible implementation of a GPT model.
2
+
3
+ Inspired by https://github.com/karpathy/minGPT/blob/master/mingpt/model.py
4
+ """
5
+ import math
6
+ import warnings
7
+ from typing import Any, Dict, List, Mapping, MutableMapping, Optional, Tuple, Union
8
+ import torch
9
+ import torch.nn as nn
10
+ import torch.nn.functional as F
11
+ from transformers import PreTrainedModel, PreTrainedTokenizerBase
12
+ from transformers.modeling_outputs import BaseModelOutputWithPast, CausalLMOutputWithPast
13
+ from .attention import attn_bias_shape, build_attn_bias
14
+ from .blocks import MPTBlock
15
+ from .custom_embedding import SharedEmbedding
16
+ from .fc import FC_CLASS_REGISTRY as FC_CLASS_REGISTRY
17
+ from .ffn import FFN_CLASS_REGISTRY as FFN_CLASS_REGISTRY
18
+ from .ffn import MPTMLP as MPTMLP
19
+ from .ffn import build_ffn as build_ffn
20
+ from .norm import NORM_CLASS_REGISTRY
21
+ from .configuration_mpt import MPTConfig
22
+ from .adapt_tokenizer import AutoTokenizerForMOD, adapt_tokenizer_for_denoising
23
+ from .hf_prefixlm_converter import add_bidirectional_mask_if_missing, convert_hf_causal_lm_to_prefix_lm
24
+ from .meta_init_context import init_empty_weights
25
+ from .param_init_fns import generic_param_init_fn_, MODEL_INIT_REGISTRY
26
+ try:
27
+ from .flash_attn_triton import flash_attn_func as flash_attn_func
28
+ except:
29
+ pass
30
+ import logging
31
+ log = logging.getLogger(__name__)
32
+
33
+ class MPTPreTrainedModel(PreTrainedModel):
34
+ config_class = MPTConfig
35
+ base_model_prefix = 'model'
36
+ _no_split_modules = ['MPTBlock']
37
+
38
+ class MPTModel(MPTPreTrainedModel):
39
+
40
+ def __init__(self, config: MPTConfig):
41
+ config._validate_config()
42
+ super().__init__(config)
43
+ self.attn_impl = config.attn_config['attn_impl']
44
+ self.prefix_lm = config.attn_config['prefix_lm']
45
+ self.attn_uses_sequence_id = config.attn_config['attn_uses_sequence_id']
46
+ self.alibi = config.attn_config['alibi']
47
+ self.alibi_bias_max = config.attn_config['alibi_bias_max']
48
+ self.learned_pos_emb = config.learned_pos_emb
49
+ if config.init_device == 'mixed':
50
+ if dist.get_local_rank() == 0:
51
+ config.init_device = 'cpu'
52
+ else:
53
+ config.init_device = 'meta'
54
+ if config.norm_type.lower() not in NORM_CLASS_REGISTRY.keys():
55
+ norm_options = ' | '.join(NORM_CLASS_REGISTRY.keys())
56
+ raise NotImplementedError(f'Requested norm type ({config.norm_type}) is not implemented within this repo (Options: {norm_options}).')
57
+ norm_class = NORM_CLASS_REGISTRY[config.norm_type.lower()]
58
+ self.embedding_fraction = config.embedding_fraction
59
+ self.wte = SharedEmbedding(config.vocab_size, config.d_model, device=config.init_device)
60
+ if self.learned_pos_emb:
61
+ self.wpe = torch.nn.Embedding(config.max_seq_len, config.d_model, device=config.init_device)
62
+ self.emb_drop = nn.Dropout(config.emb_pdrop)
63
+ self.blocks = nn.ModuleList([MPTBlock(device=config.init_device, **config.to_dict()) for _ in range(config.n_layers)])
64
+ self.norm_f = norm_class(config.d_model, device=config.init_device)
65
+ if config.init_device != 'meta':
66
+ log.info(f'We recommend using config.init_device="meta" with Composer + FSDP for faster initialization.')
67
+ self.apply(self.param_init_fn)
68
+ self.is_causal = not self.prefix_lm
69
+ self._attn_bias_initialized = False
70
+ self.attn_bias = None
71
+ self.attn_bias_shape = attn_bias_shape(self.attn_impl, config.n_heads, config.max_seq_len, self.alibi, prefix_lm=self.prefix_lm, causal=self.is_causal, use_sequence_id=self.attn_uses_sequence_id)
72
+ if config.no_bias:
73
+ for module in self.modules():
74
+ if hasattr(module, 'bias') and isinstance(module.bias, nn.Parameter):
75
+ log.info(f'Removing bias ({module.bias}) from {module}.')
76
+ module.register_parameter('bias', None)
77
+ log.debug(self)
78
+ log.debug(f"Using {self.config.init_config['name']} initialization.")
79
+
80
+ def get_input_embeddings(self) -> nn.Embedding:
81
+ return self.wte
82
+
83
+ def set_input_embeddings(self, value: nn.Embedding) -> None:
84
+ self.wte = value
85
+
86
+ @torch.no_grad()
87
+ def _attn_bias(self, device: torch.device, dtype: torch.dtype, attention_mask: Optional[torch.ByteTensor]=None, prefix_mask: Optional[torch.ByteTensor]=None, sequence_id: Optional[torch.LongTensor]=None) -> Tuple[Optional[torch.Tensor], Optional[torch.ByteTensor]]:
88
+ if not self._attn_bias_initialized:
89
+ if self.attn_bias_shape:
90
+ self.attn_bias = torch.zeros(self.attn_bias_shape, device=device, dtype=dtype)
91
+ self.attn_bias = build_attn_bias(self.attn_impl, self.attn_bias, self.config.n_heads, self.config.max_seq_len, causal=self.is_causal, alibi=self.alibi, alibi_bias_max=self.alibi_bias_max)
92
+ self._attn_bias_initialized = True
93
+ if self.attn_impl == 'flash':
94
+ return (self.attn_bias, attention_mask)
95
+ if self.attn_bias is not None:
96
+ self.attn_bias = self.attn_bias.to(dtype=dtype, device=device)
97
+ attn_bias = self.attn_bias
98
+ if self.prefix_lm:
99
+ assert isinstance(attn_bias, torch.Tensor)
100
+ assert isinstance(prefix_mask, torch.Tensor)
101
+ attn_bias = self._apply_prefix_mask(attn_bias, prefix_mask)
102
+ if self.attn_uses_sequence_id and sequence_id is not None:
103
+ assert isinstance(attn_bias, torch.Tensor)
104
+ attn_bias = self._apply_sequence_id(attn_bias, sequence_id)
105
+ if attention_mask is not None:
106
+ s_k = attention_mask.shape[-1]
107
+ if attn_bias is None:
108
+ attn_bias = torch.zeros((1, 1, 1, s_k), device=device, dtype=dtype)
109
+ else:
110
+ _s_k = max(0, attn_bias.size(-1) - s_k)
111
+ attn_bias = attn_bias[:, :, :, _s_k:]
112
+ if prefix_mask is not None and attention_mask.shape != prefix_mask.shape:
113
+ raise ValueError(f'attention_mask shape={attention_mask.shape} ' + f'and prefix_mask shape={prefix_mask.shape} are not equal.')
114
+ min_val = torch.finfo(attn_bias.dtype).min
115
+ attn_bias = attn_bias.masked_fill(~attention_mask.view(-1, 1, 1, s_k), min_val)
116
+ return (attn_bias, None)
117
+
118
+ def _apply_prefix_mask(self, attn_bias: torch.Tensor, prefix_mask: torch.Tensor) -> torch.Tensor:
119
+ (s_k, s_q) = attn_bias.shape[-2:]
120
+ if s_k != self.config.max_seq_len or s_q != self.config.max_seq_len:
121
+ raise ValueError('attn_bias does not match the expected shape. ' + f'The last two dimensions should both be {self.config.max_length} ' + f'but are {s_k} and {s_q}.')
122
+ seq_len = prefix_mask.shape[-1]
123
+ if seq_len > self.config.max_seq_len:
124
+ raise ValueError(f'prefix_mask sequence length cannot exceed max_seq_len={self.config.max_seq_len}')
125
+ attn_bias = attn_bias[..., :seq_len, :seq_len]
126
+ causal = torch.tril(torch.ones((seq_len, seq_len), dtype=torch.bool, device=prefix_mask.device)).view(1, 1, seq_len, seq_len)
127
+ prefix = prefix_mask.view(-1, 1, 1, seq_len)
128
+ cannot_attend = ~torch.logical_or(causal, prefix.bool())
129
+ min_val = torch.finfo(attn_bias.dtype).min
130
+ attn_bias = attn_bias.masked_fill(cannot_attend, min_val)
131
+ return attn_bias
132
+
133
+ def _apply_sequence_id(self, attn_bias: torch.Tensor, sequence_id: torch.LongTensor) -> torch.Tensor:
134
+ seq_len = sequence_id.shape[-1]
135
+ if seq_len > self.config.max_seq_len:
136
+ raise ValueError(f'sequence_id sequence length cannot exceed max_seq_len={self.config.max_seq_len}')
137
+ attn_bias = attn_bias[..., :seq_len, :seq_len]
138
+ cannot_attend = torch.logical_not(torch.eq(sequence_id.view(-1, seq_len, 1), sequence_id.view(-1, 1, seq_len))).unsqueeze(1)
139
+ min_val = torch.finfo(attn_bias.dtype).min
140
+ attn_bias = attn_bias.masked_fill(cannot_attend, min_val)
141
+ return attn_bias
142
+
143
+ def forward(self, input_ids: torch.LongTensor, past_key_values: Optional[List[Tuple[torch.FloatTensor]]]=None, attention_mask: Optional[torch.ByteTensor]=None, prefix_mask: Optional[torch.ByteTensor]=None, sequence_id: Optional[torch.LongTensor]=None, return_dict: Optional[bool]=None, output_attentions: Optional[bool]=None, output_hidden_states: Optional[bool]=None, use_cache: Optional[bool]=None, inputs_embeds: Optional[torch.Tensor]=None) -> BaseModelOutputWithPast:
144
+ return_dict = return_dict if return_dict is not None else self.config.return_dict
145
+ use_cache = use_cache if use_cache is not None else self.config.use_cache
146
+ if attention_mask is not None:
147
+ attention_mask = attention_mask.bool()
148
+ if prefix_mask is not None:
149
+ prefix_mask = prefix_mask.bool()
150
+ if not return_dict:
151
+ raise NotImplementedError('return_dict False is not implemented yet for MPT')
152
+ if output_attentions:
153
+ if self.attn_impl != 'torch':
154
+ raise NotImplementedError('output_attentions is not implemented for MPT when using attn_impl `flash` or `triton`.')
155
+ if self.training and attention_mask is not None and (attention_mask[:, 0].sum() != attention_mask.shape[0]):
156
+ raise NotImplementedError('MPT does not support training with left padding.')
157
+ if self.prefix_lm and prefix_mask is None:
158
+ raise ValueError('prefix_mask is a required argument when MPT is configured with prefix_lm=True.')
159
+ if inputs_embeds is not None:
160
+ raise NotImplementedError('inputs_embeds is not implemented for MPT.')
161
+ if self.training:
162
+ if self.attn_uses_sequence_id and sequence_id is None:
163
+ raise ValueError('sequence_id is a required argument when MPT is configured with attn_uses_sequence_id=True ' + 'and the model is in train mode.')
164
+ elif self.attn_uses_sequence_id is False and sequence_id is not None:
165
+ warnings.warn('MPT received non-None input for `sequence_id` but is configured with attn_uses_sequence_id=False. ' + 'This input will be ignored. If you want the model to use `sequence_id`, set attn_uses_sequence_id to True.')
166
+ S = input_ids.size(1)
167
+ assert S <= self.config.max_seq_len, f'Cannot forward input with seq_len={S}, this model only supports seq_len<={self.config.max_seq_len}'
168
+ tok_emb = self.wte(input_ids)
169
+ if self.learned_pos_emb:
170
+ past_position = 0
171
+ if past_key_values is not None:
172
+ if len(past_key_values) != self.config.n_layers:
173
+ raise ValueError(f'past_key_values must provide a past_key_value for each attention ' + f'layer in the network (len(past_key_values)={len(past_key_values)!r}; self.config.n_layers={self.config.n_layers!r}).')
174
+ past_position = past_key_values[0][0].size(1)
175
+ if self.attn_impl == 'torch':
176
+ past_position = past_key_values[0][0].size(3)
177
+ if S + past_position > self.config.max_seq_len:
178
+ raise ValueError(f'Cannot forward input with past sequence length {past_position} and current sequence length ' + f'{S + 1}, this model only supports total sequence length <= {self.config.max_seq_len}.')
179
+ pos = torch.arange(past_position, S + past_position, dtype=torch.long, device=input_ids.device).unsqueeze(0)
180
+ if attention_mask is not None:
181
+ pos = torch.clamp(pos - torch.cumsum((~attention_mask).to(torch.int32), dim=1)[:, past_position:], min=0)
182
+ pos_emb = self.wpe(pos)
183
+ x = tok_emb + pos_emb
184
+ else:
185
+ x = tok_emb
186
+ if self.embedding_fraction == 1:
187
+ x = self.emb_drop(x)
188
+ else:
189
+ x_shrunk = x * self.embedding_fraction + x.detach() * (1 - self.embedding_fraction)
190
+ assert isinstance(self.emb_drop, nn.Module)
191
+ x = self.emb_drop(x_shrunk)
192
+ (attn_bias, attention_mask) = self._attn_bias(device=x.device, dtype=torch.float32, attention_mask=attention_mask, prefix_mask=prefix_mask, sequence_id=sequence_id)
193
+ if use_cache and past_key_values is None:
194
+ past_key_values = [() for _ in range(self.config.n_layers)]
195
+ all_hidden_states = () if output_hidden_states else None
196
+ all_self_attns = () if output_attentions else None
197
+ for (b_idx, block) in enumerate(self.blocks):
198
+ if output_hidden_states:
199
+ assert all_hidden_states is not None
200
+ all_hidden_states = all_hidden_states + (x,)
201
+ past_key_value = past_key_values[b_idx] if past_key_values is not None else None
202
+ (x, attn_weights, past_key_value) = block(x, past_key_value=past_key_value, attn_bias=attn_bias, attention_mask=attention_mask, is_causal=self.is_causal, output_attentions=bool(output_attentions))
203
+ if past_key_values is not None:
204
+ past_key_values[b_idx] = past_key_value
205
+ if output_attentions:
206
+ assert all_self_attns is not None
207
+ all_self_attns = all_self_attns + (attn_weights,)
208
+ x = self.norm_f(x)
209
+ if output_hidden_states:
210
+ assert all_hidden_states is not None
211
+ all_hidden_states = all_hidden_states + (x,)
212
+ return BaseModelOutputWithPast(last_hidden_state=x, past_key_values=past_key_values, hidden_states=all_hidden_states, attentions=all_self_attns)
213
+
214
+ def param_init_fn(self, module: nn.Module) -> None:
215
+ init_fn_name = self.config.init_config['name']
216
+ MODEL_INIT_REGISTRY[init_fn_name](module=module, n_layers=self.config.n_layers, d_model=self.config.d_model, **self.config.init_config)
217
+
218
+ def fsdp_wrap_fn(self, module: nn.Module) -> bool:
219
+ return isinstance(module, MPTBlock)
220
+
221
+ def activation_checkpointing_fn(self, module: nn.Module) -> bool:
222
+ return isinstance(module, MPTBlock)
223
+
224
+ class MPTForCausalLM(MPTPreTrainedModel):
225
+
226
+ def __init__(self, config: MPTConfig):
227
+ super().__init__(config)
228
+ if not config.tie_word_embeddings:
229
+ raise ValueError('MPTForCausalLM only supports tied word embeddings')
230
+ log.info(f'Instantiating an MPTForCausalLM model from {__file__}')
231
+ self.transformer: MPTModel = MPTModel(config)
232
+ for child in self.transformer.children():
233
+ if isinstance(child, torch.nn.ModuleList):
234
+ continue
235
+ if isinstance(child, torch.nn.Module):
236
+ child._fsdp_wrap = True
237
+ self.logit_scale = None
238
+ if config.logit_scale is not None:
239
+ logit_scale = config.logit_scale
240
+ if isinstance(logit_scale, str):
241
+ if logit_scale == 'inv_sqrt_d_model':
242
+ logit_scale = 1 / math.sqrt(config.d_model)
243
+ else:
244
+ raise ValueError(f"logit_scale={logit_scale!r} is not recognized as an option; use numeric value or 'inv_sqrt_d_model'.")
245
+ self.logit_scale = logit_scale
246
+
247
+ def get_input_embeddings(self) -> nn.Embedding:
248
+ return self.transformer.wte
249
+
250
+ def set_input_embeddings(self, value: Union[SharedEmbedding, nn.Embedding]) -> None:
251
+ self.transformer.wte = value
252
+
253
+ def get_output_embeddings(self) -> nn.Embedding:
254
+ return self.transformer.wte
255
+
256
+ def set_output_embeddings(self, new_embeddings: Union[SharedEmbedding, nn.Embedding]) -> None:
257
+ self.transformer.wte = new_embeddings
258
+
259
+ def set_decoder(self, decoder: MPTModel) -> None:
260
+ self.transformer = decoder
261
+
262
+ def get_decoder(self) -> MPTModel:
263
+ return self.transformer
264
+
265
+ def forward(self, input_ids: torch.LongTensor, past_key_values: Optional[List[Tuple[torch.FloatTensor]]]=None, attention_mask: Optional[torch.ByteTensor]=None, prefix_mask: Optional[torch.ByteTensor]=None, sequence_id: Optional[torch.LongTensor]=None, labels: Optional[torch.LongTensor]=None, return_dict: Optional[bool]=None, output_attentions: Optional[bool]=None, output_hidden_states: Optional[bool]=None, use_cache: Optional[bool]=None, inputs_embeds: Optional[torch.FloatTensor]=None) -> CausalLMOutputWithPast:
266
+ return_dict = return_dict if return_dict is not None else self.config.return_dict
267
+ use_cache = use_cache if use_cache is not None else self.config.use_cache
268
+ if inputs_embeds is not None:
269
+ raise NotImplementedError('inputs_embeds has to be None (for hf/peft support).')
270
+ outputs = self.transformer(input_ids=input_ids, past_key_values=past_key_values, attention_mask=attention_mask, prefix_mask=prefix_mask, sequence_id=sequence_id, return_dict=return_dict, output_attentions=output_attentions, output_hidden_states=output_hidden_states, use_cache=use_cache)
271
+ logits = self.transformer.wte(outputs.last_hidden_state.to(self.transformer.wte.weight.device), True)
272
+ if self.logit_scale is not None:
273
+ if self.logit_scale == 0:
274
+ warnings.warn(f'Multiplying logits by self.logit_scale={self.logit_scale!r}. This will produce uniform (uninformative) outputs.')
275
+ logits *= self.logit_scale
276
+ loss = None
277
+ if labels is not None:
278
+ _labels = torch.roll(labels, shifts=-1)
279
+ _labels[:, -1] = -100
280
+ loss = F.cross_entropy(logits.view(-1, logits.size(-1)), _labels.to(logits.device).view(-1))
281
+ return CausalLMOutputWithPast(loss=loss, logits=logits, past_key_values=outputs.past_key_values, hidden_states=outputs.hidden_states, attentions=outputs.attentions)
282
+
283
+ def param_init_fn(self, module: nn.Module) -> None:
284
+ init_fn_name = self.config.init_config['name']
285
+ MODEL_INIT_REGISTRY[init_fn_name](module=module, n_layers=self.config.n_layers, d_model=self.config.d_model, **self.config.init_config)
286
+
287
+ def fsdp_wrap_fn(self, module: nn.Module) -> bool:
288
+ return isinstance(module, MPTBlock)
289
+
290
+ def activation_checkpointing_fn(self, module: nn.Module) -> bool:
291
+ return isinstance(module, MPTBlock)
292
+
293
+ def prepare_inputs_for_generation(self, input_ids: torch.Tensor, past_key_values: Optional[List[Tuple[torch.Tensor, torch.Tensor]]]=None, inputs_embeds: Optional[torch.Tensor]=None, **kwargs: Any) -> Dict[str, Any]:
294
+ if inputs_embeds is not None:
295
+ raise NotImplementedError('inputs_embeds is not implemented for MPT yet')
296
+ attention_mask = kwargs['attention_mask'].bool()
297
+ if attention_mask[:, -1].sum() != attention_mask.shape[0]:
298
+ raise NotImplementedError('MPT does not support generation with right padding.')
299
+ if self.transformer.attn_uses_sequence_id and self.training:
300
+ sequence_id = torch.zeros_like(input_ids[:1])
301
+ else:
302
+ sequence_id = None
303
+ if past_key_values is not None:
304
+ input_ids = input_ids[:, -1].unsqueeze(-1)
305
+ if self.transformer.prefix_lm:
306
+ prefix_mask = torch.ones_like(attention_mask)
307
+ if kwargs.get('use_cache') == False:
308
+ raise NotImplementedError('MPT with prefix_lm=True does not support use_cache=False.')
309
+ else:
310
+ prefix_mask = None
311
+ return {'input_ids': input_ids, 'attention_mask': attention_mask, 'prefix_mask': prefix_mask, 'sequence_id': sequence_id, 'past_key_values': past_key_values, 'use_cache': kwargs.get('use_cache', True)}
312
+
313
+ @staticmethod
314
+ def _reorder_cache(past_key_values: List[Tuple[torch.Tensor, torch.Tensor]], beam_idx: torch.LongTensor) -> List[Tuple[torch.Tensor, ...]]:
315
+ """Used by HuggingFace generate when using beam search with kv-caching.
316
+
317
+ See https://github.com/huggingface/transformers/blob/3ec7a47664ebe40c40f4b722f6bb1cd30c3821ec/src/transformers/models/gpt2/modeling_gpt2.py#L1122-L1133
318
+ for an example in transformers.
319
+ """
320
+ reordered_past = []
321
+ for layer_past in past_key_values:
322
+ reordered_past += [tuple((past_state.index_select(0, beam_idx) for past_state in layer_past))]
323
+ return reordered_past
norm.py ADDED
@@ -0,0 +1,57 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from typing import Dict, List, Optional, Type, Union
2
+ import torch
3
+
4
+ def _cast_if_autocast_enabled(tensor: torch.Tensor) -> torch.Tensor:
5
+ if torch.is_autocast_enabled():
6
+ if tensor.device.type == 'cuda':
7
+ dtype = torch.get_autocast_gpu_dtype()
8
+ elif tensor.device.type == 'cpu':
9
+ dtype = torch.get_autocast_cpu_dtype()
10
+ else:
11
+ raise NotImplementedError()
12
+ return tensor.to(dtype=dtype)
13
+ return tensor
14
+
15
+ class LPLayerNorm(torch.nn.LayerNorm):
16
+
17
+ def __init__(self, normalized_shape: Union[int, List[int], torch.Size], eps: float=1e-05, elementwise_affine: bool=True, device: Optional[torch.device]=None, dtype: Optional[torch.dtype]=None):
18
+ super().__init__(normalized_shape=normalized_shape, eps=eps, elementwise_affine=elementwise_affine, device=device, dtype=dtype)
19
+
20
+ def forward(self, x: torch.Tensor) -> torch.Tensor:
21
+ module_device = x.device
22
+ downcast_x = _cast_if_autocast_enabled(x)
23
+ downcast_weight = _cast_if_autocast_enabled(self.weight) if self.weight is not None else self.weight
24
+ downcast_bias = _cast_if_autocast_enabled(self.bias) if self.bias is not None else self.bias
25
+ with torch.autocast(enabled=False, device_type=module_device.type):
26
+ return torch.nn.functional.layer_norm(downcast_x, self.normalized_shape, downcast_weight, downcast_bias, self.eps)
27
+
28
+ def rms_norm(x: torch.Tensor, weight: Optional[torch.Tensor]=None, eps: float=1e-05) -> torch.Tensor:
29
+ output = x * torch.rsqrt(x.pow(2).mean(-1, keepdim=True) + eps)
30
+ if weight is not None:
31
+ return output * weight
32
+ return output
33
+
34
+ class RMSNorm(torch.nn.Module):
35
+
36
+ def __init__(self, normalized_shape: Union[int, List[int], torch.Size], eps: float=1e-05, weight: bool=True, dtype: Optional[torch.dtype]=None, device: Optional[torch.device]=None):
37
+ super().__init__()
38
+ self.eps = eps
39
+ if weight:
40
+ self.weight = torch.nn.Parameter(torch.ones(normalized_shape, dtype=dtype, device=device))
41
+ else:
42
+ self.register_parameter('weight', None)
43
+
44
+ def forward(self, x: torch.Tensor) -> torch.Tensor:
45
+ return rms_norm(x.float(), self.weight, self.eps).to(dtype=x.dtype)
46
+
47
+ class LPRMSNorm(RMSNorm):
48
+
49
+ def __init__(self, normalized_shape: Union[int, List[int], torch.Size], eps: float=1e-05, weight: bool=True, dtype: Optional[torch.dtype]=None, device: Optional[torch.device]=None):
50
+ super().__init__(normalized_shape=normalized_shape, eps=eps, weight=weight, dtype=dtype, device=device)
51
+
52
+ def forward(self, x: torch.Tensor) -> torch.Tensor:
53
+ downcast_x = _cast_if_autocast_enabled(x)
54
+ downcast_weight = _cast_if_autocast_enabled(self.weight) if self.weight is not None else self.weight
55
+ with torch.autocast(enabled=False, device_type=x.device.type):
56
+ return rms_norm(downcast_x, downcast_weight, self.eps).to(dtype=x.dtype)
57
+ NORM_CLASS_REGISTRY: Dict[str, Type[torch.nn.Module]] = {'layernorm': torch.nn.LayerNorm, 'low_precision_layernorm': LPLayerNorm, 'rmsnorm': RMSNorm, 'low_precision_rmsnorm': LPRMSNorm}
param_init_fns.py ADDED
@@ -0,0 +1,179 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import math
2
+ import warnings
3
+ from collections.abc import Sequence
4
+ from functools import partial
5
+ from typing import Any, Callable, Optional, Tuple, Union
6
+ import torch
7
+ from torch import nn
8
+ from .fc import FC_CLASS_REGISTRY
9
+ from .norm import NORM_CLASS_REGISTRY
10
+ try:
11
+ import transformer_engine.pytorch as te
12
+ except:
13
+ te = None
14
+
15
+ def torch_default_param_init_fn_(module: nn.Module, **kwargs: Any) -> None:
16
+ del kwargs
17
+ if hasattr(module, 'reset_parameters') and isinstance(module.reset_parameters, Callable):
18
+ module.reset_parameters()
19
+
20
+ def fused_init_helper_(module: nn.Module, init_fn_: Callable) -> None:
21
+ _fused = getattr(module, '_fused', None)
22
+ if _fused is None:
23
+ raise RuntimeError(f'Internal logic error')
24
+ assert isinstance(module.weight, torch.Tensor)
25
+ (dim, splits) = _fused
26
+ splits = (0, *splits, module.weight.size(dim))
27
+ for (s, e) in zip(splits[:-1], splits[1:]):
28
+ slice_indices = [slice(None)] * module.weight.ndim
29
+ slice_indices[dim] = slice(s, e)
30
+ init_fn_(module.weight[slice_indices])
31
+
32
+ def generic_param_init_fn_(module: nn.Module, init_fn_: Callable, n_layers: int, d_model: Optional[int]=None, init_div_is_residual: Union[int, float, str, bool]=True, emb_init_std: Optional[float]=None, emb_init_uniform_lim: Optional[Union[Tuple[float, float], float]]=None, **kwargs: Any) -> None:
33
+ del kwargs
34
+ init_div_is_residual = init_div_is_residual
35
+ if init_div_is_residual is False:
36
+ div_is_residual = 1.0
37
+ elif init_div_is_residual is True:
38
+ div_is_residual = math.sqrt(2 * n_layers)
39
+ elif isinstance(init_div_is_residual, float) or isinstance(init_div_is_residual, int):
40
+ div_is_residual = init_div_is_residual
41
+ elif init_div_is_residual.isnumeric():
42
+ div_is_residual = float(init_div_is_residual)
43
+ else:
44
+ div_is_residual = 1.0
45
+ raise ValueError(f'Expected init_div_is_residual to be boolean or numeric, got {init_div_is_residual}')
46
+ if isinstance(module, tuple(set(FC_CLASS_REGISTRY.values()))):
47
+ if hasattr(module, '_fused'):
48
+ fused_init_helper_(module, init_fn_)
49
+ else:
50
+ init_fn_(module.weight)
51
+ if module.bias is not None:
52
+ assert isinstance(module.bias, torch.Tensor)
53
+ torch.nn.init.zeros_(module.bias)
54
+ if init_div_is_residual is not False and getattr(module, '_is_residual', False):
55
+ with torch.no_grad():
56
+ module.weight.div_(div_is_residual)
57
+ elif isinstance(module, nn.Embedding):
58
+ if emb_init_std is not None:
59
+ std = emb_init_std
60
+ if std == 0:
61
+ warnings.warn(f'Embedding layer initialized to 0.')
62
+ emb_init_fn_ = partial(torch.nn.init.normal_, mean=0.0, std=std)
63
+ elif emb_init_uniform_lim is not None:
64
+ lim = emb_init_uniform_lim
65
+ if isinstance(lim, Sequence):
66
+ if len(lim) > 2:
67
+ raise ValueError(f'Uniform init requires a min and a max limit. User input: {lim}.')
68
+ if lim[0] == lim[1]:
69
+ warnings.warn(f'Embedding layer initialized to {lim[0]}.')
70
+ else:
71
+ if lim == 0:
72
+ warnings.warn(f'Embedding layer initialized to 0.')
73
+ lim = [-lim, lim]
74
+ (a, b) = lim
75
+ emb_init_fn_ = partial(torch.nn.init.uniform_, a=a, b=b)
76
+ else:
77
+ emb_init_fn_ = init_fn_
78
+ emb_init_fn_(module.weight)
79
+ elif isinstance(module, tuple(set(NORM_CLASS_REGISTRY.values()))):
80
+ if hasattr(module, 'weight') and isinstance(module.weight, torch.Tensor):
81
+ torch.nn.init.ones_(module.weight)
82
+ if hasattr(module, 'bias') and isinstance(module.bias, torch.Tensor):
83
+ torch.nn.init.zeros_(module.bias)
84
+ elif isinstance(module, nn.MultiheadAttention):
85
+ if module._qkv_same_embed_dim:
86
+ assert module.in_proj_weight is not None
87
+ assert module.q_proj_weight is None and module.k_proj_weight is None and (module.v_proj_weight is None)
88
+ assert d_model is not None
89
+ _d = d_model
90
+ splits = (0, _d, 2 * _d, 3 * _d)
91
+ for (s, e) in zip(splits[:-1], splits[1:]):
92
+ init_fn_(module.in_proj_weight[s:e])
93
+ else:
94
+ assert module.q_proj_weight is not None and module.k_proj_weight is not None and (module.v_proj_weight is not None)
95
+ assert module.in_proj_weight is None
96
+ init_fn_(module.q_proj_weight)
97
+ init_fn_(module.k_proj_weight)
98
+ init_fn_(module.v_proj_weight)
99
+ if module.in_proj_bias is not None:
100
+ torch.nn.init.zeros_(module.in_proj_bias)
101
+ if module.bias_k is not None:
102
+ torch.nn.init.zeros_(module.bias_k)
103
+ if module.bias_v is not None:
104
+ torch.nn.init.zeros_(module.bias_v)
105
+ init_fn_(module.out_proj.weight)
106
+ if init_div_is_residual is not False and getattr(module.out_proj, '_is_residual', False):
107
+ with torch.no_grad():
108
+ module.out_proj.weight.div_(div_is_residual)
109
+ if module.out_proj.bias is not None:
110
+ torch.nn.init.zeros_(module.out_proj.bias)
111
+ elif te is not None and isinstance(module, te.LayerNormMLP):
112
+ if isinstance(module.layer_norm_weight, torch.Tensor):
113
+ torch.nn.init.ones_(module.layer_norm_weight)
114
+ if isinstance(module.layer_norm_bias, torch.Tensor):
115
+ torch.nn.init.zeros_(module.layer_norm_bias)
116
+ init_fn_(module.fc1_weight)
117
+ if module.fc1_bias is not None:
118
+ assert isinstance(module.fc1_bias, torch.Tensor)
119
+ torch.nn.init.zeros_(module.fc1_bias)
120
+ init_fn_(module.fc2_weight)
121
+ if module.fc2_bias is not None:
122
+ assert isinstance(module.fc2_bias, torch.Tensor)
123
+ torch.nn.init.zeros_(module.fc2_bias)
124
+ with torch.no_grad():
125
+ module.fc2_weight.div_(div_is_residual)
126
+ else:
127
+ for _ in module.parameters(recurse=False):
128
+ raise NotImplementedError(f'{module.__class__.__name__} parameters are not initialized by param_init_fn.')
129
+
130
+ def _normal_init_(std: float, mean: float=0.0) -> Callable:
131
+ return partial(torch.nn.init.normal_, mean=mean, std=std)
132
+
133
+ def _normal_param_init_fn_(module: nn.Module, std: float, n_layers: int, d_model: Optional[int]=None, init_div_is_residual: Union[int, float, str, bool]=True, emb_init_std: Optional[float]=None, emb_init_uniform_lim: Optional[Union[Tuple[float, float], float]]=None, **kwargs: Any) -> None:
134
+ del kwargs
135
+ init_fn_ = _normal_init_(std=std)
136
+ generic_param_init_fn_(module=module, init_fn_=init_fn_, d_model=d_model, n_layers=n_layers, init_div_is_residual=init_div_is_residual, emb_init_std=emb_init_std, emb_init_uniform_lim=emb_init_uniform_lim)
137
+
138
+ def baseline_param_init_fn_(module: nn.Module, init_std: Optional[float], n_layers: int, d_model: Optional[int]=None, init_div_is_residual: Union[int, float, str, bool]=True, emb_init_std: Optional[float]=None, emb_init_uniform_lim: Optional[Union[Tuple[float, float], float]]=None, **kwargs: Any) -> None:
139
+ del kwargs
140
+ if init_std is None:
141
+ raise ValueError("You must set model.init_config['init_std'] to a float value to use the default initialization scheme.")
142
+ _normal_param_init_fn_(module=module, std=init_std, d_model=d_model, n_layers=n_layers, init_div_is_residual=init_div_is_residual, emb_init_std=emb_init_std, emb_init_uniform_lim=emb_init_uniform_lim)
143
+
144
+ def small_param_init_fn_(module: nn.Module, n_layers: int, d_model: int, init_div_is_residual: Union[int, float, str, bool]=True, emb_init_std: Optional[float]=None, emb_init_uniform_lim: Optional[Union[Tuple[float, float], float]]=None, **kwargs: Any) -> None:
145
+ del kwargs
146
+ std = math.sqrt(2 / (5 * d_model))
147
+ _normal_param_init_fn_(module=module, std=std, d_model=d_model, n_layers=n_layers, init_div_is_residual=init_div_is_residual, emb_init_std=emb_init_std, emb_init_uniform_lim=emb_init_uniform_lim)
148
+
149
+ def neox_param_init_fn_(module: nn.Module, n_layers: int, d_model: int, emb_init_std: Optional[float]=None, emb_init_uniform_lim: Optional[Union[Tuple[float, float], float]]=None, **kwargs: Any) -> None:
150
+ """From section 2.3.1 of GPT-NeoX-20B:
151
+
152
+ An Open-Source AutoregressiveLanguage Model — Black et. al. (2022)
153
+ see https://github.com/EleutherAI/gpt-neox/blob/9610391ab319403cef079b438edd016a2443af54/megatron/model/init_functions.py#L151
154
+ and https://github.com/EleutherAI/gpt-neox/blob/main/megatron/model/transformer.py
155
+ """
156
+ del kwargs
157
+ residual_div = n_layers / math.sqrt(10)
158
+ small_param_init_fn_(module=module, d_model=d_model, n_layers=n_layers, init_div_is_residual=residual_div, emb_init_std=emb_init_std, emb_init_uniform_lim=emb_init_uniform_lim)
159
+
160
+ def kaiming_uniform_param_init_fn_(module: nn.Module, n_layers: int, d_model: Optional[int]=None, init_div_is_residual: Union[int, float, str, bool]=True, emb_init_std: Optional[float]=None, emb_init_uniform_lim: Optional[Union[Tuple[float, float], float]]=None, init_gain: float=0, fan_mode: str='fan_in', init_nonlinearity: str='leaky_relu', **kwargs: Any) -> None:
161
+ del kwargs
162
+ kaiming_uniform_ = partial(nn.init.kaiming_uniform_, a=init_gain, mode=fan_mode, nonlinearity=init_nonlinearity)
163
+ generic_param_init_fn_(module=module, init_fn_=kaiming_uniform_, d_model=d_model, n_layers=n_layers, init_div_is_residual=init_div_is_residual, emb_init_std=emb_init_std, emb_init_uniform_lim=emb_init_uniform_lim)
164
+
165
+ def kaiming_normal_param_init_fn_(module: nn.Module, n_layers: int, d_model: Optional[int]=None, init_div_is_residual: Union[int, float, str, bool]=True, emb_init_std: Optional[float]=None, emb_init_uniform_lim: Optional[Union[Tuple[float, float], float]]=None, init_gain: float=0, fan_mode: str='fan_in', init_nonlinearity: str='leaky_relu', **kwargs: Any) -> None:
166
+ del kwargs
167
+ kaiming_normal_ = partial(torch.nn.init.kaiming_normal_, a=init_gain, mode=fan_mode, nonlinearity=init_nonlinearity)
168
+ generic_param_init_fn_(module=module, init_fn_=kaiming_normal_, d_model=d_model, n_layers=n_layers, init_div_is_residual=init_div_is_residual, emb_init_std=emb_init_std, emb_init_uniform_lim=emb_init_uniform_lim)
169
+
170
+ def xavier_uniform_param_init_fn_(module: nn.Module, n_layers: int, d_model: Optional[int]=None, init_div_is_residual: Union[int, float, str, bool]=True, emb_init_std: Optional[float]=None, emb_init_uniform_lim: Optional[Union[Tuple[float, float], float]]=None, init_gain: float=0, **kwargs: Any) -> None:
171
+ del kwargs
172
+ xavier_uniform_ = partial(torch.nn.init.xavier_uniform_, gain=init_gain)
173
+ generic_param_init_fn_(module=module, init_fn_=xavier_uniform_, d_model=d_model, n_layers=n_layers, init_div_is_residual=init_div_is_residual, emb_init_std=emb_init_std, emb_init_uniform_lim=emb_init_uniform_lim)
174
+
175
+ def xavier_normal_param_init_fn_(module: nn.Module, n_layers: int, d_model: Optional[int]=None, init_div_is_residual: Union[int, float, str, bool]=True, emb_init_std: Optional[float]=None, emb_init_uniform_lim: Optional[Union[Tuple[float, float], float]]=None, init_gain: float=0, **kwargs: Any) -> None:
176
+ del kwargs
177
+ xavier_normal_ = partial(torch.nn.init.xavier_normal_, gain=init_gain)
178
+ generic_param_init_fn_(module=module, init_fn_=xavier_normal_, d_model=d_model, n_layers=n_layers, init_div_is_residual=init_div_is_residual, emb_init_std=emb_init_std, emb_init_uniform_lim=emb_init_uniform_lim)
179
+ MODEL_INIT_REGISTRY = {'default_': torch_default_param_init_fn_, 'baseline_': baseline_param_init_fn_, 'kaiming_uniform_': kaiming_uniform_param_init_fn_, 'kaiming_normal_': kaiming_normal_param_init_fn_, 'neox_init_': neox_param_init_fn_, 'small_init_': small_param_init_fn_, 'xavier_uniform_': xavier_uniform_param_init_fn_, 'xavier_normal_': xavier_normal_param_init_fn_}
pytorch_model.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:57458118149b4e0318854179a731a76dec1de8b4deed5262865e52f17c173a3f
3
+ size 135
requirements.txt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ altair
2
+ pandas
3
+ streamlit
special_tokens_map.json ADDED
@@ -0,0 +1,23 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "additional_special_tokens": [
3
+ "<fim_prefix>",
4
+ "<fim_middle>",
5
+ "<fim_suffix>",
6
+ "<fim_eot>",
7
+ "<fim_pad>",
8
+ "[INST]",
9
+ "[/INST]",
10
+ "<extra_id_2>",
11
+ "<extra_id_3>",
12
+ "<extra_id_4>",
13
+ "<extra_id_5>",
14
+ "<extra_id_6>",
15
+ "<extra_id_7>",
16
+ "<extra_id_8>",
17
+ "<extra_id_9>",
18
+ "<extra_id_10>"
19
+ ],
20
+ "bos_token": "<|endoftext|>",
21
+ "eos_token": "<|endoftext|>",
22
+ "unk_token": "<|endoftext|>"
23
+ }
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json ADDED
@@ -0,0 +1,27 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "add_prefix_space": false,
3
+ "additional_special_tokens": [
4
+ "<fim_prefix>",
5
+ "<fim_middle>",
6
+ "<fim_suffix>",
7
+ "<fim_eot>",
8
+ "<fim_pad>",
9
+ "[INST]",
10
+ "[/INST]",
11
+ "<extra_id_2>",
12
+ "<extra_id_3>",
13
+ "<extra_id_4>",
14
+ "<extra_id_5>",
15
+ "<extra_id_6>",
16
+ "<extra_id_7>",
17
+ "<extra_id_8>",
18
+ "<extra_id_9>",
19
+ "<extra_id_10>"
20
+ ],
21
+ "bos_token": "<|endoftext|>",
22
+ "clean_up_tokenization_spaces": false,
23
+ "eos_token": "<|endoftext|>",
24
+ "model_max_length": 4096,
25
+ "tokenizer_class": "GPTNeoXTokenizer",
26
+ "unk_token": "<|endoftext|>"
27
+ }