Spaces:
Runtime error
Runtime error
Commit
·
795dc75
1
Parent(s):
81885f7
open in new tab
Browse files- static/tabs.html +20 -19
static/tabs.html
CHANGED
@@ -93,7 +93,8 @@ a:visited {
|
|
93 |
<p>
|
94 |
<b>Dataset Streaming</b>
|
95 |
Usually data is stored on disk and needs to be fully or partially loaded into CPU memory to be used for training.
|
96 |
-
Large datasets used for pre-training measure in <a href="https://arxiv.org/abs/2101.00027">hundreds of gigabytes</a>
|
|
|
97 |
This can pose a significant problem, as most desktop and cheap cloud instance simply do not have that much space.
|
98 |
Furthermore, downloading the dataset over the internet would take up hours before one can even begin training.
|
99 |
<!--Changing the dataset means downloading a new dataset in full and using additional disk space.-->
|
@@ -106,7 +107,7 @@ a:visited {
|
|
106 |
</p>
|
107 |
<center>
|
108 |
Here's a tutorial for using these techniques:<br>
|
109 |
-
<a href="https://colab.research.google.com/gist/justheuristic/75f6a2a731f05a213a55cd2c8a458aaf/fine-tune-a-language-model-with-dataset-streaming-and-8-bit-optimizers.ipynb">
|
110 |
<img src="https://colab.research.google.com/assets/colab-badge.svg" width=360px>
|
111 |
</a>
|
112 |
</center>
|
@@ -159,7 +160,7 @@ a:visited {
|
|
159 |
<li>
|
160 |
<p>
|
161 |
Another defense is replacing the naive averaging of the peers' gradients with an <b>aggregation technique robust to outliers</b>.
|
162 |
-
<a href="https://arxiv.org/abs/2012.10333">Karimireddy et al. (2020)</a>
|
163 |
suggested such a technique (named CenteredClip) and proved that it does not significantly affect the model's convergence.
|
164 |
</p>
|
165 |
|
@@ -172,7 +173,7 @@ a:visited {
|
|
172 |
</p>
|
173 |
|
174 |
<p>
|
175 |
-
Recently, <a href="https://arxiv.org/abs/2106.11257">Gorbunov et al. (2021)</a>
|
176 |
proposed a robust aggregation protocol for decentralized systems that does not require this assumption.
|
177 |
This protocol uses CenteredClip as a subroutine but is able to detect and ban participants who performed it incorrectly.
|
178 |
</p>
|
@@ -182,54 +183,54 @@ a:visited {
|
|
182 |
<div role="tabpanel" class="tab-pane" id="tab3">
|
183 |
<p>In this section, we provide a roadmap for you to run the collaborative training yourself.</p>
|
184 |
<p>
|
185 |
-
<b>Got confused?</b> Feel free to ask any questions at our <a href="https://discord.gg/uGugx9zYvN">Discord</a>!
|
186 |
</p>
|
187 |
<ol>
|
188 |
<li>
|
189 |
Set up dataset streaming:
|
190 |
<ul>
|
191 |
<li>
|
192 |
-
<a href="https://huggingface.co/docs/datasets/share_dataset.html">Upload</a> your dataset to Hugging Face Hub
|
193 |
-
in a streaming-friendly format (<a href="https://huggingface.co/datasets/laion/laion_100m_vqgan_f8">example</a>).
|
194 |
</li>
|
195 |
<li>Set up dataset streaming (see the "Efficient Training" section).</li>
|
196 |
</ul>
|
197 |
</li>
|
198 |
<li>
|
199 |
-
Write code of training peers (<a href="https://github.com/learning-at-home/dalle-hivemind/blob/main/run_trainer.py">example</a>):
|
200 |
<ul>
|
201 |
<li>Implement your model, set up dataset streaming, and write the training loop.</li>
|
202 |
<li>
|
203 |
Get familiar with the hivemind library
|
204 |
-
(e.g., via the <a href="https://learning-at-home.readthedocs.io/en/latest/user/quickstart.html">quickstart</a>).
|
205 |
</li>
|
206 |
<li>
|
207 |
In the training loop, wrap up your PyTorch optimizer with
|
208 |
-
<a href="https://learning-at-home.readthedocs.io/en/latest/modules/optim.html#hivemind.optim.experimental.optimizer.Optimizer">hivemind.Optimizer</a>
|
209 |
-
(<a href="https://github.com/learning-at-home/dalle-hivemind/blob/main/task.py#L121">example</a>).
|
210 |
</li>
|
211 |
</ul>
|
212 |
</li>
|
213 |
<li>
|
214 |
-
<b>(optional)</b> Write code of auxiliary peers (<a href="https://github.com/learning-at-home/dalle-hivemind/blob/main/run_aux_peer.py">example</a>):
|
215 |
<ul>
|
216 |
<li>
|
217 |
Auxiliary peers a special kind of peers responsible for
|
218 |
-
logging loss and other metrics (e.g., to <a href="https://wandb.ai/">Weights & Biases</a>)
|
219 |
-
and uploading model checkpoints (e.g., to <a href="https://huggingface.co/docs/transformers/model_sharing">Hugging Face Hub</a>).
|
220 |
</li>
|
221 |
<li>
|
222 |
Such peers don't need to calculate gradients and may be run on cheap machines without GPUs.
|
223 |
</li>
|
224 |
<li>
|
225 |
They can serve as a convenient entry point to
|
226 |
-
<a href="https://learning-at-home.readthedocs.io/en/latest/modules/dht.html">hivemind.DHT</a>
|
227 |
(i.e., their address can be specified as <code>initial_peers</code>).
|
228 |
</li>
|
229 |
<li>
|
230 |
It is useful to fix their address by providing <code>host_maddrs</code> and <code>identity_path</code>
|
231 |
arguments to <code>hivemind.DHT</code>
|
232 |
-
(these are forwarded to the underlying <a href="https://libp2p.io/">libp2p</a> daemon).
|
233 |
</li>
|
234 |
</ul>
|
235 |
</li>
|
@@ -241,10 +242,10 @@ a:visited {
|
|
241 |
People may run them online and/or download and run them on their own hardware.
|
242 |
</li>
|
243 |
<li>
|
244 |
-
<a href="https://huggingface.co/organizations/new">Create</a> a Hugging Face organization
|
245 |
with all resources related to the training
|
246 |
(dataset, model, inference demo, links to a dashboard with loss and other metrics, etc.).
|
247 |
-
Look at <a href="https://huggingface.co/training-transformers-together">ours</a> as an example.
|
248 |
</li>
|
249 |
<li>
|
250 |
Set up an authentication system (see the "Security" section).
|
@@ -255,7 +256,7 @@ a:visited {
|
|
255 |
ban accounts who behave maliciously.
|
256 |
</li>
|
257 |
<li>
|
258 |
-
Set up an inference demo for your model (e.g., using <a href="https://huggingface.co/spaces">Spaces</a>) or
|
259 |
a script that periodically uploads the inference results to show the training progress.
|
260 |
</li>
|
261 |
</ul>
|
|
|
93 |
<p>
|
94 |
<b>Dataset Streaming</b>
|
95 |
Usually data is stored on disk and needs to be fully or partially loaded into CPU memory to be used for training.
|
96 |
+
Large datasets used for pre-training measure in <a target="_blank" rel="noopener noreferrer" href="https://arxiv.org/abs/2101.00027">hundreds of gigabytes</a>
|
97 |
+
or even <a target="_blank" rel="noopener noreferrer" href="https://laion.ai/laion-400-open-dataset/">terabytes</a>.
|
98 |
This can pose a significant problem, as most desktop and cheap cloud instance simply do not have that much space.
|
99 |
Furthermore, downloading the dataset over the internet would take up hours before one can even begin training.
|
100 |
<!--Changing the dataset means downloading a new dataset in full and using additional disk space.-->
|
|
|
107 |
</p>
|
108 |
<center>
|
109 |
Here's a tutorial for using these techniques:<br>
|
110 |
+
<a target="_blank" rel="noopener noreferrer" href="https://colab.research.google.com/gist/justheuristic/75f6a2a731f05a213a55cd2c8a458aaf/fine-tune-a-language-model-with-dataset-streaming-and-8-bit-optimizers.ipynb">
|
111 |
<img src="https://colab.research.google.com/assets/colab-badge.svg" width=360px>
|
112 |
</a>
|
113 |
</center>
|
|
|
160 |
<li>
|
161 |
<p>
|
162 |
Another defense is replacing the naive averaging of the peers' gradients with an <b>aggregation technique robust to outliers</b>.
|
163 |
+
<a target="_blank" rel="noopener noreferrer" href="https://arxiv.org/abs/2012.10333">Karimireddy et al. (2020)</a>
|
164 |
suggested such a technique (named CenteredClip) and proved that it does not significantly affect the model's convergence.
|
165 |
</p>
|
166 |
|
|
|
173 |
</p>
|
174 |
|
175 |
<p>
|
176 |
+
Recently, <a target="_blank" rel="noopener noreferrer" href="https://arxiv.org/abs/2106.11257">Gorbunov et al. (2021)</a>
|
177 |
proposed a robust aggregation protocol for decentralized systems that does not require this assumption.
|
178 |
This protocol uses CenteredClip as a subroutine but is able to detect and ban participants who performed it incorrectly.
|
179 |
</p>
|
|
|
183 |
<div role="tabpanel" class="tab-pane" id="tab3">
|
184 |
<p>In this section, we provide a roadmap for you to run the collaborative training yourself.</p>
|
185 |
<p>
|
186 |
+
<b>Got confused?</b> Feel free to ask any questions at our <a target="_blank" rel="noopener noreferrer" href="https://discord.gg/uGugx9zYvN">Discord</a>!
|
187 |
</p>
|
188 |
<ol>
|
189 |
<li>
|
190 |
Set up dataset streaming:
|
191 |
<ul>
|
192 |
<li>
|
193 |
+
<a target="_blank" rel="noopener noreferrer" href="https://huggingface.co/docs/datasets/share_dataset.html">Upload</a> your dataset to Hugging Face Hub
|
194 |
+
in a streaming-friendly format (<a target="_blank" rel="noopener noreferrer" href="https://huggingface.co/datasets/laion/laion_100m_vqgan_f8">example</a>).
|
195 |
</li>
|
196 |
<li>Set up dataset streaming (see the "Efficient Training" section).</li>
|
197 |
</ul>
|
198 |
</li>
|
199 |
<li>
|
200 |
+
Write code of training peers (<a target="_blank" rel="noopener noreferrer" href="https://github.com/learning-at-home/dalle-hivemind/blob/main/run_trainer.py">example</a>):
|
201 |
<ul>
|
202 |
<li>Implement your model, set up dataset streaming, and write the training loop.</li>
|
203 |
<li>
|
204 |
Get familiar with the hivemind library
|
205 |
+
(e.g., via the <a target="_blank" rel="noopener noreferrer" href="https://learning-at-home.readthedocs.io/en/latest/user/quickstart.html">quickstart</a>).
|
206 |
</li>
|
207 |
<li>
|
208 |
In the training loop, wrap up your PyTorch optimizer with
|
209 |
+
<a target="_blank" rel="noopener noreferrer" href="https://learning-at-home.readthedocs.io/en/latest/modules/optim.html#hivemind.optim.experimental.optimizer.Optimizer">hivemind.Optimizer</a>
|
210 |
+
(<a target="_blank" rel="noopener noreferrer" href="https://github.com/learning-at-home/dalle-hivemind/blob/main/task.py#L121">example</a>).
|
211 |
</li>
|
212 |
</ul>
|
213 |
</li>
|
214 |
<li>
|
215 |
+
<b>(optional)</b> Write code of auxiliary peers (<a target="_blank" rel="noopener noreferrer" href="https://github.com/learning-at-home/dalle-hivemind/blob/main/run_aux_peer.py">example</a>):
|
216 |
<ul>
|
217 |
<li>
|
218 |
Auxiliary peers a special kind of peers responsible for
|
219 |
+
logging loss and other metrics (e.g., to <a target="_blank" rel="noopener noreferrer" href="https://wandb.ai/">Weights & Biases</a>)
|
220 |
+
and uploading model checkpoints (e.g., to <a target="_blank" rel="noopener noreferrer" href="https://huggingface.co/docs/transformers/model_sharing">Hugging Face Hub</a>).
|
221 |
</li>
|
222 |
<li>
|
223 |
Such peers don't need to calculate gradients and may be run on cheap machines without GPUs.
|
224 |
</li>
|
225 |
<li>
|
226 |
They can serve as a convenient entry point to
|
227 |
+
<a target="_blank" rel="noopener noreferrer" href="https://learning-at-home.readthedocs.io/en/latest/modules/dht.html">hivemind.DHT</a>
|
228 |
(i.e., their address can be specified as <code>initial_peers</code>).
|
229 |
</li>
|
230 |
<li>
|
231 |
It is useful to fix their address by providing <code>host_maddrs</code> and <code>identity_path</code>
|
232 |
arguments to <code>hivemind.DHT</code>
|
233 |
+
(these are forwarded to the underlying <a target="_blank" rel="noopener noreferrer" href="https://libp2p.io/">libp2p</a> daemon).
|
234 |
</li>
|
235 |
</ul>
|
236 |
</li>
|
|
|
242 |
People may run them online and/or download and run them on their own hardware.
|
243 |
</li>
|
244 |
<li>
|
245 |
+
<a target="_blank" rel="noopener noreferrer" href="https://huggingface.co/organizations/new">Create</a> a Hugging Face organization
|
246 |
with all resources related to the training
|
247 |
(dataset, model, inference demo, links to a dashboard with loss and other metrics, etc.).
|
248 |
+
Look at <a target="_blank" rel="noopener noreferrer" href="https://huggingface.co/training-transformers-together">ours</a> as an example.
|
249 |
</li>
|
250 |
<li>
|
251 |
Set up an authentication system (see the "Security" section).
|
|
|
256 |
ban accounts who behave maliciously.
|
257 |
</li>
|
258 |
<li>
|
259 |
+
Set up an inference demo for your model (e.g., using <a target="_blank" rel="noopener noreferrer" href="https://huggingface.co/spaces">Spaces</a>) or
|
260 |
a script that periodically uploads the inference results to show the training progress.
|
261 |
</li>
|
262 |
</ul>
|