Spaces:
Runtime error
Runtime error
<html lang="en"> | |
<head> | |
<title>Bootstrap Example</title> | |
<meta charset="utf-8"> | |
<meta name="viewport" content="width=device-width, initial-scale=1"> | |
<link rel="stylesheet" href="https://maxcdn.bootstrapcdn.com/bootstrap/3.4.1/css/bootstrap.min.css"> | |
<script src="https://ajax.googleapis.com/ajax/libs/jquery/3.5.1/jquery.min.js"></script> | |
<script src="https://maxcdn.bootstrapcdn.com/bootstrap/3.4.1/js/bootstrap.min.js"></script> | |
<style> | |
.faded { | |
margin: 0 auto; | |
background: var(--window-color); | |
box-shadow: 0 0 5px 5px var(--window-color); | |
font-family: cursive; | |
font-family: "Gill Sans", sans-serif; | |
display: inline-block | |
} | |
.padded { | |
width: 100%; | |
max-width: 800px; | |
text-align: left; | |
} | |
.demo_title { | |
font-size: 32px; | |
box-shadow: 0 0 5px 5px var(--window-color); | |
font-family: -apple-system,BlinkMacSystemFont,Segoe UI,Helvetica,Arial, | |
sans-serif,Apple Color Emoji,Segoe UI Emoji; | |
} | |
.demo_text { | |
font-size: 16px; | |
box-shadow: 0 0 5px 5px var(--window-color); | |
font-family: -apple-system,BlinkMacSystemFont,Segoe UI,Helvetica,Arial, | |
sans-serif,Apple Color Emoji,Segoe UI Emoji; | |
} | |
.tab-group { | |
font-size: 15px; | |
} | |
.tab-content { | |
margin-top: 16px; | |
} | |
ul > li { | |
margin: 3px 0; | |
} | |
ol > li { | |
margin: 5px 0; | |
} | |
/* a:link { | |
color: #00194a; | |
text-decoration: none; | |
} | |
a:visited { | |
color: #3f004a; | |
text-decoration: none; | |
} */ | |
</style> | |
</head> | |
<body> | |
<div class="tab-group" style="width: 100%; margin:0 auto;"> | |
<div> | |
<!-- Nav tabs --> | |
<ul class="nav nav-tabs" role="tablist"> | |
<li role="presentation" class="active"><a href="#tab1" aria-controls="tab1" role="tab" data-toggle="tab">Memory-Efficient Training</a></li> | |
<li role="presentation"><a href="#tab2" aria-controls="tab2" role="tab" data-toggle="tab">Security</a></li> | |
<li role="presentation"><a href="#tab3" aria-controls="tab3" role="tab" data-toggle="tab">Make Your Own</a></li> | |
</ul> | |
<!-- Tab panes --> | |
<div class="tab-content"> | |
<div role="tabpanel" class="tab-pane active" id="tab1"> | |
<p> | |
Our aim is to train a large model in a decentralized fashion on consumer hardware or low-end cloud instances. | |
This means we need to make the model, dataset, and other memory buffers fit onto a few GB of disk, 12-16 GB of CPU RAM, | |
and 8-12 GB of GPU memory. Unfortunately, this rules out many popular techniques such as | |
<a target="_blank" rel="noopener noreferrer" href="https://arxiv.org/abs/2101.06840">ZeRO-Offload</a>: | |
there is simply not enough RAM for that. Instead, we must make better use of what limited memory we have. | |
To do this, we use two techniques: 8-bit Optimizers for GPU memory and dataset streaming for RAM & HDD. | |
</p> | |
<p> | |
<b>8-bit Optimizers:</b> | |
Using optimizers such as LAMB or Adam requires four times as much GPU memory as simply storing model parameters (8 bytes vs 2 bytes). | |
As such, for training large models with many parameters the optimizers make up the largest chunk of memory. | |
With 8-bit optimizers this memory is reduced by 75% (2 bytes) making it much easier to fit large models onto consumer GPUs. | |
</p><p> | |
Naturally, we can combine this technique with offloading: storing 8-bit optimizer states in CPU memory rather | |
than GPU memory (0 bytes GPU, 2 bytes CPU). To perform an optimizer update, we transfer the GPU gradients | |
to the CPU, perform the optimizer update, and then transfer the updated weights to the GPU. | |
We can do this for each weight one-by-one so that additional CPU memory required for the optimizer update | |
is minimal. | |
The combination of offloading and 8-bit optimizers means that we conserve GPU memory (0 bytes per parameter) | |
and also use only a limited amount of CPU memory (2 bytes per parameter). | |
</p> | |
<p> | |
<b>Dataset Streaming</b> | |
Usually data is stored on disk and needs to be fully or partially loaded into CPU memory to be used for training. | |
Large datasets used for pre-training measure in <a href="https://arxiv.org/abs/2101.00027">hundreds of gigabytes</a> or even <a href="https://laion.ai/laion-400-open-dataset/">terabytes</a>. | |
This can pose a significant problem, as most desktop and cheap cloud instance simply do not have that much space. | |
Furthermore, downloading the dataset over the internet would take up hours before one can even begin training. | |
<!--Changing the dataset means downloading a new dataset in full and using additional disk space.--> | |
</p><p> | |
To circumvent these problems, we stream the training dataset in the same way as you stream online videos. | |
Participants download a small random portion of the training dataset and immediately begin training on it, | |
while additional data is loaded in background. As such, we can train a model with virtually no memory | |
overhead from the dataset and switching to a new dataset is as simple as changing an argument to the streamer class. | |
</p> | |
</div> | |
<div role="tabpanel" class="tab-pane" id="tab2"> | |
<p>In this section, we discuss common concerns related to security of the collaborative training.</p> | |
<p> | |
<b>Q: If I join a collaborative training, do I allow other people to execute code on my computer?</b> | |
</p> | |
<p> | |
<b>A:</b> During the training, participants only exchange data (gradients, statistics, model weights) and never send code to each other. | |
No other peer can execute code on your computer. | |
</p> | |
<p> | |
To join the training, you typically need to run the code (implementing the model, data streaming, training loop, etc.) | |
from a repository or a Colab notebook provided by the authors of the experiment. | |
This is no different from running any other open source project/Colab notebook. | |
</p> | |
<p> | |
<b>Q: Can a malicious participant influence the training outcome?</b> | |
</p> | |
<p> | |
<b>A:</b> It is indeed possible unless we use some defense mechanism. | |
For instance, a malicious participant can damage model weights by sending large numbers instead of the correct gradients. | |
The same can happen due to broken hardware or misconfiguration. | |
</p> | |
<ul> | |
<li> | |
<p> | |
One possible defense is using <b>authentication</b> combined with <b>model checkpointing</b>. | |
In this case, participants should log in (e.g. with their Hugging Face account) to interact with the rest of the collaboration. | |
In turn, moderators can screen potential participants and add them to an allowlist. | |
If something goes wrong (e.g. if a participant sends invalid gradients and the model diverges), | |
the moderators remove them from the list and revert the model to the latest checkpoint unaffected by the attack. | |
</p> | |
<!-- <p><b>Spoiler (TODO): How to implement authentication in a decentralized system efficiently?</b></p>--> | |
<p> | |
Nice bonus: using this data, the moderators can acknowledge the personal contribution of each participant. | |
</p> | |
</li> | |
<li> | |
<p> | |
Another defense is replacing the naive averaging of the peers' gradients with an <b>aggregation technique robust to outliers</b>. | |
<a href="https://arxiv.org/abs/2012.10333">Karimireddy et al. (2020)</a> | |
suggested such a technique (named CenteredClip) and proved that it does not significantly affect the model's convergence. | |
</p> | |
<!-- <p><b>Spoiler (TODO): How does CenteredClip protect from outliers? (Interactive Demo)</b></p>--> | |
<p> | |
In our case, CenteredClip is useful but not enough to protect from malicious participants, | |
since it implies that the CenteredClip procedure itself is performed by a trusted server. | |
In contrast, in our decentralized system, all participants can aggregate a part of the gradients and we cannot assume all of them to be trusted. | |
</p> | |
<p> | |
Recently, <a href="https://arxiv.org/abs/2106.11257">Gorbunov et al. (2021)</a> | |
proposed a robust aggregation protocol for decentralized systems that does not require this assumption. | |
This protocol uses CenteredClip as a subroutine but is able to detect and ban participants who performed it incorrectly. | |
</p> | |
</li> | |
</ul> | |
</div> | |
<div role="tabpanel" class="tab-pane" id="tab3"> | |
<p>In this section, we provide a roadmap for you to run the collaborative training yourself.</p> | |
<p> | |
<b>Got confused?</b> Feel free to ask any questions at our <a href="https://discord.gg/uGugx9zYvN">Discord</a>! | |
</p> | |
<ol> | |
<li> | |
Set up dataset streaming: | |
<ul> | |
<li> | |
<a href="https://huggingface.co/docs/datasets/share_dataset.html">Upload</a> your dataset to Hugging Face Hub | |
in a streaming-friendly format (<a href="https://huggingface.co/datasets/laion/laion_100m_vqgan_f8">example</a>). | |
</li> | |
<li>Set up dataset streaming (see the "Efficient Training" section).</li> | |
</ul> | |
</li> | |
<li> | |
Write code of training peers (<a href="https://github.com/learning-at-home/dalle-hivemind/blob/main/run_trainer.py">example</a>): | |
<ul> | |
<li>Implement your model, set up dataset streaming, and write the training loop.</li> | |
<li> | |
Get familiar with the hivemind library | |
(e.g., via the <a href="https://learning-at-home.readthedocs.io/en/latest/user/quickstart.html">quickstart</a>). | |
</li> | |
<li> | |
In the training loop, wrap up your PyTorch optimizer with | |
<a href="https://learning-at-home.readthedocs.io/en/latest/modules/optim.html#hivemind.optim.experimental.optimizer.Optimizer">hivemind.Optimizer</a> | |
(<a href="https://github.com/learning-at-home/dalle-hivemind/blob/main/task.py#L121">example</a>). | |
</li> | |
</ul> | |
</li> | |
<li> | |
<b>(optional)</b> Write code of auxiliary peers (<a href="https://github.com/learning-at-home/dalle-hivemind/blob/main/run_aux_peer.py">example</a>): | |
<ul> | |
<li> | |
Auxiliary peers a special kind of peers responsible for | |
logging loss and other metrics (e.g., to <a href="https://wandb.ai/">Weights & Biases</a>) | |
and uploading model checkpoints (e.g., to <a href="https://huggingface.co/docs/transformers/model_sharing">Hugging Face Hub</a>). | |
</li> | |
<li> | |
Such peers don't need to calculate gradients and may be run on cheap machines without GPUs. | |
</li> | |
<li> | |
They can serve as a convenient entry point to | |
<a href="https://learning-at-home.readthedocs.io/en/latest/modules/dht.html">hivemind.DHT</a> | |
(i.e., their address can be specified as <code>initial_peers</code>). | |
</li> | |
<li> | |
It is useful to fix their address by providing <code>host_maddrs</code> and <code>identity_path</code> | |
arguments to <code>hivemind.DHT</code> | |
(these are forwarded to the underlying <a href="https://libp2p.io/">libp2p</a> daemon). | |
</li> | |
</ul> | |
</li> | |
<li> | |
<b>(optional)</b> Make it easier for other people to join: | |
<ul> | |
<li> | |
Create notebooks for free GPU providers (Google Colab, Kaggle, AWS SageMaker, etc.). | |
People may run them online and/or download and run them on their own hardware. | |
</li> | |
<li> | |
<a href="https://huggingface.co/organizations/new">Create</a> a Hugging Face organization | |
with all resources related to the training | |
(dataset, model, inference demo, links to a dashboard with loss and other metrics, etc.). | |
Look at <a href="https://huggingface.co/training-transformers-together">ours</a> as an example. | |
</li> | |
<li> | |
Set up an authentication system (see the "Security" section). | |
For example, you can ask people to join your organization with their Hugging Face accounts | |
(Hugging Face allows to share a link for joining or manually approve new participants). | |
This allows you to screen participants, | |
acknowledge their contributions (e.g., make a leaderboard), and | |
ban accounts who behave maliciously. | |
</li> | |
<li> | |
Set up an inference demo for your model (e.g., using <a href="https://huggingface.co/spaces">Spaces</a>) or | |
a script that periodically uploads the inference results to show the training progress. | |
</li> | |
</ul> | |
</li> | |
</ol> | |
</div> | |
</div> | |
</div> | |
</div> | |
</body> |