""" This specific file was bodged together by ham-handed hedgehogs. If something looks wrong, it's because it is. If you're not a hedgehog, you shouldn't reuse this code. Use this instead: https://docs.streamlit.io/library/get-started """ import streamlit as st from st_helpers import make_header, content_text, content_title, cite, make_footer, make_tabs from charts import draw_current_progress st.set_page_config(page_title="Training Transformers Together", layout="centered") st.markdown("## Full demo content will be posted here on December 7th!") make_header() content_text(f""" There was a time when you could comfortably train state-of-the-art vision and language models at home on your workstation. The first convolutional neural net to beat ImageNet ({cite("AlexNet", "https://proceedings.neurips.cc/paper/2012/file/c399862d3b9d6b76c8436e924a68c45b-Paper.pdf")}) was trained for 5-6 days on two gamer-grade GPUs. In contrast, today's TOP-1 ImageNet model ({cite("CoAtNet", "https://arxiv.org/abs/2106.04803")}) takes 20,000 TPU-v3 days. And things are even worse in the NLP world: training {cite("GPT‑3", "https://arxiv.org/abs/2005.14165")} on a top-tier server with 8x A100 would take decades.""") content_text(f""" So, can individual researchers and small labs still train state-of-the-art? Yes we can! All it takes is for a bunch of us to come together. In fact, we're doing it right now and you're invited to join! """, vspace_before=12) draw_current_progress() content_text(f""" We're training a model similar to {cite("OpenAI DALL-E", "https://openai.com/blog/dall-e/")}, that is, a transformer "language model" that generates images from text description. It is trained on {cite("LAION-400M", "https://laion.ai/laion-400-open-dataset/")}, the world's largest openly available image-text-pair dataset with 400 million samples. Our model is based on the {cite("dalle‑pytorch", "https://github.com/lucidrains/DALLE-pytorch")} implementation by {cite("Phil Wang", "https://github.com/lucidrains")} with a few tweaks to make it communication-efficient. """, vspace_after=8) with st.expander("How to train efficiently over the internet?"): content_text(f""" Modern distributed training algorithms are designed for HPC networks with 10-100 gigabit per second bandwidth. In turn, a typical Internet connection runs at 10-100 megabits per second: that’s three orders of magnitude slower. To make distributed training over the Internet efficient, you need to win back these three orders of magnitude. """) content_text(f""" This may seem daunting at first, but in reality, DL researchers have already made all the necessary pieces for solving this puzzle:
Speed-up (AllReduce) | Existing technique |
4-16x | Large-batch training: {cite("You et al. (2019)", "https://arxiv.org/abs/1904.00962")} proposed a way for training neural networks efficiently with larger batches, and hence, fewer communication rounds. |
4-64x | Gradient Compression: from simple {cite("8-bit quantization", "https://arxiv.org/abs/1511.04561")} to advanced techniques such as {cite("Deep Gradient Compression", "https://arxiv.org/abs/1712.01887")}, {cite("PowerSGD", "https://arxiv.org/abs/1905.13727")}, {cite("1-bit Adam", "https://arxiv.org/abs/2102.02888")}, and many others. As a rule of thumb, you can safely reduce communication by 16-64x. More extreme compression is often possible, but it may affect stability or final quality. |
4-24x | Parameter sharing: reusing parameters between model layers results in a model with fewer parameters, and hence, fewer gradients to communicate. {cite("Lan et al. (2019)", "https://arxiv.org/abs/1909.11942")} and {cite("Xue et al. (2021)", "https://arxiv.org/pdf/2107.11817.pdf")} propose efficient parameter sharing techniques for NLP and vision. |
1.5-2x | Overlapping computation with communication: running network communication in background while computing the next portion of gradients. This is a {cite("long-standing trick from HPC", "https://ur.booksc.eu/book/1624068/2d0506")} that was recently adapted for DL training. {cite("Ren et al. (2021)", "https://arxiv.org/abs/2101.06840")} show that updating parameters in background while computing the next batch of gradients does not reduce convergence. |