Mais Alheraki commited on
Commit
c474fbe
Β·
1 Parent(s): 4d75517

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +62 -81
README.md CHANGED
@@ -18,97 +18,78 @@ pinned: false
18
  <!-- The classes below are necessary for correct rendering -->
19
  <div class="lg:col-span-3">
20
  <img src="https://raw.githubusercontent.com/NCAI-Research/CALM/main/assets/logo.png" width="380" alt="CALM Logo" />
21
- <p class="mb-2" style="font-size:64px">
22
  CALM: Collaborative Arabic Language Model
23
  </p>
24
  <p class="mb-2">
25
- The CALM project is joint effort lead by <u><a href="https://sdaia.gov.sa/ncai/?Lang=en">NCAI</a></u> in collaboration with <u><a href="https://yandex.com/">Yandex</a> and <a href="https://huggingface.co/">HuggingFace</a></u> to train an Arabic language model with volunteers from around the globe. The project is an adaptation of the framework proposed at the NeurIPS 2021 demonstration: <u><a href="https://huggingface.co/training-transformers-together">Training Transformers Together</a></u>.
26
- TODO
27
- In this demo, we train a model similar to <u><a target="_blank" href="https://openai.com/blog/dall-e/">OpenAI DALL-E</a></u> β€”
28
- a Transformer "language model" that generates images from text descriptions.
29
- Training happens collaboratively β€” volunteers from all over the Internet contribute to the training using hardware available to them.
30
- We use <u><a target="_blank" href="https://laion.ai/laion-400-open-dataset/">LAION-400M</a></u>,
31
- the world's largest openly available image-text-pair dataset with 400 million samples. Our model is based on
32
- the <u><a target="_blank" href="https://github.com/lucidrains/DALLE-pytorch">dalle‑pytorch</a></u> implementation
33
- by <u><a target="_blank" href="https://github.com/lucidrains">Phil Wang</a></u> with a few tweaks to make it communication-efficient.
34
  </p>
35
  <p class="mb-2">
36
- See details about how to join and how it works on <u><a target="_blank" href="https://training-transformers-together.github.io/">our website</a></u>.
 
 
 
 
 
37
  </p>
38
  <p class="mb-2">
39
- This organization gathers people participating in the collaborative training and provides links to the necessary resources:
 
 
 
 
40
  </p>
 
 
 
 
 
 
 
 
 
41
  <ul class="mb-2">
42
- <li>πŸ‘‰ Starter kits for <u><a target="_blank" href="https://colab.research.google.com/drive/1BqTWcfsvNQwQqqCRKMKp1_jvQ5L1BhCY?usp=sharing">Google Colab</a></u> and <u><a target="_blank" href="https://www.kaggle.com/yhn112/training-transformers-together/">Kaggle</a></u> (easy way to join the training)</li>
43
- <li>πŸ‘‰ <u><a target="_blank" href="https://huggingface.co/spaces/training-transformers-together/Dashboard">Dashboard</a></u> (the current training state: loss, number of peers, etc.)</li>
44
- <li>πŸ‘‰ <u><a target="_blank" href="https://colab.research.google.com/drive/1Vkb-4nhEEH1a5vrKtpL4MTNiUTPdpPUl?usp=sharing">Colab notebook for running inference</a></u>
45
- <li>πŸ‘‰ <u><a target="_blank" href="https://huggingface.co/training-transformers-together/dalle-demo-v1">Model weights</a></u> (the latest checkpoint)</li></li>
46
- <li>πŸ‘‰ Weights & Biases plots for <u><a target="_blank" href="https://wandb.ai/learning-at-home/dalle-hivemind/runs/3l7q56ht">aux peers</a></u> (aggregating the metrics) and actual <u><a target="_blank" href="https://wandb.ai/learning-at-home/dalle-hivemind-trainers">trainers</a></u> (contributing with their GPUs)</li>
47
- <li>πŸ‘‰ <u><a target="_blank" href="https://github.com/learning-at-home/dalle-hivemind">Code</a></u></li>
48
- <li>πŸ‘‰ <u><a target="_blank" href="https://huggingface.co/datasets/laion/laion_100m_vqgan_f8">Dataset</a></u></li>
 
 
 
 
 
49
  </ul>
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
50
  <p class="mb-2">
51
- Feel free to reach us on <u><a target="_blank" href="https://discord.gg/uGugx9zYvN">Discord</a></u> if you have any questions πŸ™‚
52
  </p>
53
  </div>
54
-
55
-
56
- <div class="lg:col-span-3">
57
-
58
-
59
-
60
- <!-- this is actually not markdown, please use regular HTML -->
61
-
62
- #
63
-
64
-
65
- Once of the main obstacles facing many researchers in the Arabic NLP community is the lack of computing resources that are needed for training large models. Models with leading performane on Arabic NLP tasks, such as <a href="https://github.com/aub-mind/arabert">AraBERT</a>, <a href="https://github.com/CAMeL-Lab/CAMeLBERT">CamelBERT</a>, <a href="https://huggingface.co/aubmindlab/araelectra-base-generator">AraELECTRA</a>, and <a href="https://huggingface.co/qarib">QARiB</a>, took days to train on TPUs. In the spirit of democratization of AI and community enabling, a core value at NCAI, CALM aims to demonstrate the effectiveness of collaborative training and form a community of volunteers for ANLP researchers with basic level cloud GPUs who wish to train their own models collaboratively.
66
-
67
- CALM trains a single BERT model on a dataset that combines MSA, Oscar and Arabic Wikipedia, and dialectal data for the gulf region from existing open source datasets. Each volunteer GPU trains the model locally at its own pace on a portion of the dataset while another portion is being streamed in the background to reduces local memory consumption. Computing the gradients and aggregating them is performed in a distributed manner, based on the computing abilities of each participating volunteer. Details of the distributed training process are further described in the paper <a href="https://papers.nips.cc/paper/2021/hash/41a60377ba920919939d83326ebee5a1-Abstract.html">Deep Learning in Open Collaborations</a>.
68
-
69
- <h2>How to participate in training?</h2>
70
-
71
- To join the collaborative training, all you have to do is to keep a notebook running for at <b>least 15 minutes</b>, you're free to close it after that and join again in another time. There are few steps before running the notebook:
72
-
73
- <ol class="list-decimal">
74
- <li>Create an account on <a href="https://huggingface.co">Huggingface</a>.</li>
75
- <li>Join the <a href="https://huggingface.co/CALM">NCAI-CALM Organization</a> on Huggingface through the invitation link shared with you by email.</li>
76
- <li>Get your Access Token, it's later required in the notebook
77
- <ol>
78
- <li>Go to your <a href="https://huggingface.co">HF account</a></li>
79
- <li>Go to Settings β‡’ Access Tokensv</li>
80
- <li>Generate a new Access Token and enter any name for "what's this token for"</li>
81
- <li>Select <code>read</code> role</li>
82
- <li>Copy your access token</li>
83
- <li>Paste it in the execution prompt in the notebook</li>
84
- </ol>
85
- </li>
86
- </ol>
87
-
88
- <h2>Start training</h2>
89
-
90
- <p>Pick one of the following methods to run the training code.
91
- <br /><em>NOTE: Kaggle gives you around 40 hrs per week of GPU time, so it's preferred over Colab, unless you have Colab Pro or Colab Pro+.</em></p>
92
-
93
- <ul class="list-decimal">
94
- <li>
95
- <span><a href="https://www.kaggle.com/prmais/volunteer-gpu-notebook">
96
- <img style="display:inline;margin:0px" src="https://img.shields.io/badge/kaggle-Open%20in%20Kaggle-blue.svg"/>
97
- </a></span> <b> (recommended)</b> <br />
98
- </li>
99
- <li>
100
- <span><a href="https://colab.research.google.com/github/NCAI-Research/CALM/blob/main/notebooks/volunteer-gpu-notebook.ipynb">
101
- <img style="display:inline;margin:0px" src="https://colab.research.google.com/assets/colab-badge.svg"/>
102
- </a></span>
103
- </li>
104
- <li><b>Running locally</b>
105
- <br />If you have additional local computing GPUs, please visit our discord channel for instructions to set it.
106
- </li>
107
- </ul>
108
-
109
- <h2>Issues or questions?</h2>
110
- We are there to provide any assistance needed, please make sure to join our <span><a href="https://discord.gg/vRNN9ua2">
111
- <img style="display:inline;margin:0px" src="https://badgen.net/badge/icon/discord?icon=discord&label"/>
112
- </a></span>
113
-
114
- </div>
 
18
  <!-- The classes below are necessary for correct rendering -->
19
  <div class="lg:col-span-3">
20
  <img src="https://raw.githubusercontent.com/NCAI-Research/CALM/main/assets/logo.png" width="380" alt="CALM Logo" />
21
+ <p class="h1 mb-2">
22
  CALM: Collaborative Arabic Language Model
23
  </p>
24
  <p class="mb-2">
25
+ The CALM project is joint effort lead by <u><a target="_blank" href="https://sdaia.gov.sa/ncai/?Lang=en">NCAI</a></u> in collaboration with
26
+ <u><a target="_blank" href="https://yandex.com/">Yandex</a> and <a href="https://huggingface.co/">HuggingFace</a></u> to train an Arabic language model with
27
+ volunteers from around the globe. The project is an adaptation of the framework proposed at the NeurIPS 2021 demonstration:
28
+ <u><a target="_blank" href="https://huggingface.co/training-transformers-together">Training Transformers Together</a></u>.
 
 
 
 
 
29
  </p>
30
  <p class="mb-2">
31
+ One of the main obstacles facing many researchers in the Arabic NLP community is the lack of computing resources that are needed for training large models. Models with
32
+ leading performane on Arabic NLP tasks, such as <a target="_blank" href="https://github.com/aub-mind/arabert">AraBERT</a>,
33
+ <a href="https://github.com/CAMeL-Lab/CAMeLBERT" target="_blank" >CamelBERT</a>,
34
+ <a href="https://huggingface.co/aubmindlab/araelectra-base-generator" target="_blank" >AraELECTRA</a>, and <a href="https://huggingface.co/qarib">QARiB</a>,
35
+ took days to train on TPUs. In the spirit of democratization of AI and community enabling, a core value at NCAI, CALM aims to demonstrate the effectiveness
36
+ of collaborative training and form a community of volunteers for ANLP researchers with basic level cloud GPUs who wish to train their own models collaboratively.
37
  </p>
38
  <p class="mb-2">
39
+ CALM trains a single BERT model on a dataset that combines MSA, Oscar and Arabic Wikipedia, and dialectal data for the gulf region from existing open source datasets.
40
+ Each volunteer GPU trains the model locally at its own pace on a portion of the dataset while another portion is being streamed in the background to reduces local
41
+ memory consumption. Computing the gradients and aggregating them is performed in a distributed manner, based on the computing abilities of each participating
42
+ volunteer. Details of the distributed training process are further described in the paper
43
+ <a target="_blank" href="https://papers.nips.cc/paper/2021/hash/41a60377ba920919939d83326ebee5a1-Abstract.html">Deep Learning in Open Collaborations</a>.
44
  </p>
45
+
46
+ <p class="h2 mb-2">
47
+ How to participate in training?
48
+ </p>
49
+ <p class="mb-2">
50
+ To join the collaborative training, all you have to do is to keep a notebook running for at <b>least 15 minutes</b>, you're free to close it after that and join again
51
+ in another time. There are few steps before running the notebook:
52
+ </p>
53
+
54
  <ul class="mb-2">
55
+ <li>πŸ‘‰ Create an account on <a href="https://huggingface.co">Huggingface</a>.</li>
56
+ <li>πŸ‘‰ Join the <a href="https://huggingface.co/CALM">NCAI-CALM Organization</a> on Huggingface through the invitation link shared with you by email.</li>
57
+ <li>πŸ‘‰ Get your Access Token, it's later required in the notebook.
58
+ <ul class="mb-2">
59
+ <li>πŸ‘‰ Go to your <a href="https://huggingface.co">HF account</a></li>
60
+ <li>πŸ‘‰ Go to Settings β‡’ Access Tokens</li>
61
+ <li>πŸ‘‰ Generate a new Access Token and enter any name for "what's this token for"</li>
62
+ <li>πŸ‘‰ Select <code>read</code> role</li>
63
+ <li>πŸ‘‰ Copy your access token</li>
64
+ <li>πŸ‘‰ Paste it in the execution prompt in the notebook</li>
65
+ </ul>
66
+ </li>
67
  </ul>
68
+
69
+ <p class="h2 mb-2">
70
+ Start training
71
+ </p>
72
+ <p class="mb-2">Pick one of the following methods to run the training code.
73
+ <br /><em>NOTE: Kaggle gives you around 40 hrs per week of GPU time, so it's preferred over Colab, unless you have Colab Pro or Colab Pro+.</em></p>
74
+ <ul class="mb-2">
75
+ <li><span><a href="https://www.kaggle.com/prmais/volunteer-gpu-notebook">
76
+ <img style="display:inline;margin:0px" src="https://img.shields.io/badge/kaggle-Open%20in%20Kaggle-blue.svg"/>
77
+ </a></span> <b> (recommended)</b> <br />
78
+ </li>
79
+ <li> <span><a href="https://colab.research.google.com/github/NCAI-Research/CALM/blob/main/notebooks/volunteer-gpu-notebook.ipynb">
80
+ <img style="display:inline;margin:0px" src="https://colab.research.google.com/assets/colab-badge.svg"/>
81
+ </a></span>
82
+ </li>
83
+ <li>Running locally
84
+ <br />If you have additional local computing GPUs, please visit our discord channel for instructions to set it.
85
+ </li>
86
+ </ul>
87
+
88
+ <p class="h2 mb-2">
89
+ Issues or questions?
90
+ </p>
91
+
92
  <p class="mb-2">
93
+ Feel free to reach us on <u><a target="_blank" href="https://discord.gg/vRNN9ua2">Discord</a></u> if you have any questions πŸ™‚
94
  </p>
95
  </div>