Spaces:

Dovakiins
/

qwerrwe

Build error

winglian commited on Aug 29, 2023

Commit

5ac3392

unverified ·

1 Parent(s): e356b29

support for datasets with multiple names (#480)

* support for datasets with multiple names

* update docs

Files changed (2) hide show

README.md CHANGED Viewed

@@ -328,6 +328,15 @@ See [examples](examples) for quick start. It is recommended to duplicate and mod
       name: enron_emails
       type: completion # format from earlier
   # local
   datasets:
     - path: data.jsonl # or json

       name: enron_emails
       type: completion # format from earlier
+  # huggingface repo with multiple named configurations/subsets
+  datasets:
+    - path: bigcode/commitpackft
+      name:
+        - ruby
+        - python
+        - typescript
+      type: ... # unimplemented custom format
   # local
   datasets:
     - path: data.jsonl # or json

src/axolotl/utils/data.py CHANGED Viewed

@@ -134,8 +134,17 @@ def load_tokenized_prepared_datasets(
             seed = 42
         datasets = []
         # pylint: disable=invalid-name
-        for d in cfg.datasets:
             ds: Union[Dataset, DatasetDict] = None
             ds_from_hub = False
             try:

             seed = 42
         datasets = []
+        def for_d_in_datasets(dataset_configs):
+            for dataset in dataset_configs:
+                if dataset.name and isinstance(dataset.name, list):
+                    for name in dataset.name:
+                        yield DictDefault({**dataset, "name": name})
+                else:
+                    yield dataset
         # pylint: disable=invalid-name
+        for d in for_d_in_datasets(cfg.datasets):
             ds: Union[Dataset, DatasetDict] = None
             ds_from_hub = False
             try: