Datasets documentation
Main classes
Main classes
DatasetInfo
class datasets.DatasetInfo
< source >( description: str = <factory> citation: str = <factory> homepage: str = <factory> license: str = <factory> features: typing.Optional[datasets.features.features.Features] = None post_processed: typing.Optional[datasets.info.PostProcessedInfo] = None supervised_keys: typing.Optional[datasets.info.SupervisedKeysData] = None task_templates: typing.Optional[typing.List[datasets.tasks.base.TaskTemplate]] = None builder_name: typing.Optional[str] = None dataset_name: typing.Optional[str] = None config_name: typing.Optional[str] = None version: typing.Union[str, datasets.utils.version.Version, NoneType] = None splits: typing.Optional[dict] = None download_checksums: typing.Optional[dict] = None download_size: typing.Optional[int] = None post_processing_size: typing.Optional[int] = None dataset_size: typing.Optional[int] = None size_in_bytes: typing.Optional[int] = None )
Parameters
- 
							description (str) — A description of the dataset.
- 
							citation (str) — A BibTeX citation of the dataset.
- 
							homepage (str) — A URL to the official homepage for the dataset.
- 
							license (str) — The dataset’s license. It can be the name of the license or a paragraph containing the terms of the license.
- features (Features, optional) — The features used to specify the dataset’s column types.
- 
							post_processed (PostProcessedInfo, optional) — Information regarding the resources of a possible post-processing of a dataset. For example, it can contain the information of an index.
- 
							supervised_keys (SupervisedKeysData, optional) — Specifies the input feature and the label for supervised learning if applicable for the dataset (legacy from TFDS).
- 
							builder_name (str, optional) — The name of theGeneratorBasedBuildersubclass used to create the dataset. Usually matched to the corresponding script name. It is also the snake_case version of the dataset builder class name.
- 
							config_name (str, optional) — The name of the configuration derived from BuilderConfig.
- 
							version (stror Version, optional) — The version of the dataset.
- 
							splits (dict, optional) — The mapping between split name and metadata.
- 
							download_checksums (dict, optional) — The mapping between the URL to download the dataset’s checksums and corresponding metadata.
- 
							download_size (int, optional) — The size of the files to download to generate the dataset, in bytes.
- 
							post_processing_size (int, optional) — Size of the dataset in bytes after post-processing, if any.
- 
							dataset_size (int, optional) — The combined size in bytes of the Arrow tables for all splits.
- 
							size_in_bytes (int, optional) — The combined size in bytes of all files associated with the dataset (downloaded files + Arrow files).
- 
							task_templates (List[TaskTemplate], optional) — The task templates to prepare the dataset for during training and evaluation. Each template casts the dataset’s Features to standardized column names and types as detailed indatasets.tasks.
- **config_kwargs (additional keyword arguments) — Keyword arguments to be passed to the BuilderConfig and used in the DatasetBuilder.
Information about a dataset.
DatasetInfo documents datasets, including its name, version, and features.
See the constructor arguments and properties for a full list.
Not all fields are known on construction and may be updated later.
from_directory
< source >( dataset_info_dir: str fs = 'deprecated' storage_options: typing.Optional[dict] = None )
Parameters
- 
							dataset_info_dir (str) — The directory containing the metadata file. This should be the root directory of a specific dataset version.
- 
							fs (fsspec.spec.AbstractFileSystem, optional) — Instance of the remote filesystem used to download the files from.Deprecated in 2.9.0 fswas deprecated in version 2.9.0 and will be removed in 3.0.0. Please usestorage_optionsinstead, e.g.storage_options=fs.storage_options.
- 
							storage_options (dict, optional) — Key/value pairs to be passed on to the file-system backend, if any.Added in 2.9.0 
Create DatasetInfo from the JSON file in dataset_info_dir.
This function updates all the dynamically generated fields (num_examples, hash, time of creation,…) of the DatasetInfo.
This will overwrite all previous metadata.
write_to_directory
< source >( dataset_info_dir pretty_print = False fs = 'deprecated' storage_options: typing.Optional[dict] = None )
Parameters
- 
							dataset_info_dir (str) — Destination directory.
- 
							pretty_print (bool, defaults toFalse) — IfTrue, the JSON will be pretty-printed with the indent level of 4.
- 
							fs (fsspec.spec.AbstractFileSystem, optional) — Instance of the remote filesystem used to download the files from.Deprecated in 2.9.0 fswas deprecated in version 2.9.0 and will be removed in 3.0.0. Please usestorage_optionsinstead, e.g.storage_options=fs.storage_options.
- 
							storage_options (dict, optional) — Key/value pairs to be passed on to the file-system backend, if any.Added in 2.9.0 
Write DatasetInfo and license (if present) as JSON files to dataset_info_dir.
Dataset
The base class Dataset implements a Dataset backed by an Apache Arrow table.
class datasets.Dataset
< source >( arrow_table: Table info: typing.Optional[datasets.info.DatasetInfo] = None split: typing.Optional[datasets.splits.NamedSplit] = None indices_table: typing.Optional[datasets.table.Table] = None fingerprint: typing.Optional[str] = None )
A Dataset backed by an Arrow table.
add_column
< source >( name: str column: typing.Union[list, <built-in function array>] new_fingerprint: str )
Add column to Dataset.
<Added version=“1.7”/>
add_item
< source >( item: dict new_fingerprint: str )
Add item to Dataset.
<Added version=“1.7”/>
Example:
>>> from datasets import load_dataset
>>> ds = load_dataset("rotten_tomatoes", split="validation")
>>> new_review = {'label': 0, 'text': 'this movie is the absolute worst thing I have ever seen'}
>>> ds = ds.add_item(new_review)
>>> ds[-1]
{'label': 0, 'text': 'this movie is the absolute worst thing I have ever seen'}from_file
< source >( filename: str info: typing.Optional[datasets.info.DatasetInfo] = None split: typing.Optional[datasets.splits.NamedSplit] = None indices_filename: typing.Optional[str] = None in_memory: bool = False )
Parameters
- 
							filename (str) — File name of the dataset.
- 
							info (DatasetInfo, optional) — Dataset information, like description, citation, etc.
- 
							split (NamedSplit, optional) — Name of the dataset split.
- 
							indices_filename (str, optional) — File names of the indices.
- 
							in_memory (bool, defaults toFalse) — Whether to copy the data in-memory.
Instantiate a Dataset backed by an Arrow table at filename.
from_buffer
< source >( buffer: Buffer info: typing.Optional[datasets.info.DatasetInfo] = None split: typing.Optional[datasets.splits.NamedSplit] = None indices_buffer: typing.Optional[pyarrow.lib.Buffer] = None )
Instantiate a Dataset backed by an Arrow buffer.
from_pandas
< source >( df: DataFrame features: typing.Optional[datasets.features.features.Features] = None info: typing.Optional[datasets.info.DatasetInfo] = None split: typing.Optional[datasets.splits.NamedSplit] = None preserve_index: typing.Optional[bool] = None )
Parameters
- 
							df (pandas.DataFrame) — Dataframe that contains the dataset.
- features (Features, optional) — Dataset features.
- 
							info (DatasetInfo, optional) — Dataset information, like description, citation, etc.
- 
							split (NamedSplit, optional) — Name of the dataset split.
- 
							preserve_index (bool, optional) — Whether to store the index as an additional column in the resulting Dataset. The default ofNonewill store the index as a column, except forRangeIndexwhich is stored as metadata only. Usepreserve_index=Trueto force it to be stored as a column.
Convert pandas.DataFrame to a pyarrow.Table to create a Dataset.
The column types in the resulting Arrow Table are inferred from the dtypes of the pandas.Series in the
DataFrame. In the case of non-object Series, the NumPy dtype is translated to its Arrow equivalent. In the
case of object, we need to guess the datatype by looking at the Python objects in this Series.
Be aware that Series of the object dtype don’t carry enough information to always lead to a meaningful Arrow
type. In the case that we cannot infer a type, e.g. because the DataFrame is of length 0 or the Series only
contains None/nan objects, the type is set to null. This behavior can be avoided by constructing explicit
features and passing it to this function.
from_dict
< source >( mapping: dict features: typing.Optional[datasets.features.features.Features] = None info: typing.Optional[datasets.info.DatasetInfo] = None split: typing.Optional[datasets.splits.NamedSplit] = None )
Parameters
- 
							mapping (Mapping) — Mapping of strings to Arrays or Python lists.
- features (Features, optional) — Dataset features.
- 
							info (DatasetInfo, optional) — Dataset information, like description, citation, etc.
- 
							split (NamedSplit, optional) — Name of the dataset split.
Convert dict to a pyarrow.Table to create a Dataset.
from_generator
< source >( generator: typing.Callable features: typing.Optional[datasets.features.features.Features] = None cache_dir: str = None keep_in_memory: bool = False gen_kwargs: typing.Optional[dict] = None num_proc: typing.Optional[int] = None **kwargs )
Parameters
- 
							generator ( —Callable): A generator function thatyieldsexamples.
- features (Features, optional) — Dataset features.
- 
							cache_dir (str, optional, defaults to"~/.cache/huggingface/datasets") — Directory to cache data.
- 
							keep_in_memory (bool, defaults toFalse) — Whether to copy the data in-memory.
- 
							gen_kwargs(dict, optional) — Keyword arguments to be passed to thegeneratorcallable. You can define a sharded dataset by passing the list of shards ingen_kwargs.
- 
							num_proc (int, optional, defaults toNone) — Number of processes when downloading and generating the dataset locally. This is helpful if the dataset is made of multiple files. Multiprocessing is disabled by default.Added in 2.7.0 
- 
							**kwargs (additional keyword arguments) —
Keyword arguments to be passed to :GeneratorConfig.
Create a Dataset from a generator.
The Apache Arrow table backing the dataset.
Example:
>>> from datasets import load_dataset
>>> ds = load_dataset("rotten_tomatoes", split="validation")
>>> ds.data
MemoryMappedTable
text: string
label: int64
----
text: [["compassionately explores the seemingly irreconcilable situation between conservative christian parents and their estranged gay and lesbian children .","the soundtrack alone is worth the price of admission .","rodriguez does a splendid job of racial profiling hollywood style--casting excellent latin actors of all ages--a trend long overdue .","beneath the film's obvious determination to shock at any cost lies considerable skill and determination , backed by sheer nerve .","bielinsky is a filmmaker of impressive talent .","so beautifully acted and directed , it's clear that washington most certainly has a new career ahead of him if he so chooses .","a visual spectacle full of stunning images and effects .","a gentle and engrossing character study .","it's enough to watch huppert scheming , with her small , intelligent eyes as steady as any noir villain , and to enjoy the perfectly pitched web of tension that chabrol spins .","an engrossing portrait of uncompromising artists trying to create something original against the backdrop of a corporate music industry that only seems to care about the bottom line .",...,"ultimately , jane learns her place as a girl , softens up and loses some of the intensity that made her an interesting character to begin with .","ah-nuld's action hero days might be over .","it's clear why deuces wild , which was shot two years ago , has been gathering dust on mgm's shelf .","feels like nothing quite so much as a middle-aged moviemaker's attempt to surround himself with beautiful , half-naked women .","when the precise nature of matthew's predicament finally comes into sharp focus , the revelation fails to justify the build-up .","this picture is murder by numbers , and as easy to be bored by as your abc's , despite a few whopping shootouts .","hilarious musical comedy though stymied by accents thick as mud .","if you are into splatter movies , then you will probably have a reasonably good time with the salton sea .","a dull , simple-minded and stereotypical tale of drugs , death and mind-numbing indifference on the inner-city streets .","the feature-length stretch . . . strains the show's concept ."]]
label: [[1,1,1,1,1,1,1,1,1,1,...,0,0,0,0,0,0,0,0,0,0]]The cache files containing the Apache Arrow table backing the dataset.
Example:
>>> from datasets import load_dataset
>>> ds = load_dataset("rotten_tomatoes", split="validation")
>>> ds.cache_files
[{'filename': '/root/.cache/huggingface/datasets/rotten_tomatoes_movie_review/default/1.0.0/40d411e45a6ce3484deed7cc15b82a53dad9a72aafd9f86f8f227134bec5ca46/rotten_tomatoes_movie_review-validation.arrow'}]Number of columns in the dataset.
Number of rows in the dataset (same as Dataset.len()).
Names of the columns in the dataset.
Shape of the dataset (number of columns, number of rows).
unique
< source >(
			column: str
				
			)
			→
				list
Parameters
- 
							column (str) — Column name (list all the column names with column_names).
Returns
list
List of unique elements in the given column.
Return a list of the unique elements in a column.
This is implemented in the low-level backend and as such, very fast.
flatten
< source >( new_fingerprint: typing.Optional[str] = None max_depth = 16 ) → Dataset
Flatten the table. Each column with a struct type is flattened into one column per struct field. Other columns are left unchanged.
Example:
>>> from datasets import load_dataset
>>> ds = load_dataset("squad", split="train")
>>> ds.features
{'answers': Sequence(feature={'text': Value(dtype='string', id=None), 'answer_start': Value(dtype='int32', id=None)}, length=-1, id=None),
 'context': Value(dtype='string', id=None),
 'id': Value(dtype='string', id=None),
 'question': Value(dtype='string', id=None),
 'title': Value(dtype='string', id=None)}
>>> ds.flatten()
Dataset({
    features: ['id', 'title', 'context', 'question', 'answers.text', 'answers.answer_start'],
    num_rows: 87599
})cast
< source >( features: Features batch_size: typing.Optional[int] = 1000 keep_in_memory: bool = False load_from_cache_file: typing.Optional[bool] = None cache_file_name: typing.Optional[str] = None writer_batch_size: typing.Optional[int] = 1000 num_proc: typing.Optional[int] = None ) → Dataset
Parameters
- 
							features (Features) —
New features to cast the dataset to.
The name of the fields in the features must match the current column names.
The type of the data must also be convertible from one type to the other.
For non-trivial conversion, e.g. str<->ClassLabelyou should use map() to update the Dataset.
- 
							batch_size (int, defaults to1000) — Number of examples per batch provided to cast. Ifbatch_size <= 0orbatch_size == Nonethen provide the full dataset as a single batch to cast.
- 
							keep_in_memory (bool, defaults toFalse) — Whether to copy the data in-memory.
- 
							load_from_cache_file (bool, defaults toTrueif caching is enabled) — If a cache file storing the current computation fromfunctioncan be identified, use it instead of recomputing.
- 
							cache_file_name (str, optional, defaults toNone) — Provide the name of a path for the cache file. It is used to store the results of the computation instead of the automatically generated cache file name.
- 
							writer_batch_size (int, defaults to1000) — Number of rows per write operation for the cache file writer. This value is a good trade-off between memory usage during the processing, and processing speed. Higher value makes the processing do fewer lookups, lower value consume less temporary memory while running map().
- 
							num_proc (int, optional, defaults toNone) — Number of processes for multiprocessing. By default it doesn’t use multiprocessing.
Returns
A copy of the dataset with casted features.
Cast the dataset to a new set of features.
Example:
>>> from datasets import load_dataset, ClassLabel, Value
>>> ds = load_dataset("rotten_tomatoes", split="validation")
>>> ds.features
{'label': ClassLabel(num_classes=2, names=['neg', 'pos'], id=None),
 'text': Value(dtype='string', id=None)}
>>> new_features = ds.features.copy()
>>> new_features['label'] = ClassLabel(names=['bad', 'good'])
>>> new_features['text'] = Value('large_string')
>>> ds = ds.cast(new_features)
>>> ds.features
{'label': ClassLabel(num_classes=2, names=['bad', 'good'], id=None),
 'text': Value(dtype='large_string', id=None)}cast_column
< source >( column: str feature: typing.Union[dict, list, tuple, datasets.features.features.Value, datasets.features.features.ClassLabel, datasets.features.translation.Translation, datasets.features.translation.TranslationVariableLanguages, datasets.features.features.Sequence, datasets.features.features.Array2D, datasets.features.features.Array3D, datasets.features.features.Array4D, datasets.features.features.Array5D, datasets.features.audio.Audio, datasets.features.image.Image] new_fingerprint: typing.Optional[str] = None )
Cast column to feature for decoding.
Example:
>>> from datasets import load_dataset
>>> ds = load_dataset("rotten_tomatoes", split="validation")
>>> ds.features
{'label': ClassLabel(num_classes=2, names=['neg', 'pos'], id=None),
 'text': Value(dtype='string', id=None)}
>>> ds = ds.cast_column('label', ClassLabel(names=['bad', 'good']))
>>> ds.features
{'label': ClassLabel(num_classes=2, names=['bad', 'good'], id=None),
 'text': Value(dtype='string', id=None)}remove_columns
< source >( column_names: typing.Union[str, typing.List[str]] new_fingerprint: typing.Optional[str] = None ) → Dataset
Parameters
- 
							column_names (Union[str, List[str]]) — Name of the column(s) to remove.
- 
							new_fingerprint (str, optional) — The new fingerprint of the dataset after transform. IfNone, the new fingerprint is computed using a hash of the previous fingerprint, and the transform arguments.
Returns
A copy of the dataset object without the columns to remove.
Remove one or several column(s) in the dataset and the features associated to them.
You can also remove a column using map() with remove_columns but the present method
is in-place (doesn’t copy the data to a new dataset) and is thus faster.
Example:
>>> from datasets import load_dataset
>>> ds = load_dataset("rotten_tomatoes", split="validation")
>>> ds.remove_columns('label')
Dataset({
    features: ['text'],
    num_rows: 1066
})
>>> ds.remove_columns(column_names=ds.column_names) # Removing all the columns returns an empty dataset with the `num_rows` property set to 0
Dataset({
    features: [],
    num_rows: 0
})rename_column
< source >( original_column_name: str new_column_name: str new_fingerprint: typing.Optional[str] = None ) → Dataset
Parameters
- 
							original_column_name (str) — Name of the column to rename.
- 
							new_column_name (str) — New name for the column.
- 
							new_fingerprint (str, optional) — The new fingerprint of the dataset after transform. IfNone, the new fingerprint is computed using a hash of the previous fingerprint, and the transform arguments.
Returns
A copy of the dataset with a renamed column.
Rename a column in the dataset, and move the features associated to the original column under the new column name.
rename_columns
< source >( column_mapping: typing.Dict[str, str] new_fingerprint: typing.Optional[str] = None ) → Dataset
Parameters
- 
							column_mapping (Dict[str, str]) — A mapping of columns to rename to their new names
- 
							new_fingerprint (str, optional) — The new fingerprint of the dataset after transform. IfNone, the new fingerprint is computed using a hash of the previous fingerprint, and the transform arguments.
Returns
A copy of the dataset with renamed columns
Rename several columns in the dataset, and move the features associated to the original columns under the new column names.
select_columns
< source >( column_names: typing.Union[str, typing.List[str]] new_fingerprint: typing.Optional[str] = None ) → Dataset
Parameters
- 
							column_names (Union[str, List[str]]) — Name of the column(s) to keep.
- 
							new_fingerprint (str, optional) — The new fingerprint of the dataset after transform. IfNone, the new fingerprint is computed using a hash of the previous fingerprint, and the transform arguments.
Returns
A copy of the dataset object which only consists of selected columns.
Select one or several column(s) in the dataset and the features associated to them.
class_encode_column
< source >( column: str include_nulls: bool = False )
Parameters
- 
							column (str) — The name of the column to cast (list all the column names with column_names)
- 
							include_nulls (bool, defaults toFalse) — Whether to include null values in the class labels. IfTrue, the null values will be encoded as the"None"class label.Added in 1.14.2 
Casts the given column as ClassLabel and updates the table.
Example:
>>> from datasets import load_dataset
>>> ds = load_dataset("boolq", split="validation")
>>> ds.features
{'answer': Value(dtype='bool', id=None),
 'passage': Value(dtype='string', id=None),
 'question': Value(dtype='string', id=None)}
>>> ds = ds.class_encode_column('answer')
>>> ds.features
{'answer': ClassLabel(num_classes=2, names=['False', 'True'], id=None),
 'passage': Value(dtype='string', id=None),
 'question': Value(dtype='string', id=None)}Number of rows in the dataset.
Iterate through the examples.
If a formatting is set with Dataset.set_format() rows will be returned with the selected format.
iter
< source >( batch_size: int drop_last_batch: bool = False )
Iterate through the batches of size batch_size.
If a formatting is set with [~datasets.Dataset.set_format] rows will be returned with the selected format.
formatted_as
< source >( type: typing.Optional[str] = None columns: typing.Optional[typing.List] = None output_all_columns: bool = False **format_kwargs )
Parameters
- 
							type (str, optional) — Output type selected in[None, 'numpy', 'torch', 'tensorflow', 'pandas', 'arrow', 'jax'].Nonemeans `getitem“ returns python objects (default).
- 
							columns (List[str], optional) — Columns to format in the output.Nonemeans__getitem__returns all columns (default).
- 
							output_all_columns (bool, defaults toFalse) — Keep un-formatted columns as well in the output (as python objects).
- 
							**format_kwargs (additional keyword arguments) —
Keywords arguments passed to the convert function like np.array,torch.tensorortensorflow.ragged.constant.
To be used in a with statement. Set __getitem__ return format (type and columns).
set_format
< source >( type: typing.Optional[str] = None columns: typing.Optional[typing.List] = None output_all_columns: bool = False **format_kwargs )
Parameters
- 
							type (str, optional) — Either output type selected in[None, 'numpy', 'torch', 'tensorflow', 'pandas', 'arrow', 'jax'].Nonemeans__getitem__returns python objects (default).
- 
							columns (List[str], optional) — Columns to format in the output.Nonemeans__getitem__returns all columns (default).
- 
							output_all_columns (bool, defaults toFalse) — Keep un-formatted columns as well in the output (as python objects).
- 
							**format_kwargs (additional keyword arguments) —
Keywords arguments passed to the convert function like np.array,torch.tensorortensorflow.ragged.constant.
Set __getitem__ return format (type and columns). The data formatting is applied on-the-fly.
The format type (for example “numpy”) is used to format batches when using __getitem__.
It’s also possible to use custom transforms for formatting using set_transform().
It is possible to call map() after calling set_format. Since map may add new columns, then the list of formatted columns
gets updated. In this case, if you apply map on a dataset to add a new column, then this column will be formatted as:
new formatted columns = (all columns - previously unformatted columns)Example:
>>> from datasets import load_dataset
>>> from transformers import AutoTokenizer
>>> ds = load_dataset("rotten_tomatoes", split="validation")
>>> tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
>>> ds = ds.map(lambda x: tokenizer(x['text'], truncation=True, padding=True), batched=True)
>>> ds.set_format(type='numpy', columns=['text', 'label'])
>>> ds.format
{'type': 'numpy',
'format_kwargs': {},
'columns': ['text', 'label'],
'output_all_columns': False}set_transform
< source >( transform: typing.Optional[typing.Callable] columns: typing.Optional[typing.List] = None output_all_columns: bool = False )
Parameters
- 
							transform (Callable, optional) — User-defined formatting transform, replaces the format defined by set_format(). A formatting function is a callable that takes a batch (as adict) as input and returns a batch. This function is applied right before returning the objects in__getitem__.
- 
							columns (List[str], optional) — Columns to format in the output. If specified, then the input batch of the transform only contains those columns.
- 
							output_all_columns (bool, defaults toFalse) — Keep un-formatted columns as well in the output (as python objects). If set to True, then the other un-formatted columns are kept with the output of the transform.
Set __getitem__ return format using this transform. The transform is applied on-the-fly on batches when __getitem__ is called.
As set_format(), this can be reset using reset_format().
Example:
>>> from datasets import load_dataset
>>> from transformers import AutoTokenizer
>>> ds = load_dataset("rotten_tomatoes", split="validation")
>>> tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
>>> def encode(batch):
...     return tokenizer(batch['text'], padding=True, truncation=True, return_tensors='pt')
>>> ds.set_transform(encode)
>>> ds[0]
{'attention_mask': tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
 1, 1]),
 'input_ids': tensor([  101, 29353,  2135, 15102,  1996,  9428, 20868,  2890,  8663,  6895,
         20470,  2571,  3663,  2090,  4603,  3017,  3008,  1998,  2037, 24211,
         5637,  1998, 11690,  2336,  1012,   102]),
 'token_type_ids': tensor([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0])}Reset __getitem__ return format to python objects and all columns.
Same as self.set_format()
Example:
>>> from datasets import load_dataset
>>> from transformers import AutoTokenizer
>>> ds = load_dataset("rotten_tomatoes", split="validation")
>>> tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
>>> ds = ds.map(lambda x: tokenizer(x['text'], truncation=True, padding=True), batched=True)
>>> ds.set_format(type='numpy', columns=['input_ids', 'token_type_ids', 'attention_mask', 'label'])
>>> ds.format
{'columns': ['input_ids', 'token_type_ids', 'attention_mask', 'label'],
 'format_kwargs': {},
 'output_all_columns': False,
 'type': 'numpy'}
>>> ds.reset_format()
>>> ds.format
{'columns': ['text', 'label', 'input_ids', 'token_type_ids', 'attention_mask'],
 'format_kwargs': {},
 'output_all_columns': False,
 'type': None}with_format
< source >( type: typing.Optional[str] = None columns: typing.Optional[typing.List] = None output_all_columns: bool = False **format_kwargs )
Parameters
- 
							type (str, optional) — Either output type selected in[None, 'numpy', 'torch', 'tensorflow', 'pandas', 'arrow', 'jax'].Nonemeans__getitem__returns python objects (default).
- 
							columns (List[str], optional) — Columns to format in the output.Nonemeans__getitem__returns all columns (default).
- 
							output_all_columns (bool, defaults toFalse) — Keep un-formatted columns as well in the output (as python objects).
- 
							**format_kwargs (additional keyword arguments) —
Keywords arguments passed to the convert function like np.array,torch.tensorortensorflow.ragged.constant.
Set __getitem__ return format (type and columns). The data formatting is applied on-the-fly.
The format type (for example “numpy”) is used to format batches when using __getitem__.
It’s also possible to use custom transforms for formatting using with_transform().
Contrary to set_format(), with_format returns a new Dataset object.
Example:
>>> from datasets import load_dataset
>>> from transformers import AutoTokenizer
>>> ds = load_dataset("rotten_tomatoes", split="validation")
>>> tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
>>> ds = ds.map(lambda x: tokenizer(x['text'], truncation=True, padding=True), batched=True)
>>> ds.format
{'columns': ['text', 'label', 'input_ids', 'token_type_ids', 'attention_mask'],
 'format_kwargs': {},
 'output_all_columns': False,
 'type': None}
>>> ds = ds.with_format(type='tensorflow', columns=['input_ids', 'token_type_ids', 'attention_mask', 'label'])
>>> ds.format
{'columns': ['input_ids', 'token_type_ids', 'attention_mask', 'label'],
 'format_kwargs': {},
 'output_all_columns': False,
 'type': 'tensorflow'}with_transform
< source >( transform: typing.Optional[typing.Callable] columns: typing.Optional[typing.List] = None output_all_columns: bool = False )
Parameters
- 
							transform (Callable,optional) — User-defined formatting transform, replaces the format defined by set_format(). A formatting function is a callable that takes a batch (as adict) as input and returns a batch. This function is applied right before returning the objects in__getitem__.
- 
							columns (List[str],optional) — Columns to format in the output. If specified, then the input batch of the transform only contains those columns.
- 
							output_all_columns (bool, defaults toFalse) — Keep un-formatted columns as well in the output (as python objects). If set toTrue, then the other un-formatted columns are kept with the output of the transform.
Set __getitem__ return format using this transform. The transform is applied on-the-fly on batches when __getitem__ is called.
As set_format(), this can be reset using reset_format().
Contrary to set_transform(), with_transform returns a new Dataset object.
Example:
>>> from datasets import load_dataset
>>> from transformers import AutoTokenizer
>>> ds = load_dataset("rotten_tomatoes", split="validation")
>>> tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
>>> def encode(example):
...     return tokenizer(example["text"], padding=True, truncation=True, return_tensors='pt')
>>> ds = ds.with_transform(encode)
>>> ds[0]
{'attention_mask': tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
 1, 1, 1, 1, 1]),
 'input_ids': tensor([  101, 18027, 16310, 16001,  1103,  9321,   178, 11604,  7235,  6617,
         1742,  2165,  2820,  1206,  6588, 22572, 12937,  1811,  2153,  1105,
         1147, 12890, 19587,  6463,  1105, 15026,  1482,   119,   102]),
 'token_type_ids': tensor([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0])}Can be used to index columns (by string names) or rows (by integer index or iterable of indices or bools).
Clean up all cache files in the dataset cache directory, excepted the currently used cache file if there is one.
Be careful when running this command that no other process is currently using other cache files.
map
< source >( function: typing.Optional[typing.Callable] = None with_indices: bool = False with_rank: bool = False input_columns: typing.Union[str, typing.List[str], NoneType] = None batched: bool = False batch_size: typing.Optional[int] = 1000 drop_last_batch: bool = False remove_columns: typing.Union[str, typing.List[str], NoneType] = None keep_in_memory: bool = False load_from_cache_file: typing.Optional[bool] = None cache_file_name: typing.Optional[str] = None writer_batch_size: typing.Optional[int] = 1000 features: typing.Optional[datasets.features.features.Features] = None disable_nullable: bool = False fn_kwargs: typing.Optional[dict] = None num_proc: typing.Optional[int] = None suffix_template: str = '_{rank:05d}_of_{num_proc:05d}' new_fingerprint: typing.Optional[str] = None desc: typing.Optional[str] = None )
Parameters
- 
							function (Callable) — Function with one of the following signatures:- function(example: Dict[str, Any]) -> Dict[str, Any]if- batched=Falseand- with_indices=Falseand- with_rank=False
- function(example: Dict[str, Any], *extra_args) -> Dict[str, Any]if- batched=Falseand- with_indices=Trueand/or- with_rank=True(one extra arg for each)
- function(batch: Dict[str, List]) -> Dict[str, List]if- batched=Trueand- with_indices=Falseand- with_rank=False
- function(batch: Dict[str, List], *extra_args) -> Dict[str, List]if- batched=Trueand- with_indices=Trueand/or- with_rank=True(one extra arg for each)
 For advanced usage, the function can also return a pyarrow.Table. Moreover if your function returns nothing (None), thenmapwill run your function and return the dataset unchanged. If no function is provided, default to identity function:lambda x: x.
- 
							with_indices (bool, defaults toFalse) — Provide example indices tofunction. Note that in this case the signature offunctionshould bedef function(example, idx[, rank]): ....
- 
							with_rank (bool, defaults toFalse) — Provide process rank tofunction. Note that in this case the signature offunctionshould bedef function(example[, idx], rank): ....
- 
							input_columns (Optional[Union[str, List[str]]], defaults toNone) — The columns to be passed intofunctionas positional arguments. IfNone, adictmapping to all formatted columns is passed as one argument.
- 
							batched (bool, defaults toFalse) — Provide batch of examples tofunction.
- 
							batch_size (int, optional, defaults to1000) — Number of examples per batch provided tofunctionifbatched=True. Ifbatch_size <= 0orbatch_size == None, provide the full dataset as a single batch tofunction.
- 
							drop_last_batch (bool, defaults toFalse) — Whether a last batch smaller than the batch_size should be dropped instead of being processed by the function.
- 
							remove_columns (Optional[Union[str, List[str]]], defaults toNone) — Remove a selection of columns while doing the mapping. Columns will be removed before updating the examples with the output offunction, i.e. iffunctionis adding columns with names inremove_columns, these columns will be kept.
- 
							keep_in_memory (bool, defaults toFalse) — Keep the dataset in memory instead of writing it to a cache file.
- 
							load_from_cache_file (Optioanl[bool], defaults toTrueif caching is enabled) — If a cache file storing the current computation fromfunctioncan be identified, use it instead of recomputing.
- 
							cache_file_name (str, optional, defaults toNone) — Provide the name of a path for the cache file. It is used to store the results of the computation instead of the automatically generated cache file name.
- 
							writer_batch_size (int, defaults to1000) — Number of rows per write operation for the cache file writer. This value is a good trade-off between memory usage during the processing, and processing speed. Higher value makes the processing do fewer lookups, lower value consume less temporary memory while runningmap.
- 
							features (Optional[datasets.Features], defaults toNone) — Use a specific Features to store the cache file instead of the automatically generated one.
- 
							disable_nullable (bool, defaults toFalse) — Disallow null values in the table.
- 
							fn_kwargs (Dict, optional, defaults toNone) — Keyword arguments to be passed tofunction.
- 
							num_proc (int, optional, defaults toNone) — Max number of processes when generating cache. Already cached shards are loaded sequentially.
- 
							suffix_template (str) — Ifcache_file_nameis specified, then this suffix will be added at the end of the base name of each. Defaults to"_{rank:05d}_of_{num_proc:05d}". For example, ifcache_file_nameis “processed.arrow”, then forrank=1andnum_proc=4, the resulting file would be"processed_00001_of_00004.arrow"for the default suffix.
- 
							new_fingerprint (str, optional, defaults toNone) — The new fingerprint of the dataset after transform. IfNone, the new fingerprint is computed using a hash of the previous fingerprint, and the transform arguments.
- 
							desc (str, optional, defaults toNone) — Meaningful description to be displayed alongside with the progress bar while mapping examples.
Apply a function to all the examples in the table (individually or in batches) and update the table. If your function returns a column that already exists, then it overwrites it.
You can specify whether the function should be batched or not with the batched parameter:
- If batched is False, then the function takes 1 example in and should return 1 example. An example is a dictionary, e.g.{"text": "Hello there !"}.
- If batched is Trueandbatch_sizeis 1, then the function takes a batch of 1 example as input and can return a batch with 1 or more examples. A batch is a dictionary, e.g. a batch of 1 example is{"text": ["Hello there !"]}.
- If batched is Trueandbatch_sizeisn > 1, then the function takes a batch ofnexamples as input and can return a batch withnexamples, or with an arbitrary number of examples. Note that the last batch may have less thannexamples. A batch is a dictionary, e.g. a batch ofnexamples is{"text": ["Hello there !"] * n}.
Example:
>>> from datasets import load_dataset
>>> ds = load_dataset("rotten_tomatoes", split="validation")
>>> def add_prefix(example):
...     example["text"] = "Review: " + example["text"]
...     return example
>>> ds = ds.map(add_prefix)
>>> ds[0:3]["text"]
['Review: compassionately explores the seemingly irreconcilable situation between conservative christian parents and their estranged gay and lesbian children .',
 'Review: the soundtrack alone is worth the price of admission .',
 'Review: rodriguez does a splendid job of racial profiling hollywood style--casting excellent latin actors of all ages--a trend long overdue .']
# process a batch of examples
>>> ds = ds.map(lambda example: tokenizer(example["text"]), batched=True)
# set number of processors
>>> ds = ds.map(add_prefix, num_proc=4)filter
< source >( function: typing.Optional[typing.Callable] = None with_indices = False input_columns: typing.Union[str, typing.List[str], NoneType] = None batched: bool = False batch_size: typing.Optional[int] = 1000 keep_in_memory: bool = False load_from_cache_file: typing.Optional[bool] = None cache_file_name: typing.Optional[str] = None writer_batch_size: typing.Optional[int] = 1000 fn_kwargs: typing.Optional[dict] = None num_proc: typing.Optional[int] = None suffix_template: str = '_{rank:05d}_of_{num_proc:05d}' new_fingerprint: typing.Optional[str] = None desc: typing.Optional[str] = None )
Parameters
- 
							function (Callable) — Callable with one of the following signatures:- function(example: Dict[str, Any]) -> boolif- with_indices=False, batched=False
- function(example: Dict[str, Any], indices: int) -> boolif- with_indices=True, batched=False
- function(example: Dict[str, List]) -> List[bool]if- with_indices=False, batched=True
- function(example: Dict[str, List], indices: List[int]) -> List[bool]if- with_indices=True, batched=True
 If no function is provided, defaults to an always Truefunction:lambda x: True.
- 
							with_indices (bool, defaults toFalse) — Provide example indices tofunction. Note that in this case the signature offunctionshould bedef function(example, idx): ....
- 
							input_columns (strorList[str], optional) — The columns to be passed intofunctionas positional arguments. IfNone, adictmapping to all formatted columns is passed as one argument.
- 
							batched (bool, defaults toFalse) — Provide batch of examples tofunction.
- 
							batch_size (int, optional, defaults to1000) — Number of examples per batch provided tofunctionifbatched = True. Ifbatched = False, one example per batch is passed tofunction. Ifbatch_size <= 0orbatch_size == None, provide the full dataset as a single batch tofunction.
- 
							keep_in_memory (bool, defaults toFalse) — Keep the dataset in memory instead of writing it to a cache file.
- 
							load_from_cache_file (Optional[bool], defaults toTrueif caching is enabled) — If a cache file storing the current computation fromfunctioncan be identified, use it instead of recomputing.
- 
							cache_file_name (str, optional) — Provide the name of a path for the cache file. It is used to store the results of the computation instead of the automatically generated cache file name.
- 
							writer_batch_size (int, defaults to1000) — Number of rows per write operation for the cache file writer. This value is a good trade-off between memory usage during the processing, and processing speed. Higher value makes the processing do fewer lookups, lower value consume less temporary memory while runningmap.
- 
							fn_kwargs (dict, optional) — Keyword arguments to be passed tofunction.
- 
							num_proc (int, optional) — Number of processes for multiprocessing. By default it doesn’t use multiprocessing.
- 
							suffix_template (str) — Ifcache_file_nameis specified, then this suffix will be added at the end of the base name of each. For example, ifcache_file_nameis"processed.arrow", then forrank = 1andnum_proc = 4, the resulting file would be"processed_00001_of_00004.arrow"for the default suffix (default_{rank:05d}_of_{num_proc:05d}).
- 
							new_fingerprint (str, optional) — The new fingerprint of the dataset after transform. IfNone, the new fingerprint is computed using a hash of the previous fingerprint, and the transform arguments.
- 
							desc (str, optional, defaults toNone) — Meaningful description to be displayed alongside with the progress bar while filtering examples.
Apply a filter function to all the elements in the table in batches and update the table so that the dataset only includes examples according to the filter function.
select
< source >( indices: typing.Iterable keep_in_memory: bool = False indices_cache_file_name: typing.Optional[str] = None writer_batch_size: typing.Optional[int] = 1000 new_fingerprint: typing.Optional[str] = None )
Parameters
- 
							indices (range,list,iterable,ndarrayorSeries) — Range, list or 1D-array of integer indices for indexing. If the indices correspond to a contiguous range, the Arrow table is simply sliced. However passing a list of indices that are not contiguous creates indices mapping, which is much less efficient, but still faster than recreating an Arrow table made of the requested rows.
- 
							keep_in_memory (bool, defaults toFalse) — Keep the indices mapping in memory instead of writing it to a cache file.
- 
							indices_cache_file_name (str, optional, defaults toNone) — Provide the name of a path for the cache file. It is used to store the indices mapping instead of the automatically generated cache file name.
- 
							writer_batch_size (int, defaults to1000) — Number of rows per write operation for the cache file writer. This value is a good trade-off between memory usage during the processing, and processing speed. Higher value makes the processing do fewer lookups, lower value consume less temporary memory while runningmap.
- 
							new_fingerprint (str, optional, defaults toNone) — The new fingerprint of the dataset after transform. IfNone, the new fingerprint is computed using a hash of the previous fingerprint, and the transform arguments.
Create a new dataset with rows selected following the list/array of indices.
sort
< source >( column_names: typing.Union[str, typing.Sequence[str]] reverse: typing.Union[bool, typing.Sequence[bool]] = False kind = 'deprecated' null_placement: str = 'at_end' keep_in_memory: bool = False load_from_cache_file: typing.Optional[bool] = None indices_cache_file_name: typing.Optional[str] = None writer_batch_size: typing.Optional[int] = 1000 new_fingerprint: typing.Optional[str] = None )
Parameters
- 
							column_names (Union[str, Sequence[str]]) — Column name(s) to sort by.
- 
							reverse (Union[bool, Sequence[bool]], defaults toFalse) — IfTrue, sort by descending order rather than ascending. If a single bool is provided, the value is applied to the sorting of all column names. Otherwise a list of bools with the same length and order as column_names must be provided.
- 
							kind (str, optional) — Pandas algorithm for sorting selected in{quicksort, mergesort, heapsort, stable}, The default isquicksort. Note that bothstableandmergesortusetimsortunder the covers and, in general, the actual implementation will vary with data type. Themergesortoption is retained for backwards compatibility.Deprecated in 2.8.0 kindwas deprecated in version 2.10.0 and will be removed in 3.0.0.
- 
							null_placement (str, defaults toat_end) — PutNonevalues at the beginning ifat_startorfirstor at the end ifat_endorlastAdded in 1.14.2 
- 
							keep_in_memory (bool, defaults toFalse) — Keep the sorted indices in memory instead of writing it to a cache file.
- 
							load_from_cache_file (Optional[bool], defaults toTrueif caching is enabled) — If a cache file storing the sorted indices can be identified, use it instead of recomputing.
- 
							indices_cache_file_name (str, optional, defaults toNone) — Provide the name of a path for the cache file. It is used to store the sorted indices instead of the automatically generated cache file name.
- 
							writer_batch_size (int, defaults to1000) — Number of rows per write operation for the cache file writer. Higher value gives smaller cache files, lower value consume less temporary memory.
- 
							new_fingerprint (str, optional, defaults toNone) — The new fingerprint of the dataset after transform. IfNone, the new fingerprint is computed using a hash of the previous fingerprint, and the transform arguments
Create a new dataset sorted according to a single or multiple columns.
Example:
>>> from datasets import load_dataset
>>> ds = load_dataset('rotten_tomatoes', split='validation')
>>> ds['label'][:10]
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
>>> sorted_ds = ds.sort('label')
>>> sorted_ds['label'][:10]
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
>>> another_sorted_ds = ds.sort(['label', 'text'], reverse=[True, False])
>>> another_sorted_ds['label'][:10]
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1]shuffle
< source >( seed: typing.Optional[int] = None generator: typing.Optional[numpy.random._generator.Generator] = None keep_in_memory: bool = False load_from_cache_file: typing.Optional[bool] = None indices_cache_file_name: typing.Optional[str] = None writer_batch_size: typing.Optional[int] = 1000 new_fingerprint: typing.Optional[str] = None )
Parameters
- 
							seed (int, optional) — A seed to initialize the default BitGenerator ifgenerator=None. IfNone, then fresh, unpredictable entropy will be pulled from the OS. If anintorarray_like[ints]is passed, then it will be passed to SeedSequence to derive the initial BitGenerator state.
- 
							generator (numpy.random.Generator, optional) — Numpy random Generator to use to compute the permutation of the dataset rows. Ifgenerator=None(default), usesnp.random.default_rng(the default BitGenerator (PCG64) of NumPy).
- 
							keep_in_memory (bool, defaultFalse) — Keep the shuffled indices in memory instead of writing it to a cache file.
- 
							load_from_cache_file (Optional[bool], defaults toTrueif caching is enabled) — If a cache file storing the shuffled indices can be identified, use it instead of recomputing.
- 
							indices_cache_file_name (str, optional) — Provide the name of a path for the cache file. It is used to store the shuffled indices instead of the automatically generated cache file name.
- 
							writer_batch_size (int, defaults to1000) — Number of rows per write operation for the cache file writer. This value is a good trade-off between memory usage during the processing, and processing speed. Higher value makes the processing do fewer lookups, lower value consume less temporary memory while runningmap.
- 
							new_fingerprint (str, optional, defaults toNone) — The new fingerprint of the dataset after transform. IfNone, the new fingerprint is computed using a hash of the previous fingerprint, and the transform arguments.
Create a new Dataset where the rows are shuffled.
Currently shuffling uses numpy random generators. You can either supply a NumPy BitGenerator to use, or a seed to initiate NumPy’s default random generator (PCG64).
Shuffling takes the list of indices [0:len(my_dataset)] and shuffles it to create an indices mapping.
However as soon as your Dataset has an indices mapping, the speed can become 10x slower.
This is because there is an extra step to get the row index to read using the indices mapping, and most importantly, you aren’t reading contiguous chunks of data anymore.
To restore the speed, you’d need to rewrite the entire dataset on your disk again using Dataset.flatten_indices(), which removes the indices mapping.
This may take a lot of time depending of the size of your dataset though:
my_dataset[0]  # fast
my_dataset = my_dataset.shuffle(seed=42)
my_dataset[0]  # up to 10x slower
my_dataset = my_dataset.flatten_indices()  # rewrite the shuffled dataset on disk as contiguous chunks of data
my_dataset[0]  # fast againIn this case, we recommend switching to an IterableDataset and leveraging its fast approximate shuffling method IterableDataset.shuffle().
It only shuffles the shards order and adds a shuffle buffer to your dataset, which keeps the speed of your dataset optimal:
my_iterable_dataset = my_dataset.to_iterable_dataset(num_shards=128)
for example in enumerate(my_iterable_dataset):  # fast
    pass
shuffled_iterable_dataset = my_iterable_dataset.shuffle(seed=42, buffer_size=100)
for example in enumerate(shuffled_iterable_dataset):  # as fast as before
    passtrain_test_split
< source >( test_size: typing.Union[float, int, NoneType] = None train_size: typing.Union[float, int, NoneType] = None shuffle: bool = True stratify_by_column: typing.Optional[str] = None seed: typing.Optional[int] = None generator: typing.Optional[numpy.random._generator.Generator] = None keep_in_memory: bool = False load_from_cache_file: typing.Optional[bool] = None train_indices_cache_file_name: typing.Optional[str] = None test_indices_cache_file_name: typing.Optional[str] = None writer_batch_size: typing.Optional[int] = 1000 train_new_fingerprint: typing.Optional[str] = None test_new_fingerprint: typing.Optional[str] = None )
Parameters
- 
							test_size (numpy.random.Generator, optional) — Size of the test split Iffloat, should be between0.0and1.0and represent the proportion of the dataset to include in the test split. Ifint, represents the absolute number of test samples. IfNone, the value is set to the complement of the train size. Iftrain_sizeis alsoNone, it will be set to0.25.
- 
							train_size (numpy.random.Generator, optional) — Size of the train split Iffloat, should be between0.0and1.0and represent the proportion of the dataset to include in the train split. Ifint, represents the absolute number of train samples. IfNone, the value is automatically set to the complement of the test size.
- 
							shuffle (bool, optional, defaults toTrue) — Whether or not to shuffle the data before splitting.
- 
							stratify_by_column (str, optional, defaults toNone) — The column name of labels to be used to perform stratified split of data.
- 
							seed (int, optional) — A seed to initialize the default BitGenerator ifgenerator=None. IfNone, then fresh, unpredictable entropy will be pulled from the OS. If anintorarray_like[ints]is passed, then it will be passed to SeedSequence to derive the initial BitGenerator state.
- 
							generator (numpy.random.Generator, optional) — Numpy random Generator to use to compute the permutation of the dataset rows. Ifgenerator=None(default), usesnp.random.default_rng(the default BitGenerator (PCG64) of NumPy).
- 
							keep_in_memory (bool, defaults toFalse) — Keep the splits indices in memory instead of writing it to a cache file.
- 
							load_from_cache_file (Optional[bool], defaults toTrueif caching is enabled) — If a cache file storing the splits indices can be identified, use it instead of recomputing.
- 
							train_cache_file_name (str, optional) — Provide the name of a path for the cache file. It is used to store the train split indices instead of the automatically generated cache file name.
- 
							test_cache_file_name (str, optional) — Provide the name of a path for the cache file. It is used to store the test split indices instead of the automatically generated cache file name.
- 
							writer_batch_size (int, defaults to1000) — Number of rows per write operation for the cache file writer. This value is a good trade-off between memory usage during the processing, and processing speed. Higher value makes the processing do fewer lookups, lower value consume less temporary memory while runningmap.
- 
							train_new_fingerprint (str, optional, defaults toNone) — The new fingerprint of the train set after transform. IfNone, the new fingerprint is computed using a hash of the previous fingerprint, and the transform arguments
- 
							test_new_fingerprint (str, optional, defaults toNone) — The new fingerprint of the test set after transform. IfNone, the new fingerprint is computed using a hash of the previous fingerprint, and the transform arguments
Return a dictionary (datasets.DatasetDict) with two random train and test subsets (train and test Dataset splits).
Splits are created from the dataset according to test_size, train_size and shuffle.
This method is similar to scikit-learn train_test_split.
Example:
>>> from datasets import load_dataset
>>> ds = load_dataset("rotten_tomatoes", split="validation")
>>> ds = ds.train_test_split(test_size=0.2, shuffle=True)
DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 852
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 214
    })
})
# set a seed
>>> ds = ds.train_test_split(test_size=0.2, seed=42)
# stratified split
>>> ds = load_dataset("imdb",split="train")
Dataset({
    features: ['text', 'label'],
    num_rows: 25000
})
>>> ds = ds.train_test_split(test_size=0.2, stratify_by_column="label")
DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 20000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 5000
    })
})shard
< source >( num_shards: int index: int contiguous: bool = False keep_in_memory: bool = False indices_cache_file_name: typing.Optional[str] = None writer_batch_size: typing.Optional[int] = 1000 )
Parameters
- 
							num_shards (int) — How many shards to split the dataset into.
- 
							index (int) — Which shard to select and return. contiguous — (bool, defaults toFalse): Whether to select contiguous blocks of indices for shards.
- 
							keep_in_memory (bool, defaults toFalse) — Keep the dataset in memory instead of writing it to a cache file.
- 
							indices_cache_file_name (str, optional) — Provide the name of a path for the cache file. It is used to store the indices of each shard instead of the automatically generated cache file name.
- 
							writer_batch_size (int, defaults to1000) — Number of rows per write operation for the cache file writer. This value is a good trade-off between memory usage during the processing, and processing speed. Higher value makes the processing do fewer lookups, lower value consume less temporary memory while runningmap.
Return the index-nth shard from dataset split into num_shards pieces.
This shards deterministically. dset.shard(n, i) will contain all elements of dset whose
index mod n = i.
dset.shard(n, i, contiguous=True) will instead split dset into contiguous chunks,
so it can be easily concatenated back together after processing. If n % i == l, then the
first l shards will have length (n // i) + 1, and the remaining shards will have length (n // i).
datasets.concatenate([dset.shard(n, i, contiguous=True) for i in range(n)]) will return
a dataset with the same order as the original.
Be sure to shard before using any randomizing operator (such as shuffle).
It is best if the shard operator is used early in the dataset pipeline.
to_tf_dataset
< source >( batch_size: typing.Optional[int] = None columns: typing.Union[str, typing.List[str], NoneType] = None shuffle: bool = False collate_fn: typing.Optional[typing.Callable] = None drop_remainder: bool = False collate_fn_args: typing.Union[typing.Dict[str, typing.Any], NoneType] = None label_cols: typing.Union[str, typing.List[str], NoneType] = None prefetch: bool = True num_workers: int = 0 num_test_batches: int = 20 )
Parameters
- 
							batch_size (int, optional) — Size of batches to load from the dataset. Defaults toNone, which implies that the dataset won’t be batched, but the returned dataset can be batched later withtf_dataset.batch(batch_size).
- 
							columns (List[str]orstr, optional) — Dataset column(s) to load in thetf.data.Dataset. Column names that are created by thecollate_fnand that do not exist in the original dataset can be used.
- 
							shuffle(bool, defaults toFalse) — Shuffle the dataset order when loading. RecommendedTruefor training,Falsefor validation/evaluation.
- 
							drop_remainder(bool, defaults toFalse) — Drop the last incomplete batch when loading. Ensures that all batches yielded by the dataset will have the same length on the batch dimension.
- 
							collate_fn(Callable, optional) — A function or callable object (such as aDataCollator) that will collate lists of samples into a batch.
- 
							collate_fn_args (Dict, optional) — An optionaldictof keyword arguments to be passed to thecollate_fn.
- 
							label_cols (List[str]orstr, defaults toNone) — Dataset column(s) to load as labels. Note that many models compute loss internally rather than letting Keras do it, in which case passing the labels here is optional, as long as they’re in the inputcolumns.
- 
							prefetch (bool, defaults toTrue) — Whether to run the dataloader in a separate thread and maintain a small buffer of batches for training. Improves performance by allowing data to be loaded in the background while the model is training.
- 
							num_workers (int, defaults to0) — Number of workers to use for loading the dataset. Only supported on Python versions >= 3.8.
- 
							num_test_batches (int, defaults to20) — Number of batches to use to infer the output signature of the dataset. The higher this number, the more accurate the signature will be, but the longer it will take to create the dataset.
Create a tf.data.Dataset from the underlying Dataset. This tf.data.Dataset will load and collate batches from
the Dataset, and is suitable for passing to methods like model.fit() or model.predict(). The dataset will yield
dicts for both inputs and labels unless the dict would contain only a single key, in which case a raw
tf.Tensor is yielded instead.
push_to_hub
< source >( repo_id: str config_name: str = 'default' split: typing.Optional[str] = None private: typing.Optional[bool] = False token: typing.Optional[str] = None branch: typing.Optional[str] = None max_shard_size: typing.Union[str, int, NoneType] = None num_shards: typing.Optional[int] = None embed_external_files: bool = True )
Parameters
- 
							repo_id (str) — The ID of the repository to push to in the following format:<user>/<dataset_name>or<org>/<dataset_name>. Also accepts<dataset_name>, which will default to the namespace of the logged-in user.
- 
							config_name (str, defaults to “default”) — The configuration name of a dataset. Defaults to “default”
- 
							split (str, optional) — The name of the split that will be given to that dataset. Defaults toself.split.
- 
							private (bool, optional, defaults toFalse) — Whether the dataset repository should be set to private or not. Only affects repository creation: a repository that already exists will not be affected by that parameter.
- 
							token (str, optional) — An optional authentication token for the Hugging Face Hub. If no token is passed, will default to the token saved locally when logging in withhuggingface-cli login. Will raise an error if no token is passed and the user is not logged-in.
- 
							branch (str, optional) — The git branch on which to push the dataset. This defaults to the default branch as specified in your repository, which defaults to"main".
- 
							max_shard_size (intorstr, optional, defaults to"500MB") — The maximum size of the dataset shards to be uploaded to the hub. If expressed as a string, needs to be digits followed by a unit (like"5MB").
- 
							num_shards (int, optional) — Number of shards to write. By default the number of shards depends onmax_shard_size.Added in 2.8.0 
- 
							embed_external_files (bool, defaults toTrue) — Whether to embed file bytes in the shards. In particular, this will do the following before the push for the fields of type:
Pushes the dataset to the hub as a Parquet dataset. The dataset is pushed using HTTP requests and does not need to have neither git or git-lfs installed.
The resulting Parquet files are self-contained by default. If your dataset contains Image or Audio
data, the Parquet files will store the bytes of your images or audio files.
You can disable this by setting embed_external_files to False.
save_to_disk
< source >( dataset_path: typing.Union[str, bytes, os.PathLike] fs = 'deprecated' max_shard_size: typing.Union[str, int, NoneType] = None num_shards: typing.Optional[int] = None num_proc: typing.Optional[int] = None storage_options: typing.Optional[dict] = None )
Parameters
- 
							dataset_path (str) — Path (e.g.dataset/train) or remote URI (e.g.s3://my-bucket/dataset/train) of the dataset directory where the dataset will be saved to.
- 
							fs (fsspec.spec.AbstractFileSystem, optional) — Instance of the remote filesystem where the dataset will be saved to.Deprecated in 2.8.0 fswas deprecated in version 2.8.0 and will be removed in 3.0.0. Please usestorage_optionsinstead, e.g.storage_options=fs.storage_options
- 
							max_shard_size (intorstr, optional, defaults to"500MB") — The maximum size of the dataset shards to be uploaded to the hub. If expressed as a string, needs to be digits followed by a unit (like"50MB").
- 
							num_shards (int, optional) — Number of shards to write. By default the number of shards depends onmax_shard_sizeandnum_proc.Added in 2.8.0 
- 
							num_proc (int, optional) — Number of processes when downloading and generating the dataset locally. Multiprocessing is disabled by default.Added in 2.8.0 
- 
							storage_options (dict, optional) — Key/value pairs to be passed on to the file-system backend, if any.Added in 2.8.0 
Saves a dataset to a dataset directory, or in a filesystem using any implementation of fsspec.spec.AbstractFileSystem.
All the Image() and Audio() data are stored in the arrow files. If you want to store paths or urls, please use the Value(“string”) type.
load_from_disk
< source >( dataset_path: str fs = 'deprecated' keep_in_memory: typing.Optional[bool] = None storage_options: typing.Optional[dict] = None ) → Dataset or DatasetDict
Parameters
- 
							dataset_path (str) — Path (e.g."dataset/train") or remote URI (e.g."s3//my-bucket/dataset/train") of the dataset directory where the dataset will be loaded from.
- 
							fs (fsspec.spec.AbstractFileSystem, optional) — Instance of the remote filesystem where the dataset will be saved to.Deprecated in 2.8.0 fswas deprecated in version 2.8.0 and will be removed in 3.0.0. Please usestorage_optionsinstead, e.g.storage_options=fs.storage_options
- 
							keep_in_memory (bool, defaults toNone) — Whether to copy the dataset in-memory. IfNone, the dataset will not be copied in-memory unless explicitly enabled by settingdatasets.config.IN_MEMORY_MAX_SIZEto nonzero. See more details in the improve performance section.
- 
							storage_options (dict, optional) — Key/value pairs to be passed on to the file-system backend, if any.Added in 2.8.0 
Returns
- If dataset_pathis a path of a dataset directory, the dataset requested.
- If dataset_pathis a path of a dataset dict directory, adatasets.DatasetDictwith each split.
Loads a dataset that was previously saved using save_to_disk from a dataset directory, or from a
filesystem using any implementation of fsspec.spec.AbstractFileSystem.
flatten_indices
< source >( keep_in_memory: bool = False cache_file_name: typing.Optional[str] = None writer_batch_size: typing.Optional[int] = 1000 features: typing.Optional[datasets.features.features.Features] = None disable_nullable: bool = False num_proc: typing.Optional[int] = None new_fingerprint: typing.Optional[str] = None )
Parameters
- 
							keep_in_memory (bool, defaults toFalse) — Keep the dataset in memory instead of writing it to a cache file.
- 
							cache_file_name (str, optional, defaultNone) — Provide the name of a path for the cache file. It is used to store the results of the computation instead of the automatically generated cache file name.
- 
							writer_batch_size (int, defaults to1000) — Number of rows per write operation for the cache file writer. This value is a good trade-off between memory usage during the processing, and processing speed. Higher value makes the processing do fewer lookups, lower value consume less temporary memory while runningmap.
- 
							features (Optional[datasets.Features], defaults toNone) — Use a specific Features to store the cache file instead of the automatically generated one.
- 
							disable_nullable (bool, defaults toFalse) — Allow null values in the table.
- 
							num_proc (int, optional, defaultNone) — Max number of processes when generating cache. Already cached shards are loaded sequentially
- 
							new_fingerprint (str, optional, defaults toNone) — The new fingerprint of the dataset after transform. IfNone, the new fingerprint is computed using a hash of the previous fingerprint, and the transform arguments
Create and cache a new Dataset by flattening the indices mapping.
to_csv
< source >(
			path_or_buf: typing.Union[str, bytes, os.PathLike, typing.BinaryIO]
				batch_size: typing.Optional[int] = None
				num_proc: typing.Optional[int] = None
				**to_csv_kwargs
				
			)
			→
				int
Parameters
- 
							path_or_buf (PathLikeorFileOrBuffer) — Either a path to a file or a BinaryIO.
- 
							batch_size (int, optional) — Size of the batch to load in memory and write at once. Defaults todatasets.config.DEFAULT_MAX_BATCH_SIZE.
- 
							num_proc (int, optional) — Number of processes for multiprocessing. By default it doesn’t use multiprocessing.batch_sizein this case defaults todatasets.config.DEFAULT_MAX_BATCH_SIZEbut feel free to make it 5x or 10x of the default value if you have sufficient compute power.
- 
							**to_csv_kwargs (additional keyword arguments) —
Parameters to pass to pandas’s pandas.DataFrame.to_csv.Changed in 2.10.0 Now, indexdefaults toFalseif not specified.If you would like to write the index, pass index=Trueand also set a name for the index column by passingindex_label.
Returns
int
The number of characters or bytes written.
Exports the dataset to csv
to_pandas
< source >( batch_size: typing.Optional[int] = None batched: bool = False )
Parameters
- 
							batched (bool) — Set toTrueto return a generator that yields the dataset as batches ofbatch_sizerows. Defaults toFalse(returns the whole datasets once).
- 
							batch_size (int, optional) — The size (number of rows) of the batches ifbatchedisTrue. Defaults todatasets.config.DEFAULT_MAX_BATCH_SIZE.
Returns the dataset as a pandas.DataFrame. Can also return a generator for large datasets.
to_dict
< source >( batch_size: typing.Optional[int] = None batched = 'deprecated' )
Parameters
- 
							batched (bool) — Set toTrueto return a generator that yields the dataset as batches ofbatch_sizerows. Defaults toFalse(returns the whole datasets once).Deprecated in 2.11.0 Use .iter(batch_size=batch_size)followed by.to_dict()on the individual batches instead.
- 
							batch_size (int, optional) — The size (number of rows) of the batches ifbatchedisTrue. Defaults todatasets.config.DEFAULT_MAX_BATCH_SIZE.
Returns the dataset as a Python dict. Can also return a generator for large datasets.
to_json
< source >(
			path_or_buf: typing.Union[str, bytes, os.PathLike, typing.BinaryIO]
				batch_size: typing.Optional[int] = None
				num_proc: typing.Optional[int] = None
				**to_json_kwargs
				
			)
			→
				int
Parameters
- 
							path_or_buf (PathLikeorFileOrBuffer) — Either a path to a file or a BinaryIO.
- 
							batch_size (int, optional) — Size of the batch to load in memory and write at once. Defaults todatasets.config.DEFAULT_MAX_BATCH_SIZE.
- 
							num_proc (int, optional) — Number of processes for multiprocessing. By default it doesn’t use multiprocessing.batch_sizein this case defaults todatasets.config.DEFAULT_MAX_BATCH_SIZEbut feel free to make it 5x or 10x of the default value if you have sufficient compute power.
- 
							**to_json_kwargs (additional keyword arguments) —
Parameters to pass to pandas’s pandas.DataFrame.to_json.Changed in 2.11.0 Now, indexdefaults toFalseiforintis"split"or"table"is specified.If you would like to write the index, pass index=True.
Returns
int
The number of characters or bytes written.
Export the dataset to JSON Lines or JSON.
to_parquet
< source >(
			path_or_buf: typing.Union[str, bytes, os.PathLike, typing.BinaryIO]
				batch_size: typing.Optional[int] = None
				**parquet_writer_kwargs
				
			)
			→
				int
Parameters
- 
							path_or_buf (PathLikeorFileOrBuffer) — Either a path to a file or a BinaryIO.
- 
							batch_size (int, optional) — Size of the batch to load in memory and write at once. Defaults todatasets.config.DEFAULT_MAX_BATCH_SIZE.
- 
							**parquet_writer_kwargs (additional keyword arguments) —
Parameters to pass to PyArrow’s pyarrow.parquet.ParquetWriter.
Returns
int
The number of characters or bytes written.
Exports the dataset to parquet
to_sql
< source >(
			name: str
				con: typing.Union[str, ForwardRef('sqlalchemy.engine.Connection'), ForwardRef('sqlalchemy.engine.Engine'), ForwardRef('sqlite3.Connection')]
				batch_size: typing.Optional[int] = None
				**sql_writer_kwargs
				
			)
			→
				int
Parameters
- 
							name (str) — Name of SQL table.
- 
							con (strorsqlite3.Connectionorsqlalchemy.engine.Connectionorsqlalchemy.engine.Connection) — A URI string or a SQLite3/SQLAlchemy connection object used to write to a database.
- 
							batch_size (int, optional) — Size of the batch to load in memory and write at once. Defaults todatasets.config.DEFAULT_MAX_BATCH_SIZE.
- 
							**sql_writer_kwargs (additional keyword arguments) —
Parameters to pass to pandas’s pandas.DataFrame.to_sql.Changed in 2.11.0 Now, indexdefaults toFalseif not specified.If you would like to write the index, pass index=Trueand also set a name for the index column by passingindex_label.
Returns
int
The number of records written.
Exports the dataset to a SQL database.
add_faiss_index
< source >( column: str index_name: typing.Optional[str] = None device: typing.Optional[int] = None string_factory: typing.Optional[str] = None metric_type: typing.Optional[int] = None custom_index: typing.Optional[ForwardRef('faiss.Index')] = None batch_size: int = 1000 train_size: typing.Optional[int] = None faiss_verbose: bool = False dtype = <class 'numpy.float32'> )
Parameters
- 
							column (str) — The column of the vectors to add to the index.
- 
							index_name (str, optional) — Theindex_name/identifier of the index. This is theindex_namethat is used to call get_nearest_examples() or search(). By default it corresponds tocolumn.
- 
							device (Union[int, List[int]], optional) — If positive integer, this is the index of the GPU to use. If negative integer, use all GPUs. If a list of positive integers is passed in, run only on those GPUs. By default it uses the CPU.
- 
							string_factory (str, optional) — This is passed to the index factory of Faiss to create the index. Default index class isIndexFlat.
- 
							metric_type (int, optional) — Type of metric. Ex:faiss.METRIC_INNER_PRODUCTorfaiss.METRIC_L2.
- 
							custom_index (faiss.Index, optional) — Custom Faiss index that you already have instantiated and configured for your needs.
- 
							batch_size (int) — Size of the batch to use while adding vectors to theFaissIndex. Default value is1000.Added in 2.4.0 
- 
							train_size (int, optional) — If the index needs a training step, specifies how many vectors will be used to train the index.
- 
							faiss_verbose (bool, defaults toFalse) — Enable the verbosity of the Faiss index.
- 
							dtype (data-type) — The dtype of the numpy arrays that are indexed. Default isnp.float32.
Add a dense index using Faiss for fast retrieval.
By default the index is done over the vectors of the specified column.
You can specify device if you want to run it on GPU (device must be the GPU index).
You can find more information about Faiss here:
- For string factory
Example:
>>> ds = datasets.load_dataset('crime_and_punish', split='train')
>>> ds_with_embeddings = ds.map(lambda example: {'embeddings': embed(example['line']}))
>>> ds_with_embeddings.add_faiss_index(column='embeddings')
>>> # query
>>> scores, retrieved_examples = ds_with_embeddings.get_nearest_examples('embeddings', embed('my new query'), k=10)
>>> # save index
>>> ds_with_embeddings.save_faiss_index('embeddings', 'my_index.faiss')
>>> ds = datasets.load_dataset('crime_and_punish', split='train')
>>> # load index
>>> ds.load_faiss_index('embeddings', 'my_index.faiss')
>>> # query
>>> scores, retrieved_examples = ds.get_nearest_examples('embeddings', embed('my new query'), k=10)add_faiss_index_from_external_arrays
< source >( external_arrays: array index_name: str device: typing.Optional[int] = None string_factory: typing.Optional[str] = None metric_type: typing.Optional[int] = None custom_index: typing.Optional[ForwardRef('faiss.Index')] = None batch_size: int = 1000 train_size: typing.Optional[int] = None faiss_verbose: bool = False dtype = <class 'numpy.float32'> )
Parameters
- 
							external_arrays (np.array) — If you want to use arrays from outside the lib for the index, you can setexternal_arrays. It will useexternal_arraysto create the Faiss index instead of the arrays in the givencolumn.
- 
							index_name (str) — Theindex_name/identifier of the index. This is theindex_namethat is used to call get_nearest_examples() or search().
- 
							device (Optional Union[int, List[int]], optional) — If positive integer, this is the index of the GPU to use. If negative integer, use all GPUs. If a list of positive integers is passed in, run only on those GPUs. By default it uses the CPU.
- 
							string_factory (str, optional) — This is passed to the index factory of Faiss to create the index. Default index class isIndexFlat.
- 
							metric_type (int, optional) — Type of metric. Ex:faiss.faiss.METRIC_INNER_PRODUCTorfaiss.METRIC_L2.
- 
							custom_index (faiss.Index, optional) — Custom Faiss index that you already have instantiated and configured for your needs.
- 
							batch_size (int, optional) — Size of the batch to use while adding vectors to the FaissIndex. Default value is 1000.Added in 2.4.0 
- 
							train_size (int, optional) — If the index needs a training step, specifies how many vectors will be used to train the index.
- 
							faiss_verbose (bool, defaults to False) — Enable the verbosity of the Faiss index.
- 
							dtype (numpy.dtype) — The dtype of the numpy arrays that are indexed. Default is np.float32.
Add a dense index using Faiss for fast retrieval.
The index is created using the vectors of external_arrays.
You can specify device if you want to run it on GPU (device must be the GPU index).
You can find more information about Faiss here:
- For string factory
save_faiss_index
< source >( index_name: str file: typing.Union[str, pathlib.PurePath] storage_options: typing.Optional[typing.Dict] = None )
Parameters
- 
							index_name (str) — The index_name/identifier of the index. This is the index_name that is used to call.get_nearestor.search.
- 
							file (str) — The path to the serialized faiss index on disk or remote URI (e.g."s3://my-bucket/index.faiss").
- 
							storage_options (dict, optional) — Key/value pairs to be passed on to the file-system backend, if any.Added in 2.11.0 
Save a FaissIndex on disk.
load_faiss_index
< source >( index_name: str file: typing.Union[str, pathlib.PurePath] device: typing.Union[int, typing.List[int], NoneType] = None storage_options: typing.Optional[typing.Dict] = None )
Parameters
- 
							index_name (str) — The index_name/identifier of the index. This is the index_name that is used to call.get_nearestor.search.
- 
							file (str) — The path to the serialized faiss index on disk or remote URI (e.g."s3://my-bucket/index.faiss").
- 
							device (Optional Union[int, List[int]]) — If positive integer, this is the index of the GPU to use. If negative integer, use all GPUs. If a list of positive integers is passed in, run only on those GPUs. By default it uses the CPU.
- 
							storage_options (dict, optional) — Key/value pairs to be passed on to the file-system backend, if any.Added in 2.11.0 
Load a FaissIndex from disk.
If you want to do additional configurations, you can have access to the faiss index object by doing
.get_index(index_name).faiss_index to make it fit your needs.
add_elasticsearch_index
< source >( column: str index_name: typing.Optional[str] = None host: typing.Optional[str] = None port: typing.Optional[int] = None es_client: typing.Optional[ForwardRef('elasticsearch.Elasticsearch')] = None es_index_name: typing.Optional[str] = None es_index_config: typing.Optional[dict] = None )
Parameters
- 
							column (str) — The column of the documents to add to the index.
- 
							index_name (str, optional) — Theindex_name/identifier of the index. This is the index name that is used to call get_nearest_examples() or Dataset.search(). By default it corresponds tocolumn.
- 
							host (str, optional, defaults tolocalhost) — Host of where ElasticSearch is running.
- 
							port (str, optional, defaults to9200) — Port of where ElasticSearch is running.
- 
							es_client (elasticsearch.Elasticsearch, optional) — The elasticsearch client used to create the index if host and port areNone.
- 
							es_index_name (str, optional) — The elasticsearch index name used to create the index.
- 
							es_index_config (dict, optional) — The configuration of the elasticsearch index. Default config is:
Add a text index using ElasticSearch for fast retrieval. This is done in-place.
Example:
>>> es_client = elasticsearch.Elasticsearch()
>>> ds = datasets.load_dataset('crime_and_punish', split='train')
>>> ds.add_elasticsearch_index(column='line', es_client=es_client, es_index_name="my_es_index")
>>> scores, retrieved_examples = ds.get_nearest_examples('line', 'my new query', k=10)load_elasticsearch_index
< source >( index_name: str es_index_name: str host: typing.Optional[str] = None port: typing.Optional[int] = None es_client: typing.Optional[ForwardRef('Elasticsearch')] = None es_index_config: typing.Optional[dict] = None )
Parameters
- 
							index_name (str) — Theindex_name/identifier of the index. This is the index name that is used to callget_nearestorsearch.
- 
							es_index_name (str) — The name of elasticsearch index to load.
- 
							host (str, optional, defaults tolocalhost) — Host of where ElasticSearch is running.
- 
							port (str, optional, defaults to9200) — Port of where ElasticSearch is running.
- 
							es_client (elasticsearch.Elasticsearch, optional) — The elasticsearch client used to create the index if host and port areNone.
- 
							es_index_config (dict, optional) — The configuration of the elasticsearch index. Default config is:
Load an existing text index using ElasticSearch for fast retrieval.
List the colindex_nameumns/identifiers of all the attached indexes.
List the index_name/identifiers of all the attached indexes.
drop_index
< source >( index_name: str )
Drop the index with the specified column.
search
< source >(
			index_name: str
				query: typing.Union[str, <built-in function array>]
				k: int = 10
				**kwargs
				
			)
			→
				(scores, indices)
Parameters
- 
							index_name (str) — The name/identifier of the index.
- 
							query (Union[str, np.ndarray]) — The query as a string ifindex_nameis a text index or as a numpy array ifindex_nameis a vector index.
- 
							k (int) — The number of examples to retrieve.
Returns
(scores, indices)
A tuple of (scores, indices) where:
- scores (List[List[float]): the retrieval scores from either FAISS (IndexFlatL2by default) or ElasticSearch of the retrieved examples
- indices (List[List[int]]): the indices of the retrieved examples
Find the nearest examples indices in the dataset to the query.
search_batch
< source >(
			index_name: str
				queries: typing.Union[typing.List[str], <built-in function array>]
				k: int = 10
				**kwargs
				
			)
			→
				(total_scores, total_indices)
Parameters
- 
							index_name (str) — Theindex_name/identifier of the index.
- 
							queries (Union[List[str], np.ndarray]) — The queries as a list of strings ifindex_nameis a text index or as a numpy array ifindex_nameis a vector index.
- 
							k (int) — The number of examples to retrieve per query.
Returns
(total_scores, total_indices)
A tuple of (total_scores, total_indices) where:
- total_scores (List[List[float]): the retrieval scores from either FAISS (IndexFlatL2by default) or ElasticSearch of the retrieved examples per query
- total_indices (List[List[int]]): the indices of the retrieved examples per query
Find the nearest examples indices in the dataset to the query.
get_nearest_examples
< source >(
			index_name: str
				query: typing.Union[str, <built-in function array>]
				k: int = 10
				**kwargs
				
			)
			→
				(scores, examples)
Parameters
- 
							index_name (str) — The index_name/identifier of the index.
- 
							query (Union[str, np.ndarray]) — The query as a string ifindex_nameis a text index or as a numpy array ifindex_nameis a vector index.
- 
							k (int) — The number of examples to retrieve.
Returns
(scores, examples)
A tuple of (scores, examples) where:
- scores (List[float]): the retrieval scores from either FAISS (IndexFlatL2by default) or ElasticSearch of the retrieved examples
- examples (dict): the retrieved examples
Find the nearest examples in the dataset to the query.
get_nearest_examples_batch
< source >(
			index_name: str
				queries: typing.Union[typing.List[str], <built-in function array>]
				k: int = 10
				**kwargs
				
			)
			→
				(total_scores, total_examples)
Parameters
- 
							index_name (str) — Theindex_name/identifier of the index.
- 
							queries (Union[List[str], np.ndarray]) — The queries as a list of strings ifindex_nameis a text index or as a numpy array ifindex_nameis a vector index.
- 
							k (int) — The number of examples to retrieve per query.
Returns
(total_scores, total_examples)
A tuple of (total_scores, total_examples) where:
- total_scores (List[List[float]): the retrieval scores from either FAISS (IndexFlatL2by default) or ElasticSearch of the retrieved examples per query
- total_examples (List[dict]): the retrieved examples per query
Find the nearest examples in the dataset to the query.
DatasetInfo object containing all the metadata in the dataset.
NamedSplit object corresponding to a named dataset split.
from_csv
< source >( path_or_paths: typing.Union[str, bytes, os.PathLike, typing.List[typing.Union[str, bytes, os.PathLike]]] split: typing.Optional[datasets.splits.NamedSplit] = None features: typing.Optional[datasets.features.features.Features] = None cache_dir: str = None keep_in_memory: bool = False num_proc: typing.Optional[int] = None **kwargs )
Parameters
- 
							path_or_paths (path-likeor list ofpath-like) — Path(s) of the CSV file(s).
- split (NamedSplit, optional) — Split name to be assigned to the dataset.
- features (Features, optional) — Dataset features.
- 
							cache_dir (str, optional, defaults to"~/.cache/huggingface/datasets") — Directory to cache data.
- 
							keep_in_memory (bool, defaults toFalse) — Whether to copy the data in-memory.
- 
							num_proc (int, optional, defaults toNone) — Number of processes when downloading and generating the dataset locally. This is helpful if the dataset is made of multiple files. Multiprocessing is disabled by default.Added in 2.8.0 
- 
							**kwargs (additional keyword arguments) —
Keyword arguments to be passed to pandas.read_csv.
Create Dataset from CSV file(s).
from_json
< source >( path_or_paths: typing.Union[str, bytes, os.PathLike, typing.List[typing.Union[str, bytes, os.PathLike]]] split: typing.Optional[datasets.splits.NamedSplit] = None features: typing.Optional[datasets.features.features.Features] = None cache_dir: str = None keep_in_memory: bool = False field: typing.Optional[str] = None num_proc: typing.Optional[int] = None **kwargs )
Parameters
- 
							path_or_paths (path-likeor list ofpath-like) — Path(s) of the JSON or JSON Lines file(s).
- split (NamedSplit, optional) — Split name to be assigned to the dataset.
- features (Features, optional) — Dataset features.
- 
							cache_dir (str, optional, defaults to"~/.cache/huggingface/datasets") — Directory to cache data.
- 
							keep_in_memory (bool, defaults toFalse) — Whether to copy the data in-memory.
- 
							field (str, optional) — Field name of the JSON file where the dataset is contained in.
- 
							num_proc (int, optional defaults toNone) — Number of processes when downloading and generating the dataset locally. This is helpful if the dataset is made of multiple files. Multiprocessing is disabled by default.Added in 2.8.0 
- 
							**kwargs (additional keyword arguments) —
Keyword arguments to be passed to JsonConfig.
Create Dataset from JSON or JSON Lines file(s).
from_parquet
< source >( path_or_paths: typing.Union[str, bytes, os.PathLike, typing.List[typing.Union[str, bytes, os.PathLike]]] split: typing.Optional[datasets.splits.NamedSplit] = None features: typing.Optional[datasets.features.features.Features] = None cache_dir: str = None keep_in_memory: bool = False columns: typing.Optional[typing.List[str]] = None num_proc: typing.Optional[int] = None **kwargs )
Parameters
- 
							path_or_paths (path-likeor list ofpath-like) — Path(s) of the Parquet file(s).
- 
							split (NamedSplit, optional) — Split name to be assigned to the dataset.
- 
							features (Features, optional) — Dataset features.
- 
							cache_dir (str, optional, defaults to"~/.cache/huggingface/datasets") — Directory to cache data.
- 
							keep_in_memory (bool, defaults toFalse) — Whether to copy the data in-memory.
- 
							columns (List[str], optional) — If notNone, only these columns will be read from the file. A column name may be a prefix of a nested field, e.g. ‘a’ will select ‘a.b’, ‘a.c’, and ‘a.d.e’.
- 
							num_proc (int, optional, defaults toNone) — Number of processes when downloading and generating the dataset locally. This is helpful if the dataset is made of multiple files. Multiprocessing is disabled by default.Added in 2.8.0 
- 
							**kwargs (additional keyword arguments) —
Keyword arguments to be passed to ParquetConfig.
Create Dataset from Parquet file(s).
from_text
< source >( path_or_paths: typing.Union[str, bytes, os.PathLike, typing.List[typing.Union[str, bytes, os.PathLike]]] split: typing.Optional[datasets.splits.NamedSplit] = None features: typing.Optional[datasets.features.features.Features] = None cache_dir: str = None keep_in_memory: bool = False num_proc: typing.Optional[int] = None **kwargs )
Parameters
- 
							path_or_paths (path-likeor list ofpath-like) — Path(s) of the text file(s).
- 
							split (NamedSplit, optional) — Split name to be assigned to the dataset.
- 
							features (Features, optional) — Dataset features.
- 
							cache_dir (str, optional, defaults to"~/.cache/huggingface/datasets") — Directory to cache data.
- 
							keep_in_memory (bool, defaults toFalse) — Whether to copy the data in-memory.
- 
							num_proc (int, optional, defaults toNone) — Number of processes when downloading and generating the dataset locally. This is helpful if the dataset is made of multiple files. Multiprocessing is disabled by default.Added in 2.8.0 
- 
							**kwargs (additional keyword arguments) —
Keyword arguments to be passed to TextConfig.
Create Dataset from text file(s).
from_sql
< source >( sql: typing.Union[str, ForwardRef('sqlalchemy.sql.Selectable')] con: typing.Union[str, ForwardRef('sqlalchemy.engine.Connection'), ForwardRef('sqlalchemy.engine.Engine'), ForwardRef('sqlite3.Connection')] features: typing.Optional[datasets.features.features.Features] = None cache_dir: str = None keep_in_memory: bool = False **kwargs )
Parameters
- 
							sql (strorsqlalchemy.sql.Selectable) — SQL query to be executed or a table name.
- 
							con (strorsqlite3.Connectionorsqlalchemy.engine.Connectionorsqlalchemy.engine.Connection) — A URI string used to instantiate a database connection or a SQLite3/SQLAlchemy connection object.
- features (Features, optional) — Dataset features.
- 
							cache_dir (str, optional, defaults to"~/.cache/huggingface/datasets") — Directory to cache data.
- 
							keep_in_memory (bool, defaults toFalse) — Whether to copy the data in-memory.
- 
							**kwargs (additional keyword arguments) —
Keyword arguments to be passed to SqlConfig.
Create Dataset from SQL query or database table.
Example:
>>> # Fetch a database table
>>> ds = Dataset.from_sql("test_data", "postgres:///db_name")
>>> # Execute a SQL query on the table
>>> ds = Dataset.from_sql("SELECT sentence FROM test_data", "postgres:///db_name")
>>> # Use a Selectable object to specify the query
>>> from sqlalchemy import select, text
>>> stmt = select([text("sentence")]).select_from(text("test_data"))
>>> ds = Dataset.from_sql(stmt, "postgres:///db_name")The returned dataset can only be cached if con is specified as URI string.
prepare_for_task
< source >( task: typing.Union[str, datasets.tasks.base.TaskTemplate] id: int = 0 )
Parameters
- 
							task (Union[str, TaskTemplate]) — The task to prepare the dataset for during training and evaluation. Ifstr, supported tasks include:- "text-classification"
- "question-answering"
 If TaskTemplate, must be one of the task templates indatasets.tasks.
- 
							id (int, defaults to0) — The id required to unambiguously identify the task template when multiple task templates of the same type are supported.
Prepare a dataset for the given task by casting the dataset’s Features to standardized column names and types as detailed in datasets.tasks.
Casts datasets.DatasetInfo.features according to a task-specific schema. Intended for single-use only, so all task templates are removed from datasets.DatasetInfo.task_templates after casting.
align_labels_with_mapping
< source >( label2id: typing.Dict label_column: str )
Align the dataset’s label ID and label name mapping to match an input label2id mapping.
This is useful when you want to ensure that a model’s predicted labels are aligned with the dataset.
The alignment in done using the lowercase label names.
Example:
>>> # dataset with mapping {'entailment': 0, 'neutral': 1, 'contradiction': 2}
>>> ds = load_dataset("glue", "mnli", split="train")
>>> # mapping to align with
>>> label2id = {'CONTRADICTION': 0, 'NEUTRAL': 1, 'ENTAILMENT': 2}
>>> ds_aligned = ds.align_labels_with_mapping(label2id, "label")datasets.concatenate_datasets
< source >( dsets: typing.List[~DatasetType] info: typing.Optional[datasets.info.DatasetInfo] = None split: typing.Optional[datasets.splits.NamedSplit] = None axis: int = 0 )
Parameters
- 
							dsets (List[datasets.Dataset]) — List of Datasets to concatenate.
- 
							info (DatasetInfo, optional) — Dataset information, like description, citation, etc.
- 
							split (NamedSplit, optional) — Name of the dataset split.
- 
							axis ({0, 1}, defaults to0) — Axis to concatenate over, where0means over rows (vertically) and1means over columns (horizontally).Added in 1.6.0 
Converts a list of Dataset with the same schema into a single Dataset.
datasets.interleave_datasets
< source >( datasets: typing.List[~DatasetType] probabilities: typing.Optional[typing.List[float]] = None seed: typing.Optional[int] = None info: typing.Optional[datasets.info.DatasetInfo] = None split: typing.Optional[datasets.splits.NamedSplit] = None stopping_strategy: typing.Literal['first_exhausted', 'all_exhausted'] = 'first_exhausted' ) → Dataset or IterableDataset
Parameters
- 
							datasets (List[Dataset]orList[IterableDataset]) — List of datasets to interleave.
- 
							probabilities (List[float], optional, defaults toNone) — If specified, the new dataset is constructed by sampling examples from one source at a time according to these probabilities.
- 
							seed (int, optional, defaults toNone) — The random seed used to choose a source for each example.
- 
							info (DatasetInfo, optional) —
Dataset information, like description, citation, etc.Added in 2.4.0 
- 
							split (NamedSplit, optional) —
Name of the dataset split.Added in 2.4.0 
- 
							stopping_strategy (str, defaults tofirst_exhausted) — Two strategies are proposed right now,first_exhaustedandall_exhausted. By default,first_exhaustedis an undersampling strategy, i.e the dataset construction is stopped as soon as one dataset has ran out of samples. If the strategy isall_exhausted, we use an oversampling strategy, i.e the dataset construction is stopped as soon as every samples of every dataset has been added at least once. Note that if the strategy isall_exhausted, the interleaved dataset size can get enormous:- with no probabilities, the resulting dataset will have max_length_datasets*nb_datasetsamples.
- with given probabilities, the resulting dataset will have more samples if some datasets have really low probability of visiting.
 
- with no probabilities, the resulting dataset will have 
Returns
Return type depends on the input datasets
parameter. Dataset if the input is a list of Dataset, IterableDataset if the input is a list of
IterableDataset.
Interleave several datasets (sources) into a single dataset. The new dataset is constructed by alternating between the sources to get the examples.
You can use this function on a list of Dataset objects, or on a list of IterableDataset objects.
- If probabilitiesisNone(default) the new dataset is constructed by cycling between each source to get the examples.
- If probabilitiesis notNone, the new dataset is constructed by getting examples from a random source at a time according to the provided probabilities.
The resulting dataset ends when one of the source datasets runs out of examples except when oversampling is True,
in which case, the resulting dataset ends when all datasets have ran out of examples at least one time.
Note for iterable datasets:
In a distributed setup or in PyTorch DataLoader workers, the stopping strategy is applied per process. Therefore the “first_exhausted” strategy on an sharded iterable dataset can generate less samples in total (up to 1 missing sample per subdataset per worker).
Example:
For regular datasets (map-style):
>>> from datasets import Dataset, interleave_datasets
>>> d1 = Dataset.from_dict({"a": [0, 1, 2]})
>>> d2 = Dataset.from_dict({"a": [10, 11, 12]})
>>> d3 = Dataset.from_dict({"a": [20, 21, 22]})
>>> dataset = interleave_datasets([d1, d2, d3], probabilities=[0.7, 0.2, 0.1], seed=42, stopping_strategy="all_exhausted")
>>> dataset["a"]
[10, 0, 11, 1, 2, 20, 12, 10, 0, 1, 2, 21, 0, 11, 1, 2, 0, 1, 12, 2, 10, 0, 22]
>>> dataset = interleave_datasets([d1, d2, d3], probabilities=[0.7, 0.2, 0.1], seed=42)
>>> dataset["a"]
[10, 0, 11, 1, 2]
>>> dataset = interleave_datasets([d1, d2, d3])
>>> dataset["a"]
[0, 10, 20, 1, 11, 21, 2, 12, 22]
>>> dataset = interleave_datasets([d1, d2, d3], stopping_strategy="all_exhausted")
>>> dataset["a"]
[0, 10, 20, 1, 11, 21, 2, 12, 22]
>>> d1 = Dataset.from_dict({"a": [0, 1, 2]})
>>> d2 = Dataset.from_dict({"a": [10, 11, 12, 13]})
>>> d3 = Dataset.from_dict({"a": [20, 21, 22, 23, 24]})
>>> dataset = interleave_datasets([d1, d2, d3])
>>> dataset["a"]
[0, 10, 20, 1, 11, 21, 2, 12, 22]
>>> dataset = interleave_datasets([d1, d2, d3], stopping_strategy="all_exhausted")
>>> dataset["a"]
[0, 10, 20, 1, 11, 21, 2, 12, 22, 0, 13, 23, 1, 10, 24]
>>> dataset = interleave_datasets([d1, d2, d3], probabilities=[0.7, 0.2, 0.1], seed=42)
>>> dataset["a"]
[10, 0, 11, 1, 2]
>>> dataset = interleave_datasets([d1, d2, d3], probabilities=[0.7, 0.2, 0.1], seed=42, stopping_strategy="all_exhausted")
>>> dataset["a"]
[10, 0, 11, 1, 2, 20, 12, 13, ..., 0, 1, 2, 0, 24]
For datasets in streaming mode (iterable):
>>> from datasets import load_dataset, interleave_datasets
>>> d1 = load_dataset("oscar", "unshuffled_deduplicated_en", split="train", streaming=True)
>>> d2 = load_dataset("oscar", "unshuffled_deduplicated_fr", split="train", streaming=True)
>>> dataset = interleave_datasets([d1, d2])
>>> iterator = iter(dataset)
>>> next(iterator)
{'text': 'Mtendere Village was inspired by the vision...}
>>> next(iterator)
{'text': "Média de débat d'idées, de culture...}datasets.distributed.split_dataset_by_node
< source >( dataset: DatasetType rank: int world_size: int ) → Dataset or IterableDataset
Parameters
- dataset (Dataset or IterableDataset) — The dataset to split by node.
- 
							rank (int) — Rank of the current node.
- 
							world_size (int) — Total number of nodes.
Returns
The dataset to be used on the node at rank rank.
Split a dataset for the node at rank rank in a pool of nodes of size world_size.
For map-style datasets:
Each node is assigned a chunk of data, e.g. rank 0 is given the first chunk of the dataset. To maximize data loading throughput, chunks are made of contiguous data on disk if possible.
For iterable datasets:
If the dataset has a number of shards that is a factor of world_size (i.e. if dataset.n_shards % world_size == 0),
then the shards are evenly assigned across the nodes, which is the most optimized.
Otherwise, each node keeps 1 example out of world_size, skipping the other examples.
When applying transforms on a dataset, the data are stored in cache files. The caching mechanism allows to reload an existing cache file if it’s already been computed.
Reloading a dataset is possible since the cache files are named using the dataset fingerprint, which is updated after each transform.
If disabled, the library will no longer reload cached datasets files when applying transforms to the datasets. More precisely, if the caching is disabled:
- cache files are always recreated
- cache files are written to a temporary directory that is deleted when session closes
- cache files are named using a random hash instead of the dataset fingerprint
- use save_to_disk() to save a transformed dataset or it will be deleted when session closes
- caching doesn’t affect load_dataset(). If you want to regenerate a dataset from scratch you should use
the download_modeparameter in load_dataset().
When applying transforms on a dataset, the data are stored in cache files. The caching mechanism allows to reload an existing cache file if it’s already been computed.
Reloading a dataset is possible since the cache files are named using the dataset fingerprint, which is updated after each transform.
If disabled, the library will no longer reload cached datasets files when applying transforms to the datasets. More precisely, if the caching is disabled:
- cache files are always recreated
- cache files are written to a temporary directory that is deleted when session closes
- cache files are named using a random hash instead of the dataset fingerprint
- use save_to_disk() to save a transformed dataset or it will be deleted when session closes
- caching doesn’t affect load_dataset(). If you want to regenerate a dataset from scratch you should use
the download_modeparameter in load_dataset().
When applying transforms on a dataset, the data are stored in cache files. The caching mechanism allows to reload an existing cache file if it’s already been computed.
Reloading a dataset is possible since the cache files are named using the dataset fingerprint, which is updated after each transform.
If disabled, the library will no longer reload cached datasets files when applying transforms to the datasets. More precisely, if the caching is disabled:
- cache files are always recreated
- cache files are written to a temporary directory that is deleted when session closes
- cache files are named using a random hash instead of the dataset fingerprint
- use save_to_disk()] to save a transformed dataset or it will be deleted when session closes
- caching doesn’t affect load_dataset(). If you want to regenerate a dataset from scratch you should use
the download_modeparameter in load_dataset().
DatasetDict
Dictionary with split names as keys (‘train’, ‘test’ for example), and Dataset objects as values.
It also has dataset transform methods like map or filter, to process all the splits at once.
A dictionary (dict of str: datasets.Dataset) with dataset transforms methods (map, filter, etc.)
The Apache Arrow tables backing each split.
The cache files containing the Apache Arrow table backing each split.
Example:
>>> from datasets import load_dataset
>>> ds = load_dataset("rotten_tomatoes")
>>> ds.cache_files
{'test': [{'filename': '/root/.cache/huggingface/datasets/rotten_tomatoes_movie_review/default/1.0.0/40d411e45a6ce3484deed7cc15b82a53dad9a72aafd9f86f8f227134bec5ca46/rotten_tomatoes_movie_review-test.arrow'}],
 'train': [{'filename': '/root/.cache/huggingface/datasets/rotten_tomatoes_movie_review/default/1.0.0/40d411e45a6ce3484deed7cc15b82a53dad9a72aafd9f86f8f227134bec5ca46/rotten_tomatoes_movie_review-train.arrow'}],
 'validation': [{'filename': '/root/.cache/huggingface/datasets/rotten_tomatoes_movie_review/default/1.0.0/40d411e45a6ce3484deed7cc15b82a53dad9a72aafd9f86f8f227134bec5ca46/rotten_tomatoes_movie_review-validation.arrow'}]}Number of columns in each split of the dataset.
Number of rows in each split of the dataset (same as datasets.Dataset.len()).
Names of the columns in each split of the dataset.
Shape of each split of the dataset (number of columns, number of rows).
unique
< source >(
			column: str
				
			)
			→
				Dict[str, list]
Parameters
- 
							column (str) — column name (list all the column names with column_names)
Returns
Dict[str, list]
Dictionary of unique elements in the given column.
Return a list of the unique elements in a column for each split.
This is implemented in the low-level backend and as such, very fast.
Clean up all cache files in the dataset cache directory, excepted the currently used cache file if there is one. Be careful when running this command that no other process is currently using other cache files.
map
< source >( function: typing.Optional[typing.Callable] = None with_indices: bool = False with_rank: bool = False input_columns: typing.Union[str, typing.List[str], NoneType] = None batched: bool = False batch_size: typing.Optional[int] = 1000 drop_last_batch: bool = False remove_columns: typing.Union[str, typing.List[str], NoneType] = None keep_in_memory: bool = False load_from_cache_file: typing.Optional[bool] = None cache_file_names: typing.Union[typing.Dict[str, typing.Optional[str]], NoneType] = None writer_batch_size: typing.Optional[int] = 1000 features: typing.Optional[datasets.features.features.Features] = None disable_nullable: bool = False fn_kwargs: typing.Optional[dict] = None num_proc: typing.Optional[int] = None desc: typing.Optional[str] = None )
Parameters
- 
							function (callable) — with one of the following signature:- function(example: Dict[str, Any]) -> Dict[str, Any]if- batched=Falseand- with_indices=False
- function(example: Dict[str, Any], indices: int) -> Dict[str, Any]if- batched=Falseand- with_indices=True
- function(batch: Dict[str, List]) -> Dict[str, List]if- batched=Trueand- with_indices=False
- function(batch: Dict[str, List], indices: List[int]) -> Dict[str, List]if- batched=Trueand- with_indices=True
 For advanced usage, the function can also return a pyarrow.Table. Moreover if your function returns nothing (None), thenmapwill run your function and return the dataset unchanged.
- 
							with_indices (bool, defaults toFalse) — Provide example indices tofunction. Note that in this case the signature offunctionshould bedef function(example, idx): ....
- 
							with_rank (bool, defaults toFalse) — Provide process rank tofunction. Note that in this case the signature offunctionshould bedef function(example[, idx], rank): ....
- 
							input_columns ([Union[str, List[str]]], optional, defaults toNone) — The columns to be passed intofunctionas positional arguments. IfNone, a dict mapping to all formatted columns is passed as one argument.
- 
							batched (bool, defaults toFalse) — Provide batch of examples tofunction.
- 
							batch_size (int, optional, defaults to1000) — Number of examples per batch provided tofunctionifbatched=True,batch_size <= 0orbatch_size == Nonethen provide the full dataset as a single batch tofunction.
- 
							drop_last_batch (bool, defaults toFalse) — Whether a last batch smaller than the batch_size should be dropped instead of being processed by the function.
- 
							remove_columns ([Union[str, List[str]]], optional, defaults toNone) — Remove a selection of columns while doing the mapping. Columns will be removed before updating the examples with the output offunction, i.e. iffunctionis adding columns with names inremove_columns, these columns will be kept.
- 
							keep_in_memory (bool, defaults toFalse) — Keep the dataset in memory instead of writing it to a cache file.
- 
							load_from_cache_file (Optional[bool], defaults toTrueif caching is enabled) — If a cache file storing the current computation fromfunctioncan be identified, use it instead of recomputing.
- 
							cache_file_names ([Dict[str, str]], optional, defaults toNone) — Provide the name of a path for the cache file. It is used to store the results of the computation instead of the automatically generated cache file name. You have to provide onecache_file_nameper dataset in the dataset dictionary.
- 
							writer_batch_size (int, default1000) — Number of rows per write operation for the cache file writer. This value is a good trade-off between memory usage during the processing, and processing speed. Higher value makes the processing do fewer lookups, lower value consume less temporary memory while runningmap.
- 
							features ([datasets.Features], optional, defaults toNone) — Use a specific Features to store the cache file instead of the automatically generated one.
- 
							disable_nullable (bool, defaults toFalse) — Disallow null values in the table.
- 
							fn_kwargs (Dict, optional, defaults toNone) — Keyword arguments to be passed tofunction
- 
							num_proc (int, optional, defaults toNone) — Number of processes for multiprocessing. By default it doesn’t use multiprocessing.
- 
							desc (str, optional, defaults toNone) — Meaningful description to be displayed alongside with the progress bar while mapping examples.
Apply a function to all the elements in the table (individually or in batches) and update the table (if function does updated examples). The transformation is applied to all the datasets of the dataset dictionary.
Example:
>>> from datasets import load_dataset
>>> ds = load_dataset("rotten_tomatoes")
>>> def add_prefix(example):
...     example["text"] = "Review: " + example["text"]
...     return example
>>> ds = ds.map(add_prefix)
>>> ds["train"][0:3]["text"]
['Review: the rock is destined to be the 21st century's new " conan " and that he's going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal .',
 'Review: the gorgeously elaborate continuation of " the lord of the rings " trilogy is so huge that a column of words cannot adequately describe co-writer/director peter jackson's expanded vision of j . r . r . tolkien's middle-earth .',
 'Review: effective but too-tepid biopic']
# process a batch of examples
>>> ds = ds.map(lambda example: tokenizer(example["text"]), batched=True)
# set number of processors
>>> ds = ds.map(add_prefix, num_proc=4)filter
< source >( function with_indices = False input_columns: typing.Union[str, typing.List[str], NoneType] = None batched: bool = False batch_size: typing.Optional[int] = 1000 keep_in_memory: bool = False load_from_cache_file: typing.Optional[bool] = None cache_file_names: typing.Union[typing.Dict[str, typing.Optional[str]], NoneType] = None writer_batch_size: typing.Optional[int] = 1000 fn_kwargs: typing.Optional[dict] = None num_proc: typing.Optional[int] = None desc: typing.Optional[str] = None )
Parameters
- 
							function (callable) — With one of the following signature:- function(example: Dict[str, Any]) -> boolif- with_indices=False, batched=False
- function(example: Dict[str, Any], indices: int) -> boolif- with_indices=True, batched=False
- function(example: Dict[str, List]) -> List[bool]if- with_indices=False, batched=True
- function(example: Dict[str, List], indices: List[int]) -> List[bool]if `- with_indices=True, batched=True
 
- 
							with_indices (bool, defaults toFalse) — Provide example indices tofunction. Note that in this case the signature offunctionshould bedef function(example, idx): ....
- 
							input_columns ([Union[str, List[str]]], optional, defaults toNone) — The columns to be passed intofunctionas positional arguments. IfNone, a dict mapping to all formatted columns is passed as one argument.
- 
							batched (bool, defaults toFalse) — Provide batch of examples tofunction.
- 
							batch_size (int, optional, defaults to1000) — Number of examples per batch provided tofunctionifbatched=Truebatch_size <= 0orbatch_size == Nonethen provide the full dataset as a single batch tofunction.
- 
							keep_in_memory (bool, defaults toFalse) — Keep the dataset in memory instead of writing it to a cache file.
- 
							load_from_cache_file (Optional[bool], defaults toTrueif chaching is enabled) — If a cache file storing the current computation fromfunctioncan be identified, use it instead of recomputing.
- 
							cache_file_names ([Dict[str, str]], optional, defaults toNone) — Provide the name of a path for the cache file. It is used to store the results of the computation instead of the automatically generated cache file name. You have to provide onecache_file_nameper dataset in the dataset dictionary.
- 
							writer_batch_size (int, defaults to1000) — Number of rows per write operation for the cache file writer. This value is a good trade-off between memory usage during the processing, and processing speed. Higher value makes the processing do fewer lookups, lower value consume less temporary memory while runningmap.
- 
							fn_kwargs (Dict, optional, defaults toNone) — Keyword arguments to be passed tofunction
- 
							num_proc (int, optional, defaults toNone) — Number of processes for multiprocessing. By default it doesn’t use multiprocessing.
- 
							desc (str, optional, defaults toNone) — Meaningful description to be displayed alongside with the progress bar while filtering examples.
Apply a filter function to all the elements in the table in batches and update the table so that the dataset only includes examples according to the filter function. The transformation is applied to all the datasets of the dataset dictionary.
Example:
>>> from datasets import load_dataset
>>> ds = load_dataset("rotten_tomatoes")
>>> ds.filter(lambda x: x["label"] == 1)
DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 4265
    })
    validation: Dataset({
        features: ['text', 'label'],
        num_rows: 533
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 533
    })
})sort
< source >( column_names: typing.Union[str, typing.Sequence[str]] reverse: typing.Union[bool, typing.Sequence[bool]] = False kind = 'deprecated' null_placement: str = 'at_end' keep_in_memory: bool = False load_from_cache_file: typing.Optional[bool] = None indices_cache_file_names: typing.Union[typing.Dict[str, typing.Optional[str]], NoneType] = None writer_batch_size: typing.Optional[int] = 1000 )
Parameters
- 
							column_names (Union[str, Sequence[str]]) — Column name(s) to sort by.
- 
							reverse (Union[bool, Sequence[bool]], defaults toFalse) — IfTrue, sort by descending order rather than ascending. If a single bool is provided, the value is applied to the sorting of all column names. Otherwise a list of bools with the same length and order as column_names must be provided.
- 
							kind (str, optional) — Pandas algorithm for sorting selected in{quicksort, mergesort, heapsort, stable}, The default isquicksort. Note that bothstableandmergesortuse timsort under the covers and, in general, the actual implementation will vary with data type. Themergesortoption is retained for backwards compatibility.Deprecated in 2.8.0 kindwas deprecated in version 2.10.0 and will be removed in 3.0.0.
- 
							null_placement (str, defaults toat_end) — PutNonevalues at the beginning ifat_startorfirstor at the end ifat_endorlast
- 
							keep_in_memory (bool, defaults toFalse) — Keep the sorted indices in memory instead of writing it to a cache file.
- 
							load_from_cache_file (Optional[bool], defaults toTrueif caching is enabled) — If a cache file storing the sorted indices can be identified, use it instead of recomputing.
- 
							indices_cache_file_names ([Dict[str, str]], optional, defaults toNone) — Provide the name of a path for the cache file. It is used to store the indices mapping instead of the automatically generated cache file name. You have to provide onecache_file_nameper dataset in the dataset dictionary.
- 
							writer_batch_size (int, defaults to1000) — Number of rows per write operation for the cache file writer. Higher value gives smaller cache files, lower value consume less temporary memory.
Create a new dataset sorted according to a single or multiple columns.
Example:
>>> from datasets import load_dataset
>>> ds = load_dataset('rotten_tomatoes')
>>> ds['train']['label'][:10]
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
>>> sorted_ds = ds.sort('label')
>>> sorted_ds['train']['label'][:10]
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
>>> another_sorted_ds = ds.sort(['label', 'text'], reverse=[True, False])
>>> another_sorted_ds['train']['label'][:10]
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1]shuffle
< source >( seeds: typing.Union[int, typing.Dict[str, typing.Optional[int]], NoneType] = None seed: typing.Optional[int] = None generators: typing.Union[typing.Dict[str, numpy.random._generator.Generator], NoneType] = None keep_in_memory: bool = False load_from_cache_file: typing.Optional[bool] = None indices_cache_file_names: typing.Union[typing.Dict[str, typing.Optional[str]], NoneType] = None writer_batch_size: typing.Optional[int] = 1000 )
Parameters
- 
							seeds (Dict[str, int]orint, optional) — A seed to initialize the default BitGenerator ifgenerator=None. IfNone, then fresh, unpredictable entropy will be pulled from the OS. If anintorarray_like[ints]is passed, then it will be passed to SeedSequence to derive the initial BitGenerator state. You can provide oneseedper dataset in the dataset dictionary.
- 
							seed (int, optional) — A seed to initialize the default BitGenerator ifgenerator=None. Alias for seeds (aValueErroris raised if both are provided).
- 
							generators (Dict[str, *optional*, np.random.Generator]) — Numpy random Generator to use to compute the permutation of the dataset rows. Ifgenerator=None(default), usesnp.random.default_rng(the default BitGenerator (PCG64) of NumPy). You have to provide onegeneratorper dataset in the dataset dictionary.
- 
							keep_in_memory (bool, defaults toFalse) — Keep the dataset in memory instead of writing it to a cache file.
- 
							load_from_cache_file (Optional[bool], defaults toTrueif caching is enabled) — If a cache file storing the current computation fromfunctioncan be identified, use it instead of recomputing.
- 
							indices_cache_file_names (Dict[str, str], optional) — Provide the name of a path for the cache file. It is used to store the indices mappings instead of the automatically generated cache file name. You have to provide onecache_file_nameper dataset in the dataset dictionary.
- 
							writer_batch_size (int, defaults to1000) — Number of rows per write operation for the cache file writer. This value is a good trade-off between memory usage during the processing, and processing speed. Higher value makes the processing do fewer lookups, lower value consume less temporary memory while runningmap.
Create a new Dataset where the rows are shuffled.
The transformation is applied to all the datasets of the dataset dictionary.
Currently shuffling uses numpy random generators. You can either supply a NumPy BitGenerator to use, or a seed to initiate NumPy’s default random generator (PCG64).
set_format
< source >( type: typing.Optional[str] = None columns: typing.Optional[typing.List] = None output_all_columns: bool = False **format_kwargs )
Parameters
- 
							type (str, optional) — Output type selected in[None, 'numpy', 'torch', 'tensorflow', 'pandas', 'arrow', 'jax'].Nonemeans__getitem__returns python objects (default).
- 
							columns (List[str], optional) — Columns to format in the output.Nonemeans__getitem__returns all columns (default).
- 
							output_all_columns (bool, defaults to False) — Keep un-formatted columns as well in the output (as python objects),
- 
							**format_kwargs (additional keyword arguments) —
Keywords arguments passed to the convert function like np.array,torch.tensorortensorflow.ragged.constant.
Set __getitem__ return format (type and columns).
The format is set for every dataset in the dataset dictionary.
It is possible to call map after calling set_format. Since map may add new columns, then the list of formatted columns
gets updated. In this case, if you apply map on a dataset to add a new column, then this column will be formatted:
new formatted columns = (all columns - previously unformatted columns)
Example:
>>> from datasets import load_dataset
>>> from transformers import AutoTokenizer
>>> tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
>>> ds = ds.map(lambda x: tokenizer(x["text"], truncation=True, padding=True), batched=True)
>>> ds.set_format(type="numpy", columns=['input_ids', 'token_type_ids', 'attention_mask', 'label'])
>>> ds["train"].format
{'columns': ['input_ids', 'token_type_ids', 'attention_mask', 'label'],
 'format_kwargs': {},
 'output_all_columns': False,
 'type': 'numpy'}Reset __getitem__ return format to python objects and all columns.
The transformation is applied to all the datasets of the dataset dictionary.
Same as self.set_format()
Example:
>>> from datasets import load_dataset
>>> from transformers import AutoTokenizer
>>> ds = load_dataset("rotten_tomatoes")
>>> tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
>>> ds = ds.map(lambda x: tokenizer(x["text"], truncation=True, padding=True), batched=True)
>>> ds.set_format(type="numpy", columns=['input_ids', 'token_type_ids', 'attention_mask', 'label'])
>>> ds["train"].format
{'columns': ['input_ids', 'token_type_ids', 'attention_mask', 'label'],
 'format_kwargs': {},
 'output_all_columns': False,
 'type': 'numpy'}
>>> ds.reset_format()
>>> ds["train"].format
{'columns': ['text', 'label', 'input_ids', 'token_type_ids', 'attention_mask'],
 'format_kwargs': {},
 'output_all_columns': False,
 'type': None}formatted_as
< source >( type: typing.Optional[str] = None columns: typing.Optional[typing.List] = None output_all_columns: bool = False **format_kwargs )
Parameters
- 
							type (str, optional) — Output type selected in[None, 'numpy', 'torch', 'tensorflow', 'pandas', 'arrow', 'jax'].Nonemeans__getitem__returns python objects (default).
- 
							columns (List[str], optional) — Columns to format in the output.Nonemeans__getitem__returns all columns (default).
- 
							output_all_columns (bool, defaults to False) — Keep un-formatted columns as well in the output (as python objects).
- 
							**format_kwargs (additional keyword arguments) —
Keywords arguments passed to the convert function like np.array,torch.tensorortensorflow.ragged.constant.
To be used in a with statement. Set __getitem__ return format (type and columns).
The transformation is applied to all the datasets of the dataset dictionary.
with_format
< source >( type: typing.Optional[str] = None columns: typing.Optional[typing.List] = None output_all_columns: bool = False **format_kwargs )
Parameters
- 
							type (str, optional) — Output type selected in[None, 'numpy', 'torch', 'tensorflow', 'pandas', 'arrow', 'jax'].Nonemeans__getitem__returns python objects (default).
- 
							columns (List[str], optional) — Columns to format in the output.Nonemeans__getitem__returns all columns (default).
- 
							output_all_columns (bool, defaults toFalse) — Keep un-formatted columns as well in the output (as python objects).
- 
							**format_kwargs (additional keyword arguments) —
Keywords arguments passed to the convert function like np.array,torch.tensorortensorflow.ragged.constant.
Set __getitem__ return format (type and columns). The data formatting is applied on-the-fly.
The format type (for example “numpy”) is used to format batches when using __getitem__.
The format is set for every dataset in the dataset dictionary.
It’s also possible to use custom transforms for formatting using with_transform().
Contrary to set_format(), with_format returns a new DatasetDict object with new Dataset objects.
Example:
>>> from datasets import load_dataset
>>> from transformers import AutoTokenizer
>>> ds = load_dataset("rotten_tomatoes")
>>> tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
>>> ds = ds.map(lambda x: tokenizer(x['text'], truncation=True, padding=True), batched=True)
>>> ds["train"].format
{'columns': ['text', 'label', 'input_ids', 'token_type_ids', 'attention_mask'],
 'format_kwargs': {},
 'output_all_columns': False,
 'type': None}
>>> ds = ds.with_format(type='tensorflow', columns=['input_ids', 'token_type_ids', 'attention_mask', 'label'])
>>> ds["train"].format
{'columns': ['input_ids', 'token_type_ids', 'attention_mask', 'label'],
 'format_kwargs': {},
 'output_all_columns': False,
 'type': 'tensorflow'}with_transform
< source >( transform: typing.Optional[typing.Callable] columns: typing.Optional[typing.List] = None output_all_columns: bool = False )
Parameters
- 
							transform (Callable, optional) — User-defined formatting transform, replaces the format defined by set_format(). A formatting function is a callable that takes a batch (as a dict) as input and returns a batch. This function is applied right before returning the objects in__getitem__.
- 
							columns (List[str], optional) — Columns to format in the output. If specified, then the input batch of the transform only contains those columns.
- 
							output_all_columns (bool, defaults to False) — Keep un-formatted columns as well in the output (as python objects). If set toTrue, then the other un-formatted columns are kept with the output of the transform.
Set __getitem__ return format using this transform. The transform is applied on-the-fly on batches when __getitem__ is called.
The transform is set for every dataset in the dataset dictionary
As set_format(), this can be reset using reset_format().
Contrary to set_transform(), with_transform returns a new DatasetDict object with new Dataset objects.
Example:
>>> from datasets import load_dataset
>>> from transformers import AutoTokenizer
>>> ds = load_dataset("rotten_tomatoes")
>>> tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
>>> def encode(example):
...     return tokenizer(example['text'], truncation=True, padding=True, return_tensors="pt")
>>> ds = ds.with_transform(encode)
>>> ds["train"][0]
{'attention_mask': tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
 1, 1, 1, 1, 1, 1, 1, 1, 1]),
 'input_ids': tensor([  101,  1103,  2067,  1110, 17348,  1106,  1129,  1103,  6880,  1432,
        112,   188,  1207,   107, 14255,  1389,   107,  1105,  1115,  1119,
        112,   188,  1280,  1106,  1294,   170, 24194,  1256,  3407,  1190,
        170, 11791,  5253,   188,  1732,  7200, 10947, 12606,  2895,   117,
        179,  7766,   118,   172, 15554,  1181,  3498,  6961,  3263,  1137,
        188,  1566,  7912, 14516,  6997,   119,   102]),
 'token_type_ids': tensor([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0])}Flatten the Apache Arrow Table of each split (nested features are flatten). Each column with a struct type is flattened into one column per struct field. Other columns are left unchanged.
Example:
>>> from datasets import load_dataset
>>> ds = load_dataset("squad")
>>> ds["train"].features
{'answers': Sequence(feature={'text': Value(dtype='string', id=None), 'answer_start': Value(dtype='int32', id=None)}, length=-1, id=None),
 'context': Value(dtype='string', id=None),
 'id': Value(dtype='string', id=None),
 'question': Value(dtype='string', id=None),
 'title': Value(dtype='string', id=None)}
>>> ds.flatten()
DatasetDict({
    train: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers.text', 'answers.answer_start'],
        num_rows: 87599
    })
    validation: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers.text', 'answers.answer_start'],
        num_rows: 10570
    })
})cast
< source >( features: Features )
Parameters
- 
							features (Features) —
New features to cast the dataset to.
The name and order of the fields in the features must match the current column names.
The type of the data must also be convertible from one type to the other.
For non-trivial conversion, e.g. string<->ClassLabelyou should use map() to update the Dataset.
Cast the dataset to a new set of features. The transformation is applied to all the datasets of the dataset dictionary.
You can also remove a column using Dataset.map() with feature but cast
is in-place (doesn’t copy the data to a new dataset) and is thus faster.
Example:
>>> from datasets import load_dataset
>>> ds = load_dataset("rotten_tomatoes")
>>> ds["train"].features
{'label': ClassLabel(num_classes=2, names=['neg', 'pos'], id=None),
 'text': Value(dtype='string', id=None)}
>>> new_features = ds["train"].features.copy()
>>> new_features['label'] = ClassLabel(names=['bad', 'good'])
>>> new_features['text'] = Value('large_string')
>>> ds = ds.cast(new_features)
>>> ds["train"].features
{'label': ClassLabel(num_classes=2, names=['bad', 'good'], id=None),
 'text': Value(dtype='large_string', id=None)}cast_column
< source >( column: str feature )
Cast column to feature for decoding.
Example:
>>> from datasets import load_dataset
>>> ds = load_dataset("rotten_tomatoes")
>>> ds["train"].features
{'label': ClassLabel(num_classes=2, names=['neg', 'pos'], id=None),
 'text': Value(dtype='string', id=None)}
>>> ds = ds.cast_column('label', ClassLabel(names=['bad', 'good']))
>>> ds["train"].features
{'label': ClassLabel(num_classes=2, names=['bad', 'good'], id=None),
 'text': Value(dtype='string', id=None)}remove_columns
< source >( column_names: typing.Union[str, typing.List[str]] )
Remove one or several column(s) from each split in the dataset and the features associated to the column(s).
The transformation is applied to all the splits of the dataset dictionary.
You can also remove a column using Dataset.map() with remove_columns but the present method
is in-place (doesn’t copy the data to a new dataset) and is thus faster.
Example:
>>> from datasets import load_dataset
>>> ds = load_dataset("rotten_tomatoes")
>>> ds.remove_columns("label")
DatasetDict({
    train: Dataset({
        features: ['text'],
        num_rows: 8530
    })
    validation: Dataset({
        features: ['text'],
        num_rows: 1066
    })
    test: Dataset({
        features: ['text'],
        num_rows: 1066
    })
})rename_column
< source >( original_column_name: str new_column_name: str )
Rename a column in the dataset and move the features associated to the original column under the new column name. The transformation is applied to all the datasets of the dataset dictionary.
You can also rename a column using map() with remove_columns but the present method:
- takes care of moving the original features under the new column name.
- doesn’t copy the data to a new dataset and is thus much faster.
Example:
>>> from datasets import load_dataset
>>> ds = load_dataset("rotten_tomatoes")
>>> ds.rename_column("label", "label_new")
DatasetDict({
    train: Dataset({
        features: ['text', 'label_new'],
        num_rows: 8530
    })
    validation: Dataset({
        features: ['text', 'label_new'],
        num_rows: 1066
    })
    test: Dataset({
        features: ['text', 'label_new'],
        num_rows: 1066
    })
})rename_columns
< source >( column_mapping: typing.Dict[str, str] ) → DatasetDict
Parameters
Returns
A copy of the dataset with renamed columns.
Rename several columns in the dataset, and move the features associated to the original columns under the new column names. The transformation is applied to all the datasets of the dataset dictionary.
Example:
>>> from datasets import load_dataset
>>> ds = load_dataset("rotten_tomatoes")
>>> ds.rename_columns({'text': 'text_new', 'label': 'label_new'})
DatasetDict({
    train: Dataset({
        features: ['text_new', 'label_new'],
        num_rows: 8530
    })
    validation: Dataset({
        features: ['text_new', 'label_new'],
        num_rows: 1066
    })
    test: Dataset({
        features: ['text_new', 'label_new'],
        num_rows: 1066
    })
})select_columns
< source >( column_names: typing.Union[str, typing.List[str]] )
Select one or several column(s) from each split in the dataset and the features associated to the column(s).
The transformation is applied to all the splits of the dataset dictionary.
Example:
>>> from datasets import load_dataset
>>> ds = load_dataset("rotten_tomatoes")
>>> ds.select_columns("text")
DatasetDict({
    train: Dataset({
        features: ['text'],
        num_rows: 8530
    })
    validation: Dataset({
        features: ['text'],
        num_rows: 1066
    })
    test: Dataset({
        features: ['text'],
        num_rows: 1066
    })
})class_encode_column
< source >( column: str include_nulls: bool = False )
Casts the given column as ClassLabel and updates the tables.
Example:
>>> from datasets import load_dataset
>>> ds = load_dataset("boolq")
>>> ds["train"].features
{'answer': Value(dtype='bool', id=None),
 'passage': Value(dtype='string', id=None),
 'question': Value(dtype='string', id=None)}
>>> ds = ds.class_encode_column("answer")
>>> ds["train"].features
{'answer': ClassLabel(num_classes=2, names=['False', 'True'], id=None),
 'passage': Value(dtype='string', id=None),
 'question': Value(dtype='string', id=None)}push_to_hub
< source >( repo_id config_name: str = 'default' private: typing.Optional[bool] = False token: typing.Optional[str] = None branch: NoneType = None max_shard_size: typing.Union[str, int, NoneType] = None num_shards: typing.Union[typing.Dict[str, int], NoneType] = None embed_external_files: bool = True )
Parameters
- 
							repo_id (str) — The ID of the repository to push to in the following format:<user>/<dataset_name>or<org>/<dataset_name>. Also accepts<dataset_name>, which will default to the namespace of the logged-in user.
- 
							private (bool, optional) — Whether the dataset repository should be set to private or not. Only affects repository creation: a repository that already exists will not be affected by that parameter.
- 
							config_name (str) — Configuration name of a dataset. Defaults to “default”.
- 
							token (str, optional) — An optional authentication token for the Hugging Face Hub. If no token is passed, will default to the token saved locally when logging in withhuggingface-cli login. Will raise an error if no token is passed and the user is not logged-in.
- 
							branch (str, optional) — The git branch on which to push the dataset.
- 
							max_shard_size (intorstr, optional, defaults to"500MB") — The maximum size of the dataset shards to be uploaded to the hub. If expressed as a string, needs to be digits followed by a unit (like"500MB"or"1GB").
- 
							num_shards (Dict[str, int], optional) — Number of shards to write. By default the number of shards depends onmax_shard_size. Use a dictionary to define a different num_shards for each split.Added in 2.8.0 
- 
							embed_external_files (bool, defaults toTrue) — Whether to embed file bytes in the shards. In particular, this will do the following before the push for the fields of type:
Pushes the DatasetDict to the hub as a Parquet dataset. The DatasetDict is pushed using HTTP requests and does not need to have neither git or git-lfs installed.
Each dataset split will be pushed independently. The pushed dataset will keep the original split names.
The resulting Parquet files are self-contained by default: if your dataset contains Image or Audio
data, the Parquet files will store the bytes of your images or audio files.
You can disable this by setting embed_external_files to False.
Example:
>>> dataset_dict.push_to_hub("<organization>/<dataset_id>")
>>> dataset_dict.push_to_hub("<organization>/<dataset_id>", private=True)
>>> dataset_dict.push_to_hub("<organization>/<dataset_id>", max_shard_size="1GB")
>>> dataset_dict.push_to_hub("<organization>/<dataset_id>", num_shards={"train": 1024, "test": 8})save_to_disk
< source >( dataset_dict_path: typing.Union[str, bytes, os.PathLike] fs = 'deprecated' max_shard_size: typing.Union[str, int, NoneType] = None num_shards: typing.Union[typing.Dict[str, int], NoneType] = None num_proc: typing.Optional[int] = None storage_options: typing.Optional[dict] = None )
Parameters
- 
							dataset_dict_path (str) — Path (e.g.dataset/train) or remote URI (e.g.s3://my-bucket/dataset/train) of the dataset dict directory where the dataset dict will be saved to.
- 
							fs (fsspec.spec.AbstractFileSystem, optional) — Instance of the remote filesystem where the dataset will be saved to.Deprecated in 2.8.0 fswas deprecated in version 2.8.0 and will be removed in 3.0.0. Please usestorage_optionsinstead, e.g.storage_options=fs.storage_options
- 
							max_shard_size (intorstr, optional, defaults to"500MB") — The maximum size of the dataset shards to be uploaded to the hub. If expressed as a string, needs to be digits followed by a unit (like"50MB").
- 
							num_shards (Dict[str, int], optional) — Number of shards to write. By default the number of shards depends onmax_shard_sizeandnum_proc. You need to provide the number of shards for each dataset in the dataset dictionary. Use a dictionary to define a different num_shards for each split.Added in 2.8.0 
- 
							num_proc (int, optional, defaultNone) — Number of processes when downloading and generating the dataset locally. Multiprocessing is disabled by default.Added in 2.8.0 
- 
							storage_options (dict, optional) — Key/value pairs to be passed on to the file-system backend, if any.Added in 2.8.0 
Saves a dataset dict to a filesystem using fsspec.spec.AbstractFileSystem.
All the Image() and Audio() data are stored in the arrow files. If you want to store paths or urls, please use the Value(“string”) type.
load_from_disk
< source >( dataset_dict_path: typing.Union[str, bytes, os.PathLike] fs = 'deprecated' keep_in_memory: typing.Optional[bool] = None storage_options: typing.Optional[dict] = None )
Parameters
- 
							dataset_dict_path (str) — Path (e.g."dataset/train") or remote URI (e.g."s3//my-bucket/dataset/train") of the dataset dict directory where the dataset dict will be loaded from.
- 
							fs (fsspec.spec.AbstractFileSystem, optional) — Instance of the remote filesystem where the dataset will be saved to.Deprecated in 2.8.0 fswas deprecated in version 2.8.0 and will be removed in 3.0.0. Please usestorage_optionsinstead, e.g.storage_options=fs.storage_options
- 
							keep_in_memory (bool, defaults toNone) — Whether to copy the dataset in-memory. IfNone, the dataset will not be copied in-memory unless explicitly enabled by settingdatasets.config.IN_MEMORY_MAX_SIZEto nonzero. See more details in the improve performance section.
- 
							storage_options (dict, optional) — Key/value pairs to be passed on to the file-system backend, if any.Added in 2.8.0 
Load a dataset that was previously saved using save_to_disk from a filesystem using fsspec.spec.AbstractFileSystem.
from_csv
< source >( path_or_paths: typing.Dict[str, typing.Union[str, bytes, os.PathLike]] features: typing.Optional[datasets.features.features.Features] = None cache_dir: str = None keep_in_memory: bool = False **kwargs )
Parameters
- 
							path_or_paths (dictof path-like) — Path(s) of the CSV file(s).
- features (Features, optional) — Dataset features.
- 
							cache_dir (str, optional, defaults to "~/.cache/huggingface/datasets") — Directory to cache data.
- 
							keep_in_memory (bool, defaults toFalse) — Whether to copy the data in-memory.
- 
							**kwargs (additional keyword arguments) —
Keyword arguments to be passed to pandas.read_csv.
Create DatasetDict from CSV file(s).
from_json
< source >( path_or_paths: typing.Dict[str, typing.Union[str, bytes, os.PathLike]] features: typing.Optional[datasets.features.features.Features] = None cache_dir: str = None keep_in_memory: bool = False **kwargs )
Parameters
- 
							path_or_paths (path-likeor list ofpath-like) — Path(s) of the JSON Lines file(s).
- features (Features, optional) — Dataset features.
- 
							cache_dir (str, optional, defaults to "~/.cache/huggingface/datasets") — Directory to cache data.
- 
							keep_in_memory (bool, defaults toFalse) — Whether to copy the data in-memory.
- 
							**kwargs (additional keyword arguments) —
Keyword arguments to be passed to JsonConfig.
Create DatasetDict from JSON Lines file(s).
from_parquet
< source >( path_or_paths: typing.Dict[str, typing.Union[str, bytes, os.PathLike]] features: typing.Optional[datasets.features.features.Features] = None cache_dir: str = None keep_in_memory: bool = False columns: typing.Optional[typing.List[str]] = None **kwargs )
Parameters
- 
							path_or_paths (dictof path-like) — Path(s) of the CSV file(s).
- features (Features, optional) — Dataset features.
- 
							cache_dir (str, optional, defaults to"~/.cache/huggingface/datasets") — Directory to cache data.
- 
							keep_in_memory (bool, defaults toFalse) — Whether to copy the data in-memory.
- 
							columns (List[str], optional) — If notNone, only these columns will be read from the file. A column name may be a prefix of a nested field, e.g. ‘a’ will select ‘a.b’, ‘a.c’, and ‘a.d.e’.
- 
							**kwargs (additional keyword arguments) —
Keyword arguments to be passed to ParquetConfig.
Create DatasetDict from Parquet file(s).
from_text
< source >( path_or_paths: typing.Dict[str, typing.Union[str, bytes, os.PathLike]] features: typing.Optional[datasets.features.features.Features] = None cache_dir: str = None keep_in_memory: bool = False **kwargs )
Parameters
- 
							path_or_paths (dictof path-like) — Path(s) of the text file(s).
- features (Features, optional) — Dataset features.
- 
							cache_dir (str, optional, defaults to"~/.cache/huggingface/datasets") — Directory to cache data.
- 
							keep_in_memory (bool, defaults toFalse) — Whether to copy the data in-memory.
- 
							**kwargs (additional keyword arguments) —
Keyword arguments to be passed to TextConfig.
Create DatasetDict from text file(s).
prepare_for_task
< source >( task: typing.Union[str, datasets.tasks.base.TaskTemplate] id: int = 0 )
Parameters
- 
							task (Union[str, TaskTemplate]) — The task to prepare the dataset for during training and evaluation. Ifstr, supported tasks include:- "text-classification"
- "question-answering"
 If TaskTemplate, must be one of the task templates indatasets.tasks.
- 
							id (int, defaults to0) — The id required to unambiguously identify the task template when multiple task templates of the same type are supported.
Prepare a dataset for the given task by casting the dataset’s Features to standardized column names and types as detailed in datasets.tasks.
Casts datasets.DatasetInfo.features according to a task-specific schema. Intended for single-use only, so all task templates are removed from datasets.DatasetInfo.task_templates after casting.
IterableDataset
The base class IterableDataset implements an iterable Dataset backed by python generators.
class datasets.IterableDataset
< source >( ex_iterable: _BaseExamplesIterable info: typing.Optional[datasets.info.DatasetInfo] = None split: typing.Optional[datasets.splits.NamedSplit] = None formatting: typing.Optional[datasets.iterable_dataset.FormattingConfig] = None shuffling: typing.Optional[datasets.iterable_dataset.ShufflingConfig] = None distributed: typing.Optional[datasets.iterable_dataset.DistributedConfig] = None token_per_repo_id: typing.Union[typing.Dict[str, typing.Union[str, bool, NoneType]], NoneType] = None format_type = 'deprecated' )
A Dataset backed by an iterable.
from_generator
< source >(
			generator: typing.Callable
				features: typing.Optional[datasets.features.features.Features] = None
				gen_kwargs: typing.Optional[dict] = None
				
			)
			→
				IterableDataset
Parameters
- 
							generator (Callable) — A generator function thatyieldsexamples.
- 
							features (Features, optional) — Dataset features.
- 
							gen_kwargs(dict, optional) — Keyword arguments to be passed to thegeneratorcallable. You can define a sharded iterable dataset by passing the list of shards ingen_kwargs. This can be used to improve shuffling and when iterating over the dataset with multiple workers.
Returns
IterableDataset
Create an Iterable Dataset from a generator.
Example:
>>> def gen():
...     yield {"text": "Good", "label": 0}
...     yield {"text": "Bad", "label": 1}
...
>>> ds = IterableDataset.from_generator(gen)>>> def gen(shards):
...     for shard in shards:
...         with open(shard) as f:
...             for line in f:
...                 yield {"line": line}
...
>>> shards = [f"data{i}.txt" for i in range(32)]
>>> ds = IterableDataset.from_generator(gen, gen_kwargs={"shards": shards})
>>> ds = ds.shuffle(seed=42, buffer_size=10_000)  # shuffles the shards order + uses a shuffle buffer
>>> from torch.utils.data import DataLoader
>>> dataloader = DataLoader(ds.with_format("torch"), num_workers=4)  # give each worker a subset of 32/4=8 shardsremove_columns
< source >(
			column_names: typing.Union[str, typing.List[str]]
				
			)
			→
				IterableDataset
Remove one or several column(s) in the dataset and the features associated to them. The removal is done on-the-fly on the examples when iterating over the dataset.
Example:
>>> from datasets import load_dataset
>>> ds = load_dataset("rotten_tomatoes", split="train", streaming=True)
>>> next(iter(ds))
{'text': 'the rock is destined to be the 21st century's new " conan " and that he's going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal .', 'label': 1}
>>> ds = ds.remove_columns("label")
>>> next(iter(ds))
{'text': 'the rock is destined to be the 21st century's new " conan " and that he's going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal .'}select_columns
< source >(
			column_names: typing.Union[str, typing.List[str]]
				
			)
			→
				IterableDataset
Select one or several column(s) in the dataset and the features associated to them. The selection is done on-the-fly on the examples when iterating over the dataset.
Example:
>>> from datasets import load_dataset
>>> ds = load_dataset("rotten_tomatoes", split="train", streaming=True)
>>> next(iter(ds))
{'text': 'the rock is destined to be the 21st century's new " conan " and that he's going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal .', 'label': 1}
>>> ds = ds.select_columns("text")
>>> next(iter(ds))
{'text': 'the rock is destined to be the 21st century's new " conan " and that he's going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal .'}cast_column
< source >(
			column: str
				feature: typing.Union[dict, list, tuple, datasets.features.features.Value, datasets.features.features.ClassLabel, datasets.features.translation.Translation, datasets.features.translation.TranslationVariableLanguages, datasets.features.features.Sequence, datasets.features.features.Array2D, datasets.features.features.Array3D, datasets.features.features.Array4D, datasets.features.features.Array5D, datasets.features.audio.Audio, datasets.features.image.Image]
				
			)
			→
				IterableDataset
Cast column to feature for decoding.
Example:
>>> from datasets import load_dataset, Audio
>>> ds = load_dataset("PolyAI/minds14", name="en-US", split="train", streaming=True)
>>> ds.features
{'audio': Audio(sampling_rate=8000, mono=True, decode=True, id=None),
 'english_transcription': Value(dtype='string', id=None),
 'intent_class': ClassLabel(num_classes=14, names=['abroad', 'address', 'app_error', 'atm_limit', 'balance', 'business_loan',  'card_issues', 'cash_deposit', 'direct_debit', 'freeze', 'high_value_payment', 'joint_account', 'latest_transactions', 'pay_bill'], id=None),
 'lang_id': ClassLabel(num_classes=14, names=['cs-CZ', 'de-DE', 'en-AU', 'en-GB', 'en-US', 'es-ES', 'fr-FR', 'it-IT', 'ko-KR',  'nl-NL', 'pl-PL', 'pt-PT', 'ru-RU', 'zh-CN'], id=None),
 'path': Value(dtype='string', id=None),
 'transcription': Value(dtype='string', id=None)}
>>> ds = ds.cast_column("audio", Audio(sampling_rate=16000))
>>> ds.features
{'audio': Audio(sampling_rate=16000, mono=True, decode=True, id=None),
 'english_transcription': Value(dtype='string', id=None),
 'intent_class': ClassLabel(num_classes=14, names=['abroad', 'address', 'app_error', 'atm_limit', 'balance', 'business_loan',  'card_issues', 'cash_deposit', 'direct_debit', 'freeze', 'high_value_payment', 'joint_account', 'latest_transactions', 'pay_bill'], id=None),
 'lang_id': ClassLabel(num_classes=14, names=['cs-CZ', 'de-DE', 'en-AU', 'en-GB', 'en-US', 'es-ES', 'fr-FR', 'it-IT', 'ko-KR',  'nl-NL', 'pl-PL', 'pt-PT', 'ru-RU', 'zh-CN'], id=None),
 'path': Value(dtype='string', id=None),
 'transcription': Value(dtype='string', id=None)}cast
< source >(
			features: Features
				
			)
			→
				IterableDataset
Parameters
- 
							features (Features) —
New features to cast the dataset to.
The name of the fields in the features must match the current column names.
The type of the data must also be convertible from one type to the other.
For non-trivial conversion, e.g. string<->ClassLabelyou should use map() to update the Dataset.
Returns
IterableDataset
A copy of the dataset with casted features.
Cast the dataset to a new set of features.
Example:
>>> from datasets import load_dataset
>>> ds = load_dataset("rotten_tomatoes", split="train", streaming=True)
>>> ds.features
{'label': ClassLabel(num_classes=2, names=['neg', 'pos'], id=None),
 'text': Value(dtype='string', id=None)}
>>> new_features = ds.features.copy()
>>> new_features["label"] = ClassLabel(names=["bad", "good"])
>>> new_features["text"] = Value("large_string")
>>> ds = ds.cast(new_features)
>>> ds.features
{'label': ClassLabel(num_classes=2, names=['bad', 'good'], id=None),
 'text': Value(dtype='large_string', id=None)}iter
< source >( batch_size: int drop_last_batch: bool = False )
Iterate through the batches of size batch_size.
map
< source >( function: typing.Optional[typing.Callable] = None with_indices: bool = False input_columns: typing.Union[str, typing.List[str], NoneType] = None batched: bool = False batch_size: typing.Optional[int] = 1000 drop_last_batch: bool = False remove_columns: typing.Union[str, typing.List[str], NoneType] = None features: typing.Optional[datasets.features.features.Features] = None fn_kwargs: typing.Optional[dict] = None )
Parameters
- 
							function (Callable, optional, defaults toNone) — Function applied on-the-fly on the examples when you iterate on the dataset. It must have one of the following signatures:- function(example: Dict[str, Any]) -> Dict[str, Any]if- batched=Falseand- with_indices=False
- function(example: Dict[str, Any], idx: int) -> Dict[str, Any]if- batched=Falseand- with_indices=True
- function(batch: Dict[str, List]) -> Dict[str, List]if- batched=Trueand- with_indices=False
- function(batch: Dict[str, List], indices: List[int]) -> Dict[str, List]if- batched=Trueand- with_indices=True
 For advanced usage, the function can also return a pyarrow.Table. Moreover if your function returns nothing (None), thenmapwill run your function and return the dataset unchanged. If no function is provided, default to identity function:lambda x: x.
- 
							with_indices (bool, defaults toFalse) — Provide example indices tofunction. Note that in this case the signature offunctionshould bedef function(example, idx[, rank]): ....
- 
							input_columns (Optional[Union[str, List[str]]], defaults toNone) — The columns to be passed intofunctionas positional arguments. IfNone, a dict mapping to all formatted columns is passed as one argument.
- 
							batched (bool, defaults toFalse) — Provide batch of examples tofunction.
- 
							batch_size (int, optional, defaults to1000) — Number of examples per batch provided tofunctionifbatched=True.batch_size <= 0orbatch_size == Nonethen provide the full dataset as a single batch tofunction.
- 
							drop_last_batch (bool, defaults toFalse) — Whether a last batch smaller than the batch_size should be dropped instead of being processed by the function.
- 
							remove_columns ([List[str]], optional, defaults toNone) — Remove a selection of columns while doing the mapping. Columns will be removed before updating the examples with the output offunction, i.e. iffunctionis adding columns with names inremove_columns, these columns will be kept.
- 
							features ([Features], optional, defaults toNone) — Feature types of the resulting dataset.
- 
							fn_kwargs (Dict, optional, defaultNone) — Keyword arguments to be passed tofunction.
Apply a function to all the examples in the iterable dataset (individually or in batches) and update them. If your function returns a column that already exists, then it overwrites it. The function is applied on-the-fly on the examples when iterating over the dataset.
You can specify whether the function should be batched or not with the batched parameter:
- If batched is False, then the function takes 1 example in and should return 1 example. An example is a dictionary, e.g.{"text": "Hello there !"}.
- If batched is Trueandbatch_sizeis 1, then the function takes a batch of 1 example as input and can return a batch with 1 or more examples. A batch is a dictionary, e.g. a batch of 1 example is {“text”: [“Hello there !”]}.
- If batched is Trueandbatch_sizeisn> 1, then the function takes a batch ofnexamples as input and can return a batch withnexamples, or with an arbitrary number of examples. Note that the last batch may have less thannexamples. A batch is a dictionary, e.g. a batch ofnexamples is{"text": ["Hello there !"] * n}.
Example:
>>> from datasets import load_dataset
>>> ds = load_dataset("rotten_tomatoes", split="train", streaming=True)
>>> def add_prefix(example):
...     example["text"] = "Review: " + example["text"]
...     return example
>>> ds = ds.map(add_prefix)
>>> list(ds.take(3))
[{'label': 1,
 'text': 'Review: the rock is destined to be the 21st century's new " conan " and that he's going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal .'},
 {'label': 1,
 'text': 'Review: the gorgeously elaborate continuation of " the lord of the rings " trilogy is so huge that a column of words cannot adequately describe co-writer/director peter jackson's expanded vision of j . r . r . tolkien's middle-earth .'},
 {'label': 1, 'text': 'Review: effective but too-tepid biopic'}]rename_column
< source >(
			original_column_name: str
				new_column_name: str
				
			)
			→
				IterableDataset
Rename a column in the dataset, and move the features associated to the original column under the new column name.
Example:
>>> from datasets import load_dataset
>>> ds = load_dataset("rotten_tomatoes", split="train", streaming=True)
>>> next(iter(ds))
{'label': 1,
 'text': 'the rock is destined to be the 21st century's new " conan " and that he's going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal .'}
>>> ds = ds.rename_column("text", "movie_review")
>>> next(iter(ds))
{'label': 1,
 'movie_review': 'the rock is destined to be the 21st century's new " conan " and that he's going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal .'}filter
< source >( function: typing.Optional[typing.Callable] = None with_indices = False input_columns: typing.Union[str, typing.List[str], NoneType] = None batched: bool = False batch_size: typing.Optional[int] = 1000 fn_kwargs: typing.Optional[dict] = None )
Parameters
- 
							function (Callable) — Callable with one of the following signatures:- function(example: Dict[str, Any]) -> boolif- with_indices=False, batched=False
- function(example: Dict[str, Any], indices: int) -> boolif- with_indices=True, batched=False
- function(example: Dict[str, List]) -> List[bool]if- with_indices=False, batched=True
- function(example: Dict[str, List], indices: List[int]) -> List[bool]if- with_indices=True, batched=True
 If no function is provided, defaults to an always True function: lambda x: True.
- 
							with_indices (bool, defaults toFalse) — Provide example indices tofunction. Note that in this case the signature offunctionshould bedef function(example, idx): ....
- 
							input_columns (strorList[str], optional) — The columns to be passed intofunctionas positional arguments. IfNone, a dict mapping to all formatted columns is passed as one argument.
- 
							batched (bool, defaults toFalse) — Provide batch of examples tofunction.
- 
							batch_size (int, optional, default1000) — Number of examples per batch provided tofunctionifbatched=True.
- 
							fn_kwargs (Dict, optional, defaultNone) — Keyword arguments to be passed tofunction.
Apply a filter function to all the elements so that the dataset only includes examples according to the filter function. The filtering is done on-the-fly when iterating over the dataset.
Example:
>>> from datasets import load_dataset
>>> ds = load_dataset("rotten_tomatoes", split="train", streaming=True)
>>> ds = ds.filter(lambda x: x["label"] == 0)
>>> list(ds.take(3))
[{'label': 0, 'movie_review': 'simplistic , silly and tedious .'},
 {'label': 0,
 'movie_review': "it's so laddish and juvenile , only teenage boys could possibly find it funny ."},
 {'label': 0,
 'movie_review': 'exploitative and largely devoid of the depth or sophistication that would make watching such a graphic treatment of the crimes bearable .'}]shuffle
< source >( seed = None generator: typing.Optional[numpy.random._generator.Generator] = None buffer_size: int = 1000 )
Parameters
- 
							seed (int, optional, defaults toNone) — Random seed that will be used to shuffle the dataset. It is used to sample from the shuffle buffe and also to shuffle the data shards.
- 
							generator (numpy.random.Generator, optional) — Numpy random Generator to use to compute the permutation of the dataset rows. Ifgenerator=None(default), usesnp.random.default_rng(the default BitGenerator (PCG64) of NumPy).
- 
							buffer_size (int, defaults to1000) — Size of the buffer.
Randomly shuffles the elements of this dataset.
This dataset fills a buffer with buffer_size elements, then randomly samples elements from this buffer,
replacing the selected elements with new elements. For perfect shuffling, a buffer size greater than or
equal to the full size of the dataset is required.
For instance, if your dataset contains 10,000 elements but buffer_size is set to 1000, then shuffle will
initially select a random element from only the first 1000 elements in the buffer. Once an element is
selected, its space in the buffer is replaced by the next (i.e. 1,001-st) element,
maintaining the 1000 element buffer.
If the dataset is made of several shards, it also does shuffle the order of the shards. However if the order has been fixed by using skip() or take() then the order of the shards is kept unchanged.
Example:
>>> from datasets import load_dataset
>>> ds = load_dataset("rotten_tomatoes", split="train", streaming=True)
>>> list(ds.take(3))
[{'label': 1,
 'text': 'the rock is destined to be the 21st century's new " conan " and that he's going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal .'},
 {'label': 1,
 'text': 'the gorgeously elaborate continuation of " the lord of the rings " trilogy is so huge that a column of words cannot adequately describe co-writer/director peter jackson's expanded vision of j . r . r . tolkien's middle-earth .'},
 {'label': 1, 'text': 'effective but too-tepid biopic'}]
>>> shuffled_ds = ds.shuffle(seed=42)
>>> list(shuffled_ds.take(3))
[{'label': 1,
 'text': "a sports movie with action that's exciting on the field and a story you care about off it ."},
 {'label': 1,
 'text': 'at its best , the good girl is a refreshingly adult take on adultery . . .'},
 {'label': 1,
 'text': "sam jones became a very lucky filmmaker the day wilco got dropped from their record label , proving that one man's ruin may be another's fortune ."}]Create a new IterableDataset that skips the first n elements.
Example:
>>> from datasets import load_dataset
>>> ds = load_dataset("rotten_tomatoes", split="train", streaming=True)
>>> list(ds.take(3))
[{'label': 1,
 'text': 'the rock is destined to be the 21st century's new " conan " and that he's going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal .'},
 {'label': 1,
 'text': 'the gorgeously elaborate continuation of " the lord of the rings " trilogy is so huge that a column of words cannot adequately describe co-writer/director peter jackson's expanded vision of j . r . r . tolkien's middle-earth .'},
 {'label': 1, 'text': 'effective but too-tepid biopic'}]
>>> ds = ds.skip(1)
>>> list(ds.take(3))
[{'label': 1,
 'text': 'the gorgeously elaborate continuation of " the lord of the rings " trilogy is so huge that a column of words cannot adequately describe co-writer/director peter jackson's expanded vision of j . r . r . tolkien's middle-earth .'},
 {'label': 1, 'text': 'effective but too-tepid biopic'},
 {'label': 1,
 'text': 'if you sometimes like to go to the movies to have fun , wasabi is a good place to start .'}]Create a new IterableDataset with only the first n elements.
Example:
>>> from datasets import load_dataset
>>> ds = load_dataset("rotten_tomatoes", split="train", streaming=True)
>>> small_ds = ds.take(2)
>>> list(small_ds)
[{'label': 1,
 'text': 'the rock is destined to be the 21st century's new " conan " and that he's going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal .'},
 {'label': 1,
 'text': 'the gorgeously elaborate continuation of " the lord of the rings " trilogy is so huge that a column of words cannot adequately describe co-writer/director peter jackson's expanded vision of j . r . r . tolkien's middle-earth .'}]DatasetInfo object containing all the metadata in the dataset.
NamedSplit object corresponding to a named dataset split.
IterableDatasetDict
Dictionary with split names as keys (‘train’, ‘test’ for example), and IterableDataset objects as values.
map
< source >( function: typing.Optional[typing.Callable] = None with_indices: bool = False input_columns: typing.Union[str, typing.List[str], NoneType] = None batched: bool = False batch_size: int = 1000 drop_last_batch: bool = False remove_columns: typing.Union[str, typing.List[str], NoneType] = None fn_kwargs: typing.Optional[dict] = None )
Parameters
- 
							function (Callable, optional, defaults toNone) — Function applied on-the-fly on the examples when you iterate on the dataset. It must have one of the following signatures:- function(example: Dict[str, Any]) -> Dict[str, Any]if- batched=Falseand- with_indices=False
- function(example: Dict[str, Any], idx: int) -> Dict[str, Any]if- batched=Falseand- with_indices=True
- function(batch: Dict[str, List]) -> Dict[str, List]if- batched=Trueand- with_indices=False
- function(batch: Dict[str, List], indices: List[int]) -> Dict[str, List]if- batched=Trueand- with_indices=True
 For advanced usage, the function can also return a pyarrow.Table. Moreover if your function returns nothing (None), thenmapwill run your function and return the dataset unchanged. If no function is provided, default to identity function:lambda x: x.
- 
							with_indices (bool, defaults toFalse) — Provide example indices tofunction. Note that in this case the signature offunctionshould bedef function(example, idx[, rank]): ....
- 
							input_columns ([Union[str, List[str]]], optional, defaults toNone) — The columns to be passed intofunctionas positional arguments. IfNone, a dict mapping to all formatted columns is passed as one argument.
- 
							batched (bool, defaults toFalse) — Provide batch of examples tofunction.
- 
							batch_size (int, optional, defaults to1000) — Number of examples per batch provided tofunctionifbatched=True.
- 
							drop_last_batch (bool, defaults toFalse) — Whether a last batch smaller than thebatch_sizeshould be dropped instead of being processed by the function.
- 
							remove_columns ([List[str]], optional, defaults toNone) — Remove a selection of columns while doing the mapping. Columns will be removed before updating the examples with the output offunction, i.e. iffunctionis adding columns with names inremove_columns, these columns will be kept.
- 
							fn_kwargs (Dict, optional, defaults toNone) — Keyword arguments to be passed tofunction
Apply a function to all the examples in the iterable dataset (individually or in batches) and update them. If your function returns a column that already exists, then it overwrites it. The function is applied on-the-fly on the examples when iterating over the dataset. The transformation is applied to all the datasets of the dataset dictionary.
You can specify whether the function should be batched or not with the batched parameter:
- If batched is False, then the function takes 1 example in and should return 1 example. An example is a dictionary, e.g.{"text": "Hello there !"}.
- If batched is Trueandbatch_sizeis 1, then the function takes a batch of 1 example as input and can return a batch with 1 or more examples. A batch is a dictionary, e.g. a batch of 1 example is{"text": ["Hello there !"]}.
- If batched is Trueandbatch_sizeisn> 1, then the function takes a batch ofnexamples as input and can return a batch withnexamples, or with an arbitrary number of examples. Note that the last batch may have less thannexamples. A batch is a dictionary, e.g. a batch ofnexamples is{"text": ["Hello there !"] * n}.
Example:
>>> from datasets import load_dataset
>>> ds = load_dataset("rotten_tomatoes", streaming=True)
>>> def add_prefix(example):
...     example["text"] = "Review: " + example["text"]
...     return example
>>> ds = ds.map(add_prefix)
>>> next(iter(ds["train"]))
{'label': 1,
 'text': 'Review: the rock is destined to be the 21st century's new " conan " and that he's going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal .'}filter
< source >( function: typing.Optional[typing.Callable] = None with_indices = False input_columns: typing.Union[str, typing.List[str], NoneType] = None batched: bool = False batch_size: typing.Optional[int] = 1000 fn_kwargs: typing.Optional[dict] = None )
Parameters
- 
							function (Callable) — Callable with one of the following signatures:- function(example: Dict[str, Any]) -> boolif- with_indices=False, batched=False
- function(example: Dict[str, Any], indices: int) -> boolif- with_indices=True, batched=False
- function(example: Dict[str, List]) -> List[bool]if- with_indices=False, batched=True
- function(example: Dict[str, List], indices: List[int]) -> List[bool]if- with_indices=True, batched=True
 If no function is provided, defaults to an always True function: lambda x: True.
- 
							with_indices (bool, defaults toFalse) — Provide example indices tofunction. Note that in this case the signature offunctionshould bedef function(example, idx): ....
- 
							input_columns (strorList[str], optional) — The columns to be passed intofunctionas positional arguments. IfNone, a dict mapping to all formatted columns is passed as one argument.
- 
							batched (bool, defaults toFalse) — Provide batch of examples tofunction
- 
							batch_size (int, optional, defaults to1000) — Number of examples per batch provided tofunctionifbatched=True.
- 
							fn_kwargs (Dict, optional, defaults toNone) — Keyword arguments to be passed tofunction
Apply a filter function to all the elements so that the dataset only includes examples according to the filter function. The filtering is done on-the-fly when iterating over the dataset. The filtering is applied to all the datasets of the dataset dictionary.
Example:
>>> from datasets import load_dataset
>>> ds = load_dataset("rotten_tomatoes", streaming=True)
>>> ds = ds.filter(lambda x: x["label"] == 0)
>>> list(ds["train"].take(3))
[{'label': 0, 'text': 'Review: simplistic , silly and tedious .'},
 {'label': 0,
 'text': "Review: it's so laddish and juvenile , only teenage boys could possibly find it funny ."},
 {'label': 0,
 'text': 'Review: exploitative and largely devoid of the depth or sophistication that would make watching such a graphic treatment of the crimes bearable .'}]shuffle
< source >( seed = None generator: typing.Optional[numpy.random._generator.Generator] = None buffer_size: int = 1000 )
Parameters
- 
							seed (int, optional, defaults toNone) — Random seed that will be used to shuffle the dataset. It is used to sample from the shuffle buffe and als oto shuffle the data shards.
- 
							generator (numpy.random.Generator, optional) — Numpy random Generator to use to compute the permutation of the dataset rows. Ifgenerator=None(default), usesnp.random.default_rng(the default BitGenerator (PCG64) of NumPy).
- 
							buffer_size (int, defaults to1000) — Size of the buffer.
Randomly shuffles the elements of this dataset. The shuffling is applied to all the datasets of the dataset dictionary.
This dataset fills a buffer with buffer_size elements, then randomly samples elements from this buffer, replacing the selected elements with new elements. For perfect shuffling, a buffer size greater than or equal to the full size of the dataset is required.
For instance, if your dataset contains 10,000 elements but buffer_size is set to 1000, then shuffle will
initially select a random element from only the first 1000 elements in the buffer. Once an element is
selected, its space in the buffer is replaced by the next (i.e. 1,001-st) element,
maintaining the 1000 element buffer.
If the dataset is made of several shards, it also does shuffle the order of the shards.
However if the order has been fixed by using skip() or take()
then the order of the shards is kept unchanged.
Example:
>>> from datasets import load_dataset
>>> ds = load_dataset("rotten_tomatoes", streaming=True)
>>> list(ds["train"].take(3))
[{'label': 1,
 'text': 'the rock is destined to be the 21st century's new " conan " and that he's going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal .'},
 {'label': 1,
 'text': 'the gorgeously elaborate continuation of " the lord of the rings " trilogy is so huge that a column of words cannot adequately describe co-writer/director peter jackson's expanded vision of j . r . r . tolkien's middle-earth .'},
 {'label': 1, 'text': 'effective but too-tepid biopic'}]
>>> ds = ds.shuffle(seed=42)
>>> list(ds["train"].take(3))
[{'label': 1,
 'text': "a sports movie with action that's exciting on the field and a story you care about off it ."},
 {'label': 1,
 'text': 'at its best , the good girl is a refreshingly adult take on adultery . . .'},
 {'label': 1,
 'text': "sam jones became a very lucky filmmaker the day wilco got dropped from their record label , proving that one man's ruin may be another's fortune ."}]with_format
< source >( type: typing.Optional[str] = None )
Return a dataset with the specified format. This method only supports the “torch” format for now. The format is set to all the datasets of the dataset dictionary.
Example:
>>> from datasets import load_dataset
>>> ds = load_dataset("rotten_tomatoes", streaming=True)
>>> from transformers import AutoTokenizer
>>> tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
>>> def encode(example):
...     return tokenizer(examples["text"], truncation=True, padding="max_length")
>>> ds = ds.map(encode, batched=True, remove_columns=["text"])
>>> ds = ds.with_format("torch")cast
< source >( features: Features ) → IterableDatasetDict
Parameters
- 
							features (Features) — New features to cast the dataset to. The name of the fields in the features must match the current column names. The type of the data must also be convertible from one type to the other. For non-trivial conversion, e.g.string<->ClassLabelyou should usemapto update the Dataset.
Returns
A copy of the dataset with casted features.
Cast the dataset to a new set of features. The type casting is applied to all the datasets of the dataset dictionary.
Example:
>>> from datasets import load_dataset
>>> ds = load_dataset("rotten_tomatoes", streaming=True)
>>> ds["train"].features
{'label': ClassLabel(num_classes=2, names=['neg', 'pos'], id=None),
 'text': Value(dtype='string', id=None)}
>>> new_features = ds["train"].features.copy()
>>> new_features['label'] = ClassLabel(names=['bad', 'good'])
>>> new_features['text'] = Value('large_string')
>>> ds = ds.cast(new_features)
>>> ds["train"].features
{'label': ClassLabel(num_classes=2, names=['bad', 'good'], id=None),
 'text': Value(dtype='large_string', id=None)}cast_column
< source >( column: str feature: typing.Union[dict, list, tuple, datasets.features.features.Value, datasets.features.features.ClassLabel, datasets.features.translation.Translation, datasets.features.translation.TranslationVariableLanguages, datasets.features.features.Sequence, datasets.features.features.Array2D, datasets.features.features.Array3D, datasets.features.features.Array4D, datasets.features.features.Array5D, datasets.features.audio.Audio, datasets.features.image.Image] )
Cast column to feature for decoding. The type casting is applied to all the datasets of the dataset dictionary.
Example:
>>> from datasets import load_dataset
>>> ds = load_dataset("rotten_tomatoes", streaming=True)
>>> ds["train"].features
{'label': ClassLabel(num_classes=2, names=['neg', 'pos'], id=None),
 'text': Value(dtype='string', id=None)}
>>> ds = ds.cast_column('label', ClassLabel(names=['bad', 'good']))
>>> ds["train"].features
{'label': ClassLabel(num_classes=2, names=['bad', 'good'], id=None),
 'text': Value(dtype='string', id=None)}remove_columns
< source >( column_names: typing.Union[str, typing.List[str]] ) → IterableDatasetDict
Parameters
Returns
A copy of the dataset object without the columns to remove.
Remove one or several column(s) in the dataset and the features associated to them. The removal is done on-the-fly on the examples when iterating over the dataset. The removal is applied to all the datasets of the dataset dictionary.
Example:
>>> from datasets import load_dataset
>>> ds = load_dataset("rotten_tomatoes", streaming=True)
>>> ds = ds.remove_columns("label")
>>> next(iter(ds["train"]))
{'text': 'the rock is destined to be the 21st century's new " conan " and that he's going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal .'}rename_column
< source >( original_column_name: str new_column_name: str ) → IterableDatasetDict
Rename a column in the dataset, and move the features associated to the original column under the new column name. The renaming is applied to all the datasets of the dataset dictionary.
Example:
>>> from datasets import load_dataset
>>> ds = load_dataset("rotten_tomatoes", streaming=True)
>>> ds = ds.rename_column("text", "movie_review")
>>> next(iter(ds["train"]))
{'label': 1,
 'movie_review': 'the rock is destined to be the 21st century's new " conan " and that he's going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal .'}rename_columns
< source >( column_mapping: typing.Dict[str, str] ) → IterableDatasetDict
Parameters
Returns
A copy of the dataset with renamed columns
Rename several columns in the dataset, and move the features associated to the original columns under the new column names. The renaming is applied to all the datasets of the dataset dictionary.
Example:
>>> from datasets import load_dataset
>>> ds = load_dataset("rotten_tomatoes", streaming=True)
>>> ds = ds.rename_columns({"text": "movie_review", "label": "rating"})
>>> next(iter(ds["train"]))
{'movie_review': 'the rock is destined to be the 21st century's new " conan " and that he's going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal .',
 'rating': 1}select_columns
< source >( column_names: typing.Union[str, typing.List[str]] ) → IterableDatasetDict
Parameters
Returns
A copy of the dataset object with only selected columns.
Select one or several column(s) in the dataset and the features associated to them. The selection is done on-the-fly on the examples when iterating over the dataset. The selection is applied to all the datasets of the dataset dictionary.
Example:
>>> from datasets import load_dataset
>>> ds = load_dataset("rotten_tomatoes", streaming=True)
>>> ds = ds.select("text")
>>> next(iter(ds["train"]))
{'text': 'the rock is destined to be the 21st century's new " conan " and that he's going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal .'}Features
A special dictionary that defines the internal structure of a dataset.
Instantiated with a dictionary of type dict[str, FieldType], where keys are the desired column names,
and values are the type of that column.
FieldType can be one of the following:
- a Value feature specifies a single typed value, e.g. - int64or- string.
- a ClassLabel feature specifies a field with a predefined set of classes which can have labels associated to them and will be stored as integers in the dataset. 
- a python - dictwhich specifies that the field is a nested field containing a mapping of sub-fields to sub-fields features. It’s possible to have nested fields of nested fields in an arbitrary manner.
- a python - listor a Sequence specifies that the field contains a list of objects. The python- listor Sequence should be provided with a single sub-feature as an example of the feature type hosted in this list.- A Sequence with a internal dictionary feature will be automatically converted into a dictionary of lists. This behavior is implemented to have a compatilbity layer with the TensorFlow Datasets library but may be un-wanted in some cases. If you don’t want this behavior, you can use a python - listinstead of the Sequence.
- a Array2D, Array3D, Array4D or Array5D feature for multidimensional arrays. 
- an Audio feature to store the absolute path to an audio file or a dictionary with the relative path to an audio file (“path” key) and its bytes content (“bytes” key). This feature extracts the audio data. 
- an Image feature to store the absolute path to an image file, an - np.ndarrayobject, a- PIL.Image.Imageobject or a dictionary with the relative path to an image file (“path” key) and its bytes content (“bytes” key). This feature extracts the image data.
- Translation and TranslationVariableLanguages, the two features specific to Machine Translation. 
Make a deep copy of Features.
decode_batch
< source >( batch: dict token_per_repo_id: typing.Union[typing.Dict[str, typing.Union[str, bool, NoneType]], NoneType] = None )
Decode batch with custom feature decoding.
decode_column
< source >( column: list column_name: str )
Decode column with custom feature decoding.
decode_example
< source >( example: dict token_per_repo_id: typing.Union[typing.Dict[str, typing.Union[str, bool, NoneType]], NoneType] = None )
Decode example with custom feature decoding.
encode_batch
< source >( batch )
Encode batch into a format for Arrow.
encode_column
< source >( column column_name: str )
Encode column into a format for Arrow.
Encode example into a format for Arrow.
Flatten the features. Every dictionary column is removed and is replaced by
all the subfields it contains. The new fields are named by concatenating the
name of the original column and the subfield name like this: <original>.<subfield>.
If a column contains nested dictionaries, then all the lower-level subfields names are
also concatenated to form new columns: <original>.<subfield>.<subsubfield>, etc.
Example:
>>> from datasets import load_dataset
>>> ds = load_dataset("squad", split="train")
>>> ds.features.flatten()
{'answers.answer_start': Sequence(feature=Value(dtype='int32', id=None), length=-1, id=None),
 'answers.text': Sequence(feature=Value(dtype='string', id=None), length=-1, id=None),
 'context': Value(dtype='string', id=None),
 'id': Value(dtype='string', id=None),
 'question': Value(dtype='string', id=None),
 'title': Value(dtype='string', id=None)}from_arrow_schema
< source >( pa_schema: Schema )
Construct Features from Arrow Schema. It also checks the schema metadata for Hugging Face Datasets features. Non-nullable fields are not supported and set to nullable.
from_dict
< source >( dic ) → Features
Construct [Features] from dict.
Regenerate the nested feature object from a deserialized dict. We use the _type key to infer the dataclass name of the feature FieldType.
It allows for a convenient constructor syntax to define features from deserialized JSON dictionaries. This function is used in particular when deserializing a [DatasetInfo] that was dumped to a JSON object. This acts as an analogue to [Features.from_arrow_schema] and handles the recursive field-by-field instantiation, but doesn’t require any mapping to/from pyarrow, except for the fact that it takes advantage of the mapping of pyarrow primitive dtypes that [Value] automatically performs.
reorder_fields_as
< source >( other: Features )
Reorder Features fields to match the field order of other [Features].
The order of the fields is important since it matters for the underlying arrow data. Re-ordering the fields allows to make the underlying arrow data type match.
Example:
>>> from datasets import Features, Sequence, Value
>>> # let's say we have to features with a different order of nested fields (for a and b for example)
>>> f1 = Features({"root": Sequence({"a": Value("string"), "b": Value("string")})})
>>> f2 = Features({"root": {"b": Sequence(Value("string")), "a": Sequence(Value("string"))}})
>>> assert f1.type != f2.type
>>> # re-ordering keeps the base structure (here Sequence is defined at the root level), but make the fields order match
>>> f1.reorder_fields_as(f2)
{'root': Sequence(feature={'b': Value(dtype='string', id=None), 'a': Value(dtype='string', id=None)}, length=-1, id=None)}
>>> assert f1.reorder_fields_as(f2).type == f2.typeclass datasets.Sequence
< source >( feature: typing.Any length: int = -1 id: typing.Optional[str] = None )
Construct a list of feature from a single type or a dict of types. Mostly here for compatiblity with tfds.
Example:
>>> from datasets import Features, Sequence, Value, ClassLabel
>>> features = Features({'post': Sequence(feature={'text': Value(dtype='string'), 'upvotes': Value(dtype='int32'), 'label': ClassLabel(num_classes=2, names=['hot', 'cold'])})})
>>> features
{'post': Sequence(feature={'text': Value(dtype='string', id=None), 'upvotes': Value(dtype='int32', id=None), 'label': ClassLabel(num_classes=2, names=['hot', 'cold'], id=None)}, length=-1, id=None)}class datasets.ClassLabel
< source >( num_classes: dataclasses.InitVar[typing.Optional[int]] = None names: typing.List[str] = None names_file: dataclasses.InitVar[typing.Optional[str]] = None id: typing.Optional[str] = None )
Parameters
Feature type for integer class labels.
There are 3 ways to define a ClassLabel, which correspond to the 3 arguments:
- num_classes: Create 0 to (num_classes-1) labels.
- names: List of label strings.
- names_file: File containing the list of labels.
Under the hood the labels are stored as integers. You can use negative integers to represent unknown/missing labels.
Example:
>>> from datasets import Features
>>> features = Features({'label': ClassLabel(num_classes=3, names=['bad', 'ok', 'good'])})
>>> features
{'label': ClassLabel(num_classes=3, names=['bad', 'ok', 'good'], id=None)}cast_storage
< source >(
			storage: typing.Union[pyarrow.lib.StringArray, pyarrow.lib.IntegerArray]
				
			)
			→
				pa.Int64Array
Cast an Arrow array to the ClassLabel arrow storage type.
The Arrow types that can be converted to the ClassLabel pyarrow storage type are:
- pa.string()
- pa.int()
Conversion integer => class name string.
Regarding unknown/missing labels: passing negative integers raises ValueError.
Conversion class name string => integer.
The Value dtypes are as follows:
- null
- bool
- int8
- int16
- int32
- int64
- uint8
- uint16
- uint32
- uint64
- float16
- float32(alias float)
- float64(alias double)
- time32[(s|ms)]
- time64[(us|ns)]
- timestamp[(s|ms|us|ns)]
- timestamp[(s|ms|us|ns), tz=(tzstring)]
- date32
- date64
- duration[(s|ms|us|ns)]
- decimal128(precision, scale)
- decimal256(precision, scale)
- binary
- large_binary
- string
- large_string
class datasets.Translation
< source >( languages: typing.List[str] id: typing.Optional[str] = None )
FeatureConnector for translations with fixed languages per example.
Here for compatiblity with tfds.
Example:
>>> # At construction time:
>>> datasets.features.Translation(languages=['en', 'fr', 'de'])
>>> # During data generation:
>>> yield {
...         'en': 'the cat',
...         'fr': 'le chat',
...         'de': 'die katze'
... }Flatten the Translation feature into a dictionary.
class datasets.TranslationVariableLanguages
< source >(
			languages: typing.Optional[typing.List] = None
				num_languages: typing.Optional[int] = None
				id: typing.Optional[str] = None
				
			)
			→
				
language or translation (variable-length 1D tf.Tensor of tf.string)
Parameters
- 
							languages (dict) — A dictionary for each example mapping string language codes to one or more string translations. The languages present may vary from example to example.
Returns
- languageor- translation(variable-length 1D- tf.Tensorof- tf.string)
Language codes sorted in ascending order or plain text translations, sorted to align with language codes.
FeatureConnector for translations with variable languages per example.
Here for compatiblity with tfds.
Example:
>>> # At construction time:
>>> datasets.features.TranslationVariableLanguages(languages=['en', 'fr', 'de'])
>>> # During data generation:
>>> yield {
...         'en': 'the cat',
...         'fr': ['le chat', 'la chatte,']
...         'de': 'die katze'
... }
>>> # Tensor returned :
>>> {
...         'language': ['en', 'de', 'fr', 'fr'],
...         'translation': ['the cat', 'die katze', 'la chatte', 'le chat'],
... }Flatten the TranslationVariableLanguages feature into a dictionary.
class datasets.Array2D
< source >( shape: tuple dtype: str id: typing.Optional[str] = None )
Create a two-dimensional array.
class datasets.Array3D
< source >( shape: tuple dtype: str id: typing.Optional[str] = None )
Create a three-dimensional array.
class datasets.Array4D
< source >( shape: tuple dtype: str id: typing.Optional[str] = None )
Create a four-dimensional array.
class datasets.Array5D
< source >( shape: tuple dtype: str id: typing.Optional[str] = None )
Create a five-dimensional array.
class datasets.Audio
< source >( sampling_rate: typing.Optional[int] = None mono: bool = True decode: bool = True id: typing.Optional[str] = None )
Parameters
- 
							sampling_rate (int, optional) — Target sampling rate. IfNone, the native sampling rate is used.
- 
							mono (bool, defaults toTrue) — Whether to convert the audio signal to mono by averaging samples across channels.
- 
							decode (bool, defaults toTrue) — Whether to decode the audio data. IfFalse, returns the underlying dictionary in the format{"path": audio_path, "bytes": audio_bytes}.
Audio Feature to extract audio data from an audio file.
Input: The Audio feature accepts as input:
- A - str: Absolute path to the audio file (i.e. random access is allowed).
- A - dictwith the keys:- path: String with relative path of the audio file to the archive file.
- bytes: Bytes content of the audio file.
 - This is useful for archived files with sequential access. 
- A - dictwith the keys:- path: String with relative path of the audio file to the archive file.
- array: Array containing the audio sample
- sampling_rate: Integer corresponding to the sampling rate of the audio sample.
 - This is useful for archived files with sequential access. 
Example:
>>> from datasets import load_dataset, Audio
>>> ds = load_dataset("PolyAI/minds14", name="en-US", split="train")
>>> ds = ds.cast_column("audio", Audio(sampling_rate=16000))
>>> ds[0]["audio"]
{'array': array([ 2.3443763e-05,  2.1729663e-04,  2.2145823e-04, ...,
     3.8356509e-05, -7.3497440e-06, -2.1754686e-05], dtype=float32),
 'path': '/root/.cache/huggingface/datasets/downloads/extracted/f14948e0e84be638dd7943ac36518a4cf3324e8b7aa331c5ab11541518e9368c/en-US~JOINT_ACCOUNT/602ba55abb1e6d0fbce92065.wav',
 'sampling_rate': 16000}cast_storage
< source >(
			storage: typing.Union[pyarrow.lib.StringArray, pyarrow.lib.StructArray]
				
			)
			→
				pa.StructArray
Cast an Arrow array to the Audio arrow storage type. The Arrow types that can be converted to the Audio pyarrow storage type are:
- pa.string()- it must contain the “path” data
- pa.binary()- it must contain the audio bytes
- pa.struct({"bytes": pa.binary()})
- pa.struct({"path": pa.string()})
- pa.struct({"bytes": pa.binary(), "path": pa.string()})- order doesn’t matter
decode_example
< source >(
			value: dict
				token_per_repo_id: typing.Union[typing.Dict[str, typing.Union[str, bool, NoneType]], NoneType] = None
				
			)
			→
				dict
Parameters
- 
							value (dict) — A dictionary with keys:- path: String with relative audio file path.
- bytes: Bytes of the audio file.
 
- 
							token_per_repo_id (dict, optional) — To access and decode audio files from private repositories on the Hub, you can pass a dictionary repo_id (str) -> token (boolorstr)
Returns
dict
Decode example audio file into audio data.
embed_storage
< source >(
			storage: StructArray
				
			)
			→
				pa.StructArray
Embed audio files into the Arrow array.
encode_example
< source >(
			value: typing.Union[str, bytes, dict]
				
			)
			→
				dict
Encode example into a format for Arrow.
If in the decodable state, raise an error, otherwise flatten the feature into a dictionary.
class datasets.Image
< source >( decode: bool = True id: typing.Optional[str] = None )
Image Feature to read image data from an image file.
Input: The Image feature accepts as input:
- A - str: Absolute path to the image file (i.e. random access is allowed).
- A - dictwith the keys:- path: String with relative path of the image file to the archive file.
- bytes: Bytes of the image file.
 - This is useful for archived files with sequential access. 
- An - np.ndarray: NumPy array representing an image.
- A - PIL.Image.Image: PIL image object.
Examples:
>>> from datasets import load_dataset, Image
>>> ds = load_dataset("beans", split="train")
>>> ds.features["image"]
Image(decode=True, id=None)
>>> ds[0]["image"]
<PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=500x500 at 0x15E52E7F0>
>>> ds = ds.cast_column('image', Image(decode=False))
{'bytes': None,
 'path': '/root/.cache/huggingface/datasets/downloads/extracted/b0a21163f78769a2cf11f58dfc767fb458fc7cea5c05dccc0144a2c0f0bc1292/train/healthy/healthy_train.85.jpg'}cast_storage
< source >(
			storage: typing.Union[pyarrow.lib.StringArray, pyarrow.lib.StructArray, pyarrow.lib.ListArray]
				
			)
			→
				pa.StructArray
Cast an Arrow array to the Image arrow storage type. The Arrow types that can be converted to the Image pyarrow storage type are:
- pa.string()- it must contain the “path” data
- pa.binary()- it must contain the image bytes
- pa.struct({"bytes": pa.binary()})
- pa.struct({"path": pa.string()})
- pa.struct({"bytes": pa.binary(), "path": pa.string()})- order doesn’t matter
- pa.list(*)- it must contain the image array data
decode_example
< source >( value: dict token_per_repo_id = None )
Parameters
- 
							value (strordict) — A string with the absolute image file path, a dictionary with keys:- path: String with absolute or relative image file path.
- bytes: The bytes of the image file.
 
- 
							token_per_repo_id (dict, optional) — To access and decode image files from private repositories on the Hub, you can pass a dictionary repo_id (str) -> token (boolorstr).
Decode example image file into image data.
embed_storage
< source >(
			storage: StructArray
				
			)
			→
				pa.StructArray
Embed image files into the Arrow array.
encode_example
< source >( value: typing.Union[str, bytes, dict, numpy.ndarray, ForwardRef('PIL.Image.Image')] )
Encode example into a format for Arrow.
If in the decodable state, return the feature itself, otherwise flatten the feature into a dictionary.
MetricInfo
class datasets.MetricInfo
< source >( description: str citation: str features: Features inputs_description: str = <factory> homepage: str = <factory> license: str = <factory> codebase_urls: typing.List[str] = <factory> reference_urls: typing.List[str] = <factory> streamable: bool = False format: typing.Optional[str] = None metric_name: typing.Optional[str] = None config_name: typing.Optional[str] = None experiment_id: typing.Optional[str] = None )
Information about a metric.
MetricInfo documents a metric, including its name, version, and features.
See the constructor arguments and properties for a full list.
Note: Not all fields are known on construction and may be updated later.
Create MetricInfo from the JSON file in metric_info_dir.
Write MetricInfo as JSON to metric_info_dir.
Also save the license separately in LICENCE.
If pretty_print is True, the JSON will be pretty-printed with the indent level of 4.
Metric
The base class Metric implements a Metric backed by one or several Dataset.
class datasets.Metric
< source >( config_name: typing.Optional[str] = None keep_in_memory: bool = False cache_dir: typing.Optional[str] = None num_process: int = 1 process_id: int = 0 seed: typing.Optional[int] = None experiment_id: typing.Optional[str] = None max_concurrent_cache_files: int = 10000 timeout: typing.Union[int, float] = 100 **kwargs )
Parameters
- 
							config_name (str) — This is used to define a hash specific to a metrics computation script and prevents the metric’s data to be overridden when the metric loading script is modified.
- 
							keep_in_memory (bool) — keep all predictions and references in memory. Not possible in distributed settings.
- 
							cache_dir (str) — Path to a directory in which temporary prediction/references data will be stored. The data directory should be located on a shared file-system in distributed setups.
- 
							num_process (int) — specify the total number of nodes in a distributed settings. This is useful to compute metrics in distributed setups (in particular non-additive metrics like F1).
- 
							process_id (int) — specify the id of the current process in a distributed setup (between 0 and num_process-1) This is useful to compute metrics in distributed setups (in particular non-additive metrics like F1).
- 
							seed (int, optional) — If specified, this will temporarily set numpy’s random seed when datasets.Metric.compute() is run.
- 
							experiment_id (str) — A specific experiment id. This is used if several distributed evaluations share the same file system. This is useful to compute metrics in distributed setups (in particular non-additive metrics like F1).
- 
							max_concurrent_cache_files (int) — Max number of concurrent metrics cache files (default 10000).
- 
							timeout (Union[int, float]) — Timeout in second for distributed setting synchronization.
A Metric is the base class and common API for all metrics.
Deprecated in 2.5.0
Use the new library 🤗 Evaluate instead: https://huggingface.co/docs/evaluate
add
< source >( prediction = None reference = None **kwargs )
Add one prediction and reference for the metric’s stack.
add_batch
< source >( predictions = None references = None **kwargs )
Add a batch of predictions and references for the metric’s stack.
compute
< source >( predictions = None references = None **kwargs )
Compute the metrics.
Usage of positional arguments is not allowed to prevent mistakes.
download_and_prepare
< source >( download_config: typing.Optional[datasets.download.download_config.DownloadConfig] = None dl_manager: typing.Optional[datasets.download.download_manager.DownloadManager] = None )
Parameters
- download_config (DownloadConfig, optional) — Specific download configuration parameters.
- dl_manager (DownloadManager, optional) — Specific download manager to use.
Downloads and prepares dataset for reading.
Filesystems
class datasets.filesystems.S3FileSystem
< source >( *args **kwargs )
Parameters
- 
							anon (bool, default toFalse) — Whether to use anonymous connection (public buckets only). IfFalse, uses the key/secret given, or boto’s credential resolver (client_kwargs, environment, variables, config files, EC2 IAM server, in that order).
- 
							key (str) — If not anonymous, use this access key ID, if specified.
- 
							secret (str) — If not anonymous, use this secret access key, if specified.
- 
							token (str) — If not anonymous, use this security token, if specified.
- 
							use_ssl (bool, defaults toTrue) — Whether to use SSL in connections to S3; may be faster without, but insecure. Ifuse_sslis also set inclient_kwargs, the value set inclient_kwargswill take priority.
- 
							s3_additional_kwargs (dict) — Parameters that are used when calling S3 API methods. Typically used for things like ServerSideEncryption.
- 
							client_kwargs (dict) — Parameters for the botocore client.
- 
							requester_pays (bool, defaults toFalse) — WhetherRequesterPaysbuckets are supported.
- 
							default_block_size (int) — If given, the default block size value used foropen(), if no specific value is given at all time. The built-in default is 5MB.
- 
							default_fill_cache (bool, defaults toTrue) — Whether to use cache filling with open by default. Refer toS3File.open.
- 
							default_cache_type (str, defaults tobytes) — If given, the defaultcache_typevalue used foropen(). Set tononeif no caching is desired. See fsspec’s documentation for other availablecache_typevalues.
- 
							version_aware (bool, defaults toFalse) — Whether to support bucket versioning. If enable this will require the user to have the necessary IAM permissions for dealing with versioned objects.
- 
							cache_regions (bool, defaults toFalse) — Whether to cache bucket regions. Whenever a new bucket is used, it will first find out which region it belongs to and then use the client for that region.
- 
							asynchronous (bool, defaults toFalse) — Whether this instance is to be used from inside coroutines.
- 
							config_kwargs (dict) — Parameters passed tobotocore.client.Config. **kwargs — Other parameters for core session.
- 
							session (aiobotocore.session.AioSession) — Session to be used for all connections. This session will be used inplace of creating a new session inside S3FileSystem. For example:aiobotocore.session.AioSession(profile='test_user').
- 
							skip_instance_cache (bool) — Control reuse of instances. Passed on tofsspec.
- 
							use_listings_cache (bool) — Control reuse of directory listings. Passed on tofsspec.
- 
							listings_expiry_time (intorfloat) — Control reuse of directory listings. Passed on tofsspec.
- 
							max_paths (int) — Control reuse of directory listings. Passed on tofsspec.
datasets.filesystems.S3FileSystem is a subclass of s3fs.S3FileSystem.
Users can use this class to access S3 as if it were a file system. It exposes a filesystem-like API (ls, cp, open, etc.) on top of S3 storage. Provide credentials either explicitly (key=, secret=) or with boto’s credential methods. See botocore documentation for more information. If no credentials are available, use anon=True.
Examples:
Listing files from public S3 bucket.
>>> import datasets
>>> s3 = datasets.filesystems.S3FileSystem(anon=True)
>>> s3.ls('public-datasets/imdb/train')
['dataset_info.json.json','dataset.arrow','state.json']Listing files from private S3 bucket using aws_access_key_id and aws_secret_access_key.
>>> import datasets
>>> s3 = datasets.filesystems.S3FileSystem(key=aws_access_key_id, secret=aws_secret_access_key)
>>> s3.ls('my-private-datasets/imdb/train')
['dataset_info.json.json','dataset.arrow','state.json']Using S3Filesystem with botocore.session.Session and custom aws_profile.
>>> import botocore
>>> from datasets.filesystems import S3Filesystem
>>> s3_session = botocore.session.Session(profile_name='my_profile_name')
>>> s3 = S3FileSystem(session=s3_session)Loading dataset from S3 using S3Filesystem and load_from_disk().
>>> from datasets import load_from_disk
>>> from datasets.filesystems import S3Filesystem
>>> s3 = S3FileSystem(key=aws_access_key_id, secret=aws_secret_access_key)
>>> dataset = load_from_disk('s3://my-private-datasets/imdb/train', storage_options=s3.storage_options)
>>> print(len(dataset))
25000Saving dataset to S3 using S3Filesystem and Dataset.save_to_disk().
datasets.filesystems.extract_path_from_uri
< source >( dataset_path: str )
Preprocesses dataset_path and removes remote filesystem (e.g. removing s3://).
datasets.filesystems.is_remote_filesystem
< source >( fs: AbstractFileSystem )
Parameters
- 
							fs (fsspec.spec.AbstractFileSystem) — An abstract super-class for pythonic file-systems, e.g.fsspec.filesystem('file')or datasets.filesystems.S3FileSystem.
Validates if filesystem has remote protocol.
Fingerprint
Hasher that accepts python objects as inputs.