Dataset Creation and Management

This page documents functions for creating, loading, and managing datasets.

path (str or pathlib.Path): The full path to the dataset. Can be:
An s3 path of the form s3://bucketname/path/to/dataset
A local file system path of the form ./path/to/dataset, ~/path/to/dataset or path/to/dataset
read_only (bool, optional): Opens dataset in read only mode if True. Defaults to False.
overwrite (bool, optional): If True, overwrites the dataset if it already exists. Defaults to False.
memory_cache_size (int, optional): The size of the memory cache to be used in MB. Defaults to DEFAULT_MEMORY_CACHE_SIZE.
local_cache_size (int, optional): The size of the local filesystem cache to be used in MB. Defaults to DEFAULT_LOCAL_CACHE_SIZE.
creds (dict or str, optional): Credentials for OBS service. Defaults to None.
verbose (bool, optional): If True, logs will be printed. Defaults to True.
reset (bool, optional): If the specified dataset cannot be loaded due to a corrupted HEAD state, setting reset=True will reset HEAD changes and load the previous version. Defaults to False.
check_integrity (bool, optional): Performs an integrity check by default if the dataset has 20 or fewer tensors. Defaults to True.
lock_timeout (int, optional): Number of seconds to wait before throwing a LockException. Defaults to 0.
lock_enabled (bool, optional): If True, the dataset manages a write lock. Defaults to True.
split_tensor_meta (bool, optional): Each tensor has a tensor_meta.json if True. Defaults to True.

Returns

Dataset: The dataset object.

Examples

import muller

# Create a new local dataset
ds = muller.dataset("./datasets/my_dataset", overwrite=True)

# Open an existing dataset
ds = muller.dataset("./datasets/my_dataset")

# Create a dataset on remote storage
ds = muller.dataset("s3://mybucket/my_dataset", creds={"aws_access_key_id": "...", "aws_secret_access_key": "..."})

# Open in read-only mode
ds = muller.dataset("./datasets/my_dataset", read_only=True)

# Create with custom cache sizes
ds = muller.dataset("./datasets/my_dataset", memory_cache_size=512, local_cache_size=2048)

muller.load()

Overview

Load an existing dataset from the given path. Unlike dataset(), this function will raise an error if the dataset does not exist.

Parameters

path (str or pathlib.Path): The full path to the dataset.
read_only (bool, optional): Opens dataset in read only mode if True. Defaults to False.
memory_cache_size (int, optional): The size of the memory cache to be used in MB. Defaults to DEFAULT_MEMORY_CACHE_SIZE.
local_cache_size (int, optional): The size of the local filesystem cache to be used in MB. Defaults to DEFAULT_LOCAL_CACHE_SIZE.
creds (dict or str, optional): Credentials for OBS service. Defaults to None.
verbose (bool, optional): If True, logs will be printed. Defaults to True.
check_integrity (bool, optional): Performs an integrity check by default if the dataset has 20 or fewer tensors. Defaults to True.
lock_enabled (bool, optional): If True, the dataset manages a write lock. Defaults to True.
lock_timeout (int, optional): Number of seconds to wait before throwing a LockException. Defaults to 0.
split_tensor_meta (bool, optional): Each tensor has a tensor_meta.json if True. Defaults to True.

Returns

Dataset: The loaded dataset.

Examples

import muller

# Load an existing dataset
ds = muller.load("./datasets/my_dataset")

# Load in read-only mode
ds = muller.load("./datasets/my_dataset", read_only=True)

# Load from remote storage
ds = muller.load("s3://mybucket/my_dataset", creds={"aws_access_key_id": "...", "aws_secret_access_key": "..."})

# Load without integrity check for faster loading
ds = muller.load("./datasets/large_dataset", check_integrity=False)

muller.empty()

Overview

Creates an empty dataset at the specified path. This is useful when you want to create a dataset structure before adding any data.

Parameters

path (str or pathlib.Path): The full path to the dataset.
overwrite (bool, optional): If True, overwrites the dataset if it already exists. Defaults to False.
memory_cache_size (int, optional): The size of the memory cache to be used in MB. Defaults to DEFAULT_MEMORY_CACHE_SIZE.
local_cache_size (int, optional): The size of the local filesystem cache to be used in MB. Defaults to DEFAULT_LOCAL_CACHE_SIZE.
creds (dict or str, optional): Credentials used to access the dataset. Defaults to None.
verbose (bool, optional): If True, logs will be printed. Defaults to True.
lock_timeout (int, optional): Number of seconds to wait before throwing a LockException. Defaults to 0.
lock_enabled (bool, optional): If True, the dataset manages a write lock. Defaults to True.
split_tensor_meta (bool, optional): Each tensor has a tensor_meta.json if True. Defaults to True.

Returns

Dataset: The created empty dataset.

Examples

import muller

# Create an empty dataset
ds = muller.empty("./datasets/new_dataset")

# Create and overwrite if exists
ds = muller.empty("./datasets/new_dataset", overwrite=True)

# Create empty dataset on remote storage
ds = muller.empty("s3://mybucket/new_dataset", creds={"aws_access_key_id": "...", "aws_secret_access_key": "..."})

# Add tensors to the empty dataset
with ds:
    ds.create_tensor("images")
    ds.create_tensor("labels")

Warning

Setting overwrite=True will delete all of your data if it exists!

muller.like()

Overview

Copies the source dataset's structure to a new location. No samples are copied, only the meta/info for the dataset and its tensors. This is useful for creating a new dataset with the same schema as an existing one.

Parameters

dest (str or Dataset): Empty Dataset or Path where the new dataset will be created.
src (str or Dataset): Path or dataset object that will be used as the template for the new dataset.
tensors (List[str], optional): Names of tensors (and groups) to be replicated. If not specified, all tensors in source dataset are considered. Defaults to None.
overwrite (bool, optional): If True and a dataset exists at dest, it will be overwritten. Defaults to False.
verbose (bool, optional): If True, logs will be printed. Defaults to True.

Returns

Dataset: New dataset object with the same structure as the source.

Examples

import muller

# Create a new dataset with the same structure as an existing one
source_ds = muller.load("./datasets/source_dataset")
new_ds = muller.like(dest="./datasets/new_dataset", src=source_ds)

# Copy structure from path
new_ds = muller.like(dest="./datasets/new_dataset", src="./datasets/source_dataset")

# Copy only specific tensors
new_ds = muller.like(
    dest="./datasets/new_dataset",
    src="./datasets/source_dataset",
    tensors=["images", "labels"]
)

# Overwrite if destination exists
new_ds = muller.like(
    dest="./datasets/new_dataset",
    src="./datasets/source_dataset",
    overwrite=True
)

muller.delete()

Overview

Delete a dataset at the given path. This permanently removes all data and metadata associated with the dataset.

Parameters

path (str or pathlib.Path): The full path to the dataset to delete.
large_ok (bool, optional): If True, allows deletion of large datasets. Defaults to False.
creds (dict or str, optional): Credentials for OBS service. Defaults to None.

Returns

None

Examples

import muller

# Delete a dataset
muller.delete("./datasets/old_dataset")

# Delete a large dataset
muller.delete("./datasets/large_dataset", large_ok=True)

# Delete a remote dataset
muller.delete("s3://mybucket/old_dataset", creds={"aws_access_key_id": "...", "aws_secret_access_key": "..."})

Warning

This operation is irreversible. All data will be permanently deleted.

muller.get_col_info()

Overview

Get column (tensor) information from a dataset without loading the entire dataset. This is useful for quickly inspecting dataset metadata.

Parameters

path (str or pathlib.Path): The full path to the dataset.
col_name (str, optional): Name of the column (tensor) to get info for. If None or empty string, returns dataset-level info. Defaults to None.

Returns

bytes: The raw metadata content in bytes.

Examples

import muller

# Get dataset-level info
info = muller.get_col_info("./datasets/my_dataset")

# Get specific tensor info
tensor_info = muller.get_col_info("./datasets/my_dataset", col_name="images")

# Parse the info (it's in JSON format)
import json
parsed_info = json.loads(info)
print(parsed_info)

muller.from_file()

Overview

Create a dataset from a file (JSON lines format). This function reads data from a file and creates a MULLER dataset with the appropriate schema.

Parameters

ori_path (str): Path to the source file (JSON lines format).
muller_path (str): Path where the muller dataset will be created.
schema (dict, optional): Schema definition for the dataset. If not provided, schema will be inferred from the first record. Defaults to None.
workers (int, optional): Number of workers for parallel processing. Defaults to 0.
scheduler (str, optional): Scheduler type for compute operations. Defaults to "processed".
disable_rechunk (bool, optional): Whether to disable rechunking. Defaults to True.
progressbar (bool, optional): Whether to show progress bar. Defaults to True.
ignore_errors (bool, optional): Whether to ignore errors during processing. Defaults to True.
split_tensor_meta (bool, optional): Each tensor has a tensor_meta.json if True. Defaults to True.

Returns

Dataset: The created dataset.

Examples

import muller

# Create dataset from JSON lines file with inferred schema
ds = muller.from_file(
    ori_path="./data/records.jsonl",
    muller_path="./datasets/my_dataset"
)

# Create with explicit schema
schema = {
    "id": ("text", "str", None),
    "image": ("image", "uint8", "jpeg"),
    "label": ("class_label", "int32", None)
}
ds = muller.from_file(
    ori_path="./data/records.jsonl",
    muller_path="./datasets/my_dataset",
    schema=schema
)

# Use multiple workers for faster processing
ds = muller.from_file(
    ori_path="./data/large_records.jsonl",
    muller_path="./datasets/large_dataset",
    workers=4,
    progressbar=True
)

# Nested schema example
nested_schema = {
    "metadata": {
        "author": ("text", "str", None),
        "date": ("text", "str", None)
    },
    "content": ("text", "str", None)
}
ds = muller.from_file(
    ori_path="./data/nested_records.jsonl",
    muller_path="./datasets/nested_dataset",
    schema=nested_schema
)

Notes

The input file should be in JSON lines format (one JSON object per line).
Schema format: {column_name: (htype, dtype, sample_compression)} or nested dictionaries.
If schema is not provided, it will be inferred from the first record in the file.

muller.from_dataframes()

Overview

Create a dataset from a list of dataframes (dictionaries). This is useful for creating datasets from in-memory data structures.

Parameters

dataframes (list): List of dataframes (dicts) to import.
muller_path (str): Path where the muller dataset will be created.
schema (dict, optional): Schema definition for the dataset. If not provided, schema will be inferred from the first record. Defaults to None.
workers (int, optional): Number of workers for parallel processing. Defaults to 0.
scheduler (str, optional): Scheduler type for compute operations. Defaults to "processed".
disable_rechunk (bool, optional): Whether to disable rechunking. Defaults to True.
progressbar (bool, optional): Whether to show progress bar. Defaults to True.
ignore_errors (bool, optional): Whether to ignore errors during processing. Defaults to True.
split_tensor_meta (bool, optional): Each tensor has a tensor_meta.json if True. Defaults to True.

Returns

Dataset: The created dataset.

Examples

import muller

# Create dataset from list of dictionaries
data = [
    {"id": "001", "value": 10, "label": "A"},
    {"id": "002", "value": 20, "label": "B"},
    {"id": "003", "value": 30, "label": "C"}
]
ds = muller.from_dataframes(
    dataframes=data,
    muller_path="./datasets/my_dataset"
)

# Create with explicit schema
schema = {
    "id": ("text", "str", None),
    "value": ("generic", "int32", None),
    "label": ("class_label", "str", None)
}
ds = muller.from_dataframes(
    dataframes=data,
    muller_path="./datasets/my_dataset",
    schema=schema
)

# Use multiple workers for large datasets
large_data = [{"col1": i, "col2": i*2} for i in range(10000)]
ds = muller.from_dataframes(
    dataframes=large_data,
    muller_path="./datasets/large_dataset",
    workers=4,
    progressbar=True
)

# Nested data example
nested_data = [
    {
        "user": {"name": "Alice", "age": 30},
        "score": 95
    },
    {
        "user": {"name": "Bob", "age": 25},
        "score": 87
    }
]
nested_schema = {
    "user": {
        "name": ("text", "str", None),
        "age": ("generic", "int32", None)
    },
    "score": ("generic", "int32", None)
}
ds = muller.from_dataframes(
    dataframes=nested_data,
    muller_path="./datasets/nested_dataset",
    schema=nested_schema
)

Notes

Each item in the dataframes list should be a dictionary representing one record.
Schema format: {column_name: (htype, dtype, sample_compression)} or nested dictionaries.
If schema is not provided, it will be inferred from the first record.

muller.from_csv()

Overview

Create a new dataset from a CSV file. Each CSV column becomes a tensor in the dataset. Columns containing file paths (e.g., image or video files) can be loaded automatically via muller.read() using the path_columns parameter.

Parameters

csv_path (str): Path to the source CSV file.
muller_path (str): Path where the MULLER dataset will be created.
schema (dict, optional): Schema definition for the dataset. Format: {column_name: (htype, dtype, sample_compression)}. If not provided, all columns are created as generic tensors. Defaults to None.
path_columns (dict, optional): Dict mapping column names to their handling mode for file path values. Defaults to None.
"read": Calls muller.read(path) to load the file as a Sample (for image/video/audio tensors).
"text": Stores the file path as a plain text string.
workers (int, optional): Number of workers for parallel processing. Defaults to 0.
scheduler (str, optional): Scheduler type for compute operations. Defaults to "processed".
disable_rechunk (bool, optional): Whether to disable rechunking. Defaults to True.
progressbar (bool, optional): Whether to show progress bar. Defaults to True.
ignore_errors (bool, optional): Whether to ignore errors during processing. Defaults to True.
split_tensor_meta (bool, optional): Each tensor has a tensor_meta.json if True. Defaults to True.

Returns

Dataset: The created dataset.

Raises

ValueError: If csv_path or muller_path is empty, or if the CSV file does not exist or is empty.

Examples

import muller

# Create dataset from a CSV with text-only columns
ds = muller.from_csv(
    csv_path="./data/records.csv",
    muller_path="./datasets/text_dataset",
    schema={
        "name": ("text", "", "lz4"),
        "score": ("text", "", "lz4"),
    },
)

# Create dataset from a CSV where a column contains image file paths
ds = muller.from_csv(
    csv_path="./data/images.csv",
    muller_path="./datasets/image_dataset",
    schema={
        "image_path": ("image", "uint8", "jpeg"),
        "label": ("text", "", "lz4"),
    },
    path_columns={"image_path": "read"},  # load images via muller.read()
)

# Store file paths as text instead of loading the files
ds = muller.from_csv(
    csv_path="./data/images.csv",
    muller_path="./datasets/path_dataset",
    schema={
        "image_path": ("text", "", "lz4"),
        "label": ("text", "", "lz4"),
    },
    path_columns={"image_path": "text"},  # store path string as-is
)

# Use multiple workers for faster processing
ds = muller.from_csv(
    csv_path="./data/large_data.csv",
    muller_path="./datasets/large_dataset",
    schema={
        "image_path": ("image", "uint8", "jpeg"),
        "label": ("class_label", "uint32", None),
    },
    path_columns={"image_path": "read"},
    workers=4,
    progressbar=True,
)

Notes

The CSV file must have a header row. Column names in the header are used as tensor names.
All CSV values are read as strings. Use schema to specify the correct htype and dtype for each column.
Columns not listed in path_columns are appended directly as their raw CSV string values.
The path_columns parameter is only needed when a CSV column contains file paths that should be loaded as binary data (images, videos, etc.) or explicitly stored as path strings.

ds.add_data_from_csv()

Overview

Append data from a CSV file into an existing dataset. Tensors must already be created before calling this method. This is useful for incrementally adding data from CSV files to a dataset that was created manually or from another source.

Parameters

csv_path (str): Path to the source CSV file.
schema (dict, optional): Schema definition. If not provided, CSV column names are used directly. Defaults to None.
path_columns (dict, optional): Dict mapping column names to their handling mode for file path values. Defaults to None.
"read": Calls muller.read(path) to load the file as a Sample.
"text": Stores the file path as a plain text string.
workers (int, optional): Number of workers for parallel processing. Defaults to 0.
scheduler (str, optional): Scheduler type for compute operations. Defaults to "processed".
disable_rechunk (bool, optional): Whether to disable rechunking. Defaults to True.
progressbar (bool, optional): Whether to show progress bar. Defaults to True.
ignore_errors (bool, optional): Whether to ignore errors during processing. Defaults to True.

Returns

Dataset: The updated dataset.

Raises

ValueError: If csv_path is empty, the CSV file cannot be read, or CSV column names do not match existing tensor names.

Examples

import muller

# Create a dataset and define tensors manually
ds = muller.dataset(path="./datasets/my_dataset", overwrite=True)
ds.create_tensor("image_path", htype="image", sample_compression="jpeg")
ds.create_tensor("label", htype="text", sample_compression="lz4")

# Append data from a CSV file
ds.add_data_from_csv(
    csv_path="./data/batch1.csv",
    path_columns={"image_path": "read"},
    workers=0,
)
print(len(ds))  # number of samples from batch1.csv

# Append more data from another CSV file
ds.add_data_from_csv(
    csv_path="./data/batch2.csv",
    path_columns={"image_path": "read"},
    workers=0,
)
print(len(ds))  # accumulated samples from both CSV files

Notes

The CSV column names must match the existing tensor names in the dataset. A ValueError is raised if they do not match.
This method can be called multiple times to incrementally append data from different CSV files.
The path_columns parameter works identically to muller.from_csv().

Dataset Creation and Management

Table of Contents

muller.dataset()

Overview

Parameters

Returns

Examples

muller.load()

Overview

Parameters

Returns

Examples

muller.empty()

Overview

Parameters

Returns

Examples

Warning

muller.like()

Overview

Parameters

Returns

Examples

muller.delete()

Overview

Parameters

Returns

Examples

Warning

muller.get_col_info()

Overview

Parameters

Returns

Examples

muller.from_file()

Overview

Parameters

Returns

Examples

Notes

muller.from_dataframes()

Overview

Parameters

Returns

Examples

Notes

muller.from_csv()

Overview

Parameters

Returns

Raises

Examples

Notes

ds.add_data_from_csv()

Overview

Parameters

Returns

Raises

Examples

Notes