Dataset Creation and Management
This page documents functions for creating, loading, and managing datasets.
Table of Contents
- muller.dataset()
- muller.load()
- muller.empty()
- muller.like()
- muller.delete()
- muller.get_col_info()
- muller.from_file()
- muller.from_dataframes()
muller.dataset()
Overview
Returns a Dataset object referencing either a new or existing dataset. This is the primary function for creating or opening datasets.
Parameters
- path (
strorpathlib.Path): The full path to the dataset. Can be: - An s3 path of the form
s3://bucketname/path/to/dataset - A local file system path of the form
./path/to/dataset,~/path/to/datasetorpath/to/dataset - read_only (
bool, optional): Opens dataset in read only mode ifTrue. Defaults toFalse. - overwrite (
bool, optional): IfTrue, overwrites the dataset if it already exists. Defaults toFalse. - memory_cache_size (
int, optional): The size of the memory cache to be used in MB. Defaults toDEFAULT_MEMORY_CACHE_SIZE. - local_cache_size (
int, optional): The size of the local filesystem cache to be used in MB. Defaults toDEFAULT_LOCAL_CACHE_SIZE. - creds (
dictorstr, optional): Credentials for OBS service. Defaults toNone. - verbose (
bool, optional): IfTrue, logs will be printed. Defaults toTrue. - reset (
bool, optional): If the specified dataset cannot be loaded due to a corrupted HEAD state, settingreset=Truewill reset HEAD changes and load the previous version. Defaults toFalse. - check_integrity (
bool, optional): Performs an integrity check by default if the dataset has 20 or fewer tensors. Defaults toTrue. - lock_timeout (
int, optional): Number of seconds to wait before throwing a LockException. Defaults to0. - lock_enabled (
bool, optional): IfTrue, the dataset manages a write lock. Defaults toTrue. - split_tensor_meta (
bool, optional): Each tensor has a tensor_meta.json ifTrue. Defaults toTrue.
Returns
- Dataset: The dataset object.
Examples
import muller
# Create a new local dataset
ds = muller.dataset("./datasets/my_dataset", overwrite=True)
# Open an existing dataset
ds = muller.dataset("./datasets/my_dataset")
# Create a dataset on remote storage
ds = muller.dataset("s3://mybucket/my_dataset", creds={"aws_access_key_id": "...", "aws_secret_access_key": "..."})
# Open in read-only mode
ds = muller.dataset("./datasets/my_dataset", read_only=True)
# Create with custom cache sizes
ds = muller.dataset("./datasets/my_dataset", memory_cache_size=512, local_cache_size=2048)
muller.load()
Overview
Load an existing dataset from the given path. Unlike dataset(), this function will raise an error if the dataset does not exist.
Parameters
- path (
strorpathlib.Path): The full path to the dataset. - read_only (
bool, optional): Opens dataset in read only mode ifTrue. Defaults toFalse. - memory_cache_size (
int, optional): The size of the memory cache to be used in MB. Defaults toDEFAULT_MEMORY_CACHE_SIZE. - local_cache_size (
int, optional): The size of the local filesystem cache to be used in MB. Defaults toDEFAULT_LOCAL_CACHE_SIZE. - creds (
dictorstr, optional): Credentials for OBS service. Defaults toNone. - verbose (
bool, optional): IfTrue, logs will be printed. Defaults toTrue. - check_integrity (
bool, optional): Performs an integrity check by default if the dataset has 20 or fewer tensors. Defaults toTrue. - lock_enabled (
bool, optional): IfTrue, the dataset manages a write lock. Defaults toTrue. - lock_timeout (
int, optional): Number of seconds to wait before throwing a LockException. Defaults to0. - split_tensor_meta (
bool, optional): Each tensor has a tensor_meta.json ifTrue. Defaults toTrue.
Returns
- Dataset: The loaded dataset.
Examples
import muller
# Load an existing dataset
ds = muller.load("./datasets/my_dataset")
# Load in read-only mode
ds = muller.load("./datasets/my_dataset", read_only=True)
# Load from remote storage
ds = muller.load("s3://mybucket/my_dataset", creds={"aws_access_key_id": "...", "aws_secret_access_key": "..."})
# Load without integrity check for faster loading
ds = muller.load("./datasets/large_dataset", check_integrity=False)
muller.empty()
Overview
Creates an empty dataset at the specified path. This is useful when you want to create a dataset structure before adding any data.
Parameters
- path (
strorpathlib.Path): The full path to the dataset. - overwrite (
bool, optional): IfTrue, overwrites the dataset if it already exists. Defaults toFalse. - memory_cache_size (
int, optional): The size of the memory cache to be used in MB. Defaults toDEFAULT_MEMORY_CACHE_SIZE. - local_cache_size (
int, optional): The size of the local filesystem cache to be used in MB. Defaults toDEFAULT_LOCAL_CACHE_SIZE. - creds (
dictorstr, optional): Credentials used to access the dataset. Defaults toNone. - verbose (
bool, optional): IfTrue, logs will be printed. Defaults toTrue. - lock_timeout (
int, optional): Number of seconds to wait before throwing a LockException. Defaults to0. - lock_enabled (
bool, optional): IfTrue, the dataset manages a write lock. Defaults toTrue. - split_tensor_meta (
bool, optional): Each tensor has a tensor_meta.json ifTrue. Defaults toTrue.
Returns
- Dataset: The created empty dataset.
Examples
import muller
# Create an empty dataset
ds = muller.empty("./datasets/new_dataset")
# Create and overwrite if exists
ds = muller.empty("./datasets/new_dataset", overwrite=True)
# Create empty dataset on remote storage
ds = muller.empty("s3://mybucket/new_dataset", creds={"aws_access_key_id": "...", "aws_secret_access_key": "..."})
# Add tensors to the empty dataset
with ds:
ds.create_tensor("images")
ds.create_tensor("labels")
Warning
Setting overwrite=True will delete all of your data if it exists!
muller.like()
Overview
Copies the source dataset's structure to a new location. No samples are copied, only the meta/info for the dataset and its tensors. This is useful for creating a new dataset with the same schema as an existing one.
Parameters
- dest (
strorDataset): Empty Dataset or Path where the new dataset will be created. - src (
strorDataset): Path or dataset object that will be used as the template for the new dataset. - tensors (
List[str], optional): Names of tensors (and groups) to be replicated. If not specified, all tensors in source dataset are considered. Defaults toNone. - overwrite (
bool, optional): IfTrueand a dataset exists atdest, it will be overwritten. Defaults toFalse. - verbose (
bool, optional): IfTrue, logs will be printed. Defaults toTrue.
Returns
- Dataset: New dataset object with the same structure as the source.
Examples
import muller
# Create a new dataset with the same structure as an existing one
source_ds = muller.load("./datasets/source_dataset")
new_ds = muller.like(dest="./datasets/new_dataset", src=source_ds)
# Copy structure from path
new_ds = muller.like(dest="./datasets/new_dataset", src="./datasets/source_dataset")
# Copy only specific tensors
new_ds = muller.like(
dest="./datasets/new_dataset",
src="./datasets/source_dataset",
tensors=["images", "labels"]
)
# Overwrite if destination exists
new_ds = muller.like(
dest="./datasets/new_dataset",
src="./datasets/source_dataset",
overwrite=True
)
muller.delete()
Overview
Delete a dataset at the given path. This permanently removes all data and metadata associated with the dataset.
Parameters
- path (
strorpathlib.Path): The full path to the dataset to delete. - large_ok (
bool, optional): IfTrue, allows deletion of large datasets. Defaults toFalse. - creds (
dictorstr, optional): Credentials for OBS service. Defaults toNone.
Returns
- None
Examples
import muller
# Delete a dataset
muller.delete("./datasets/old_dataset")
# Delete a large dataset
muller.delete("./datasets/large_dataset", large_ok=True)
# Delete a remote dataset
muller.delete("s3://mybucket/old_dataset", creds={"aws_access_key_id": "...", "aws_secret_access_key": "..."})
Warning
This operation is irreversible. All data will be permanently deleted.
muller.get_col_info()
Overview
Get column (tensor) information from a dataset without loading the entire dataset. This is useful for quickly inspecting dataset metadata.
Parameters
- path (
strorpathlib.Path): The full path to the dataset. - col_name (
str, optional): Name of the column (tensor) to get info for. IfNoneor empty string, returns dataset-level info. Defaults toNone.
Returns
- bytes: The raw metadata content in bytes.
Examples
import muller
# Get dataset-level info
info = muller.get_col_info("./datasets/my_dataset")
# Get specific tensor info
tensor_info = muller.get_col_info("./datasets/my_dataset", col_name="images")
# Parse the info (it's in JSON format)
import json
parsed_info = json.loads(info)
print(parsed_info)
muller.from_file()
Overview
Create a dataset from a file (JSON lines format). This function reads data from a file and creates a MULLER dataset with the appropriate schema.
Parameters
- ori_path (
str): Path to the source file (JSON lines format). - muller_path (
str): Path where the muller dataset will be created. - schema (
dict, optional): Schema definition for the dataset. If not provided, schema will be inferred from the first record. Defaults toNone. - workers (
int, optional): Number of workers for parallel processing. Defaults to0. - scheduler (
str, optional): Scheduler type for compute operations. Defaults to"processed". - disable_rechunk (
bool, optional): Whether to disable rechunking. Defaults toTrue. - progressbar (
bool, optional): Whether to show progress bar. Defaults toTrue. - ignore_errors (
bool, optional): Whether to ignore errors during processing. Defaults toTrue. - split_tensor_meta (
bool, optional): Each tensor has a tensor_meta.json ifTrue. Defaults toTrue.
Returns
- Dataset: The created dataset.
Examples
import muller
# Create dataset from JSON lines file with inferred schema
ds = muller.from_file(
ori_path="./data/records.jsonl",
muller_path="./datasets/my_dataset"
)
# Create with explicit schema
schema = {
"id": ("text", "str", None),
"image": ("image", "uint8", "jpeg"),
"label": ("class_label", "int32", None)
}
ds = muller.from_file(
ori_path="./data/records.jsonl",
muller_path="./datasets/my_dataset",
schema=schema
)
# Use multiple workers for faster processing
ds = muller.from_file(
ori_path="./data/large_records.jsonl",
muller_path="./datasets/large_dataset",
workers=4,
progressbar=True
)
# Nested schema example
nested_schema = {
"metadata": {
"author": ("text", "str", None),
"date": ("text", "str", None)
},
"content": ("text", "str", None)
}
ds = muller.from_file(
ori_path="./data/nested_records.jsonl",
muller_path="./datasets/nested_dataset",
schema=nested_schema
)
Notes
- The input file should be in JSON lines format (one JSON object per line).
- Schema format:
{column_name: (htype, dtype, sample_compression)}or nested dictionaries. - If schema is not provided, it will be inferred from the first record in the file.
muller.from_dataframes()
Overview
Create a dataset from a list of dataframes (dictionaries). This is useful for creating datasets from in-memory data structures.
Parameters
- dataframes (
list): List of dataframes (dicts) to import. - muller_path (
str): Path where the muller dataset will be created. - schema (
dict, optional): Schema definition for the dataset. If not provided, schema will be inferred from the first record. Defaults toNone. - workers (
int, optional): Number of workers for parallel processing. Defaults to0. - scheduler (
str, optional): Scheduler type for compute operations. Defaults to"processed". - disable_rechunk (
bool, optional): Whether to disable rechunking. Defaults toTrue. - progressbar (
bool, optional): Whether to show progress bar. Defaults toTrue. - ignore_errors (
bool, optional): Whether to ignore errors during processing. Defaults toTrue. - split_tensor_meta (
bool, optional): Each tensor has a tensor_meta.json ifTrue. Defaults toTrue.
Returns
- Dataset: The created dataset.
Examples
import muller
# Create dataset from list of dictionaries
data = [
{"id": "001", "value": 10, "label": "A"},
{"id": "002", "value": 20, "label": "B"},
{"id": "003", "value": 30, "label": "C"}
]
ds = muller.from_dataframes(
dataframes=data,
muller_path="./datasets/my_dataset"
)
# Create with explicit schema
schema = {
"id": ("text", "str", None),
"value": ("generic", "int32", None),
"label": ("class_label", "str", None)
}
ds = muller.from_dataframes(
dataframes=data,
muller_path="./datasets/my_dataset",
schema=schema
)
# Use multiple workers for large datasets
large_data = [{"col1": i, "col2": i*2} for i in range(10000)]
ds = muller.from_dataframes(
dataframes=large_data,
muller_path="./datasets/large_dataset",
workers=4,
progressbar=True
)
# Nested data example
nested_data = [
{
"user": {"name": "Alice", "age": 30},
"score": 95
},
{
"user": {"name": "Bob", "age": 25},
"score": 87
}
]
nested_schema = {
"user": {
"name": ("text", "str", None),
"age": ("generic", "int32", None)
},
"score": ("generic", "int32", None)
}
ds = muller.from_dataframes(
dataframes=nested_data,
muller_path="./datasets/nested_dataset",
schema=nested_schema
)
Notes
- Each item in the
dataframeslist should be a dictionary representing one record. - Schema format:
{column_name: (htype, dtype, sample_compression)}or nested dictionaries. - If schema is not provided, it will be inferred from the first record.