Dataset Export Methods

This page documents methods for exporting dataset data to various formats.

tensor_list (List[str], optional): The tensor columns to export. If not provided, all tensors will be exported. Defaults to None.
index_list (List[int], optional): The indices of rows to export. If not provided, all rows will be exported. Defaults to None.
force (bool, optional): If True, exports the dataset regardless of size. Datasets with more than TO_DATAFRAME_SAFE_LIMIT samples might take a long time to export. Defaults to False.

Returns

pandas.DataFrame: The dataset as a pandas DataFrame.

Examples

import muller

ds = muller.load("./my_dataset")

# Export entire dataset to DataFrame
df = ds.to_dataframe()
print(df.head())

# Export specific tensors
df = ds.to_dataframe(tensor_list=["images", "labels"])

# Export specific rows
df = ds.to_dataframe(index_list=[0, 1, 2, 10, 20])

# Export specific tensors and rows
df = ds.to_dataframe(
    tensor_list=["labels", "categories"],
    index_list=[1, 2, 4, 8, 16]
)

# Export last few samples
df = ds.to_dataframe(index_list=[-1, -2, -3])

# Force export of large dataset
df = ds.to_dataframe(force=True)

# Use DataFrame for analysis
df = ds.to_dataframe()
print(df.describe())
print(df["labels"].value_counts())

Notes

For large datasets, consider using index_list to export in batches.
Image and large binary data will be represented as arrays in the DataFrame.
Use force=True carefully with large datasets as it may consume significant memory.

ds.to_json()

Overview

Export the dataset to JSON format. This creates a JSON file or returns JSON data for the dataset.

Parameters

path (str, optional): Path where the JSON file will be saved. If not provided, returns JSON string. Defaults to None.
tensor_list (List[str], optional): The tensor columns to export. If not provided, all tensors will be exported. Defaults to None.
index_list (List[int], optional): The indices of rows to export. If not provided, all rows will be exported. Defaults to None.
indent (int, optional): Number of spaces for JSON indentation. Defaults to 2.

Returns

str or None: JSON string if path is not provided, otherwise None (writes to file).

Examples

import muller

ds = muller.load("./my_dataset")

# Export to JSON file
ds.to_json("./output/dataset.json")

# Export specific tensors
ds.to_json("./output/labels_only.json", tensor_list=["labels"])

# Export specific samples
ds.to_json("./output/sample_subset.json", index_list=[0, 1, 2, 3, 4])

# Get JSON string without saving
json_str = ds.to_json()
print(json_str)

# Export with custom indentation
ds.to_json("./output/dataset.json", indent=4)

# Export filtered view
filtered = ds.filter("labels == 5")
filtered.to_json("./output/label_5_samples.json")

ds.to_arrow()

Overview

Convert the dataset to Apache Arrow format. This is useful for interoperability with Arrow-based tools and efficient data transfer.

Parameters

tensor_list (List[str], optional): The tensor columns to export. If not provided, all tensors will be exported. Defaults to None.
index_list (List[int], optional): The indices of rows to export. If not provided, all rows will be exported. Defaults to None.

Returns

pyarrow.Table: The dataset as an Arrow Table.

Examples

import muller

ds = muller.load("./my_dataset")

# Convert to Arrow Table
arrow_table = ds.to_arrow()
print(arrow_table.schema)

# Export specific tensors
arrow_table = ds.to_arrow(tensor_list=["labels", "features"])

# Export specific samples
arrow_table = ds.to_arrow(index_list=range(100))

# Write to Parquet using Arrow
arrow_table = ds.to_arrow()
import pyarrow.parquet as pq
pq.write_table(arrow_table, "./output/dataset.parquet")

# Convert to pandas via Arrow
arrow_table = ds.to_arrow()
df = arrow_table.to_pandas()

# Use with Arrow datasets
arrow_table = ds.to_arrow()
import pyarrow.dataset as ds_arrow
ds_arrow.write_dataset(arrow_table, "./output/arrow_dataset", format="parquet")

ds.to_mindrecord()

Overview

Export the dataset to MindRecord format, which is used by MindSpore framework. This is useful for training models with MindSpore.

Parameters

path (str): Path where the MindRecord files will be saved.
tensor_list (List[str], optional): The tensor columns to export. If not provided, all tensors will be exported. Defaults to None.
index_list (List[int], optional): The indices of rows to export. If not provided, all rows will be exported. Defaults to None.
num_shards (int, optional): Number of MindRecord shards to create. Defaults to 1.
overwrite (bool, optional): If True, overwrites existing files. Defaults to False.

Returns

None

Examples

import muller

ds = muller.load("./my_dataset")

# Export to MindRecord
ds.to_mindrecord("./output/dataset.mindrecord")

# Export with multiple shards
ds.to_mindrecord("./output/dataset.mindrecord", num_shards=8)

# Export specific tensors
ds.to_mindrecord(
    "./output/images_labels.mindrecord",
    tensor_list=["images", "labels"]
)

# Export subset of data
ds.to_mindrecord(
    "./output/train_subset.mindrecord",
    index_list=range(1000)
)

# Overwrite existing files
ds.to_mindrecord(
    "./output/dataset.mindrecord",
    overwrite=True
)

# Export filtered view
train_ds = ds.filter("split == 'train'")
train_ds.to_mindrecord("./output/train.mindrecord", num_shards=4)

Notes

MindRecord format is optimized for MindSpore training workflows.
Multiple shards can improve parallel data loading performance.
Requires MindSpore to be installed.

ds.write_to_parquet()

Overview

Write the dataset to Parquet format. Parquet is a columnar storage format that is efficient for analytics and widely supported.

Parameters

path (str): Path where the Parquet file(s) will be saved.
tensor_list (List[str], optional): The tensor columns to export. If not provided, all tensors will be exported. Defaults to None.
index_list (List[int], optional): The indices of rows to export. If not provided, all rows will be exported. Defaults to None.
compression (str, optional): Compression codec to use (e.g., "snappy", "gzip", "brotli"). Defaults to "snappy".
row_group_size (int, optional): Number of rows per row group. Defaults to None (automatic).

Returns

None

Examples

import muller

ds = muller.load("./my_dataset")

# Write to Parquet
ds.write_to_parquet("./output/dataset.parquet")

# Write with specific compression
ds.write_to_parquet("./output/dataset.parquet", compression="gzip")

# Write specific tensors
ds.write_to_parquet(
    "./output/labels_only.parquet",
    tensor_list=["labels", "categories"]
)

# Write subset of data
ds.write_to_parquet(
    "./output/sample_subset.parquet",
    index_list=range(1000)
)

# Write with custom row group size
ds.write_to_parquet(
    "./output/dataset.parquet",
    row_group_size=10000
)

# Write filtered view
filtered = ds.filter("score > 80")
filtered.write_to_parquet("./output/high_scores.parquet")

# Write multiple partitions
train_ds = ds.filter("split == 'train'")
test_ds = ds.filter("split == 'test'")
train_ds.write_to_parquet("./output/train.parquet")
test_ds.write_to_parquet("./output/test.parquet")

Notes

Parquet format is highly efficient for columnar data access.
Compression reduces file size but may increase read/write time.
Parquet files can be read by many tools including pandas, Spark, and DuckDB.

Comparison of Export Formats

Format	Use Case	Pros	Cons
DataFrame	Data analysis, pandas workflows	Easy to use, familiar API	Memory intensive for large datasets
JSON	Data interchange, human-readable	Universal format, readable	Large file size, slower parsing
Arrow	Interoperability, efficient transfer	Fast, zero-copy, language-agnostic	Requires Arrow ecosystem
MindRecord	MindSpore training	Optimized for MindSpore	MindSpore-specific
Parquet	Analytics, data warehousing	Efficient, columnar, widely supported	Not human-readable

Choosing the Right Format

Use to_dataframe() for quick analysis and pandas integration
Use to_json() for data interchange and human readability
Use to_arrow() for efficient data transfer and Arrow ecosystem integration
Use to_mindrecord() for MindSpore model training
Use write_to_parquet() for efficient storage and analytics workflows

Dataset Export Methods

Table of Contents

ds.to_dataframe()

Overview

Parameters

Returns

Examples

Notes

ds.to_json()

Overview

Parameters

Returns

Examples

ds.to_arrow()

Overview

Parameters

Returns

Examples

ds.to_mindrecord()

Overview

Parameters

Returns

Examples

Notes

ds.write_to_parquet()

Overview

Parameters

Returns

Examples

Notes

Comparison of Export Formats

Choosing the Right Format