Dataset Export Methods
This page documents methods for exporting dataset data to various formats.
Table of Contents
ds.to_dataframe()
Overview
Convert the dataset to a pandas DataFrame. This is useful for data analysis and integration with pandas-based workflows.
Parameters
- tensor_list (
List[str], optional): The tensor columns to export. If not provided, all tensors will be exported. Defaults toNone. - index_list (
List[int], optional): The indices of rows to export. If not provided, all rows will be exported. Defaults toNone. - force (
bool, optional): IfTrue, exports the dataset regardless of size. Datasets with more thanTO_DATAFRAME_SAFE_LIMITsamples might take a long time to export. Defaults toFalse.
Returns
- pandas.DataFrame: The dataset as a pandas DataFrame.
Examples
import muller
ds = muller.load("./my_dataset")
# Export entire dataset to DataFrame
df = ds.to_dataframe()
print(df.head())
# Export specific tensors
df = ds.to_dataframe(tensor_list=["images", "labels"])
# Export specific rows
df = ds.to_dataframe(index_list=[0, 1, 2, 10, 20])
# Export specific tensors and rows
df = ds.to_dataframe(
tensor_list=["labels", "categories"],
index_list=[1, 2, 4, 8, 16]
)
# Export last few samples
df = ds.to_dataframe(index_list=[-1, -2, -3])
# Force export of large dataset
df = ds.to_dataframe(force=True)
# Use DataFrame for analysis
df = ds.to_dataframe()
print(df.describe())
print(df["labels"].value_counts())
Notes
- For large datasets, consider using
index_listto export in batches. - Image and large binary data will be represented as arrays in the DataFrame.
- Use
force=Truecarefully with large datasets as it may consume significant memory.
ds.to_json()
Overview
Export the dataset to JSON format. This creates a JSON file or returns JSON data for the dataset.
Parameters
- path (
str, optional): Path where the JSON file will be saved. If not provided, returns JSON string. Defaults toNone. - tensor_list (
List[str], optional): The tensor columns to export. If not provided, all tensors will be exported. Defaults toNone. - index_list (
List[int], optional): The indices of rows to export. If not provided, all rows will be exported. Defaults toNone. - indent (
int, optional): Number of spaces for JSON indentation. Defaults to2.
Returns
- str or None: JSON string if
pathis not provided, otherwiseNone(writes to file).
Examples
import muller
ds = muller.load("./my_dataset")
# Export to JSON file
ds.to_json("./output/dataset.json")
# Export specific tensors
ds.to_json("./output/labels_only.json", tensor_list=["labels"])
# Export specific samples
ds.to_json("./output/sample_subset.json", index_list=[0, 1, 2, 3, 4])
# Get JSON string without saving
json_str = ds.to_json()
print(json_str)
# Export with custom indentation
ds.to_json("./output/dataset.json", indent=4)
# Export filtered view
filtered = ds.filter("labels == 5")
filtered.to_json("./output/label_5_samples.json")
ds.to_arrow()
Overview
Convert the dataset to Apache Arrow format. This is useful for interoperability with Arrow-based tools and efficient data transfer.
Parameters
- tensor_list (
List[str], optional): The tensor columns to export. If not provided, all tensors will be exported. Defaults toNone. - index_list (
List[int], optional): The indices of rows to export. If not provided, all rows will be exported. Defaults toNone.
Returns
- pyarrow.Table: The dataset as an Arrow Table.
Examples
import muller
ds = muller.load("./my_dataset")
# Convert to Arrow Table
arrow_table = ds.to_arrow()
print(arrow_table.schema)
# Export specific tensors
arrow_table = ds.to_arrow(tensor_list=["labels", "features"])
# Export specific samples
arrow_table = ds.to_arrow(index_list=range(100))
# Write to Parquet using Arrow
arrow_table = ds.to_arrow()
import pyarrow.parquet as pq
pq.write_table(arrow_table, "./output/dataset.parquet")
# Convert to pandas via Arrow
arrow_table = ds.to_arrow()
df = arrow_table.to_pandas()
# Use with Arrow datasets
arrow_table = ds.to_arrow()
import pyarrow.dataset as ds_arrow
ds_arrow.write_dataset(arrow_table, "./output/arrow_dataset", format="parquet")
ds.to_mindrecord()
Overview
Export the dataset to MindRecord format, which is used by MindSpore framework. This is useful for training models with MindSpore.
Parameters
- path (
str): Path where the MindRecord files will be saved. - tensor_list (
List[str], optional): The tensor columns to export. If not provided, all tensors will be exported. Defaults toNone. - index_list (
List[int], optional): The indices of rows to export. If not provided, all rows will be exported. Defaults toNone. - num_shards (
int, optional): Number of MindRecord shards to create. Defaults to1. - overwrite (
bool, optional): IfTrue, overwrites existing files. Defaults toFalse.
Returns
- None
Examples
import muller
ds = muller.load("./my_dataset")
# Export to MindRecord
ds.to_mindrecord("./output/dataset.mindrecord")
# Export with multiple shards
ds.to_mindrecord("./output/dataset.mindrecord", num_shards=8)
# Export specific tensors
ds.to_mindrecord(
"./output/images_labels.mindrecord",
tensor_list=["images", "labels"]
)
# Export subset of data
ds.to_mindrecord(
"./output/train_subset.mindrecord",
index_list=range(1000)
)
# Overwrite existing files
ds.to_mindrecord(
"./output/dataset.mindrecord",
overwrite=True
)
# Export filtered view
train_ds = ds.filter("split == 'train'")
train_ds.to_mindrecord("./output/train.mindrecord", num_shards=4)
Notes
- MindRecord format is optimized for MindSpore training workflows.
- Multiple shards can improve parallel data loading performance.
- Requires MindSpore to be installed.
ds.write_to_parquet()
Overview
Write the dataset to Parquet format. Parquet is a columnar storage format that is efficient for analytics and widely supported.
Parameters
- path (
str): Path where the Parquet file(s) will be saved. - tensor_list (
List[str], optional): The tensor columns to export. If not provided, all tensors will be exported. Defaults toNone. - index_list (
List[int], optional): The indices of rows to export. If not provided, all rows will be exported. Defaults toNone. - compression (
str, optional): Compression codec to use (e.g., "snappy", "gzip", "brotli"). Defaults to"snappy". - row_group_size (
int, optional): Number of rows per row group. Defaults toNone(automatic).
Returns
- None
Examples
import muller
ds = muller.load("./my_dataset")
# Write to Parquet
ds.write_to_parquet("./output/dataset.parquet")
# Write with specific compression
ds.write_to_parquet("./output/dataset.parquet", compression="gzip")
# Write specific tensors
ds.write_to_parquet(
"./output/labels_only.parquet",
tensor_list=["labels", "categories"]
)
# Write subset of data
ds.write_to_parquet(
"./output/sample_subset.parquet",
index_list=range(1000)
)
# Write with custom row group size
ds.write_to_parquet(
"./output/dataset.parquet",
row_group_size=10000
)
# Write filtered view
filtered = ds.filter("score > 80")
filtered.write_to_parquet("./output/high_scores.parquet")
# Write multiple partitions
train_ds = ds.filter("split == 'train'")
test_ds = ds.filter("split == 'test'")
train_ds.write_to_parquet("./output/train.parquet")
test_ds.write_to_parquet("./output/test.parquet")
Notes
- Parquet format is highly efficient for columnar data access.
- Compression reduces file size but may increase read/write time.
- Parquet files can be read by many tools including pandas, Spark, and DuckDB.
Comparison of Export Formats
| Format | Use Case | Pros | Cons |
|---|---|---|---|
| DataFrame | Data analysis, pandas workflows | Easy to use, familiar API | Memory intensive for large datasets |
| JSON | Data interchange, human-readable | Universal format, readable | Large file size, slower parsing |
| Arrow | Interoperability, efficient transfer | Fast, zero-copy, language-agnostic | Requires Arrow ecosystem |
| MindRecord | MindSpore training | Optimized for MindSpore | MindSpore-specific |
| Parquet | Analytics, data warehousing | Efficient, columnar, widely supported | Not human-readable |
Choosing the Right Format
- Use to_dataframe() for quick analysis and pandas integration
- Use to_json() for data interchange and human readability
- Use to_arrow() for efficient data transfer and Arrow ecosystem integration
- Use to_mindrecord() for MindSpore model training
- Use write_to_parquet() for efficient storage and analytics workflows