Croissant

Croissant - Summary

A lot of time is lost trying to understand data, the organization and features of a dataset. The existing general purpose formats for datasets such as schema.org and DCAT were designed for data discovery than for the specific needs of ML(the ability to extract and combine data from structured and unstructured sources, to include metadata that would enable responsible use of the data, or to describe ML usage characteristics such as defining training, test and validation sets.)

Croissant does not change how data is represented; it provides a standard way to describe and organize it. It builds upon schema.org, the de facto standard, adding comprehensive layers for ML relevant metadata, data resources, data organization, and default ML semantics.

Croissant currently comes with a complete specification of the format, a set of example datasets, a Python library to validate, consume and generate Croissant metadata, and an open source visual editor to load, inspect and create Croissant dataset descriptions.It seems that the editor is using Streamlit, although it is somewhat broken. When running the docker image, I got the following response after going to the create metadata project, though I do not get this error in the online editor

The benefits of a shared format like Croissant lie in simplifying the iterative, data-centric process of ML development—finding, cleaning, training, testing, and refining models—by reducing the “data development burden,” especially for resource-limited research and startups. A standardized format enhances dataset discoverability through metadata, streamlines data organization for easier preprocessing and analysis, and enables ML frameworks to use data with minimal code. For dataset authors, adopting Croissant increases dataset value and usability with minimal effort, supported by creation tools and ML data platforms.

Croissant datasets can be found in HuggingFace, Kaggle, OpenML and Google dataset search. We can ingest data easily via TensorFlow Datasets for use in popular ML frameworks like TensorFlow, PyTorch, and JAX. They say we can inspect and modify the metadata using the Croissant editor UI. ( I have not tested this extensively but it is kinda broken)

To publish a Croissant dataset, users can:

Use the Croissant editor UI (github) to generate a large portion of Croissant metadata automatically by analyzing the data the user provides, and to fill important metadata fields such as RAI properties.

Publish the Croissant information as part of their dataset Web page to make it discoverable and reusable.

Publish their data in one of the repositories that support Croissant, such as Kaggle, HuggingFace and OpenML, and automatically generate Croissant metadata.

Tensorflow Integration

From the Tensorflow documentation

A CroissantBuilder defines a TFDS dataset based on a Croissant 🥐 metadata file; each of the record_set_ids specified will result in a separate ConfigBuilder.

For example, to initialize a CroissantBuilder for the MNIST dataset using its Croissant 🥐 definition:


import tensorflow_datasets as tfds
builder = tfds.dataset_builders.CroissantBuilder(
    jsonld="https://raw.githubusercontent.com/mlcommons/croissant/main/datasets/0.8/huggingface-mnist/metadata.json",
    file_format='array_record',
)
builder.download_and_prepare()
ds = builder.as_data_source()
print(ds['default'][0])

What this means that tensorflow will be able to build a separate ConfigBuilder for each subsets in the dataset (so in case of MNIST, the different numbers or the train, test and val datasets). This enables TFDS to handle multiple datasets

Need to check this but the metadata has a huggingface repo that stores the data which TF will use to download.

This helps convert any JSON-LD file into a TFDS dataset, making it possible to load the data into TensorFlow, JAX, and PyTorch.

Croissant Paper Takeaways

Apart from what has been mentioned before, here is a list of things that the paper mentions

Developing open source reference implementations for the Crois- sant format, loaders, and editor: These tools can be integrated into Clowder?

Another code block that shows how tensorflow is able to work with Croissant metadata to load the dataset

Croissants support of Responsible AI dataset documentation

Responsible AI (RAI) dataset documentation has become increasingly prevalent for making data transparent in terms of how it was created, what biases it reflects, and other considerations on data provenance

Croissant is organized around 4 layers:

Dataset Metadata Layer
Resources Layer
Structure Layer
Semantic Layer

Integration with Clowder

Not all data has to be stored in Clowder

A JSON-LD file could be stored in Clowder but the data can be stored in hugging face or parquet, but submitted to the extractor, which can then load the datasets from those fields.

In the colab notebook for example, the metadata had the url for the git repo storing the data and was able to load the dataset using the mlcroissant python library

Clowder Download URLS in Croissant

This is conjecture on my part, but from what I see in the code , we can also store a Clowder download url for a dataset in the metadata. Then, a user can download the dataset from Croissant

Croissant Editor Integration

Even in the paper, they mention this format is not human friendly. If we were to integrate croissant, the first thing is to integrate the inbuilt editor. It is created using streamlit and I am not sure how sturdy it is but we can run it with a docker container and store an iFrame of the editor for use with Clowder?

Short term goal

Store JSON-LD croissant files and use it in extractors?

Long term goal

Kaggle and HuggingFace have APIs to generate croissant metadata.For example, this page talks about Huggingface API that generates for it’s dataset. This would be a good long term goal but complicated and dependent on infrastructure. It seems since data stored in parquet files need to have metadata, HuggingFace uses this to enable the generation of the code

I skimmed through huggingface’s repository and found code relevant to creating this metadata.