Datashim

What is the Dataset Lifecycle Framework?

Dataset Lifecycle Framework (DLF) [7] is a Kubernetes framework which enables users to access remote data sources via a mount-pqoint within their con- tainerized workloads.

What is Datashim?

It is basically a pointer to the S3 data source

Datashim is a Kubernetes Framework to provide easy access to S3 and NFS Datasets within pods. It orchestrates the provisioning of Persistent Volume Claims and ConfigMaps needed for each Dataset.

Datashim introduces the Dataset CRD which is a pointer to existing S3 and NFS data sources. It includes the necessary logic to map these Datasets into Persistent Volume Claims and ConfigMaps which users can reference in their pods, letting them focus on the workload development and not on configuring/mounting/tuning the data access.

It even has caching!

Paper Notes

Datashim is a cloud-ready software framework targeting Kubernetes, aiming to abstracting low-level storage related details from the user, allowing attachment of potentially any storage solution to a containerized application via a common interface. So far, Kubernetes users have to deal with each storage solution they need in a separate way that is often customized to reflect the underlying storage infrastructure.

By using Datashim, users do not access a specific NFS share, or an S3 bucket anymore; they instead request access to a specific dataset they know (or are told by the administrator) contains their data. Datashim takes care of presenting the data the way the user wants: as a POSIX mount point, or user credentials injected as environmental variables.

Attachment of datasets to pods is managed by a component called the admission controller, which continuously monitors newly created pods for dataset-related information.

The operator constantly monitors events on datasets and establishes the necessary connections whenever a matching reference is found (i.e., when a pod references an existing dataset). It relies on the Container Storage Interface (CSI) to implement the actual connection with the specific storage solution backing the datasets.

This allows Datashim to handle datasets for virtually any storage solution that provides a CSI plug-in implementation. A CSI plug-in instructs Kubernetes on how containers can physically access the storage medium and whether they support dynamic provisioning (i.e., the automated creation of volumes dedicated to specific pods).

Installing using Helm (Broken right now)

Add the Datashim Helm repository with the following commands:


helm repo add datashim https://datashim-io.github.io/datashim/
helm repo update`

Once this has completed, run:


helm search repo datashim --versions

To verify that the repository has been added correctly. You can now install Datashim with:


helm install datashim datashim/datashim-charts

Installing using K8

Start by creating the dlf namespace with:


kubectl create ns dlf

Create in local Kubernetes the pods required


kubectl apply -f https://raw.githubusercontent.com/datashim-io/datashim/master/release-tools/manifests/dlf.yaml

Creating the credentials


apiVersion: v1
kind: Secret
metadata:
  name: minio-credentials
  namespace: ibm-hpc
stringData:
  accessKeyID: ACCESS_KEY_ID
  secretAccessKey: SECRET_ACCESSKEY

Creating the CRD Dataset

A dataset specification contains all the details needed by Datashim for setting up data access: reference to a storage source (potentially populated with pre- existing data), credentials whenever needed, and some access control information (some datasets can be read only). Even though a dataset can be created by a user, the preferred em- bodiment envisioned is for cluster administrators to create them and give access to specific users.


apiVersion: datashim.io/v1alpha1
kind: Dataset
metadata:
  name: minio-dataset
  namespace: ibm-hpc
spec:
  local:
    readonly: "false"
    bucket: clowder
    endpoint: http://clowder2-minio.ibm-hpc.svc.cluster.local:9000/
    secret-name: minio-credentials
    type: COS

The toughest part of this was figuring out the endpoint. It follows the format - http://minio-service.default.svc.cluster.local:9000 , where service name is minio-service, namespace is ibm-hpc , and port is 9000.

Once we do this a PVC is created and when you see it bound, you can attach to a container like this


apiVersion: v1
kind: Pod
metadata:
  name: minio-test-pod
spec:
  containers:
  - name: minio-test-pod
    image: alpine
    command: [ "sleep", "infinity" ]
    volumeMounts:
      - name: minio-dataset
        mountPath: /clowderfs
	volumes:
    - name: "minio-dataset"
      persistentVolumeClaim:
        claimName: "minio-dataset"

Notes in relation to Clowder

I had a tough time finding the endpoint and when I had the correct one it still did not work because we needed to use clowder, a bucket is not created till a file is uploaded. Something

that causes mounting issues here

For running Datashim with extractor


extractors:
  wordcount:
    enabled: true
    image: clowder/extractors-wordcount:latest
  rcnn-iwp-inference:
    enabled: true
    image: vismayak/rcnn_iwp_inference_extractor_k8:latest
    env:
      - name: MINIO_MOUNTED_PATH
        value: /clowderfs
    volumes:
      - name: minio-dataset
        persistentVolumeClaim:
          claimName: minio-dataset
    volumeMounts:
      - Name: minio-dataset
        mountPath: /clowderfs

The volumes section efines the storage volumes that will be available to the pod so here:

name: minio-dataset: Assigns a name to this volume that will be used to reference it within the pod

persistentVolumeClaim: Specifies that this volume is backed by a Persistent Volume Claim (PVC)

claimName: minio-dataset: The name of the PVC that should be used (it must already exist in the same namespace)

The volumeMounts defines how the volumes are mounted inside the container