What is the Dataset Lifecycle Framework?
Dataset Lifecycle Framework (DLF) [7] is a Kubernetes framework which
enables users to access remote data sources via a mount-pqoint within their con-
tainerized workloads.
What is Datashim?
It is basically a pointer to the S3 data source
Datashim is a Kubernetes Framework to provide easy access to S3 and NFS
Datasets within pods. It orchestrates the provisioning of Persistent
Volume Claims and ConfigMaps needed for each Dataset.
Datashim introduces the Dataset CRD which is a pointer to existing S3 and
NFS data sources. It includes the necessary logic to map these Datasets into
Persistent Volume Claims and ConfigMaps which users can reference in their pods,
letting them focus on the workload development and not on
configuring/mounting/tuning the data access.

It even has caching!
Paper Notes
Datashim is a cloud-ready software framework targeting Kubernetes, aiming to abstracting low-level storage related details from the user, allowing attachment of potentially any storage solution to a containerized application via a common interface. So far, Kubernetes users have to deal with each storage solution they need in a separate way that is often customized to reflect the underlying storage infrastructure.

By using Datashim, users do not access a specific NFS share, or an S3 bucket anymore; they
instead request access to a specific dataset they know (or are told by the administrator) contains their data. Datashim takes care of presenting the data the way the user wants: as a POSIX mount point, or user credentials injected as environmental variables.
Attachment of datasets to pods is managed by a component called the admission controller, which continuously monitors newly created pods for dataset-related information.
The operator constantly monitors events on datasets and establishes the necessary connections whenever a matching reference is found (i.e., when a pod references an existing dataset). It relies on the Container Storage Interface (CSI) to implement the actual connection with the specific storage solution backing the datasets.
This allows Datashim to handle datasets for virtually any storage solution that provides a CSI plug-in implementation. A CSI plug-in instructs Kubernetes on how containers can physically access the storage medium and whether they support dynamic provisioning (i.e., the automated creation of volumes dedicated to specific pods).
Installing using Helm (Broken right now)
Add the Datashim Helm repository with the following commands:
helm repo add datashim https://datashim-io.github.io/datashim/ helm repo update`
Once this has completed, run:
helm search repo datashim --versions
To verify that the repository has been added correctly. You can now install
Datashim with:
helm install datashim datashim/datashim-charts
Installing using K8
Start by creating the
dlf
namespace with:kubectl create ns dlf
Create in local Kubernetes the pods required
kubectl apply -f https://raw.githubusercontent.com/datashim-io/datashim/master/release-tools/manifests/dlf.yaml
Creating the credentials
apiVersion: v1 kind: Secret metadata: name: minio-credentials namespace: ibm-hpc stringData: accessKeyID: ACCESS_KEY_ID secretAccessKey: SECRET_ACCESSKEY
Creating the CRD Dataset
A dataset specification contains all the details needed by Datashim for setting up data access: reference to a storage source (potentially populated with pre- existing data), credentials whenever needed, and some access control information (some datasets can be read only). Even though a dataset can be created by a user, the preferred em- bodiment envisioned is for cluster administrators to create them and give access to specific users.
apiVersion: datashim.io/v1alpha1 kind: Dataset metadata: name: minio-dataset namespace: ibm-hpc spec: local: readonly: "false" bucket: clowder endpoint: http://clowder2-minio.ibm-hpc.svc.cluster.local:9000/ secret-name: minio-credentials type: COS
The toughest part of this was figuring out the endpoint. It follows the format -
http://minio-service.default.svc.cluster.local:9000
, where service name is minio-service, namespace is ibm-hpc , and port is 9000.Once we do this a PVC is created and when you see it bound, you can attach to a container like this
apiVersion: v1 kind: Pod metadata: name: minio-test-pod spec: containers: - name: minio-test-pod image: alpine command: [ "sleep", "infinity" ] volumeMounts: - name: minio-dataset mountPath: /clowderfs volumes: - name: "minio-dataset" persistentVolumeClaim: claimName: "minio-dataset"
Notes in relation to Clowder
I had a tough time finding the endpoint and when I had the correct one it still did not work because we needed to use clowder, a bucket is not created till a file is uploaded. Something
that causes mounting issues here
For running Datashim with extractor
extractors: wordcount: enabled: true image: clowder/extractors-wordcount:latest rcnn-iwp-inference: enabled: true image: vismayak/rcnn_iwp_inference_extractor_k8:latest env: - name: MINIO_MOUNTED_PATH value: /clowderfs volumes: - name: minio-dataset persistentVolumeClaim: claimName: minio-dataset volumeMounts: - Name: minio-dataset mountPath: /clowderfs
The
volumes
section efines the storage volumes that will be available to the pod so here:- name: minio-dataset: Assigns a name to this volume that will be used to reference it within the pod
- persistentVolumeClaim: Specifies that this volume is backed by a Persistent Volume Claim (PVC)
- claimName: minio-dataset: The name of the PVC that should be used (it must already exist in the same namespace)
The
volumeMounts
defines how the volumes are mounted inside the container