January 2nd, 2018
Easy way for :Distributed Tensorflow
tensorflow/k8s : Tools for ML/Tensorflow on Kubernetes.
Architecture:
- Run a job operator by Helm
- Post your job just like a simple yaml file
- Job Operator will help you to distributed to each node for computing result
- PS
- Worker
Sample Job Kubernetes YAML file
apiVersion: "tensorflow.org/v1alpha1"
kind: "TfJob"
metadata:
name: "example-job"
spec:
replicaSpecs:
- replicas: 1
tfReplicaType: MASTER
template:
spec:
containers:
- image: gcr.io/tf-on-k8s-dogfood/tf_sample:dc944ff
name: tensorflow
restartPolicy: OnFailure
- replicas: 1
tfReplicaType: WORKER
template:
spec:
containers:
- image: gcr.io/tf-on-k8s-dogfood/tf_sample:dc944ff
name: tensorflow
restartPolicy: OnFailure
- replicas: 2
tfReplicaType: PS
Sample Distributed Tensorflow Code
Refer here.
Troubleshooting: Could not enable default namespace on Helm.
kubectl create serviceaccount --namespace kube-system tiller
kubectl create clusterrolebinding tiller-cluster-rule --clusterrole=cluster-admin --serviceaccount=kube-system:tiller
kubectl patch deploy --namespace kube-system tiller-deploy -p '{"spec":{"template":{"spec":{"serviceAccount":"tiller"}}}}'
helm init --service-account tiller --upgrade
GKE: Enable GPU on GKE
gcloud alpha container clusters create gpu-test \
--project $PROJECT_ID \
--zone $ZONE \
--enable-kubernetes-alpha \
--enable-cloud-logging \
--enable-cloud-monitoring \
--accelerator type=nvidia-tesla-k80,count=1 \
--machine-type n1-standard-1 \
--cluster-version=1.8.4-gke.1 \
--image-type $IMAGE_TYPE \
--num-nodes 1 \
--quiet
Trobleshooting:
- K8S alpha don’t support master version upgrade, so you need define k8s 1.8 when you create it. (default: 1.7.8)
- How to get GKE current support versions?
- ` gcloud container get-server-config`