The following tutorial gives an introduction to Spell via a step-by-step example of training a neural network to read handwritten digits using the MNIST dataset. This tutorial focuses on the most basic building block in Spell: runs. Future tutorials will cover more advanced topics, like cloud workspaces, distributed training with horovod, tensorboard integration, and more!
This tutorial assumes you already have Spell installed; if not, follow the instructions here.
Training on MNIST
MNIST is a classical ML benchmark dataset consisting of approximately 1500 images of numbers:
MNIST became a popular benchmark dataset in the 1990s, at a time when digit recognition was still considered a moderately difficult problem. With the advent of deep learning, machines are now able to "solve" MNIST with >99% accuracy. MNIST remains popular today as a teaching tool for demos like this one.
In this tutorial we will demonstrate how Spell runs work by training a simple convolutional neural network on the MNIST dataset. Conveniently, the PyTorch framework comes with MNIST example code built in.
To start, check out a copy of the pytorch/examples repo:
$ git clone https://github.com/pytorch/examples.git
Cloning into 'examples'... remote: Counting objects: 1573, done. remote: Total 1573 (delta 0), reused 0 (delta 0), pack-reused 1573 Receiving objects: 100% (1573/1573), 38.76 MiB | 17.40 MiB/s, done. Resolving deltas: 100% (821/821), done.
mnist directory contains the example code in
$ cd examples/mnist
README.md main.py requirements.txt
Let's go ahead and run the example on Spell, using a GPU to make sure training is fast:
$ spell run -t K80 python main.py
Everything up-to-date ✨ Casting spell #1… ✨ Stop viewing logs with ^C Machine acquired -- commencing run Run created -- waiting for a K80 machine. ✨ Run is building Checking for cache of run image Using cached run image ✨ Run is mounting ✨ Run is running Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz Downloading http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz Downloading http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz Downloading http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz Processing... Done! Train Epoch: 1 [0/60000 (0%)] Loss: 2.373651 Train Epoch: 1 [640/60000 (1%)] Loss: 2.310517 ... Train Epoch: 10 [58880/60000 (98%)] Loss: 0.369152 Train Epoch: 10 [59520/60000 (99%)] Loss: 0.325808 Test set: Average loss: 0.0487, Accuracy: 9845/10000 (98%) ✨ Run is saving Retrieving modified or new files from the run Saving 'data' Compressing saved files ✨ Total run time: 2m52.17616s ✨ Run complete
Success! Training the network to 98% accuracy took a bit less than 3 minutes. Note that we didn't mount the MNIST dataset, but the code is setup to download it into the
data directory if necessary. Because
data was saved, we can use this copy in future runs with
--mount runs/1/data, with
1 indicating our run id.
Generating New Numbers
Also included in the PyTorch examples is an implemenation of a Variational Autoencoder which is a generative network that generates new examples of the handwritten digits by looking at a set of examples.
$ cd ../vae
README.md main.py requirements.txt results/
Let's run that using the
data directory from the run above. Remember to replace
1 with the run id from your MNIST training run:
$ spell run -t K80 -m runs/1/data python main.py
Everything up-to-date ✨ Casting spell #2… ✨ Stop viewing logs with ^C Run created -- waiting for a K80 machine. ✨ Run is building Machine acquired -- commencing run Checking for cache of run image Using cached run image ✨ Run is mounting Successfully added mount: runs/3/data:/spell/data ✨ Run is running Train Epoch: 1 [0/60000 (0%)] Loss: 550.824402 Train Epoch: 1 [1280/60000 (2%)] Loss: 321.067963 ... Train Epoch: 10 [57600/60000 (96%)] Loss: 107.122063 Train Epoch: 10 [58880/60000 (98%)] Loss: 105.988289 ====> Epoch: 10 Average loss: 106.3185 ====> Test set loss: 97.5021 ✨ Run is saving Retrieving modified or new files from the run Saving 'vae/results/reconstruction_1.png' Saving 'vae/results/reconstruction_10.png' Saving 'vae/results/reconstruction_2.png' Saving 'vae/results/reconstruction_3.png' Saving 'vae/results/reconstruction_4.png' Saving 'vae/results/reconstruction_5.png' Saving 'vae/results/reconstruction_6.png' Saving 'vae/results/reconstruction_7.png' Saving 'vae/results/reconstruction_8.png' Saving 'vae/results/reconstruction_9.png' Saving 'vae/results/sample_1.png' Saving 'vae/results/sample_10.png' Saving 'vae/results/sample_2.png' Saving 'vae/results/sample_3.png' Saving 'vae/results/sample_4.png' Saving 'vae/results/sample_5.png' Saving 'vae/results/sample_6.png' Saving 'vae/results/sample_7.png' Saving 'vae/results/sample_8.png' Saving 'vae/results/sample_9.png' Compressing saved files ✨ Run complete ✨ Total run time: 1m23.001697s
Success again! As we can see, this code had no need to download the data because we'd mounted
data from the previous run already. Also, we can see that this code wrote some new files into the
results directory. Let's use
spell ls to see what's there:
$ spell ls runs/2/vae/results
9802 Mar 16 13:54 reconstruction_1.png 8928 Mar 16 13:55 reconstruction_10.png 8823 Mar 16 13:54 reconstruction_2.png 9180 Mar 16 13:54 reconstruction_3.png 8376 Mar 16 13:54 reconstruction_4.png 9465 Mar 16 13:54 reconstruction_5.png 8632 Mar 16 13:54 reconstruction_6.png 8940 Mar 16 13:54 reconstruction_7.png 9358 Mar 16 13:55 reconstruction_8.png 7393 Mar 16 13:55 reconstruction_9.png 48055 Mar 16 13:54 sample_1.png 41635 Mar 16 13:55 sample_10.png 44461 Mar 16 13:54 sample_2.png 43359 Mar 16 13:54 sample_3.png 44366 Mar 16 13:54 sample_4.png 43953 Mar 16 13:54 sample_5.png 42565 Mar 16 13:54 sample_6.png 42661 Mar 16 13:55 sample_7.png 42170 Mar 16 13:55 sample_8.png 42848 Mar 16 13:55 sample_9.png
$ spell cp runs/4/vae/results/sample_10.png
Copying file to ./sample_10.png [####################################] 100%
$ open ./sample_10.png
Now let's download the results of our run to our local directory and open them up:
$ spell cp runs/2/vae
$ open results
You should see a window appear with a set of generated examples of handwritten digits. Congratulations - you've just trained a generative neural network!