Not being much of a mathematician at heart, and generally spending time on logic problems, application testing, or new HTML5 & browser paradigms rather than crunching big data, I was never really inspired to do much with these cores. This all changed when watching the Google I/O 2015 keynote address when they showed off the capability for you to draw (as best you can) an emoji, and Google's engine will try to recognize your scrawl and offer you up several profesionally-drawn emojis to represent whatever it is you're trying to express. With recent changes in my life that have augmented my ability to "Go Get 'Em" and increased the likelihood that my ideas will actually reach customers, I immediately began scheming to learn how they set out doing this. Obviously image analysis was involved, but what algorithms did they use? Thinking back to my Digital Image Analysis class, I began researching how applicable Hough transforms would be to my problem. I would need to teach the computer what certain symbols looked like in that particular mathematical space, which would probably take me a while since it's not really one of my strong points. Another discouraging bit of trivia is that Hough transforms can be difficult to apply to complex shapes because there starts to become very little margin for error. Well, scratch that; back to the drawing board.
Then, thinking back to Machine Learning class, one algorithm in particular seemed adaptable to all sorts of problems, and is even designed with the same (or very similar) scientific principles as human thought. This particular learning algorithm has received quite a bit of buzz lately, with projects such as MarI/O and Google's "Inceptionism" experiments: neural networks. With neural networks, you ultimately end up with (through some sort of black magic that occurs through repetitive training exercises) a series of very simple algebraic equations that will help you arrive at an answer given one or more inputs (it usually helps to have at least two inputs to make things at all interesting). Through stacked layers of various sizes, each comprised of various quanta called "perceptrons" (which fulfill a very similar role to neurons), the neural network will begin to perceive features in a set of data in much the same way a human will analyze a visual scene and pick out all the items they can see. There are many variables involved with coming up with a good neural network for a specific problem; for instance, the number of iterations you run training on the network, and the functions your perceptrons use when weighing inputs to make the final decision. The neural network can also, unfortunately, be easily biased by the training data it sees during formation, so sometimes it can perceive things that aren't really there.
Given a set of data that could end up being very large, it became desirable to find a way to train the neural network using some sort of parallel framework, if possible. Luckily, people have already solved this problem: NVIDIA has devised a library of primitives for neural networks (including Deep Neural Networks and Convolutional Neural Networks) called cuDNN. Computer scientists at UC Berkeley have developed a DNN framework called Caffe, a highly-optimized neural network creator; it happens to support cuDNN, which you specify support for when you build it, and this takes its existing capabilities to a whole new, much faster level.
Getting My Caffe to Brew
Important note: This is all cutting-edge information, and is subject to change over time. Some of the sources I used to put this article together are already slightly out of date, and so I expect this post will eventually go out of date too. You've been warned!
Unfortunately, Caffe with cuDNN requires quite a few dependencies; these are all called out on this particular introductory post. I chose to install some directly from source (by downloading the source or cloning from GitHub), and others were installed through Synaptic Package Manager on Ubuntu. For this particular project, I installed the following binaries from the following sources:
|BLAS||OpenBLAS 0.2.14||Direct download (see Note 2)|
|OpenCV||OpenCV 3.0.0||Direct download|
|protobuf (see Note 3)||protobuf ||Direct download|
|glog||glog 0.3.3||Direct download|
|gflags (see Note 1)||gflags 2.1.2||Direct download|
|lmdb||liblmdb0, liblmdb-dev 0.9.10-1||Synaptic|
|Caffe||Merge 805a995 7d3a8e9, 7/3/15||Git clone|
Note 1: When making gflags, take a moment to go into the Advanced option of ccmake, and specify the CMAKE_CXX_FLAGS variable (how, you ask? read the next paragraph). You need to set this variable to contain the compilation flag -fPIC thusly, or else later on, when you try to build Caffe, it will complain that the files you built for gflags aren't suitable to be used as shared objects by Caffe.
Note 2: For reasons unknown, I first tried to install it from a Git clone, but then ended up downloading this version directly and installing it successfully.
Note 3: At the time of this writing, you will run into trouble if you try to use the Python wrapper for exploring Caffe models if you build Caffe with protobuf 3.0. Until this is fixed, use protobuf 2.6.1.
If you've never used cmake before, it's not very difficult at all. At its heart, cmake facilitates making build instructions for multiple platforms in one convenient place, so that users of Windows, Linux, and Mac only need to tell it about certain paths to libraries and include files that don't already exist on their PATH or in some environment variable. To set up your Makefile with cmake, the easiest thing to do is to go into the directory one level above cmake (e.g. caffe/, which contains caffe/cmake) and write ccmake . on the command line (note the two C's and the dot). If you're into isolating new work, you may wish to create a build directory inside the project root directory, then run ccmake .. so that it's easy to trash all temporary files.
However, setting up the configuration for Caffe itself was not so easy for me. After installing all the dependencies, the system just flat out refused to believe I wanted to use OpenBLAS rather than Atlas, so I ended up actually having to delete several lines of the Dependencies.cmake file -- specifically, the parts that specified which environment variables to read from if the user had specified Atlas or MKL -- as indicated by the "stack trace" being provided by ccmake. Ultimately, not too difficult an adjustment to make; I just never have too much fun adjusting Makefiles by hand, so if it can be done through the configuration tool, I'd much prefer that.
Building a Useful Data Model
Once you have done all these steps to make Caffe with cuDNN, a great real-world example to run through is the "mnist" example which hashes through several thousand samples of handwritten numeric digits from the National Institute of Standards & Technology that were taken back in the early '90s (i.e. the MNIST database). These scans are very low-resolution by today's standards, but are still often used as a benchmark for the performance of neural networks on handwriting samples (just as the picture of Lena Soderberg from a 1972 Playboy centerfold is still used as a benchmark for image processing algorithms, except with a lot less sexist undertones :-P). Nevertheless, my machine took just under 4 minutes and 17 seconds to crank through a 10,000-iteration training cycle for a neural network that will classify image input as a digit. The demo (linked to above) was very simple to run, as all of the work to create the neural network structure and the mechanism of the perceptrons was all done for me in advance; all I had to do was kick off the script that iteratively runs the training so it drills down on salient features distinguishing each digit from each other. The only hangup was that some of the scripts expected files to be located in the ./build/ directory, but my particular installation skipped the ./build/ and went directly to the desired paths.
Putting the Model To Use: Classifying Hand-Drawn Numbers
After doing a bit of reading on how to extract the features from the neural network, I decided it'd be easiest to stick to the Python wrapper until I get some more experience with what operations exactly get run where, which is highly dependent on the way your deployment prototxt file is set up. One thing that would have been nice to know is the link seen in many places in the Caffe documentation that is said to describe how to use the Python module is wrong; they omitted a "00-", so it should really be http://nbviewer.ipython.org/github/BVLC/caffe/blob/master/examples/00-classification.ipynb. On my environment, some Python dependencies also needed to be installed before the Python wrapper would run properly. Here's what I had to do:
- for req in $(cat requirements.txt); do sudo pip install $req; done -- Installs many of the Python modules required, but leaves a little bit to be desired (which is accounted for in the next steps)
- Install python-scipy and python-skimage using Synaptic
- Uninstall protobuf-3.0.0-alpha3, and install an older version (in accordance with Caffe issue #2092 on GitHub)... would have been nice to know this ahead of time. (Don't forget to run sudo ldconfig so you can verify the installation by running protoc --version).
- Rebuild caffe so it knows where to find my "new (old)" version of protobuf
Once my dependency issues were sorted, I managed to find the deployment prototxt file for this particular neural net in caffe/examples/mnist/lenet.prototxt. Now, I can run the model simply by issuing the following Terminal command:
caffe/python$ python classify.py --model-def=../examples/mnist/lenet.prototxt --pretrained_model=../examples/mnist/lenet_iter_10000.caffemodel --gpu --center_only --channel_swap='0' --images_dim='28,28' --mean_file='' ../examples/images/inverted2.jpg ../lenet-output.txt
lenet_iter_10000.caffemodel is the trained model from the training exercise performed earlier from the Caffe instructions. inverted2.jpg is literally a 28x28 image of a hand-drawn number 2, and lenet-output.txt.npy is where I expect to see the classification as proposed by the model (it tacks on .npy). The channel swap argument relates to how OpenCV handles RGB images (really as BGR), so by default, the value is "2,1,0". By carefully scrutinizing this command, you may notice two things:
- The input image should be inverted -- i.e. white number on black background.
- The input image should only have one channel.
Thus, before running my model, I need to make sure the image I'm classifying is compliant with the format required for this classifier. For further confirmation, take a look at the top of lenet.prototxt:
input_dim: 64 # number of pictures to send to the GPU at a time -- increase this to really take advantage of your GPU if you have tons of pictures...
input_dim: 1 # number of channels in your image
input_dim: 28 # size of the image along a dimension
input_dim: 28 # size of the image along another dimension
You may be tempted to change the second input_dim to 3 in order to use images saved in the standard 3-channel RGB format, or even 4-channel RGBA. However, since you trained this neural network on grayscale images, it will give you a Check failed: ShapeEquals(proto_ shape mismatch (reshape not set) error if you do this. Thus, it's important the image is of single-channel format and inverted, as mentioned above.
Finally, so that classify.py properly handles the single-channel image, you need to make some amendments to it. Take a look at this comment on the Caffe GitHub page for an explanation of exactly what you need to do; in short, change the two calls of type caffe.io.load_image(fname) to caffe.io.load_image(fname, False), and then use the channel_swap argument as specified above in the syntax. However, you may just wish to hold out for (or incorporate) (or check out the Git branch that contains) Caffe Pull Request #2359, as this contains some code that'll clean up classify.py so you can simply use one convenient command-line flag --force_grayscale instead of having to specify --mean_file and --channel_swap and rewrite code to handle single-channel images. It'll also allow you to conveniently print out labels along with the probability of the image being each category.
Now that you've been exposed to the deployment prototxt file and have an idea of what layers are present in the system, you can start extracting them by using this straightforward guide, or possibly this other guide if you're interested in making HDF5 and Mocha models.
Before discovering lenet.prototxt, I tried to make my own deploy.prototxt. First, I utilized lenet_train_test.prototxt as my baseline.
- If you leave the file as it is but do not initialize the database properly, you will see Check failed: mdb_status == 0
- I deleted the "Data" layers that are included on phase TRAIN and phase TEST. I am not using LMDB as my picture source; I'm using an actual JPEG, so I need to follow something along this file format:
name: "LeNet" # this line stays unchanged
input: "data" # specify your "layer" name
input_dim: 1 # number of pictures to send to the GPU at a time -- increase this to really take advantage of your GPU if you have tons of pictures...
input_dim: 1 # number of channels in your image
input_dim: 28 # size of the image along a dimension
input_dim: 28 # size of the image along another dimension
name: "conv1" # continue with this layer, make sure to delete other data layers
- Delete the "accuracy" layer, since it's used in TEST only, and protobuf doesn't like barewords like TEST in the syntax anyway.
- Replace the "loss" layer with a "prob" layer. It should look like:
If you're simply replacing the loss layer with the new text, rather than removing and replacing, it's important to take out the bottom: "label" part, or else you'll probably get an error along the lines of Unknown blob input label to layer 1. Also, just use plain Softmax as your perceptron type in this layer; nothing else.
- Make sure you don't have any string values (barewords) that don't have quotes around them, such as type: SOFTMAX or phase: TEST.
- If you have both the "loss" layer and the "prob" layer in place in deploy.prototxt, you will see Failed to parse NetParameter. Again, be sure you replaced the "loss" layer with the "prob" layer.
- If you forget the --channel_swap="0" argument on a single-channel image, and you don't have something in your code to the effect of Git pull #2359 mentioned above, you will see the message "Channel swap needs to have the same number of dimensions as the input channels."
Later on, as this algorithm gets closer to deployment in a large production setting, it could be nice to tweak it in order to get the best success rate on the test data. There are some neural networks developed to classify the MNIST data so well that they have actually scored higher than their well-trained human counterparts on recognizing even the most chicken-scratch of handwritten digits. It has also been noted that some algorithms end up getting significantly weaker performance on other datasets such as the USPS handwritten digit dataset.
- http://devblogs.nvidia.com/parallelforall/deep-learning-computer-vision-caffe-cudnn/ - Article from NVIDIA describing fusing cuDNN + Caffe
- http://caffe.berkeleyvision.org/installation.html - Guide to required dependencies and installation steps for UC Berkeley's Caffe framework for neural networks. Also links to their GitHub repository, which you should definitely clone once you're ready to build.
- http://nbviewer.ipython.org/github/BVLC/caffe/blob/master/examples/00-classification.ipynb - Instructions on how to start classifying items with your trained Caffe neural network model.
- Various other links embedded in previous sections.