My Tensorflow Project Isn't Saving the World

Among all the hype around the latest and greatest technologies, there is so much publicity devoted toward how they are being used in grand schemes to cure cancer, reduce energy waste, conserve water, solve poverty, and so forth.  While all these things are wonderful to humanity, there has to be someone left in the background who helps all the do-gooders unwind when it's time to take a break!


The TL/DR Version: Get To the Point!


Use clever arguments when loading up your Docker container so you don't have to shut it down and restart it when you want to mount external directories from the host filesystem or expose the port for the Tensorboard server.  There is also nvidia-docker available if you want to use your CUDA cores.

sudo nvidia-docker run -it -p 6006:6006 -v ~/Pictures/video-game-training/:/video-game-training gcr.io/tensorflow/tensorflow:latest-devel-gpu bash

Use the --output_user_root option in your Bazel builds so you can save it to that external directory on the host you provided earlier.  This way, when you have to shut down your Docker instance, your Bazel build will still be there (though you will have to recreate some symlinks in the Bazel project directory).

bazel --output_user_root=/video-game-training/bazel-build build tensorflow/examples/image_retraining:retrain

Don't forget to store your image category directories within a "training image root" directory at the same level as the bazel-build directory, or else Bazel might try to train on its own model files.

Also, don't forget that if you export the trained model to somewhere outside /tmp, and then iterate on this model, that you pass the location to the correct model to the classification step.  Otherwise, you might classify with the wrong model, which could lead to confusion and frustration.

Use my fork of the Imker repo (maybe someday I'll make a pull request to put it in the mainstream code) if you want to download only a portion of the images in a particular category from any Wiki site such as Wikimedia Commons.  This could be built upon so you can segregate training and test data.

Just Use the Devel Docker Image; CUDA Optional


Ignoring my original plans for what I was planning to do with TensorFlow, it struck me one night to build a classifier that could recognize different game cartridges for the Nintendo Entertainment System (NES).  I had a lot of pre-work to embark on because it had been a long time since my system had been updated with the latest supporting packages.  However, all of it ended up being all for naught; I found the "virtualenv" approach for installing Tensorflow to be so fraught with tedium that I ended up going for the simple Docker approach.  This is the Tensorflow installation approach I've been recommending since November and it seems to still be worth sticking to.

I have a pretty old nVidia graphics card (a GeForce 650 Ti) in my (mostly even older) desktop running Linux (and Windows at times, mostly during tax season).  It still supports nVidia Compute Capability 3.0 which is just barely enough to run the capabilities I need to perform machine learning, play with the Blockchain, and so forth.  To make Tensorflow performant inside Docker, a special add-on called nvidia-docker allows access to your CUDA cores from inside your Docker container, so I can still get blazing fast performance from my own hardware without needing to install everything in my primary environment (which is evidently too jacked up to support the Tensorflow installation).  Docker is great for providing a uniform, trouble-free experience when running apps anyway because it provides an isolated environment not subject to your system's specific configuration.  However, the version of Docker originally on my system was so old that the required libraries for nvidia-docker were not present; luckily, the upgrade path was simple thanks to their clear instructions.

In fact, thanks in part to my pre-work from before, and lots of good Internet guides on this topic already, getting Tensorflow working on my desktop in this manner went smoothly, if not for some early trial and error, and of course the usual long wait times for compilations to finish.  As I've often said, just use Docker.

Once you have Docker and nvidia-docker installed, here is the best way to run the Tensorflow image.  Note that if you don't have the image already, Docker will automatically download it:

sudo nvidia-docker run -it -p 6006:6006 -v ~/Pictures/video-game-training/:/video-game-training gcr.io/tensorflow/tensorflow:latest-devel-gpu bash

Let's break this down:

  • There's a way to avoid running docker with sudo, but it hides any semblance of auditability or traceability for when users go beyond their expected behaviors and start to get mischievous.
  • nvidia-docker is the binary that supports Docker instances accessing CUDA cores.
  • run tells Docker to launch the specified image in its own isolated environment, with its own filesystem and process tree.
  • -it (or -i -t) specifies first to run the container in Interactive mode, leaving stdin (standard input) open even if nothing is attached.  Secondly, a pseudo-TTY port is opened so the user can actually send input to the container.
  • -p 6006:6006 exposes the Tensorboard port inside the container to the host.  When you start the server, you can access it through localhost:6006 on a browser on your host machine.  Tensorboard is a great way to visualize what is going on inside your training algorithm from the model construction and details standpoint, plus illustrate simple representations of how the data exists in the classification space (as simple as you can make it in as few dimensions as we humans can easily perceive).
  • The -v option allows you to specify or mount a directory (not an entire filesystem; there's a different way to do that) from your native filesystem to include into your Docker container as it runs.  In this case, I wanted to expose the video-game-training directory from my user account's Pictures folder onto my Docker instance as /video-game-training so that the algorithm would have access to all my training data.
  • gcr.io/tensorflow/tensorflow:latest-devel-gpu is the Docker image name.
  • bash is the command to run on the Docker image once it starts.  You can run any executable you want, but it is easiest to run a terminal instance.

First Crack At Building a Classifier: Aligning Pictures And Commands


For object classifiers, good training data comes from as many images as you can get of the subject material.  To support this, I took videos of various NES game cartridges while moving the camera around so as to film it from various angles.  Depending on the lighting, the sun or lights would also reflect back into the camera and cause slight imperfections in the label.  I labored for quite a while in the hot Texas sun taking videos of these games with different backgrounds behind the cartridges so that the classifier would learn how to focus on what is important.

Once my environment was all set up and ready to go, I ran this Tensorflow example pretty much verbatim.  It took approximately 24 minutes to run the first step which sets up the Bazel build to run the training task.  However, as my Docker instance did not have any training data loaded into it, I had to exit out of it in order to add the file mount as described above.  Unfortunately, upon logging back into my Docker container, all this pre-work had been wiped out as a result of it all being built in some temporary .cache directory under the root home.  And, to add insult to injury, running that Bazel setup command the second time took more than twice as long -- clocking in at just short of 50 minutes!

Lesson Learned


One easy way to avoid losing your entire Bazel build when Docker decides to refresh the file system from scratch is to specify the --output_user_root option to Bazel before building to be the same as the external file system or directory from the host that you mounted inside Docker.  In my case, this meant specifying the following setting for my build:

bazel --output_user_root=/video-game-training/bazel-build build tensorflow/examples/image_retraining:retrain


Continuing With Trying To Break Bazel And My Docker Instance


Now, this meant I had to put my training examples one level deeper in this directory, or else the next step would possibly try to train on whatever output is in the Bazel build directory itself.  After running the Bazel build, I exited my Docker instance to see what would happen.  When I reopened it, I found that the symlinks in the /tensorflow folder had been changed to point to /root/.cache/bazel, which did not exist (and never existed because I made the build in another folder).  It took just a hair bit of manual tedium to point the symlinks back to the right place, but upon doing so, the bazel-bin "retrain" command specified in the Google example to actually perform training worked without a hitch.  With everything in place, this command took less than 15 minutes to perform 4,000 training steps utilizing my approximately 800 pictures of each the MegaMan and MegaMan 2 cartridges.  The exact syntax looks like this:

bazel-bin/tensorflow/examples/image_retraining/retrain --image_dir /video-game-training/pictures

The output of this step produces two files in the /tmp/ directory: output_graph.pb and output_labels.txt (also /tmp/retrain_logs/ is important if you want to look at your TensorBoard at any point).  I moved these files into a model/ directory inside the directory exposed to Docker from my host system.

As for classification, I utilized the same strategy, using the --output_user_root option on the bazel build "label_image" step (obviously ignoring the conjoined bazel-bin step for the time being, thus stopping short of image classification).  This Bazel build took about 20 minutes:

bazel --output_user_root=/video-game-training/bazel-build build tensorflow/examples/label_image:label_image

Once this step was complete, I exited and re-entered Docker once again, and my symlinks had been similarly screwed up.  Upon restoring them (like last time), I found a picture of the MegaMan 2 cartridge from out on the Internet, and ran it through the classifier in this manner:

bazel-bin/tensorflow/examples/label_image/label_image \
--graph=/video-game-training/model/output_graph.pb \
--labels=/video-game-training/model/output_labels.txt \
--output_layer=final_result \
--image=/video-game-training/megaman2-ex-01.jpg \
--input_layer=Mul


And voila, a reproducible classification each time, without having to leave my Docker instance open, simply by reconstructing those symlinks!  (That part could easily be scripted in a batch file, in fact.)

Note: Without that last line in the classification command, you will probably stumble into an error saying "Running model failed: Not found: FeedInputs: unable to find feed output input".   As it turns out, Google's example command is a little bit deficient, but fortunately some forum posts succinctly clarified the issue and offered the solution.

Because the Whole World Isn't Video Game Artwork


My training data consisted of only pictures of the label up-close, and mostly ignored the rest of the cartridge.  However, my first classification picture was in fact of the entire cartridge.  I was astounded at the results, because even considering this difference, the algorithm was 96% certain that my picture of the MegaMan 2 cartridge was in fact MegaMan 2; the 4% remainder was its (very weak) confidence that it was the original MegaMan cartridge.  Now, having spent most of my professional career up until now as a tester, I immediately wanted to see how it would perform on junk input.  I fed it an old picture of one of my pinball machines (Gold Wings, of no relation whatsoever to MegaMan), but the algorithm was 86% confident that what I just showed it was in fact MegaMan, and only 14% confident that it was MegaMan 2.  This was amusing to me, because I suppose in the algorithm's limited worldview of only having been trained on examples of MegaMan or MegaMan 2, it was in no position to say with any authority that anything was in fact neither!

Wikimedia Commons appealed to me as a good location to get quality public-domain photos to use as "negative" training examples (though I suppose I could have used private images with rights held by the authors, and since their data is buried deep within a machine learning model, you would never be the wiser!).  The only downside is their site offers only 200 photos at a time for a given category, and it would be a huge waste to sit there, expand each one, and manually click Save.  Fortunately, Wikimedia Commons supports API calls that will allow you to download all the media for a given category.  Better yet, there is already a Java program called Imker that offers a CLI and GUI wrapper around the API calls.

The only problem with Imker is their current UI only offers you the ability to download every single file within a given category, not to break it up into just a fraction of randomly-selected images.  Nevertheless, Imker is open-sourced, so I forked the Git repo and began hacking away at the Java code so that I could download just 10,000 of the 272,812 images currently in the "PD-user" category on Wikimedia Commons.  After sorting out a lingering issue, and waiting a few hours (thanks in large part to my crude rate limiter), I have 10,000 images from A-Z, not to mention A-Z in other languages, consisting of roughly 75% JPEGs, 18% PNGs, 5% SVGs, 1% GIFs (even animated), and some TIFFs thrown in for good mix.  Not only that, but the images consist of things like maps, diagrams of all sorts of things in many different languages, road signs, cars, street scenes, landmarks, molecular diagrams, and all sorts of other random stuff only a small percentage of the population could possibly care about. :-P

The beautiful part about using the pre-trained, robust Inception model is that you don't have to worry about scaling your input data to a particular size.  I was able to use these images just exactly as they came, and I only had trouble with two images that apparently contained bad data and failed to download properly (had Imker not stopped due to some exceptions regarding unhealthy API responses, this might have been avoided).  Apparently, it even dealt with all these file formats adeptly too.

Important Note: One thing that stumped me as to why my model was only showing "megaman1" and "megaman2" after I had trained "not-games" was because I was using an old copy of the model in my classification argument.  Make sure you set the correct path to your model!

In any event, the Tensorflow model retrained to distinguish between Mega Man 1, Mega Man 2, and "Not a game" performed successfully in my two trials thus far.



Trained on MM1 or MM2 MM1, MM2, or Not a game
Confidence Mega Man 2 Pinball machine Mega Man 2 Pinball machine
Mega Man 1 3.9% 86% 4.0% 9.3%
Mega Man 2 96.1% 14% 56.8% 1.6%
Not a Game N/A N/A 39.2% 89.1%

Comments

Popular posts from this blog

Making a ROM hack of an old arcade game

Start Azure Pipeline from another pipeline with ADO CLI & PowerShell

Less Coding, More Prompt Engineering!