Tensorflow, from Scratch to the Cloud

Having used plenty of pre-built machine learning models using Tensorflow and GCP APIs, and having gone through the pains of setting up Tensorflow on plain vanilla Amazon EC2 AMIs (not even the pre-configured ones with all the goodies installed already) and getting it to run classifications through the GPU and on Tensorflow Serving, I thought it was high time I try coding my own machine learning model in Tensorflow.

Of course, the thing most folks aspire to do with Tensorflow when starting out is to build a neural network.  I wanted to model my basic neural network based on the MNIST examples just to get my feet wet, but use a dataset different than MNIST.  There are many datasets on Kaggle to choose from that could fit the bill, but I decided to use one of my own from a while back.  It consists of skin tones found in pictures, cropped carefully and aggregated into a dozen BMP files.  Don’t question where I got these skin tone pixels from, but rest assured that a wide variety of skin colors were covered, captured by mostly nice cameras under ideal lighting conditions.  To be honest, it’s a little bit less interesting than MNIST, because instead of 10 distinct classes, there are only two: skin and not-skin.

Performance from All Angles


Previous research shows that converting the pixels from the RGB space into the HSV space leads to better neural network performance.  Luckily, this is easy to do using the Python Imaging Library.  However, performance of neural net computation (i.e. the speed of the training phase) has improved dramatically.  Just seven years ago, I made a neural net in Weka to train on skin tones.

Back in late 2010, it was fancy if you had a system with, say, 200 CUDA cores.  Back then, my Lenovo W510 laptop shipped with 48 CUDA cores.  Now, I have a roughly 18-month old Aorus X3 laptop that boasts 1280 CUDA cores.  Note from above that there are orders of magnitude more nodes in the Tensorflow neural net, and that the Tensorflow net is also considering all 3 values in the HSV space (quantized 0-99, thus 1 million possible values), and not just the best 2 out of 3 as in the previous research (quantized 0-255, thus 65,536 possible values).  The performance comparison is as such:

Model Type Time
Tensorflow (GPU 256) 90.9
Weka (CPU 5) 17
Weka (CPU 256) 750


Building the Network


Ultimately, the trickiest part of building the network was formatting the data in a way that would be meaningful to train on, and setting up the data types of the training data and variables to match up altogether and keep the training functions happy.

Ultimately, I ran into four pitfalls, one operational, one in code, one with clashing resources, and one out of stupidity or blindness:

  1. Can’t pass in a tensor of (1000000, ?) into a variable expecting (1000000, ?).  This one eluded me for the longest time because I was positive the correct tensor was being fed into the training function.  I finally saw the light when I decided to split up the training data so that training is not run on the entire dataset at once.  In doing so, I decided to have it train on 2,000 examples at a time, rather than the whole million.  Then, the error message changed, and it immediately became apparent that the error was being caused by the computation to calculate how many HSV values have [0, 1, 2, …] appearances in the dataset, not the calculation of the neural network itself.  How frustrating it is that such an amount of time was wasted trying to look into an issue that was ultimately caused by something merely calculating metrics that I had already discovered earlier and had since moved on from.
  2. Can’t allocate 0B of memory to the cross_entropy or optimizer or whatever it was.  My Tensorflow is only compiled to run on the GPU, I’m pretty sure, and doesn’t handle CPU computations.  I had a separate terminal window open with the Python <shell> to quickly run some experiments before I put them in the real training code.  However, having this open seemed to tie up resources needed by the program, so closing the extra terminal eliminated my memory allocation error.
  3. Uninitialized variable <something> for the computations of precision & accuracy.  For this, I had to run init_local_variables right before <either defining them in the graph or running them>.  It was not good enough to run this right after the line “with tf.Session() as sess”.
  4. Weird classifications.  I was expecting a binary classification to behave like a logistic regression and simply output “0” if it’s not skin and “1” if it’s skin.  Unfortunately, the neural network tended to return results more like a linear regression, and was giving me nonsensical values like 1.335 and -40.8987 for the single class.  I ended up changing the output layer of the neural network to reflect what I had originally, which called for two classes of “skin” and “not-skin”.  When I made this change, it was possible that the calculated values when evaluating any given valid HSV value could still be totally out of line with the “0” and “1” I would expect (and also not a cumulative probability adding up to 1), but at least by taking np.argmax() of the output layer, I can turn it into the outcome I was expecting.  And, it actually works very well, with precision and recall exceeding 97 or 98% with not a whole lot of effort.

All This Effort Just To Find a Better Way


Now, after having written my model with low-level Tensorflow routines, and trained it to the point where I thought it would return really good results, it was time to try to put it on the cloud.  For this, the output from Saver (a “checkpoint”) basically needs to be converted into a SavedModel.  One can do this by writing a whole lot of code defining the exact behavior of the neural network in the first place, but it seems to be much more efficient (with the end goal being to exporting the model to the cloud) to use an Estimator object to define the neural network.

And this doesn’t even take into account what kind of goodness lies before me if I simply decide to use Keras…

However, I wanted to see how feasible it was to write a converter from checkpoint to SavedModel using the standard Tensorflow APIs.  Scouring for examples and tutorials, and diving in to find out what exactly all this code was doing to see how I could shorten it into the most concise possible example, I realized there were new APIs that these examples weren't leveraging.

Basically, the two main principles involve:

tf.saved_model.signature
_def_utils.predict_signature_def(
inputs,
outputs
)

And:

builder.add_meta_graph_and_variables(...)

There are several signature_defs you can choose from:

predict_signature_def(...)
regression_signature_def(...)
classification_signature_def(...)

It makes the most sense to me to use the "predict" when you're looking to perform inference on a pre-trained example using your SavedModel, use the "regression" when want to solve an equation based on inputs and outputs you provide directly into the function (and not through a SavedModel), and use the "classification" for learning and inferring the class in which something belongs to, once again, by using examples directly into this function and not through the SavedModel.  Now, I haven't quite used the latter two functions yet, so these are purely assumptions, so I'll update this if I find it to be false later.

As for the meta-graph and variables, all you need to do to make a prediction-flavored SavedModel working on the cloud is to populate these three positional variables with the correct terms (not to mention your Tensorflow session variable):

 tags: just set to tf.saved_model.tag_constants.SERVING.

signature_def_map: This is an object conisisting of mappings between one or more Tensorflow tag constants representing various entry points and the signature defined above.  If you assigned the output of predict_signature_def(...) to prediction_signature, then you would probably want to use this:

{
 tf.saved_model.signature_constants.DEFAULT_SERVING_SIGNATURE_DEF_KEY:
 prediction_signature
},


Finally, define in your SavedModelBuilder an operation to reinitialize all graphs.  It is not a problem to run this one if you are mobing around a lot.

main_op=tf.saved_model.main_op.main_op())

From here on out, you just need to actually execute the exact Python scripts to train the model and build it into the correct format.  For the specific code I used to build this skin tone detector and convert it to a format usable for Google Cloud ML Engine, see my GitHub repo at https://github.com/mrcity/mlworkshop/tree/master/google-apis/cloud-ml.

Making Inferences With Your Model


Assuming you have gone through the steps of uploading the contents of your entire SavedModel directory to a bucket in Google Cloud, and you have gone through the steps of initializing an ML Cloud Engine "Model" pointing to this bucket, you now need to create a script that can call Cloud ML Engine to make predictions.  The Python code to do this is also very simple.  The main gist is that, unlike for other GCP API accesses from Python, you must configure the path to your service account using an environment variable rather than with the Python libraries.

Once you have done this, there is some very simple code from Google that demonstrates making an inference with your cloud model.  However, one thing they don't make clear is how to make multiple inferences at once.  If you do this incorrectly, you may run into an error such as Online prediction is not a matrix.  If this happens to you, study how the instances variable you pass into predict_json() must be an array of dictionaries:

instances = [{"hsv": [5., 49., 79.]},
     {"hsv": [4., 59., 63.]},
     {"hsv": [21., 67., 2.]},
     {"hsv": [99., 99., 99.]},
     {"hsv": [5., 35., 83.]},
     {"hsv": [5., 29., 71.]}]

One dictionary key must be defined for each expected input into the graph.  Recall that you defined the expected inputs in the inputs field of the call to predict_signature_def() earlier, and the keys required will be the same as the keys you defined in this field.  One bit of good news about the Cloud ML Engine is that it seems to only count one usage even if you specify multiple input values to the API (i.e. multiple elements in the instances array) in a single request.

Again, here is the pointer to my GitHub repo where you can see how all this fits together: https://github.com/mrcity/mlworkshop/tree/master/google-apis/cloud-ml

Comments

Popular posts from this blog

Making a ROM hack of an old arcade game

Start Azure Pipeline from another pipeline with ADO CLI & PowerShell

Less Coding, More Prompt Engineering!