Thursday, June 28, 2018

Validating Pre-Made Tensorflow Estimators Mid-Stream

In Francois Chollet’s book Deep Learning with Python, he stresses the importance of utilizing a separate validation set of data while training a machine learning model in order to test periodically (say after every epoch) that the accuracy on something else besides strictly the training data (e.g. this validation set) is in fact improving.

Machine learning models are subject to learn relationships that have nothing to do with the problem at hand.  For instance, a model tasked with trying to determine which way a military tank is facing might end up making assumptions based on whether it is day or night.  This is often a result of trying to eke out the model’s maximum performance, say by optimizing for the smallest value of a loss function.  However, what ends up happening is that the model overfits on the training data, which means it loses its generalization — its ability to predict the correct outcome of new samples or examples that we as humans would intend for it to classify.

One way to compensate for this is to withhold a separate set of data, known as a validation set, so that we monitor for not just the loss we are trying to optimize for, but also the performance of the model on a separate set of data it hasn’t seen.  While the loss function may be decreasing, thus insinuating the model is getting more accurate, you might in fact see that the performance on the validation dataset stops improving, or in fact reverses course and gets worse.  By evaluating a validation dataset throughout training, we can figure out when to terminate training for the best results.

Pre-made (Canned) Tensorflow Estimators

As each day goes by, there are more and more benefits to using canned estimators.  For instance:
  • They manage for you which parts are best run distributed vs. a single machine
  • They can stand on the shoulders of Tensorflow Hub modules, allowing you to focus on adding just one or more instances of a simple type of layer to much more complicated models pre-filled with highly trained and optimized parameters
  • You focus on the configuration of the model as a whole, not configuring so many details of every specific layer

However, pre-made Estimators do not offer many functions other than train, evaluate, predict, and exporting as a SavedModel.  And, looking for extensibility of pre-trained Estimators can be tricky: neither the train function nor a pre-trained Estimator object offer a direct, apparent way to run evaluation of a validation dataset mid-stream during training.

But Chollet said we need to validate frequently!

As it turns out, Tensorflow allows you to run side functions (invoked by callbacks) during train, evaluate, and predict.  These can serve any purpose, such as hooking up to the Twitter API to post memes to Twitter at certain intervals during the underlying operation (at either specified intervals of time or steps), or as we are most interested in doing, running an evaluation alongside training.

The feature is specified in these three functions by the hooks parameter.  Hooks are instances of the SessionRunHook class that allow you to define custom code that runs at the start and end of each Session, as well as before and after each step.  (There are built-in functions that will help you count steps or time so you’re not executing your desired action at every step.)

Let’s take a look at a complete SessionRunHook example for a pre-made Estimator.

class ValidationHook(tf.train.SessionRunHook):
    def __init__(self, parent_estimator, input_fn,
                 every_n_secs=None, every_n_steps=None):
        print("ValidationHook was initialized")
        self._parent_estimator = parent_estimator
        self._input_fn = input_fn
        self._iter_count = 0
        self._timer = tf.train.SecondOrStepTimer(every_n_secs, every_n_steps)
        self._should_trigger = False

    def begin(self):
        self._iter_count = 0

    def before_run(self, run_context):
        self._should_trigger = self._timer.should_trigger_for_step(self._iter_count)

    def after_run(self, run_context, run_values):
        if self._should_trigger:
            print("Hook is running")
            validation_eval_accuracy = self._parent_estimator.evaluate(input_fn=self._input_fn)
            print("Hook is done running. Training set accuracy: {accuracy}".format(**validation_eval_accuracy))
        self._iter_count += 1

Our purpose above is to run evaluate on the Estimator during training after a predefined number of seconds (every_n_secs) or steps (every_n_steps) elapses.  To this end, we can actually pass in the Estimator itself as an instantiation argument to our ValidationHook object, rather than trying to create a new Estimator in this scope.  You will see in begin() that some variables get initialized just at session runtime.  The before_run() function defers to a Timer initialized in the constructor that dictates whether or not to run our desired operation (the validation evaluation) in the after_run() function.  Without asking the timer if it’s time, the evaluation in after_run() would run after each step, and that would waste a substantial amount of time.

Now, actually running the Estimator with intermediate validation steps is a simple matter of defining the correct parameters in train().

settings = tf.estimator.RunConfig(keep_checkpoint_max=2, save_checkpoints_steps=STEPS_PER_EPOCH, save_checkpoints_secs=None)

estimator = tf.estimator.DNNClassifier(

    hooks=[ValidationHook(estimator, validation_input_fn, None, STEPS_PER_EPOCH)],
    steps=STEPS_PER_EPOCH * 40

What’s happening here is that I’m writing a RunConfig seeking to reduce space consumed on the hard drive by keeping a low number of recent checkpoints on hand.  Then, I configure it to save checkpoints at every STEPS_PER_EPOCH steps and make sure to explicitly disable saving checkpoints periodically, since defining both parameters is disallowed and the default is to save a checkpoint every 600 seconds.  Then, in train(), just make sure to specify the hook you built and the config you defined.

NOTA BENE: If you do not specify the RunConfig to save a checkpoint at your desired interval, the weights will not be updated when you run evaluate() inside the hook, and thus your validation performance will not appear to change until another checkpoint is written.

Visualizing Validation Performance with TensorBoard

TensorBoard can easily graph the loss over time as the model gets trained.  This is output by default into the events file that TensorBoard monitors.  However, there are a couple things you might wonder about:
  • How can I show other metrics, such as accuracy, or precision and recall, over time?
  • If I’m showing loss every 100 steps, and TensorBoard is picking this up to graph it, how do I get it to only convey my desired performance metrics at the times when they’re actually calculated?
As it turns out, the validation accuracy should already be available for you in TensorBoard.  Under your main model's output directory (defined by model_dir), there will be another directory called eval where the validation accuracy metric consumed by TensorBoard will be placed.  You can overlay the validation accuracy with the graph of loss, and/or with any other such tf.metrics collected in other log directories.

But if you want additional metrics, especially the ones defined in tf.metrics such as precision and recall at top k, there is a function called add_metrics() that actually comes from the tf.contrib.estimator module (rather than tf.estimator).  This allows you to define a metrics function returning a dictionary of your calculated metric results.  The good news is this function returns a brand new instance of Estimator, so you don’t have to worry about your original Estimator in training trying to constantly run these evaluation functions.  But the best part is even though this is now a separate Estimator object, all the parameters of the original Estimator are conveyed to this new one as they are updated by the checkpoint writing process.

To add additional metrics to TensorBoard, add a function to your code that complies with metric_fn as such:

# This is the function that meets the specs of metric_fn
def evaluation_metrics(labels, predictions):
    probabilities = predictions['probabilities']
    return {'auc': tf.metrics.auc(labels, probabilities)}

# And note the darkened modifications below:

class ValidationHook(tf.train.SessionRunHook):
    def __init__(self, parent_estimator, input_fn,
                 every_n_secs=None, every_n_steps=None):
        print("ValidationHook was initialized")
        self._estimator = tf.contrib.estimator.add_metrics(
        self._input_fn = input_fn

    def after_run(self, run_context, run_values):
        if self._should_trigger:
            validation_eval_accuracy = self._estimator.evaluate(input_fn=self._input_fn)
            print("Hook is done running. Training set accuracy: {accuracy}".format(**validation_eval_accuracy))

NOTA BENE: While it is convenient that you only need to add a couple lines in your subclass of SessionRunHook, don't forget to add the model_dir to the parameters you use to initialize the parent Estimator object, or else TensorBoard might not be able to pick up on any of your metrics at all for both training and evaluation.

What about tf.estimator.train_and_evaluate() ?

This is a function provided in the estimator module itself, and is not exposed in pre-made Estimators.  However, it does take your Estimator as an argument.  And so it is, with very few lines of code, that you can run interleaved training and evaluation.  After defining your Estimator (omitting the hooks parameter this time), don't bother writing any hooks at all.  Just do this:

estimator = tf.contrib.estimator.add_metrics(estimator, evaluation_metrics)

train_spec = tf.estimator.TrainSpec(input_fn=train_input_fn, max_steps=5000)
eval_spec = tf.estimator.EvalSpec(input_fn=validation_input_fn, start_delay_secs=60, throttle_secs=60)

tf.estimator.train_and_evaluate(estimator, train_spec, eval_spec)

evaluation_metrics is the same function as defined earlier.  start_delay_secs defines the delay from the very instant your model begins training.  If you put even 1 here as the value, training probably won't have even made it beyond the first step before an evaluation is performed.  And, throttle_secs defines the minimum delay between evaluations.  If you put 1 here, chances are you will perform just a single step of training in between evaluations.  The default values are 600 and 120, respectively.

Unfortunately, the implementation to date of train_and_evaluate() seems incapable of counting steps rather than time, so it will take some empirical measurement to find out exactly how much time an entire epoch takes to run if you care to line up evaluations with epochs.  However, this approach should be very scalable onto distributed systems, for those looking for very fast training.

One other thing that seemed a bit odd to me is that, in the course of running all these experiments, I would often times delete the log directory and reuse the same name.  Most times, with the hooks method, TensorBoard would pick up on the new run and work just fine.  However, using this feature just above, TensorBoard only seems to pick up on the new run about half the time.


And, of course, the Tensorflow documentation