Google is running TensorFlow across hundreds of machines, and now you can run it on as many machines as you want in your office
Google has introduced distributed computing support to the latest version of its machine learning software TensorFlow.
TensorFlow 0.8 can now be run on small clusters of machines in your office using Google’s new Inception image classification neural network.
Writing on Google’s blog, software engineer Derek Murray says the new system makes it easier to scale a single-process job up to use a cluster, and also to experiment with novel architectures for distributed training.
“To coincide with the TensorFlow 0.8 release, we have published a distributed trainer for the Inception image classification neural network in the TensorFlow models repository,” writes Murray.
“Using the distributed trainer, we trained the Inception network to 78 per cent accuracy in less than 65 hours using 100 GPUs. Even small clusters – or a couple of machines under your desk – can benefit from distributed TensorFlow, since adding more GPUs improves the overall throughput, and produces accurate results sooner.”
Murray adds that the distributed trainer also enables you to scale out training using a cluster management system like Kubernetes. “Furthermore, once you have trained your model, you can deploy to production and speed up inference using TensorFlow Serving on Kubernetes.”
Beyond distributed Inception, the 0.8 release includes new libraries for defining your own distributed models. TensorFlow’s distributed architecture permits “a great deal of flexibility in defining your model, because every process in the cluster can perform general-purpose computation”, writes Murray.