SlideShare a Scribd company logo
6
Most read
8
Most read
19
Most read
Horovod
Distributed TensorFlow Made Easy
Alex Sergeev, Machine Learning Platform, Uber Engineering
Deep Learning @ Uber
● Self-Driving Vehicles
● Trip Forecasting
● Fraud Detection
● … and many more!
TensorFlow
● Most popular open source framework for Deep Learning
● Combines high performance with ability to tinker with low
level model details
● Has end-to-end support from research to production
Going Distributed
● Speed up model training
● Train very large models
● Vast majority of use cases are
data-parallel
● Facebook demonstrated
training ResNet on ImageNet
in 1 hour
Parameter Server Technique
tf.Server()
tf.ClusterSpec()
tf.train.replicas_device_setter()
tf.train.SyncReplicasOptimizer()
Parameter Server
Worker GPU Towers
Parameter Server Technique - Example Script
Image Source: TensorFlow -- https://www.tensorflow.org/deploy/distributed
Parameter Server Technique - Performance
Considering ImageNet dataset of 1.3M images, this allows to train ResNet-101 for one
epoch in 3.5 minutes. Scaling efficiency on 128 GPUs is only 42%, however.
How Can We Do Better?
● Re-think necessary complexity for data-parallel case
● Improve communication algorithm
● Use RDMA-capable networking (RoCE, InfiniBand)
Meet Horovod
● Distributed training framework for TensorFlow
● Inspired by work of Baidu, Facebook, et al.
● Uses bandwidth-optimal communication protocols
○ Makes use of RDMA (RoCE, InfiniBand) if available
● Seamlessly installs on top of TensorFlow via
pip install horovod
● Named after traditional Russian folk dance where
participants dance in a circle with linked hands
Horovod Technique
Patarasuk, P., & Yuan, X. (2009). Bandwidth optimal all-reduce algorithms for clusters of workstations.
Journal of Parallel and Distributed Computing, 69(2), 117-124. doi:10.1016/j.jpdc.2008.09.002
Horovod Stack
● Plugs into TensorFlow via custom op mechanism
● Uses MPI for worker discovery and reduction coordination
● Uses NVIDIA NCCL for actual reduction on the server and across servers
Horovod Example
import tensorflow as tf
import horovod.tensorflow as hvd
# Initialize Horovod
hvd.init()
# Pin GPU to be used
config = tf.ConfigProto()
config.gpu_options.visible_device_list = str(hvd.local_rank())
# Build model...
loss = ...
opt = tf.train.AdagradOptimizer(0.01)
# Add Horovod Distributed Optimizer
opt = hvd.DistributedOptimizer(opt)
# Add hook to broadcast variables from rank 0 to all other processes during initialization.
hooks = [hvd.BroadcastGlobalVariablesHook(0)]
# Make training operation
train_op = opt.minimize(loss)
# The MonitoredTrainingSession takes care of session initialization,
# restoring from a checkpoint, saving to a checkpoint, and closing when done
# or an error occurs.
with tf.train.MonitoredTrainingSession(checkpoint_dir="/tmp/train_logs",
config=config, hooks=hooks) as mon_sess:
while not mon_sess.should_stop():
# Perform synchronous training.
mon_sess.run(train_op)
Horovod Example Cont.
● Run on a 4 GPU machine:
○ $ mpirun -np 4 python train.py
● Run on 4 machines with 4 GPUs each using Open MPI:
○ $ mpirun -np 16 -x LD_LIBRARY_PATH 
-H server1:4,server2:4,server3:4,server4:4 
python train.py
Debugging - Horovod Timeline
● Discovered that ResNet-152 has a lot of tiny tensors
● Added Tensor Fusion - smart batching that gives large
gains (bigger gain on less optimized networks)
Horovod Performance
With Horovod, same ResNet-101 can be trained for one epoch on ImageNet in 1.5 minutes.
Scaling efficiency is improved to 88%, making it twice as efficient as standard distributed TF.
Horovod Performance Cont.
RDMA further helps to improve efficiency - by 30% for VGG-16.
Practical Results
● Used learning rate adjustment technique described in the
Facebook paper “Accurate, Large Minibatch SGD: Training
ImageNet in 1 Hour”
● Trained convolutional networks and LSTMs in hours
instead of days or weeks with the same final accuracy
● You can do that, too!
Giving Back
Horovod is available on GitHub today
https://github.com/uber/horovod
Thank you!
Learn more about Horovod on our Eng Blog: https://eng.uber.com/horovod
Learn more about ML at Uber on YouTube: http://t.uber.com/ml-meetup
Proprietary and confidential © 2017 Uber Technologies, Inc. All rights reserved. No part of this
document may be reproduced or utilized in any form or by any means, electronic or mechanical,
including photocopying, recording, or by any information storage or retrieval systems, without
permission in writing from Uber. This document is intended only for the use of the individual or entity
to whom it is addressed and contains information that is privileged, confidential or otherwise exempt
from disclosure under applicable law. All recipients of this document are notified that the information
contained herein includes proprietary and confidential information of Uber, and recipient may not
make use of, disseminate, or in any way disclose this document or any of the enclosed information to
any person other than employees of addressee to the extent necessary for consultations with
authorized personnel of Uber.

More Related Content

What's hot (20)

PDF
Linux Profiling at Netflix
Brendan Gregg
 
PDF
Kubernetes internals (Kubernetes 해부하기)
DongHyeon Kim
 
PDF
Introduction to OpenCL
Unai Lopez-Novoa
 
PDF
Introduction to TensorFlow 2.0
Databricks
 
PDF
Apache spark
shima jafari
 
PDF
An intro to Kubernetes operators
J On The Beach
 
PDF
Physical Memory Management.pdf
Adrian Huang
 
PDF
Helm - the Better Way to Deploy on Kubernetes - Reinhard Nägele - Codemotion...
Codemotion
 
PPTX
Understanding Storage I/O Under Load
ScyllaDB
 
PPTX
Intro to Helm for Kubernetes
Carlos E. Salazar
 
PDF
A whirlwind tour of the LLVM optimizer
Nikita Popov
 
PDF
Programming in Scala: Notes
Roberto Casadei
 
PPTX
Evening out the uneven: dealing with skew in Flink
Flink Forward
 
PDF
USENIX ATC 2017: Visualizing Performance with Flame Graphs
Brendan Gregg
 
PDF
Deep Dive into GPU Support in Apache Spark 3.x
Databricks
 
PDF
EBPF and Linux Networking
PLUMgrid
 
PPTX
Integrating Apache Spark and NiFi for Data Lakes
DataWorks Summit/Hadoop Summit
 
PDF
Anatomy of a Container: Namespaces, cgroups & Some Filesystem Magic - LinuxCon
Jérôme Petazzoni
 
PPTX
What is TensorFlow? | Introduction to TensorFlow | TensorFlow Tutorial For Be...
Simplilearn
 
PPTX
Kapacitor - Real Time Data Processing Engine
Prashant Vats
 
Linux Profiling at Netflix
Brendan Gregg
 
Kubernetes internals (Kubernetes 해부하기)
DongHyeon Kim
 
Introduction to OpenCL
Unai Lopez-Novoa
 
Introduction to TensorFlow 2.0
Databricks
 
Apache spark
shima jafari
 
An intro to Kubernetes operators
J On The Beach
 
Physical Memory Management.pdf
Adrian Huang
 
Helm - the Better Way to Deploy on Kubernetes - Reinhard Nägele - Codemotion...
Codemotion
 
Understanding Storage I/O Under Load
ScyllaDB
 
Intro to Helm for Kubernetes
Carlos E. Salazar
 
A whirlwind tour of the LLVM optimizer
Nikita Popov
 
Programming in Scala: Notes
Roberto Casadei
 
Evening out the uneven: dealing with skew in Flink
Flink Forward
 
USENIX ATC 2017: Visualizing Performance with Flame Graphs
Brendan Gregg
 
Deep Dive into GPU Support in Apache Spark 3.x
Databricks
 
EBPF and Linux Networking
PLUMgrid
 
Integrating Apache Spark and NiFi for Data Lakes
DataWorks Summit/Hadoop Summit
 
Anatomy of a Container: Namespaces, cgroups & Some Filesystem Magic - LinuxCon
Jérôme Petazzoni
 
What is TensorFlow? | Introduction to TensorFlow | TensorFlow Tutorial For Be...
Simplilearn
 
Kapacitor - Real Time Data Processing Engine
Prashant Vats
 

Similar to Horovod - Distributed TensorFlow Made Easy (20)

PDF
Horovod ubers distributed deep learning framework by Alex Sergeev from Uber
Bill Liu
 
PDF
Horovod: Uber’s Open Source Distributed Deep Learning Framework for TensorFlow
Databricks
 
PDF
Uber's Journey in Distributed Deep Learning
inside-BigData.com
 
PPTX
PR-129: Horovod: fast and easy distributed deep learning in TensorFlow
Seoul National University
 
PDF
Data Parallel Deep Learning
inside-BigData.com
 
PPTX
Distributed Model Training using MXNet with Horovod
Lin Yuan
 
PDF
Benchmark Tests and How-Tos of Convolutional Neural Network on HorovodRunner ...
Databricks
 
PDF
Democratizing machine learning on kubernetes
Docker, Inc.
 
PDF
Deep Learning 모델의 효과적인 분산 트레이닝과 모델 최적화 방법 - 김무현 데이터 사이언티스트, AWS :: AWS Summit...
Amazon Web Services Korea
 
PDF
A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & PyTorch with B...
Databricks
 
PDF
End-to-End Deep Learning with Horovod on Apache Spark
Databricks
 
PDF
High Performance Distributed TensorFlow with GPUs - TensorFlow Chicago Meetup...
Chris Fregly
 
PPTX
Distributed Deep learning Training.
Umang Sharma
 
PDF
High Performance Distributed TensorFlow in Production with GPUs - NIPS 2017 -...
Chris Fregly
 
PDF
TensorFlow example for AI Ukraine2016
Andrii Babii
 
PDF
Tensorflow 2.0 and Coral Edge TPU
Andrés Leonardo Martinez Ortiz
 
PPTX
Deep cv 101
Xiaohu ZHU
 
PDF
Introduction To TensorFlow | Deep Learning with TensorFlow | TensorFlow For B...
Edureka!
 
PDF
Resource-Efficient Deep Learning Model Selection on Apache Spark
Databricks
 
PDF
Optimize + Deploy Distributed Tensorflow, Spark, and Scikit-Learn Models on GPUs
Chris Fregly
 
Horovod ubers distributed deep learning framework by Alex Sergeev from Uber
Bill Liu
 
Horovod: Uber’s Open Source Distributed Deep Learning Framework for TensorFlow
Databricks
 
Uber's Journey in Distributed Deep Learning
inside-BigData.com
 
PR-129: Horovod: fast and easy distributed deep learning in TensorFlow
Seoul National University
 
Data Parallel Deep Learning
inside-BigData.com
 
Distributed Model Training using MXNet with Horovod
Lin Yuan
 
Benchmark Tests and How-Tos of Convolutional Neural Network on HorovodRunner ...
Databricks
 
Democratizing machine learning on kubernetes
Docker, Inc.
 
Deep Learning 모델의 효과적인 분산 트레이닝과 모델 최적화 방법 - 김무현 데이터 사이언티스트, AWS :: AWS Summit...
Amazon Web Services Korea
 
A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & PyTorch with B...
Databricks
 
End-to-End Deep Learning with Horovod on Apache Spark
Databricks
 
High Performance Distributed TensorFlow with GPUs - TensorFlow Chicago Meetup...
Chris Fregly
 
Distributed Deep learning Training.
Umang Sharma
 
High Performance Distributed TensorFlow in Production with GPUs - NIPS 2017 -...
Chris Fregly
 
TensorFlow example for AI Ukraine2016
Andrii Babii
 
Tensorflow 2.0 and Coral Edge TPU
Andrés Leonardo Martinez Ortiz
 
Deep cv 101
Xiaohu ZHU
 
Introduction To TensorFlow | Deep Learning with TensorFlow | TensorFlow For B...
Edureka!
 
Resource-Efficient Deep Learning Model Selection on Apache Spark
Databricks
 
Optimize + Deploy Distributed Tensorflow, Spark, and Scikit-Learn Models on GPUs
Chris Fregly
 
Ad

Recently uploaded (20)

PDF
Protecting the Digital World Cyber Securit
dnthakkar16
 
PDF
10 posting ideas for community engagement with AI prompts
Pankaj Taneja
 
PDF
New Download MiniTool Partition Wizard Crack Latest Version 2025
imang66g
 
PPTX
classification of computer and basic part of digital computer
ravisinghrajpurohit3
 
PDF
Applitools Platform Pulse: What's New and What's Coming - July 2025
Applitools
 
PDF
ChatPharo: an Open Architecture for Understanding How to Talk Live to LLMs
ESUG
 
PPTX
Contractor Management Platform and Software Solution for Compliance
SHEQ Network Limited
 
PDF
AI Image Enhancer: Revolutionizing Visual Quality”
docmasoom
 
PPTX
Presentation about variables and constant.pptx
kr2589474
 
PPTX
TRAVEL APIs | WHITE LABEL TRAVEL API | TOP TRAVEL APIs
philipnathen82
 
PPTX
Role Of Python In Programing Language.pptx
jaykoshti048
 
PPTX
Presentation about Database and Database Administrator
abhishekchauhan86963
 
PDF
New Download FL Studio Crack Full Version [Latest 2025]
imang66g
 
PDF
Enhancing Security in VAST: Towards Static Vulnerability Scanning
ESUG
 
PDF
SAP GUI Installation Guide for Windows | Step-by-Step Setup for SAP Access
SAP Vista, an A L T Z E N Company
 
PDF
advancepresentationskillshdhdhhdhdhdhhfhf
jasmenrojas249
 
PDF
Supabase Meetup: Build in a weekend, scale to millions
Carlo Gilmar Padilla Santana
 
PPTX
slidesgo-unlocking-the-code-the-dynamic-dance-of-variables-and-constants-2024...
kr2589474
 
PDF
Step-by-Step Guide to Install SAP HANA Studio | Complete Installation Tutoria...
SAP Vista, an A L T Z E N Company
 
PDF
Adobe Illustrator Crack Full Download (Latest Version 2025) Pre-Activated
imang66g
 
Protecting the Digital World Cyber Securit
dnthakkar16
 
10 posting ideas for community engagement with AI prompts
Pankaj Taneja
 
New Download MiniTool Partition Wizard Crack Latest Version 2025
imang66g
 
classification of computer and basic part of digital computer
ravisinghrajpurohit3
 
Applitools Platform Pulse: What's New and What's Coming - July 2025
Applitools
 
ChatPharo: an Open Architecture for Understanding How to Talk Live to LLMs
ESUG
 
Contractor Management Platform and Software Solution for Compliance
SHEQ Network Limited
 
AI Image Enhancer: Revolutionizing Visual Quality”
docmasoom
 
Presentation about variables and constant.pptx
kr2589474
 
TRAVEL APIs | WHITE LABEL TRAVEL API | TOP TRAVEL APIs
philipnathen82
 
Role Of Python In Programing Language.pptx
jaykoshti048
 
Presentation about Database and Database Administrator
abhishekchauhan86963
 
New Download FL Studio Crack Full Version [Latest 2025]
imang66g
 
Enhancing Security in VAST: Towards Static Vulnerability Scanning
ESUG
 
SAP GUI Installation Guide for Windows | Step-by-Step Setup for SAP Access
SAP Vista, an A L T Z E N Company
 
advancepresentationskillshdhdhhdhdhdhhfhf
jasmenrojas249
 
Supabase Meetup: Build in a weekend, scale to millions
Carlo Gilmar Padilla Santana
 
slidesgo-unlocking-the-code-the-dynamic-dance-of-variables-and-constants-2024...
kr2589474
 
Step-by-Step Guide to Install SAP HANA Studio | Complete Installation Tutoria...
SAP Vista, an A L T Z E N Company
 
Adobe Illustrator Crack Full Download (Latest Version 2025) Pre-Activated
imang66g
 
Ad

Horovod - Distributed TensorFlow Made Easy

  • 1. Horovod Distributed TensorFlow Made Easy Alex Sergeev, Machine Learning Platform, Uber Engineering
  • 2. Deep Learning @ Uber ● Self-Driving Vehicles ● Trip Forecasting ● Fraud Detection ● … and many more!
  • 3. TensorFlow ● Most popular open source framework for Deep Learning ● Combines high performance with ability to tinker with low level model details ● Has end-to-end support from research to production
  • 4. Going Distributed ● Speed up model training ● Train very large models ● Vast majority of use cases are data-parallel ● Facebook demonstrated training ResNet on ImageNet in 1 hour
  • 6. Parameter Server Technique - Example Script Image Source: TensorFlow -- https://www.tensorflow.org/deploy/distributed
  • 7. Parameter Server Technique - Performance Considering ImageNet dataset of 1.3M images, this allows to train ResNet-101 for one epoch in 3.5 minutes. Scaling efficiency on 128 GPUs is only 42%, however.
  • 8. How Can We Do Better? ● Re-think necessary complexity for data-parallel case ● Improve communication algorithm ● Use RDMA-capable networking (RoCE, InfiniBand)
  • 9. Meet Horovod ● Distributed training framework for TensorFlow ● Inspired by work of Baidu, Facebook, et al. ● Uses bandwidth-optimal communication protocols ○ Makes use of RDMA (RoCE, InfiniBand) if available ● Seamlessly installs on top of TensorFlow via pip install horovod ● Named after traditional Russian folk dance where participants dance in a circle with linked hands
  • 10. Horovod Technique Patarasuk, P., & Yuan, X. (2009). Bandwidth optimal all-reduce algorithms for clusters of workstations. Journal of Parallel and Distributed Computing, 69(2), 117-124. doi:10.1016/j.jpdc.2008.09.002
  • 11. Horovod Stack ● Plugs into TensorFlow via custom op mechanism ● Uses MPI for worker discovery and reduction coordination ● Uses NVIDIA NCCL for actual reduction on the server and across servers
  • 12. Horovod Example import tensorflow as tf import horovod.tensorflow as hvd # Initialize Horovod hvd.init() # Pin GPU to be used config = tf.ConfigProto() config.gpu_options.visible_device_list = str(hvd.local_rank()) # Build model... loss = ... opt = tf.train.AdagradOptimizer(0.01) # Add Horovod Distributed Optimizer opt = hvd.DistributedOptimizer(opt) # Add hook to broadcast variables from rank 0 to all other processes during initialization. hooks = [hvd.BroadcastGlobalVariablesHook(0)] # Make training operation train_op = opt.minimize(loss) # The MonitoredTrainingSession takes care of session initialization, # restoring from a checkpoint, saving to a checkpoint, and closing when done # or an error occurs. with tf.train.MonitoredTrainingSession(checkpoint_dir="/tmp/train_logs", config=config, hooks=hooks) as mon_sess: while not mon_sess.should_stop(): # Perform synchronous training. mon_sess.run(train_op)
  • 13. Horovod Example Cont. ● Run on a 4 GPU machine: ○ $ mpirun -np 4 python train.py ● Run on 4 machines with 4 GPUs each using Open MPI: ○ $ mpirun -np 16 -x LD_LIBRARY_PATH -H server1:4,server2:4,server3:4,server4:4 python train.py
  • 14. Debugging - Horovod Timeline ● Discovered that ResNet-152 has a lot of tiny tensors ● Added Tensor Fusion - smart batching that gives large gains (bigger gain on less optimized networks)
  • 15. Horovod Performance With Horovod, same ResNet-101 can be trained for one epoch on ImageNet in 1.5 minutes. Scaling efficiency is improved to 88%, making it twice as efficient as standard distributed TF.
  • 16. Horovod Performance Cont. RDMA further helps to improve efficiency - by 30% for VGG-16.
  • 17. Practical Results ● Used learning rate adjustment technique described in the Facebook paper “Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour” ● Trained convolutional networks and LSTMs in hours instead of days or weeks with the same final accuracy ● You can do that, too!
  • 18. Giving Back Horovod is available on GitHub today https://github.com/uber/horovod
  • 19. Thank you! Learn more about Horovod on our Eng Blog: https://eng.uber.com/horovod Learn more about ML at Uber on YouTube: http://t.uber.com/ml-meetup
  • 20. Proprietary and confidential © 2017 Uber Technologies, Inc. All rights reserved. No part of this document may be reproduced or utilized in any form or by any means, electronic or mechanical, including photocopying, recording, or by any information storage or retrieval systems, without permission in writing from Uber. This document is intended only for the use of the individual or entity to whom it is addressed and contains information that is privileged, confidential or otherwise exempt from disclosure under applicable law. All recipients of this document are notified that the information contained herein includes proprietary and confidential information of Uber, and recipient may not make use of, disseminate, or in any way disclose this document or any of the enclosed information to any person other than employees of addressee to the extent necessary for consultations with authorized personnel of Uber.