SlideShare a Scribd company logo
6
Most read
8
Most read
19
Most read
Horovod
Distributed TensorFlow Made Easy
Alex Sergeev, Machine Learning Platform, Uber Engineering
Deep Learning @ Uber
● Self-Driving Vehicles
● Trip Forecasting
● Fraud Detection
● … and many more!
TensorFlow
● Most popular open source framework for Deep Learning
● Combines high performance with ability to tinker with low
level model details
● Has end-to-end support from research to production
Going Distributed
● Speed up model training
● Train very large models
● Vast majority of use cases are
data-parallel
● Facebook demonstrated
training ResNet on ImageNet
in 1 hour
Parameter Server Technique
tf.Server()
tf.ClusterSpec()
tf.train.replicas_device_setter()
tf.train.SyncReplicasOptimizer()
Parameter Server
Worker GPU Towers
Parameter Server Technique - Example Script
Image Source: TensorFlow -- https://www.tensorflow.org/deploy/distributed
Parameter Server Technique - Performance
Considering ImageNet dataset of 1.3M images, this allows to train ResNet-101 for one
epoch in 3.5 minutes. Scaling efficiency on 128 GPUs is only 42%, however.
How Can We Do Better?
● Re-think necessary complexity for data-parallel case
● Improve communication algorithm
● Use RDMA-capable networking (RoCE, InfiniBand)
Meet Horovod
● Distributed training framework for TensorFlow
● Inspired by work of Baidu, Facebook, et al.
● Uses bandwidth-optimal communication protocols
○ Makes use of RDMA (RoCE, InfiniBand) if available
● Seamlessly installs on top of TensorFlow via
pip install horovod
● Named after traditional Russian folk dance where
participants dance in a circle with linked hands
Horovod Technique
Patarasuk, P., & Yuan, X. (2009). Bandwidth optimal all-reduce algorithms for clusters of workstations.
Journal of Parallel and Distributed Computing, 69(2), 117-124. doi:10.1016/j.jpdc.2008.09.002
Horovod Stack
● Plugs into TensorFlow via custom op mechanism
● Uses MPI for worker discovery and reduction coordination
● Uses NVIDIA NCCL for actual reduction on the server and across servers
Horovod Example
import tensorflow as tf
import horovod.tensorflow as hvd
# Initialize Horovod
hvd.init()
# Pin GPU to be used
config = tf.ConfigProto()
config.gpu_options.visible_device_list = str(hvd.local_rank())
# Build model...
loss = ...
opt = tf.train.AdagradOptimizer(0.01)
# Add Horovod Distributed Optimizer
opt = hvd.DistributedOptimizer(opt)
# Add hook to broadcast variables from rank 0 to all other processes during initialization.
hooks = [hvd.BroadcastGlobalVariablesHook(0)]
# Make training operation
train_op = opt.minimize(loss)
# The MonitoredTrainingSession takes care of session initialization,
# restoring from a checkpoint, saving to a checkpoint, and closing when done
# or an error occurs.
with tf.train.MonitoredTrainingSession(checkpoint_dir="/tmp/train_logs",
config=config, hooks=hooks) as mon_sess:
while not mon_sess.should_stop():
# Perform synchronous training.
mon_sess.run(train_op)
Horovod Example Cont.
● Run on a 4 GPU machine:
○ $ mpirun -np 4 python train.py
● Run on 4 machines with 4 GPUs each using Open MPI:
○ $ mpirun -np 16 -x LD_LIBRARY_PATH 
-H server1:4,server2:4,server3:4,server4:4 
python train.py
Debugging - Horovod Timeline
● Discovered that ResNet-152 has a lot of tiny tensors
● Added Tensor Fusion - smart batching that gives large
gains (bigger gain on less optimized networks)
Horovod Performance
With Horovod, same ResNet-101 can be trained for one epoch on ImageNet in 1.5 minutes.
Scaling efficiency is improved to 88%, making it twice as efficient as standard distributed TF.
Horovod Performance Cont.
RDMA further helps to improve efficiency - by 30% for VGG-16.
Practical Results
● Used learning rate adjustment technique described in the
Facebook paper “Accurate, Large Minibatch SGD: Training
ImageNet in 1 Hour”
● Trained convolutional networks and LSTMs in hours
instead of days or weeks with the same final accuracy
● You can do that, too!
Giving Back
Horovod is available on GitHub today
https://github.com/uber/horovod
Thank you!
Learn more about Horovod on our Eng Blog: https://eng.uber.com/horovod
Learn more about ML at Uber on YouTube: http://t.uber.com/ml-meetup
Proprietary and confidential © 2017 Uber Technologies, Inc. All rights reserved. No part of this
document may be reproduced or utilized in any form or by any means, electronic or mechanical,
including photocopying, recording, or by any information storage or retrieval systems, without
permission in writing from Uber. This document is intended only for the use of the individual or entity
to whom it is addressed and contains information that is privileged, confidential or otherwise exempt
from disclosure under applicable law. All recipients of this document are notified that the information
contained herein includes proprietary and confidential information of Uber, and recipient may not
make use of, disseminate, or in any way disclose this document or any of the enclosed information to
any person other than employees of addressee to the extent necessary for consultations with
authorized personnel of Uber.

More Related Content

PDF
A Thorough Comparison of Delta Lake, Iceberg and Hudi
Databricks
 
PPTX
Empowering developers and operators through Gitlab and HashiCorp
Mitchell Pronschinske
 
PDF
Linux tuning to improve PostgreSQL performance
PostgreSQL-Consulting
 
PDF
Pulsar Storage on BookKeeper _Seamless Evolution
StreamNative
 
PDF
Data Parallel Deep Learning
inside-BigData.com
 
PPTX
Information Extraction
Ignacio Delgado
 
PDF
Arquitectura en Alfresco
Antonio de la Torre Fernández
 
PDF
Introduction to Databases - query optimizations for MySQL
Márton Kodok
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
Databricks
 
Empowering developers and operators through Gitlab and HashiCorp
Mitchell Pronschinske
 
Linux tuning to improve PostgreSQL performance
PostgreSQL-Consulting
 
Pulsar Storage on BookKeeper _Seamless Evolution
StreamNative
 
Data Parallel Deep Learning
inside-BigData.com
 
Information Extraction
Ignacio Delgado
 
Arquitectura en Alfresco
Antonio de la Torre Fernández
 
Introduction to Databases - query optimizations for MySQL
Márton Kodok
 

What's hot (20)

PDF
Kafka for Real-Time Replication between Edge and Hybrid Cloud
Kai Wähner
 
PDF
Grokking TechTalk #31: Asynchronous Communications
Grokking VN
 
PPTX
Building a Unified Logging Layer with Fluentd, Elasticsearch and Kibana
Mushfekur Rahman
 
PDF
Mainframe Integration, Offloading and Replacement with Apache Kafka
Kai Wähner
 
PPTX
Data Parallel and Object Oriented Model
Nikhil Sharma
 
PPTX
6.hive
Prashant Gupta
 
PPTX
Challenges in Building a Data Pipeline
Manish Kumar
 
PDF
Apache spark
shima jafari
 
PDF
Hyper threading technology
Nikhil Venugopal
 
PPTX
Introduction to GCP presentation
Mohit Kachhwani
 
PDF
How Apache Kafka® Works
confluent
 
PDF
Productionzing ML Model Using MLflow Model Serving
Databricks
 
PPTX
Actor Model & Reactive Manifesto
Angelo Simone Scotto
 
PPTX
PARALLELISM IN MULTICORE PROCESSORS
Amirthavalli Senthil
 
PPT
Google Megastore
bergwolf
 
PDF
Our answer to Uber
Alexander Korotkov
 
PPT
Pipeline parallelism
Dr. C.V. Suresh Babu
 
PDF
[OpenStack Days Korea 2016] Track1 - All flash CEPH 구성 및 최적화
OpenStack Korea Community
 
PDF
Migrating Apache Hive Workload to Apache Spark: Bridge the Gap with Zhan Zhan...
Databricks
 
PPTX
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
Simplilearn
 
Kafka for Real-Time Replication between Edge and Hybrid Cloud
Kai Wähner
 
Grokking TechTalk #31: Asynchronous Communications
Grokking VN
 
Building a Unified Logging Layer with Fluentd, Elasticsearch and Kibana
Mushfekur Rahman
 
Mainframe Integration, Offloading and Replacement with Apache Kafka
Kai Wähner
 
Data Parallel and Object Oriented Model
Nikhil Sharma
 
Challenges in Building a Data Pipeline
Manish Kumar
 
Apache spark
shima jafari
 
Hyper threading technology
Nikhil Venugopal
 
Introduction to GCP presentation
Mohit Kachhwani
 
How Apache Kafka® Works
confluent
 
Productionzing ML Model Using MLflow Model Serving
Databricks
 
Actor Model & Reactive Manifesto
Angelo Simone Scotto
 
PARALLELISM IN MULTICORE PROCESSORS
Amirthavalli Senthil
 
Google Megastore
bergwolf
 
Our answer to Uber
Alexander Korotkov
 
Pipeline parallelism
Dr. C.V. Suresh Babu
 
[OpenStack Days Korea 2016] Track1 - All flash CEPH 구성 및 최적화
OpenStack Korea Community
 
Migrating Apache Hive Workload to Apache Spark: Bridge the Gap with Zhan Zhan...
Databricks
 
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
Simplilearn
 
Ad

Similar to Horovod - Distributed TensorFlow Made Easy (20)

PDF
Horovod ubers distributed deep learning framework by Alex Sergeev from Uber
Bill Liu
 
PDF
Horovod: Uber’s Open Source Distributed Deep Learning Framework for TensorFlow
Databricks
 
PDF
Uber's Journey in Distributed Deep Learning
inside-BigData.com
 
PPTX
PR-129: Horovod: fast and easy distributed deep learning in TensorFlow
Seoul National University
 
PPTX
Distributed Model Training using MXNet with Horovod
Lin Yuan
 
PDF
Benchmark Tests and How-Tos of Convolutional Neural Network on HorovodRunner ...
Databricks
 
PDF
Democratizing machine learning on kubernetes
Docker, Inc.
 
PDF
Deep Learning 모델의 효과적인 분산 트레이닝과 모델 최적화 방법 - 김무현 데이터 사이언티스트, AWS :: AWS Summit...
Amazon Web Services Korea
 
PDF
A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & PyTorch with B...
Databricks
 
PDF
End-to-End Deep Learning with Horovod on Apache Spark
Databricks
 
PDF
High Performance Distributed TensorFlow with GPUs - TensorFlow Chicago Meetup...
Chris Fregly
 
PPTX
Distributed Deep learning Training.
Umang Sharma
 
PDF
High Performance Distributed TensorFlow in Production with GPUs - NIPS 2017 -...
Chris Fregly
 
PDF
TensorFlow example for AI Ukraine2016
Andrii Babii
 
PDF
Tensorflow 2.0 and Coral Edge TPU
Andrés Leonardo Martinez Ortiz
 
PPTX
Deep cv 101
Xiaohu ZHU
 
PDF
Introduction To TensorFlow | Deep Learning with TensorFlow | TensorFlow For B...
Edureka!
 
PDF
Resource-Efficient Deep Learning Model Selection on Apache Spark
Databricks
 
PDF
Optimize + Deploy Distributed Tensorflow, Spark, and Scikit-Learn Models on GPUs
Chris Fregly
 
PDF
Optimizing, profiling and deploying high performance Spark ML and TensorFlow ...
DataWorks Summit
 
Horovod ubers distributed deep learning framework by Alex Sergeev from Uber
Bill Liu
 
Horovod: Uber’s Open Source Distributed Deep Learning Framework for TensorFlow
Databricks
 
Uber's Journey in Distributed Deep Learning
inside-BigData.com
 
PR-129: Horovod: fast and easy distributed deep learning in TensorFlow
Seoul National University
 
Distributed Model Training using MXNet with Horovod
Lin Yuan
 
Benchmark Tests and How-Tos of Convolutional Neural Network on HorovodRunner ...
Databricks
 
Democratizing machine learning on kubernetes
Docker, Inc.
 
Deep Learning 모델의 효과적인 분산 트레이닝과 모델 최적화 방법 - 김무현 데이터 사이언티스트, AWS :: AWS Summit...
Amazon Web Services Korea
 
A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & PyTorch with B...
Databricks
 
End-to-End Deep Learning with Horovod on Apache Spark
Databricks
 
High Performance Distributed TensorFlow with GPUs - TensorFlow Chicago Meetup...
Chris Fregly
 
Distributed Deep learning Training.
Umang Sharma
 
High Performance Distributed TensorFlow in Production with GPUs - NIPS 2017 -...
Chris Fregly
 
TensorFlow example for AI Ukraine2016
Andrii Babii
 
Tensorflow 2.0 and Coral Edge TPU
Andrés Leonardo Martinez Ortiz
 
Deep cv 101
Xiaohu ZHU
 
Introduction To TensorFlow | Deep Learning with TensorFlow | TensorFlow For B...
Edureka!
 
Resource-Efficient Deep Learning Model Selection on Apache Spark
Databricks
 
Optimize + Deploy Distributed Tensorflow, Spark, and Scikit-Learn Models on GPUs
Chris Fregly
 
Optimizing, profiling and deploying high performance Spark ML and TensorFlow ...
DataWorks Summit
 
Ad

Recently uploaded (20)

PDF
Enhancing Healthcare RPM Platforms with Contextual AI Integration
Cadabra Studio
 
PDF
What to consider before purchasing Microsoft 365 Business Premium_PDF.pdf
Q-Advise
 
PDF
Applitools Platform Pulse: What's New and What's Coming - July 2025
Applitools
 
PDF
New Download MiniTool Partition Wizard Crack Latest Version 2025
imang66g
 
PPTX
Contractor Management Platform and Software Solution for Compliance
SHEQ Network Limited
 
PPTX
Can You Build Dashboards Using Open Source Visualization Tool.pptx
Varsha Nayak
 
DOCX
Can You Build Dashboards Using Open Source Visualization Tool.docx
Varsha Nayak
 
PPT
Why Reliable Server Maintenance Service in New York is Crucial for Your Business
Sam Vohra
 
PDF
Generating Union types w/ Static Analysis
K. Matthew Dupree
 
PDF
Adobe Illustrator Crack Full Download (Latest Version 2025) Pre-Activated
imang66g
 
PPTX
Visualising Data with Scatterplots in IBM SPSS Statistics.pptx
Version 1 Analytics
 
PDF
Key Features to Look for in Arizona App Development Services
Net-Craft.com
 
PDF
Download iTop VPN Free 6.1.0.5882 Crack Full Activated Pre Latest 2025
imang66g
 
PPTX
AI-Ready Handoff: Auto-Summaries & Draft Emails from MQL to Slack in One Flow
bbedford2
 
PPTX
oapresentation.pptx
mehatdhavalrajubhai
 
PPTX
GALILEO CRS SYSTEM | GALILEO TRAVEL SOFTWARE
philipnathen82
 
PDF
49785682629390197565_LRN3014_Migrating_the_Beast.pdf
Abilash868456
 
PDF
Bandai Playdia The Book - David Glotz
BluePanther6
 
PDF
An Experience-Based Look at AI Lead Generation Pricing, Features & B2B Results
Thomas albart
 
PPTX
ASSIGNMENT_1[1][1][1][1][1] (1) variables.pptx
kr2589474
 
Enhancing Healthcare RPM Platforms with Contextual AI Integration
Cadabra Studio
 
What to consider before purchasing Microsoft 365 Business Premium_PDF.pdf
Q-Advise
 
Applitools Platform Pulse: What's New and What's Coming - July 2025
Applitools
 
New Download MiniTool Partition Wizard Crack Latest Version 2025
imang66g
 
Contractor Management Platform and Software Solution for Compliance
SHEQ Network Limited
 
Can You Build Dashboards Using Open Source Visualization Tool.pptx
Varsha Nayak
 
Can You Build Dashboards Using Open Source Visualization Tool.docx
Varsha Nayak
 
Why Reliable Server Maintenance Service in New York is Crucial for Your Business
Sam Vohra
 
Generating Union types w/ Static Analysis
K. Matthew Dupree
 
Adobe Illustrator Crack Full Download (Latest Version 2025) Pre-Activated
imang66g
 
Visualising Data with Scatterplots in IBM SPSS Statistics.pptx
Version 1 Analytics
 
Key Features to Look for in Arizona App Development Services
Net-Craft.com
 
Download iTop VPN Free 6.1.0.5882 Crack Full Activated Pre Latest 2025
imang66g
 
AI-Ready Handoff: Auto-Summaries & Draft Emails from MQL to Slack in One Flow
bbedford2
 
oapresentation.pptx
mehatdhavalrajubhai
 
GALILEO CRS SYSTEM | GALILEO TRAVEL SOFTWARE
philipnathen82
 
49785682629390197565_LRN3014_Migrating_the_Beast.pdf
Abilash868456
 
Bandai Playdia The Book - David Glotz
BluePanther6
 
An Experience-Based Look at AI Lead Generation Pricing, Features & B2B Results
Thomas albart
 
ASSIGNMENT_1[1][1][1][1][1] (1) variables.pptx
kr2589474
 

Horovod - Distributed TensorFlow Made Easy

  • 1. Horovod Distributed TensorFlow Made Easy Alex Sergeev, Machine Learning Platform, Uber Engineering
  • 2. Deep Learning @ Uber ● Self-Driving Vehicles ● Trip Forecasting ● Fraud Detection ● … and many more!
  • 3. TensorFlow ● Most popular open source framework for Deep Learning ● Combines high performance with ability to tinker with low level model details ● Has end-to-end support from research to production
  • 4. Going Distributed ● Speed up model training ● Train very large models ● Vast majority of use cases are data-parallel ● Facebook demonstrated training ResNet on ImageNet in 1 hour
  • 6. Parameter Server Technique - Example Script Image Source: TensorFlow -- https://www.tensorflow.org/deploy/distributed
  • 7. Parameter Server Technique - Performance Considering ImageNet dataset of 1.3M images, this allows to train ResNet-101 for one epoch in 3.5 minutes. Scaling efficiency on 128 GPUs is only 42%, however.
  • 8. How Can We Do Better? ● Re-think necessary complexity for data-parallel case ● Improve communication algorithm ● Use RDMA-capable networking (RoCE, InfiniBand)
  • 9. Meet Horovod ● Distributed training framework for TensorFlow ● Inspired by work of Baidu, Facebook, et al. ● Uses bandwidth-optimal communication protocols ○ Makes use of RDMA (RoCE, InfiniBand) if available ● Seamlessly installs on top of TensorFlow via pip install horovod ● Named after traditional Russian folk dance where participants dance in a circle with linked hands
  • 10. Horovod Technique Patarasuk, P., & Yuan, X. (2009). Bandwidth optimal all-reduce algorithms for clusters of workstations. Journal of Parallel and Distributed Computing, 69(2), 117-124. doi:10.1016/j.jpdc.2008.09.002
  • 11. Horovod Stack ● Plugs into TensorFlow via custom op mechanism ● Uses MPI for worker discovery and reduction coordination ● Uses NVIDIA NCCL for actual reduction on the server and across servers
  • 12. Horovod Example import tensorflow as tf import horovod.tensorflow as hvd # Initialize Horovod hvd.init() # Pin GPU to be used config = tf.ConfigProto() config.gpu_options.visible_device_list = str(hvd.local_rank()) # Build model... loss = ... opt = tf.train.AdagradOptimizer(0.01) # Add Horovod Distributed Optimizer opt = hvd.DistributedOptimizer(opt) # Add hook to broadcast variables from rank 0 to all other processes during initialization. hooks = [hvd.BroadcastGlobalVariablesHook(0)] # Make training operation train_op = opt.minimize(loss) # The MonitoredTrainingSession takes care of session initialization, # restoring from a checkpoint, saving to a checkpoint, and closing when done # or an error occurs. with tf.train.MonitoredTrainingSession(checkpoint_dir="/tmp/train_logs", config=config, hooks=hooks) as mon_sess: while not mon_sess.should_stop(): # Perform synchronous training. mon_sess.run(train_op)
  • 13. Horovod Example Cont. ● Run on a 4 GPU machine: ○ $ mpirun -np 4 python train.py ● Run on 4 machines with 4 GPUs each using Open MPI: ○ $ mpirun -np 16 -x LD_LIBRARY_PATH -H server1:4,server2:4,server3:4,server4:4 python train.py
  • 14. Debugging - Horovod Timeline ● Discovered that ResNet-152 has a lot of tiny tensors ● Added Tensor Fusion - smart batching that gives large gains (bigger gain on less optimized networks)
  • 15. Horovod Performance With Horovod, same ResNet-101 can be trained for one epoch on ImageNet in 1.5 minutes. Scaling efficiency is improved to 88%, making it twice as efficient as standard distributed TF.
  • 16. Horovod Performance Cont. RDMA further helps to improve efficiency - by 30% for VGG-16.
  • 17. Practical Results ● Used learning rate adjustment technique described in the Facebook paper “Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour” ● Trained convolutional networks and LSTMs in hours instead of days or weeks with the same final accuracy ● You can do that, too!
  • 18. Giving Back Horovod is available on GitHub today https://github.com/uber/horovod
  • 19. Thank you! Learn more about Horovod on our Eng Blog: https://eng.uber.com/horovod Learn more about ML at Uber on YouTube: http://t.uber.com/ml-meetup
  • 20. Proprietary and confidential © 2017 Uber Technologies, Inc. All rights reserved. No part of this document may be reproduced or utilized in any form or by any means, electronic or mechanical, including photocopying, recording, or by any information storage or retrieval systems, without permission in writing from Uber. This document is intended only for the use of the individual or entity to whom it is addressed and contains information that is privileged, confidential or otherwise exempt from disclosure under applicable law. All recipients of this document are notified that the information contained herein includes proprietary and confidential information of Uber, and recipient may not make use of, disseminate, or in any way disclose this document or any of the enclosed information to any person other than employees of addressee to the extent necessary for consultations with authorized personnel of Uber.