SlideShare a Scribd company logo
6
Most read
8
Most read
19
Most read
Horovod
Distributed TensorFlow Made Easy
Alex Sergeev, Machine Learning Platform, Uber Engineering
Deep Learning @ Uber
● Self-Driving Vehicles
● Trip Forecasting
● Fraud Detection
● … and many more!
TensorFlow
● Most popular open source framework for Deep Learning
● Combines high performance with ability to tinker with low
level model details
● Has end-to-end support from research to production
Going Distributed
● Speed up model training
● Train very large models
● Vast majority of use cases are
data-parallel
● Facebook demonstrated
training ResNet on ImageNet
in 1 hour
Parameter Server Technique
tf.Server()
tf.ClusterSpec()
tf.train.replicas_device_setter()
tf.train.SyncReplicasOptimizer()
Parameter Server
Worker GPU Towers
Parameter Server Technique - Example Script
Image Source: TensorFlow -- https://www.tensorflow.org/deploy/distributed
Parameter Server Technique - Performance
Considering ImageNet dataset of 1.3M images, this allows to train ResNet-101 for one
epoch in 3.5 minutes. Scaling efficiency on 128 GPUs is only 42%, however.
How Can We Do Better?
● Re-think necessary complexity for data-parallel case
● Improve communication algorithm
● Use RDMA-capable networking (RoCE, InfiniBand)
Meet Horovod
● Distributed training framework for TensorFlow
● Inspired by work of Baidu, Facebook, et al.
● Uses bandwidth-optimal communication protocols
○ Makes use of RDMA (RoCE, InfiniBand) if available
● Seamlessly installs on top of TensorFlow via
pip install horovod
● Named after traditional Russian folk dance where
participants dance in a circle with linked hands
Horovod Technique
Patarasuk, P., & Yuan, X. (2009). Bandwidth optimal all-reduce algorithms for clusters of workstations.
Journal of Parallel and Distributed Computing, 69(2), 117-124. doi:10.1016/j.jpdc.2008.09.002
Horovod Stack
● Plugs into TensorFlow via custom op mechanism
● Uses MPI for worker discovery and reduction coordination
● Uses NVIDIA NCCL for actual reduction on the server and across servers
Horovod Example
import tensorflow as tf
import horovod.tensorflow as hvd
# Initialize Horovod
hvd.init()
# Pin GPU to be used
config = tf.ConfigProto()
config.gpu_options.visible_device_list = str(hvd.local_rank())
# Build model...
loss = ...
opt = tf.train.AdagradOptimizer(0.01)
# Add Horovod Distributed Optimizer
opt = hvd.DistributedOptimizer(opt)
# Add hook to broadcast variables from rank 0 to all other processes during initialization.
hooks = [hvd.BroadcastGlobalVariablesHook(0)]
# Make training operation
train_op = opt.minimize(loss)
# The MonitoredTrainingSession takes care of session initialization,
# restoring from a checkpoint, saving to a checkpoint, and closing when done
# or an error occurs.
with tf.train.MonitoredTrainingSession(checkpoint_dir="/tmp/train_logs",
config=config, hooks=hooks) as mon_sess:
while not mon_sess.should_stop():
# Perform synchronous training.
mon_sess.run(train_op)
Horovod Example Cont.
● Run on a 4 GPU machine:
○ $ mpirun -np 4 python train.py
● Run on 4 machines with 4 GPUs each using Open MPI:
○ $ mpirun -np 16 -x LD_LIBRARY_PATH 
-H server1:4,server2:4,server3:4,server4:4 
python train.py
Debugging - Horovod Timeline
● Discovered that ResNet-152 has a lot of tiny tensors
● Added Tensor Fusion - smart batching that gives large
gains (bigger gain on less optimized networks)
Horovod Performance
With Horovod, same ResNet-101 can be trained for one epoch on ImageNet in 1.5 minutes.
Scaling efficiency is improved to 88%, making it twice as efficient as standard distributed TF.
Horovod Performance Cont.
RDMA further helps to improve efficiency - by 30% for VGG-16.
Practical Results
● Used learning rate adjustment technique described in the
Facebook paper “Accurate, Large Minibatch SGD: Training
ImageNet in 1 Hour”
● Trained convolutional networks and LSTMs in hours
instead of days or weeks with the same final accuracy
● You can do that, too!
Giving Back
Horovod is available on GitHub today
https://github.com/uber/horovod
Thank you!
Learn more about Horovod on our Eng Blog: https://eng.uber.com/horovod
Learn more about ML at Uber on YouTube: http://t.uber.com/ml-meetup
Proprietary and confidential © 2017 Uber Technologies, Inc. All rights reserved. No part of this
document may be reproduced or utilized in any form or by any means, electronic or mechanical,
including photocopying, recording, or by any information storage or retrieval systems, without
permission in writing from Uber. This document is intended only for the use of the individual or entity
to whom it is addressed and contains information that is privileged, confidential or otherwise exempt
from disclosure under applicable law. All recipients of this document are notified that the information
contained herein includes proprietary and confidential information of Uber, and recipient may not
make use of, disseminate, or in any way disclose this document or any of the enclosed information to
any person other than employees of addressee to the extent necessary for consultations with
authorized personnel of Uber.

More Related Content

PPTX
Empowering developers and operators through Gitlab and HashiCorp
Mitchell Pronschinske
 
PDF
Linux tuning to improve PostgreSQL performance
PostgreSQL-Consulting
 
PDF
Pulsar Storage on BookKeeper _Seamless Evolution
StreamNative
 
PDF
Data Parallel Deep Learning
inside-BigData.com
 
PDF
A Thorough Comparison of Delta Lake, Iceberg and Hudi
Databricks
 
PPTX
Information Extraction
Ignacio Delgado
 
PDF
Arquitectura en Alfresco
Antonio de la Torre Fernández
 
PDF
Introduction to Databases - query optimizations for MySQL
Márton Kodok
 
Empowering developers and operators through Gitlab and HashiCorp
Mitchell Pronschinske
 
Linux tuning to improve PostgreSQL performance
PostgreSQL-Consulting
 
Pulsar Storage on BookKeeper _Seamless Evolution
StreamNative
 
Data Parallel Deep Learning
inside-BigData.com
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
Databricks
 
Information Extraction
Ignacio Delgado
 
Arquitectura en Alfresco
Antonio de la Torre Fernández
 
Introduction to Databases - query optimizations for MySQL
Márton Kodok
 

What's hot (20)

PDF
Kafka for Real-Time Replication between Edge and Hybrid Cloud
Kai Wähner
 
PDF
Grokking TechTalk #31: Asynchronous Communications
Grokking VN
 
PPTX
Building a Unified Logging Layer with Fluentd, Elasticsearch and Kibana
Mushfekur Rahman
 
PDF
Mainframe Integration, Offloading and Replacement with Apache Kafka
Kai Wähner
 
PPTX
Data Parallel and Object Oriented Model
Nikhil Sharma
 
PPTX
6.hive
Prashant Gupta
 
PPTX
Challenges in Building a Data Pipeline
Manish Kumar
 
PDF
Apache spark
shima jafari
 
PDF
Hyper threading technology
Nikhil Venugopal
 
PPTX
Introduction to GCP presentation
Mohit Kachhwani
 
PDF
How Apache Kafka® Works
confluent
 
PDF
Productionzing ML Model Using MLflow Model Serving
Databricks
 
PPTX
Actor Model & Reactive Manifesto
Angelo Simone Scotto
 
PPTX
PARALLELISM IN MULTICORE PROCESSORS
Amirthavalli Senthil
 
PPT
Google Megastore
bergwolf
 
PDF
Our answer to Uber
Alexander Korotkov
 
PPT
Pipeline parallelism
Dr. C.V. Suresh Babu
 
PDF
[OpenStack Days Korea 2016] Track1 - All flash CEPH 구성 및 최적화
OpenStack Korea Community
 
PDF
Migrating Apache Hive Workload to Apache Spark: Bridge the Gap with Zhan Zhan...
Databricks
 
PPTX
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
Simplilearn
 
Kafka for Real-Time Replication between Edge and Hybrid Cloud
Kai Wähner
 
Grokking TechTalk #31: Asynchronous Communications
Grokking VN
 
Building a Unified Logging Layer with Fluentd, Elasticsearch and Kibana
Mushfekur Rahman
 
Mainframe Integration, Offloading and Replacement with Apache Kafka
Kai Wähner
 
Data Parallel and Object Oriented Model
Nikhil Sharma
 
Challenges in Building a Data Pipeline
Manish Kumar
 
Apache spark
shima jafari
 
Hyper threading technology
Nikhil Venugopal
 
Introduction to GCP presentation
Mohit Kachhwani
 
How Apache Kafka® Works
confluent
 
Productionzing ML Model Using MLflow Model Serving
Databricks
 
Actor Model & Reactive Manifesto
Angelo Simone Scotto
 
PARALLELISM IN MULTICORE PROCESSORS
Amirthavalli Senthil
 
Google Megastore
bergwolf
 
Our answer to Uber
Alexander Korotkov
 
Pipeline parallelism
Dr. C.V. Suresh Babu
 
[OpenStack Days Korea 2016] Track1 - All flash CEPH 구성 및 최적화
OpenStack Korea Community
 
Migrating Apache Hive Workload to Apache Spark: Bridge the Gap with Zhan Zhan...
Databricks
 
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
Simplilearn
 
Ad

Similar to Horovod - Distributed TensorFlow Made Easy (20)

PDF
Uber's Journey in Distributed Deep Learning
inside-BigData.com
 
PDF
Horovod ubers distributed deep learning framework by Alex Sergeev from Uber
Bill Liu
 
PDF
Horovod: Uber’s Open Source Distributed Deep Learning Framework for TensorFlow
Databricks
 
PDF
2018 TensorFlow Summit Recap (GDG Shanghai)
Jiang Jun
 
PDF
TensorFlow GPU_ A Comprehensive Guide to Boosting AI Tasks.pdf
GPU SERVER
 
PDF
C3 w3
Ajay Taneja
 
PPTX
Leonid Kuligin "Training ML models with Cloud"
Lviv Startup Club
 
PDF
TensorFlow Lite for mobile & IoT
Mia Chang
 
PPTX
GPU and Deep learning best practices
Lior Sidi
 
PPTX
«Training Deep Learning Models on Multi-GPUs Systems», Dmitry Spodarets.
Provectus
 
PPTX
Distributed Deep learning Training.
Umang Sharma
 
PPTX
PR-129: Horovod: fast and easy distributed deep learning in TensorFlow
Seoul National University
 
PDF
Toronto meetup 20190917
Bill Liu
 
PDF
running Tensorflow in Production
Matthias Feys
 
PPTX
Getting Started with TensorFlow on Google Cloud
Mariam Aslam
 
PDF
Large Scale Deep Learning with TensorFlow
Jen Aman
 
PDF
How to Run TensorFlow Cheaper in the Cloud Using Elastic GPUs
Altoros
 
PDF
Deep Learning with Apache Spark and GPUs with Pierce Spitler
Databricks
 
PDF
Alluxio Webinar - Maximize GPU Utilization for Model Training
Alluxio, Inc.
 
PPTX
Deep Learning with Spark and GPUs
DataWorks Summit
 
Uber's Journey in Distributed Deep Learning
inside-BigData.com
 
Horovod ubers distributed deep learning framework by Alex Sergeev from Uber
Bill Liu
 
Horovod: Uber’s Open Source Distributed Deep Learning Framework for TensorFlow
Databricks
 
2018 TensorFlow Summit Recap (GDG Shanghai)
Jiang Jun
 
TensorFlow GPU_ A Comprehensive Guide to Boosting AI Tasks.pdf
GPU SERVER
 
Leonid Kuligin "Training ML models with Cloud"
Lviv Startup Club
 
TensorFlow Lite for mobile & IoT
Mia Chang
 
GPU and Deep learning best practices
Lior Sidi
 
«Training Deep Learning Models on Multi-GPUs Systems», Dmitry Spodarets.
Provectus
 
Distributed Deep learning Training.
Umang Sharma
 
PR-129: Horovod: fast and easy distributed deep learning in TensorFlow
Seoul National University
 
Toronto meetup 20190917
Bill Liu
 
running Tensorflow in Production
Matthias Feys
 
Getting Started with TensorFlow on Google Cloud
Mariam Aslam
 
Large Scale Deep Learning with TensorFlow
Jen Aman
 
How to Run TensorFlow Cheaper in the Cloud Using Elastic GPUs
Altoros
 
Deep Learning with Apache Spark and GPUs with Pierce Spitler
Databricks
 
Alluxio Webinar - Maximize GPU Utilization for Model Training
Alluxio, Inc.
 
Deep Learning with Spark and GPUs
DataWorks Summit
 
Ad

Recently uploaded (20)

PDF
ShowUs: Pharo Stream Deck (ESUG 2025, Gdansk)
ESUG
 
PPTX
The-Dawn-of-AI-Reshaping-Our-World.pptxx
parthbhanushali307
 
PDF
Enhancing Healthcare RPM Platforms with Contextual AI Integration
Cadabra Studio
 
PDF
Adobe Illustrator Crack Full Download (Latest Version 2025) Pre-Activated
imang66g
 
PDF
Bandai Playdia The Book - David Glotz
BluePanther6
 
PPT
Why Reliable Server Maintenance Service in New York is Crucial for Your Business
Sam Vohra
 
PDF
vAdobe Premiere Pro 2025 (v25.2.3.004) Crack Pre-Activated Latest
imang66g
 
PDF
Balancing Resource Capacity and Workloads with OnePlan – Avoid Overloading Te...
OnePlan Solutions
 
PDF
advancepresentationskillshdhdhhdhdhdhhfhf
jasmenrojas249
 
PPTX
GALILEO CRS SYSTEM | GALILEO TRAVEL SOFTWARE
philipnathen82
 
PPTX
ConcordeApp: Engineering Global Impact & Unlocking Billions in Event ROI with AI
chastechaste14
 
PPTX
slidesgo-unlocking-the-code-the-dynamic-dance-of-variables-and-constants-2024...
kr2589474
 
PDF
WatchTraderHub - Watch Dealer software with inventory management and multi-ch...
WatchDealer Pavel
 
PDF
On Software Engineers' Productivity - Beyond Misleading Metrics
Romén Rodríguez-Gil
 
PDF
New Download MiniTool Partition Wizard Crack Latest Version 2025
imang66g
 
PDF
49785682629390197565_LRN3014_Migrating_the_Beast.pdf
Abilash868456
 
PPTX
oapresentation.pptx
mehatdhavalrajubhai
 
PDF
Generating Union types w/ Static Analysis
K. Matthew Dupree
 
PDF
lesson-2-rules-of-netiquette.pdf.bshhsjdj
jasmenrojas249
 
PDF
Teaching Reproducibility and Embracing Variability: From Floating-Point Exper...
University of Rennes, INSA Rennes, Inria/IRISA, CNRS
 
ShowUs: Pharo Stream Deck (ESUG 2025, Gdansk)
ESUG
 
The-Dawn-of-AI-Reshaping-Our-World.pptxx
parthbhanushali307
 
Enhancing Healthcare RPM Platforms with Contextual AI Integration
Cadabra Studio
 
Adobe Illustrator Crack Full Download (Latest Version 2025) Pre-Activated
imang66g
 
Bandai Playdia The Book - David Glotz
BluePanther6
 
Why Reliable Server Maintenance Service in New York is Crucial for Your Business
Sam Vohra
 
vAdobe Premiere Pro 2025 (v25.2.3.004) Crack Pre-Activated Latest
imang66g
 
Balancing Resource Capacity and Workloads with OnePlan – Avoid Overloading Te...
OnePlan Solutions
 
advancepresentationskillshdhdhhdhdhdhhfhf
jasmenrojas249
 
GALILEO CRS SYSTEM | GALILEO TRAVEL SOFTWARE
philipnathen82
 
ConcordeApp: Engineering Global Impact & Unlocking Billions in Event ROI with AI
chastechaste14
 
slidesgo-unlocking-the-code-the-dynamic-dance-of-variables-and-constants-2024...
kr2589474
 
WatchTraderHub - Watch Dealer software with inventory management and multi-ch...
WatchDealer Pavel
 
On Software Engineers' Productivity - Beyond Misleading Metrics
Romén Rodríguez-Gil
 
New Download MiniTool Partition Wizard Crack Latest Version 2025
imang66g
 
49785682629390197565_LRN3014_Migrating_the_Beast.pdf
Abilash868456
 
oapresentation.pptx
mehatdhavalrajubhai
 
Generating Union types w/ Static Analysis
K. Matthew Dupree
 
lesson-2-rules-of-netiquette.pdf.bshhsjdj
jasmenrojas249
 
Teaching Reproducibility and Embracing Variability: From Floating-Point Exper...
University of Rennes, INSA Rennes, Inria/IRISA, CNRS
 

Horovod - Distributed TensorFlow Made Easy

  • 1. Horovod Distributed TensorFlow Made Easy Alex Sergeev, Machine Learning Platform, Uber Engineering
  • 2. Deep Learning @ Uber ● Self-Driving Vehicles ● Trip Forecasting ● Fraud Detection ● … and many more!
  • 3. TensorFlow ● Most popular open source framework for Deep Learning ● Combines high performance with ability to tinker with low level model details ● Has end-to-end support from research to production
  • 4. Going Distributed ● Speed up model training ● Train very large models ● Vast majority of use cases are data-parallel ● Facebook demonstrated training ResNet on ImageNet in 1 hour
  • 6. Parameter Server Technique - Example Script Image Source: TensorFlow -- https://www.tensorflow.org/deploy/distributed
  • 7. Parameter Server Technique - Performance Considering ImageNet dataset of 1.3M images, this allows to train ResNet-101 for one epoch in 3.5 minutes. Scaling efficiency on 128 GPUs is only 42%, however.
  • 8. How Can We Do Better? ● Re-think necessary complexity for data-parallel case ● Improve communication algorithm ● Use RDMA-capable networking (RoCE, InfiniBand)
  • 9. Meet Horovod ● Distributed training framework for TensorFlow ● Inspired by work of Baidu, Facebook, et al. ● Uses bandwidth-optimal communication protocols ○ Makes use of RDMA (RoCE, InfiniBand) if available ● Seamlessly installs on top of TensorFlow via pip install horovod ● Named after traditional Russian folk dance where participants dance in a circle with linked hands
  • 10. Horovod Technique Patarasuk, P., & Yuan, X. (2009). Bandwidth optimal all-reduce algorithms for clusters of workstations. Journal of Parallel and Distributed Computing, 69(2), 117-124. doi:10.1016/j.jpdc.2008.09.002
  • 11. Horovod Stack ● Plugs into TensorFlow via custom op mechanism ● Uses MPI for worker discovery and reduction coordination ● Uses NVIDIA NCCL for actual reduction on the server and across servers
  • 12. Horovod Example import tensorflow as tf import horovod.tensorflow as hvd # Initialize Horovod hvd.init() # Pin GPU to be used config = tf.ConfigProto() config.gpu_options.visible_device_list = str(hvd.local_rank()) # Build model... loss = ... opt = tf.train.AdagradOptimizer(0.01) # Add Horovod Distributed Optimizer opt = hvd.DistributedOptimizer(opt) # Add hook to broadcast variables from rank 0 to all other processes during initialization. hooks = [hvd.BroadcastGlobalVariablesHook(0)] # Make training operation train_op = opt.minimize(loss) # The MonitoredTrainingSession takes care of session initialization, # restoring from a checkpoint, saving to a checkpoint, and closing when done # or an error occurs. with tf.train.MonitoredTrainingSession(checkpoint_dir="/tmp/train_logs", config=config, hooks=hooks) as mon_sess: while not mon_sess.should_stop(): # Perform synchronous training. mon_sess.run(train_op)
  • 13. Horovod Example Cont. ● Run on a 4 GPU machine: ○ $ mpirun -np 4 python train.py ● Run on 4 machines with 4 GPUs each using Open MPI: ○ $ mpirun -np 16 -x LD_LIBRARY_PATH -H server1:4,server2:4,server3:4,server4:4 python train.py
  • 14. Debugging - Horovod Timeline ● Discovered that ResNet-152 has a lot of tiny tensors ● Added Tensor Fusion - smart batching that gives large gains (bigger gain on less optimized networks)
  • 15. Horovod Performance With Horovod, same ResNet-101 can be trained for one epoch on ImageNet in 1.5 minutes. Scaling efficiency is improved to 88%, making it twice as efficient as standard distributed TF.
  • 16. Horovod Performance Cont. RDMA further helps to improve efficiency - by 30% for VGG-16.
  • 17. Practical Results ● Used learning rate adjustment technique described in the Facebook paper “Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour” ● Trained convolutional networks and LSTMs in hours instead of days or weeks with the same final accuracy ● You can do that, too!
  • 18. Giving Back Horovod is available on GitHub today https://github.com/uber/horovod
  • 19. Thank you! Learn more about Horovod on our Eng Blog: https://eng.uber.com/horovod Learn more about ML at Uber on YouTube: http://t.uber.com/ml-meetup
  • 20. Proprietary and confidential © 2017 Uber Technologies, Inc. All rights reserved. No part of this document may be reproduced or utilized in any form or by any means, electronic or mechanical, including photocopying, recording, or by any information storage or retrieval systems, without permission in writing from Uber. This document is intended only for the use of the individual or entity to whom it is addressed and contains information that is privileged, confidential or otherwise exempt from disclosure under applicable law. All recipients of this document are notified that the information contained herein includes proprietary and confidential information of Uber, and recipient may not make use of, disseminate, or in any way disclose this document or any of the enclosed information to any person other than employees of addressee to the extent necessary for consultations with authorized personnel of Uber.