SlideShare a Scribd company logo
Advanced visualization of
Spark jobs
Zoltán Zvara
zoltan.zvara@sztaki.hu
Márton Balassi
marton.balassi@sztaki.hu
Agenda
• Introduction
• Motivation and the challenges
• The Spark execution visualization
• An elegant way to tackle data skew
• The visualizer extended to Flink
• Future plans
Introduction
• Hungarian Academy of Sciences (MTA SZTAKI)
• Research institute with strong industry ties
• Big Data projects using Spark, Flink, Cassandra, Hadoop etc.
• Multiple telco use cases lately, with challenging data volume and
distribution
Motivation
• We have developped an application aggreagting telco data that tested
well on toy data
• When deploying it against the real dataset the application seemed
healthy
• However it could become suprisingly slow or even crash
• What did go wrong?
Data skew
• When some data-points are very frequent, and we need to process
them on a single machine – aka the Bieber-effect
• Power-law distributions are very common: from social-media to
telecommunication, even in our beloved WordCount
• Default hashing puts these into „random” buckets
• We can do better than that
Aim
1. Most of the telco data-processing workloads suffer from inefficient
communication patterns, introduced by skewed data
2. Help developers to write better applications by detecting issues of
logical and physical execution plans easier
3. Help newcomers to understand a distributed data-processing
system faster
4. Demonstrate and guide the testing of adaptive (re)partitioning
techniques
Spark’s execution model
• Extended MapReduce model, where a previous stage has to complete
all its tasks before the next stage can be scheduled (global
synchronization)
• Data skew can introduce slow tasks
• In most cases, the data distribution
is not known in advance
• „Concept drifts” are common
stage
boundary
& shuffle
even partitioning skewed data
slow task
slow task
Physical execution plan
• Understanding the physical plan
• can lead to designing a better application
• can uncover bottlenecks that are hard to identify otherwise
• can help to pick the right tool for the job at hand
• can be quite difficult for newcomers
• can give insights to new, adaptive, on-the-fly partitioning & optimization
strategies
Current visualizations in Spark
• Current visualization tools
• are missing the fine-grained input & output metrics – also not available in
most sytems, like Spark
• does not capture the data charateristics – hard to accomplish a lightweight
and efficient solution
• does not visualize the physical plan
DAG visualization in Spark
Collecting information during execution
• Data-transfer between tasks are accomplished in
the shuffle phase, in the form of shuffle block fetches
• We distinguish local & remote block fetches
• Data characteristics collected on shuffle write & read
A scalable sampling
• Key-distributions are approximated with a strategy, that
• is not sensitive to early or late concept drifts,
• lightweight and efficient,
• scalable by using a backoff strategy
keys
frequency
counters
keys
frequencysampling rate of 𝑠𝑖 sampling rate of 𝑠𝑖/𝑏
keys
frequency
sampling rate of 𝑠𝑖+𝑗/𝑏
truncateincrease by 𝑠𝑖
𝑇𝐵
Availability through the Spark REST API
• Enhanced Spark REST API to provide block fetch & data characteristics
information of tasks
• New queries to the REST API, for example: „what happend in the last
3 seconds?”
• /api/v1/applications/{appId}/{jobId}/tasks/{timestamp}
THE VISUALIZATION
Our way of tackling data skew
• Goal: use the collected information to handle data skew dynamically,
on-the-fly on any workload (batch or streaming)
• Currently paper in progress
heavy keys create
heavy partitions (slow tasks)
Bieber-key is alone,
size of the biggest partition
is minimized
Making visualization pluggable
• Let us also support Flink
• The execution model is quite different from Spark
• Minimal code change on the data reader module of the visualization
• Expose the necessary metrics through the Flink JM Rest API
• Map Flink’s model to Spark’s
Computational models
Batch data
Kafka, RabbitMQ ...
HDFS, JDBC ...
Stream Data
Spark RDDs break down the computation into stagesFlink computation is fully pipelined by default
The currently available Flink visualizer
Granular visualization of Flink jobs
Future plans
• Open-sourcing of the visualization framework
• Dynamic repartitioning paper is in progress
• Task-classification to make the visualization more compact
• Visual improvements, more features
• Opening PRs against Spark & Flink with the suggested metrics
• A rewrite is coming for the Flink metrics, track it at FLINK-1504
Conclusions
• The devil is in the details
• Visualizations can aid developers to better understand issues and
bottlenecks of certain workloads
• Specially data skew can hinder the completion time of the whole
stage in a batch engine
• Adaptive repartitioning strategies can be suprisingly efficient
Thank you for your attention
Q&A
stage
boundary
& shuffle
even partitioning skewed data
slow task
slow task

More Related Content

PPT
Running Spark in Production
DataWorks Summit/Hadoop Summit
 
PDF
Data science lifecycle with Apache Zeppelin
DataWorks Summit/Hadoop Summit
 
PPTX
Tame that Beast
DataWorks Summit/Hadoop Summit
 
PDF
Migrating pipelines into Docker
DataWorks Summit/Hadoop Summit
 
PDF
Improving Apache Spark for Dynamic Allocation and Spot Instances
Databricks
 
PPTX
How do you decide where your customer was?
DataWorks Summit/Hadoop Summit
 
PPTX
Securing Spark Applications
DataWorks Summit/Hadoop Summit
 
Running Spark in Production
DataWorks Summit/Hadoop Summit
 
Data science lifecycle with Apache Zeppelin
DataWorks Summit/Hadoop Summit
 
Migrating pipelines into Docker
DataWorks Summit/Hadoop Summit
 
Improving Apache Spark for Dynamic Allocation and Spot Instances
Databricks
 
How do you decide where your customer was?
DataWorks Summit/Hadoop Summit
 
Securing Spark Applications
DataWorks Summit/Hadoop Summit
 

What's hot (20)

PDF
Magnet Shuffle Service: Push-based Shuffle at LinkedIn
Databricks
 
PPTX
Introduction to Big Data Analytics using Apache Spark and Zeppelin on HDInsig...
Alex Zeltov
 
PPTX
Data Science with Spark & Zeppelin
Vinay Shukla
 
PDF
Spark Uber Development Kit
DataWorks Summit/Hadoop Summit
 
PDF
Helium makes Zeppelin fly!
DataWorks Summit
 
PDF
Apache Spark and Apache Ignite: Where Fast Data Meets the IoT with Denis Magda
Databricks
 
PDF
Apache Pulsar: The Next Generation Messaging and Queuing System
Databricks
 
PDF
Building a Location Based Social Graph in Spark at InMobi-(Seinjuti Chatterje...
Spark Summit
 
PPTX
Machine Learning in the IoT with Apache NiFi
DataWorks Summit/Hadoop Summit
 
PDF
Unified, Efficient, and Portable Data Processing with Apache Beam
DataWorks Summit/Hadoop Summit
 
PDF
Apache Spark At Apple with Sam Maclennan and Vishwanath Lakkundi
Databricks
 
PPTX
Faster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on Hive
DataWorks Summit/Hadoop Summit
 
PPTX
Debunking Common Myths in Stream Processing
DataWorks Summit/Hadoop Summit
 
PPTX
Securing Hadoop in an Enterprise Context
DataWorks Summit/Hadoop Summit
 
PPTX
Producing Spark on YARN for ETL
DataWorks Summit/Hadoop Summit
 
PDF
Low latency high throughput streaming using Apache Apex and Apache Kudu
DataWorks Summit
 
PDF
Spark Summit EU talk by Heiko Korndorf
Spark Summit
 
PPTX
Combining Machine Learning frameworks with Apache Spark
DataWorks Summit/Hadoop Summit
 
PDF
Monitoring of GPU Usage with Tensorflow Models Using Prometheus
Databricks
 
PDF
Cassandra and SparkSQL: You Don't Need Functional Programming for Fun with Ru...
Databricks
 
Magnet Shuffle Service: Push-based Shuffle at LinkedIn
Databricks
 
Introduction to Big Data Analytics using Apache Spark and Zeppelin on HDInsig...
Alex Zeltov
 
Data Science with Spark & Zeppelin
Vinay Shukla
 
Spark Uber Development Kit
DataWorks Summit/Hadoop Summit
 
Helium makes Zeppelin fly!
DataWorks Summit
 
Apache Spark and Apache Ignite: Where Fast Data Meets the IoT with Denis Magda
Databricks
 
Apache Pulsar: The Next Generation Messaging and Queuing System
Databricks
 
Building a Location Based Social Graph in Spark at InMobi-(Seinjuti Chatterje...
Spark Summit
 
Machine Learning in the IoT with Apache NiFi
DataWorks Summit/Hadoop Summit
 
Unified, Efficient, and Portable Data Processing with Apache Beam
DataWorks Summit/Hadoop Summit
 
Apache Spark At Apple with Sam Maclennan and Vishwanath Lakkundi
Databricks
 
Faster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on Hive
DataWorks Summit/Hadoop Summit
 
Debunking Common Myths in Stream Processing
DataWorks Summit/Hadoop Summit
 
Securing Hadoop in an Enterprise Context
DataWorks Summit/Hadoop Summit
 
Producing Spark on YARN for ETL
DataWorks Summit/Hadoop Summit
 
Low latency high throughput streaming using Apache Apex and Apache Kudu
DataWorks Summit
 
Spark Summit EU talk by Heiko Korndorf
Spark Summit
 
Combining Machine Learning frameworks with Apache Spark
DataWorks Summit/Hadoop Summit
 
Monitoring of GPU Usage with Tensorflow Models Using Prometheus
Databricks
 
Cassandra and SparkSQL: You Don't Need Functional Programming for Fun with Ru...
Databricks
 
Ad

Viewers also liked (20)

PDF
Spark Summit EU 2015: SparkUI visualization: a lens into your application
Databricks
 
PPTX
Monitoring Spark Applications
Tzach Zohar
 
PDF
Zoltán Zvara - Advanced visualization of Flink and Spark jobs

Flink Forward
 
PDF
Advanced Visualizations, Bijilash Babu
Bijilash Babu
 
PPTX
How to deploy spark instance using ansible 2.0 in fiware lab v2
Fernando Lopez Aguilar
 
PDF
MLeap: Release Spark ML Pipelines
DataWorks Summit/Hadoop Summit
 
PDF
超高速処理とスケーラビリティを両立するApache GEODE
Masaki Yamakawa
 
PPTX
Think Like Spark
Alpine Data
 
PPTX
Spark meets Smart Meters
DataWorks Summit/Hadoop Summit
 
PPTX
Elasticsearchインデクシングのパフォーマンスを測ってみた
Ryoji Kurosawa
 
PDF
Evaluating NoSQL Performance: Time for Benchmarking
Sergey Bushik
 
PPTX
Faster Batch Processing with Cloudera 5.7: Hive-on-Spark is ready for production
Cloudera, Inc.
 
PDF
Clickstream Analysis with Spark
Josef Adersberger
 
PDF
Lessons Learned: Using Spark and Microservices
Alexis Seigneurin
 
PDF
Visualizing AutoTrader Traffic in Near Real-Time with Spark Streaming-(Jon Gr...
Spark Summit
 
PPT
Eda Story So far
kirtidesai
 
PDF
Business intelligence, Data Analytics & Data Visualization
Muthu Natarajan
 
PDF
Visualizing big data in the browser using spark
Databricks
 
PDF
Unit testing of spark applications
Knoldus Inc.
 
PPTX
Data visualization trends in Business Intelligence: Allison Sapka at Analytic...
Fitzgerald Analytics, Inc.
 
Spark Summit EU 2015: SparkUI visualization: a lens into your application
Databricks
 
Monitoring Spark Applications
Tzach Zohar
 
Zoltán Zvara - Advanced visualization of Flink and Spark jobs

Flink Forward
 
Advanced Visualizations, Bijilash Babu
Bijilash Babu
 
How to deploy spark instance using ansible 2.0 in fiware lab v2
Fernando Lopez Aguilar
 
MLeap: Release Spark ML Pipelines
DataWorks Summit/Hadoop Summit
 
超高速処理とスケーラビリティを両立するApache GEODE
Masaki Yamakawa
 
Think Like Spark
Alpine Data
 
Spark meets Smart Meters
DataWorks Summit/Hadoop Summit
 
Elasticsearchインデクシングのパフォーマンスを測ってみた
Ryoji Kurosawa
 
Evaluating NoSQL Performance: Time for Benchmarking
Sergey Bushik
 
Faster Batch Processing with Cloudera 5.7: Hive-on-Spark is ready for production
Cloudera, Inc.
 
Clickstream Analysis with Spark
Josef Adersberger
 
Lessons Learned: Using Spark and Microservices
Alexis Seigneurin
 
Visualizing AutoTrader Traffic in Near Real-Time with Spark Streaming-(Jon Gr...
Spark Summit
 
Eda Story So far
kirtidesai
 
Business intelligence, Data Analytics & Data Visualization
Muthu Natarajan
 
Visualizing big data in the browser using spark
Databricks
 
Unit testing of spark applications
Knoldus Inc.
 
Data visualization trends in Business Intelligence: Allison Sapka at Analytic...
Fitzgerald Analytics, Inc.
 
Ad

Similar to Advanced Visualization of Spark jobs (20)

PPTX
Introduction to NetGuardians' Big Data Software Stack
Jérôme Kehrli
 
PPTX
Spark.pptx to knowledge gaining in wdm days ago
PreethamMCPreethamMC
 
PPTX
Tech Spark Presentation
Stephen Borg
 
PPTX
From Pipelines to Refineries: scaling big data applications with Tim Hunter
Databricks
 
PDF
Apache Spark Presentation good for big data
kijekormu1
 
PPTX
Pyspark presentationsfspfsjfspfjsfpsjfspfjsfpsjfsfsf
sasuke20y4sh
 
PPTX
Big Data training
vishal192091
 
PDF
Using Machine Learning to Understand Kafka Runtime Behavior (Shivanath Babu, ...
confluent
 
PDF
Deep Learning on Apache® Spark™: Workflows and Best Practices
Databricks
 
PDF
Deep Learning on Apache® Spark™ : Workflows and Best Practices
Jen Aman
 
PDF
Deep Learning on Apache® Spark™: Workflows and Best Practices
Jen Aman
 
PPTX
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
Landon Robinson
 
PDF
Headaches and Breakthroughs in Building Continuous Applications
Databricks
 
PPTX
Spark Overview and Performance Issues
Antonios Katsarakis
 
PPTX
Boston Spark User Group - Spark's Role at MediaCrossing - July 15, 2014
gmalouf678
 
PDF
From Pipelines to Refineries: Scaling Big Data Applications
Databricks
 
PPTX
Real time data viz with Spark Streaming, Kafka and D3.js
Ben Laird
 
PPTX
Взгляд на облака с точки зрения HPC
Olga Lavrentieva
 
PDF
Spark and Couchbase: Augmenting the Operational Database with Spark
Spark Summit
 
PPSX
IRMAC April 2015 - DMBOK2 DWBI New Content
Martin Sykora
 
Introduction to NetGuardians' Big Data Software Stack
Jérôme Kehrli
 
Spark.pptx to knowledge gaining in wdm days ago
PreethamMCPreethamMC
 
Tech Spark Presentation
Stephen Borg
 
From Pipelines to Refineries: scaling big data applications with Tim Hunter
Databricks
 
Apache Spark Presentation good for big data
kijekormu1
 
Pyspark presentationsfspfsjfspfjsfpsjfspfjsfpsjfsfsf
sasuke20y4sh
 
Big Data training
vishal192091
 
Using Machine Learning to Understand Kafka Runtime Behavior (Shivanath Babu, ...
confluent
 
Deep Learning on Apache® Spark™: Workflows and Best Practices
Databricks
 
Deep Learning on Apache® Spark™ : Workflows and Best Practices
Jen Aman
 
Deep Learning on Apache® Spark™: Workflows and Best Practices
Jen Aman
 
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
Landon Robinson
 
Headaches and Breakthroughs in Building Continuous Applications
Databricks
 
Spark Overview and Performance Issues
Antonios Katsarakis
 
Boston Spark User Group - Spark's Role at MediaCrossing - July 15, 2014
gmalouf678
 
From Pipelines to Refineries: Scaling Big Data Applications
Databricks
 
Real time data viz with Spark Streaming, Kafka and D3.js
Ben Laird
 
Взгляд на облака с точки зрения HPC
Olga Lavrentieva
 
Spark and Couchbase: Augmenting the Operational Database with Spark
Spark Summit
 
IRMAC April 2015 - DMBOK2 DWBI New Content
Martin Sykora
 

More from DataWorks Summit/Hadoop Summit (20)

PPT
Running Apache Spark & Apache Zeppelin in Production
DataWorks Summit/Hadoop Summit
 
PPT
State of Security: Apache Spark & Apache Zeppelin
DataWorks Summit/Hadoop Summit
 
PDF
Unleashing the Power of Apache Atlas with Apache Ranger
DataWorks Summit/Hadoop Summit
 
PDF
Enabling Digital Diagnostics with a Data Science Platform
DataWorks Summit/Hadoop Summit
 
PDF
Revolutionize Text Mining with Spark and Zeppelin
DataWorks Summit/Hadoop Summit
 
PDF
Double Your Hadoop Performance with Hortonworks SmartSense
DataWorks Summit/Hadoop Summit
 
PDF
Hadoop Crash Course
DataWorks Summit/Hadoop Summit
 
PDF
Data Science Crash Course
DataWorks Summit/Hadoop Summit
 
PDF
Apache Spark Crash Course
DataWorks Summit/Hadoop Summit
 
PDF
Dataflow with Apache NiFi
DataWorks Summit/Hadoop Summit
 
PPTX
Schema Registry - Set you Data Free
DataWorks Summit/Hadoop Summit
 
PPTX
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
DataWorks Summit/Hadoop Summit
 
PDF
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
DataWorks Summit/Hadoop Summit
 
PPTX
Mool - Automated Log Analysis using Data Science and ML
DataWorks Summit/Hadoop Summit
 
PPTX
How Hadoop Makes the Natixis Pack More Efficient
DataWorks Summit/Hadoop Summit
 
PPTX
HBase in Practice
DataWorks Summit/Hadoop Summit
 
PPTX
The Challenge of Driving Business Value from the Analytics of Things (AOT)
DataWorks Summit/Hadoop Summit
 
PDF
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
DataWorks Summit/Hadoop Summit
 
PPTX
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
DataWorks Summit/Hadoop Summit
 
PPTX
Backup and Disaster Recovery in Hadoop
DataWorks Summit/Hadoop Summit
 
Running Apache Spark & Apache Zeppelin in Production
DataWorks Summit/Hadoop Summit
 
State of Security: Apache Spark & Apache Zeppelin
DataWorks Summit/Hadoop Summit
 
Unleashing the Power of Apache Atlas with Apache Ranger
DataWorks Summit/Hadoop Summit
 
Enabling Digital Diagnostics with a Data Science Platform
DataWorks Summit/Hadoop Summit
 
Revolutionize Text Mining with Spark and Zeppelin
DataWorks Summit/Hadoop Summit
 
Double Your Hadoop Performance with Hortonworks SmartSense
DataWorks Summit/Hadoop Summit
 
Hadoop Crash Course
DataWorks Summit/Hadoop Summit
 
Data Science Crash Course
DataWorks Summit/Hadoop Summit
 
Apache Spark Crash Course
DataWorks Summit/Hadoop Summit
 
Dataflow with Apache NiFi
DataWorks Summit/Hadoop Summit
 
Schema Registry - Set you Data Free
DataWorks Summit/Hadoop Summit
 
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
DataWorks Summit/Hadoop Summit
 
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
DataWorks Summit/Hadoop Summit
 
Mool - Automated Log Analysis using Data Science and ML
DataWorks Summit/Hadoop Summit
 
How Hadoop Makes the Natixis Pack More Efficient
DataWorks Summit/Hadoop Summit
 
HBase in Practice
DataWorks Summit/Hadoop Summit
 
The Challenge of Driving Business Value from the Analytics of Things (AOT)
DataWorks Summit/Hadoop Summit
 
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
DataWorks Summit/Hadoop Summit
 
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
DataWorks Summit/Hadoop Summit
 
Backup and Disaster Recovery in Hadoop
DataWorks Summit/Hadoop Summit
 

Recently uploaded (20)

PDF
The Evolution of KM Roles (Presented at Knowledge Summit Dublin 2025)
Enterprise Knowledge
 
PDF
Chapter 1 Introduction to CV and IP Lecture Note.pdf
Getnet Tigabie Askale -(GM)
 
PPT
Coupa-Kickoff-Meeting-Template presentai
annapureddyn
 
PDF
Chapter 2 Digital Image Fundamentals.pdf
Getnet Tigabie Askale -(GM)
 
PPTX
Comunidade Salesforce São Paulo - Desmistificando o Omnistudio (Vlocity)
Francisco Vieira Júnior
 
PPTX
Smart Infrastructure and Automation through IoT Sensors
Rejig Digital
 
PPTX
The-Ethical-Hackers-Imperative-Safeguarding-the-Digital-Frontier.pptx
sujalchauhan1305
 
PDF
Accelerating Oracle Database 23ai Troubleshooting with Oracle AHF Fleet Insig...
Sandesh Rao
 
PDF
How-Cloud-Computing-Impacts-Businesses-in-2025-and-Beyond.pdf
Artjoker Software Development Company
 
PDF
Orbitly Pitch Deck|A Mission-Driven Platform for Side Project Collaboration (...
zz41354899
 
PPTX
What-is-the-World-Wide-Web -- Introduction
tonifi9488
 
PPTX
OA presentation.pptx OA presentation.pptx
pateldhruv002338
 
PDF
Cloud-Migration-Best-Practices-A-Practical-Guide-to-AWS-Azure-and-Google-Clou...
Artjoker Software Development Company
 
PPTX
ChatGPT's Deck on The Enduring Legacy of Fax Machines
Greg Swan
 
PDF
SparkLabs Primer on Artificial Intelligence 2025
SparkLabs Group
 
PDF
Oracle AI Vector Search- Getting Started and what's new in 2025- AIOUG Yatra ...
Sandesh Rao
 
PDF
DevOps & Developer Experience Summer BBQ
AUGNYC
 
PDF
Event Presentation Google Cloud Next Extended 2025
minhtrietgect
 
PDF
Why Your AI & Cybersecurity Hiring Still Misses the Mark in 2025
Virtual Employee Pvt. Ltd.
 
PPTX
Stamford - Community User Group Leaders_ Agentblazer Status, AI Sustainabilit...
Amol Dixit
 
The Evolution of KM Roles (Presented at Knowledge Summit Dublin 2025)
Enterprise Knowledge
 
Chapter 1 Introduction to CV and IP Lecture Note.pdf
Getnet Tigabie Askale -(GM)
 
Coupa-Kickoff-Meeting-Template presentai
annapureddyn
 
Chapter 2 Digital Image Fundamentals.pdf
Getnet Tigabie Askale -(GM)
 
Comunidade Salesforce São Paulo - Desmistificando o Omnistudio (Vlocity)
Francisco Vieira Júnior
 
Smart Infrastructure and Automation through IoT Sensors
Rejig Digital
 
The-Ethical-Hackers-Imperative-Safeguarding-the-Digital-Frontier.pptx
sujalchauhan1305
 
Accelerating Oracle Database 23ai Troubleshooting with Oracle AHF Fleet Insig...
Sandesh Rao
 
How-Cloud-Computing-Impacts-Businesses-in-2025-and-Beyond.pdf
Artjoker Software Development Company
 
Orbitly Pitch Deck|A Mission-Driven Platform for Side Project Collaboration (...
zz41354899
 
What-is-the-World-Wide-Web -- Introduction
tonifi9488
 
OA presentation.pptx OA presentation.pptx
pateldhruv002338
 
Cloud-Migration-Best-Practices-A-Practical-Guide-to-AWS-Azure-and-Google-Clou...
Artjoker Software Development Company
 
ChatGPT's Deck on The Enduring Legacy of Fax Machines
Greg Swan
 
SparkLabs Primer on Artificial Intelligence 2025
SparkLabs Group
 
Oracle AI Vector Search- Getting Started and what's new in 2025- AIOUG Yatra ...
Sandesh Rao
 
DevOps & Developer Experience Summer BBQ
AUGNYC
 
Event Presentation Google Cloud Next Extended 2025
minhtrietgect
 
Why Your AI & Cybersecurity Hiring Still Misses the Mark in 2025
Virtual Employee Pvt. Ltd.
 
Stamford - Community User Group Leaders_ Agentblazer Status, AI Sustainabilit...
Amol Dixit
 

Advanced Visualization of Spark jobs

  • 1. Advanced visualization of Spark jobs Zoltán Zvara zoltan.zvara@sztaki.hu Márton Balassi marton.balassi@sztaki.hu
  • 2. Agenda • Introduction • Motivation and the challenges • The Spark execution visualization • An elegant way to tackle data skew • The visualizer extended to Flink • Future plans
  • 3. Introduction • Hungarian Academy of Sciences (MTA SZTAKI) • Research institute with strong industry ties • Big Data projects using Spark, Flink, Cassandra, Hadoop etc. • Multiple telco use cases lately, with challenging data volume and distribution
  • 4. Motivation • We have developped an application aggreagting telco data that tested well on toy data • When deploying it against the real dataset the application seemed healthy • However it could become suprisingly slow or even crash • What did go wrong?
  • 5. Data skew • When some data-points are very frequent, and we need to process them on a single machine – aka the Bieber-effect • Power-law distributions are very common: from social-media to telecommunication, even in our beloved WordCount • Default hashing puts these into „random” buckets • We can do better than that
  • 6. Aim 1. Most of the telco data-processing workloads suffer from inefficient communication patterns, introduced by skewed data 2. Help developers to write better applications by detecting issues of logical and physical execution plans easier 3. Help newcomers to understand a distributed data-processing system faster 4. Demonstrate and guide the testing of adaptive (re)partitioning techniques
  • 7. Spark’s execution model • Extended MapReduce model, where a previous stage has to complete all its tasks before the next stage can be scheduled (global synchronization) • Data skew can introduce slow tasks • In most cases, the data distribution is not known in advance • „Concept drifts” are common stage boundary & shuffle even partitioning skewed data slow task slow task
  • 8. Physical execution plan • Understanding the physical plan • can lead to designing a better application • can uncover bottlenecks that are hard to identify otherwise • can help to pick the right tool for the job at hand • can be quite difficult for newcomers • can give insights to new, adaptive, on-the-fly partitioning & optimization strategies
  • 9. Current visualizations in Spark • Current visualization tools • are missing the fine-grained input & output metrics – also not available in most sytems, like Spark • does not capture the data charateristics – hard to accomplish a lightweight and efficient solution • does not visualize the physical plan DAG visualization in Spark
  • 10. Collecting information during execution • Data-transfer between tasks are accomplished in the shuffle phase, in the form of shuffle block fetches • We distinguish local & remote block fetches • Data characteristics collected on shuffle write & read
  • 11. A scalable sampling • Key-distributions are approximated with a strategy, that • is not sensitive to early or late concept drifts, • lightweight and efficient, • scalable by using a backoff strategy keys frequency counters keys frequencysampling rate of 𝑠𝑖 sampling rate of 𝑠𝑖/𝑏 keys frequency sampling rate of 𝑠𝑖+𝑗/𝑏 truncateincrease by 𝑠𝑖 𝑇𝐵
  • 12. Availability through the Spark REST API • Enhanced Spark REST API to provide block fetch & data characteristics information of tasks • New queries to the REST API, for example: „what happend in the last 3 seconds?” • /api/v1/applications/{appId}/{jobId}/tasks/{timestamp}
  • 14. Our way of tackling data skew • Goal: use the collected information to handle data skew dynamically, on-the-fly on any workload (batch or streaming) • Currently paper in progress heavy keys create heavy partitions (slow tasks) Bieber-key is alone, size of the biggest partition is minimized
  • 15. Making visualization pluggable • Let us also support Flink • The execution model is quite different from Spark • Minimal code change on the data reader module of the visualization • Expose the necessary metrics through the Flink JM Rest API • Map Flink’s model to Spark’s
  • 16. Computational models Batch data Kafka, RabbitMQ ... HDFS, JDBC ... Stream Data Spark RDDs break down the computation into stagesFlink computation is fully pipelined by default
  • 17. The currently available Flink visualizer
  • 19. Future plans • Open-sourcing of the visualization framework • Dynamic repartitioning paper is in progress • Task-classification to make the visualization more compact • Visual improvements, more features • Opening PRs against Spark & Flink with the suggested metrics • A rewrite is coming for the Flink metrics, track it at FLINK-1504
  • 20. Conclusions • The devil is in the details • Visualizations can aid developers to better understand issues and bottlenecks of certain workloads • Specially data skew can hinder the completion time of the whole stage in a batch engine • Adaptive repartitioning strategies can be suprisingly efficient
  • 21. Thank you for your attention Q&A stage boundary & shuffle even partitioning skewed data slow task slow task