SlideShare a Scribd company logo
The 7 quests of resilient software design
A guide for the adventurous software engineer

Uwe Friedrichsen – codecentric AG – 2012-2017
Uwe Friedrichsen

IT traveller.
Connecting the dots.
Attracted by uncharted territory.
CTO at codecentric.

https://www.slideshare.net/ufried
https://medium.com/@ufried
 @ufried
You want to do resilient software design ...
... and you expect everything to be like this
But somehow it feels more like that ...
... or even that
What the **** went wrong?
The road to resilience is a twisted one
“7 quests you must complete!”
Quest #1
Understand the business case
“How much money will we earn with it?”
“Does it improve our velocity?”
Resilience is not about making money
Resilience is about not losing money
Lack of resilient software design
Reduced system availability
Users cannot do what they intend to do
Less transactions per time period
Immediate lost revenue
Users get annoyed
Churn rate increases
Delayed lost revenue
Due to non-determinism
of distributed systems
This is at most your
resilience budget
Quest #2
Embrace distributed systems
Everything fails, all the time.
-- Werner Vogels
If X then Y
What we learned in our IT education
If X then maybe Y



This changes
everything!
What we need for distributed systems












We are good at this (due to how our brains work)
Inside process thinking
Reasoning about
deterministic behavior
Designing a complicated system












We are poor at that (due to how our brains work)
Reasoning about
non-deterministic behavior
Across process thinking
Designing a complex system
Yet, we usually use deterministic thinking
to reason about distributed systems
Failures in distributed systems ...

•  Crash failure
•  Omission failure
•  Timing failure
•  Response failure
•  Byzantine failure
... turn seemingly simple issues into very hard ones
Time & Ordering

Leslie Lamport

"Time, clocks, and the
ordering of events in
distributed systems"
Consensus

Leslie Lamport

”The part-time
parliament”
(Paxos)
CAP

Eric A. Brewer

"Towards robust
distributed systems"
Faulty processes

Leslie Lamport,
Robert Shostak,
Marshall Pease

"The Byzantine
generals problem"
Consensus

Michael J. Fischer,
Nancy A. Lynch,
Michael S. Paterson

"Impossibility of
distributed consensus
with one faulty
process” (FLP)
Impossibility

Nancy A. Lynch

”A hundred
impossibility proofs
for distributed
computing"
Embrace distributed systems

•  Distributed systems introduce non-determinism regarding
•  Execution completeness
•  Message ordering
•  Communication timing
•  You will be affected by this at the application level
•  Don’t expect your infrastructure to hide all effects from you
•  Better have a plan to detect and recover from inconsistencies
But do I really need to care?
(The system, I am working on, is not a distributed system)
(Almost) every system is a distributed system
-- Chas Emerick


http://www.infoq.com/presentations/problems-distributed-systems
… and it’s getting “worse”


•  Cloud-based systems
•  Microservices
•  Zero Downtime
•  Mobile & IoT
•  Social Web
Quest #3
Avoid the “100% available” trap
The “100% available” trap, version #1

You: “How should the application respond if a technical failure occurs?”

Business owner: “This must not happen! It is your responsibility to make

 
sure that this will not happen.”
The “100% available” trap, version #2

You: “How do you handle the situation if the service you call does not

 
respond (or does not respond timely)?”

Developer 1: “We did not implement any extra measures. The other service

 
is so important and thus needs to be so highly available that it is

 
not worth any extra effort.”

Developer 2: “Actually, if that service should be down, we would not be able

 
to do anything useful anyway. Thus, it just needs to be up.”
The question is not, if a failure will happen
The question is, when a failure will happen
A short note about availability

Assume a service availability of 99,5% (incl. planned downtimes)

•  10 services involved in a request à 95,1% probability of success
•  50 services involved in a request à 77,8% probability of success
Quest #4
Establish the ops-dev feedback loop
The big wall between Dev and Ops
In a distributed environment, you cannot solve
availability issues on an infrastructure level only
Dev
 Ops
“I implemented something to
improve production availability”
“Here are the figures
how it worked”
Continuous improvement cycle
of resilient software design
Dev is where you
implement your
resilience measures
Build
Measure
Learn
Ops is where your
resilience measures
take effect
Dev
 Ops
“I implemented something to
improve production availability”
“Here are the figures
how it worked”
Continuous improvement cycle
of resilient software design
Dev is where you
implement your
resilience measures
Build
Measure
Learn
Ops is where your
resilience measures
take effect
All developer activities towards
improving robustness are basically
“shooting at the dark” which is neither
effective not sustainable
Having a wall between Dev and Ops
breaks the cycle required to implement
effective robustness measures
For effective resilient software design
you need a working ops-dev feedback loop
Establishing the feedback loop


•  Adopt DevOps
•  Adopt Site Reliability Engineering (SRE)
•  Or do it your own way if you know a better way ...
•  ... but make sure you establish the required feedback loops!
Quest #5
Master functional design
Without proper functional design
nothing else matters
Isolation

•  System must not fail as a whole
•  Split system in parts and isolate parts against each other
•  Avoid cascading failures
•  Foundation of resilient software design
Bulkhead

•  Bulkheads implement the “parts” that need to be isolated
•  Core isolation pattern (a.k.a. “failure units” or “units of mitigation”)
•  Diverse implementation choices available, e.g., (micro)services, actors, SCS, ...
•  Shaping good bulkheads is a pure functional design issue (and extremely hard)
Hmm, sound easy. Why should that be hard?
Service A
 Service B
Request
Due to functional design, Service A
always needs backing from Service B
to be able to answer a client request,

i.e. the isolation is broken by design
How do we avoid this …
Service
Request
Due to functional design we need
to call a lot of services to be able
to answer a client request,

i.e. availability is broken by design
... and this ...
Service
Service
Service
 Service
Service
Service
Service
Service
Service
Service
Service
Service
Mothership Service

(a.k.a. Monolith)
Request
By trying to avoid the aforementioned
issues we ended up with cramming all
required functionality in one big service

i.e. the isolation is broken by design
... without ending up with this?
Let us apply our well-known best practices

•  Divide & conquer a.k.a. functional decomposition
•  DRY (Don’t Repeat Yourself)
•  Design for reusability
•  Layered architecture
•  …
Unfortunately ...
Service A
 Service B
Request
Due to functional design, Service A
always needs backing from Service B
to be able to answer a client request,

i.e. the isolation is broken by design
... this usually leads to this …
Service
Request
Due to functional design we need
to call a lot of services to be able
to answer a client request,

i.e. availability is broken by design
... and this ...
Service
Service
Service
 Service
Service
Service
Service
Service
Service
Service
Service
Service
Mothership Service

(a.k.a. Monolith)
Request
By trying to avoid the aforementioned
issues we ended up with cramming all
required functionality in one big service

i.e. the isolation is broken by design
... and in the end also often to this.
Welcome to distributed hell!
Caches to the rescue!
Service A
 Service B
Request
Due to functional design, Service A
always needs backing from Service B
to be able to answer a client request,

i.e. the isolation is broken by design
CacheofB
Break tight service coupling
by caching data/responses
of downstream service
Caches to the rescue?
Do you really think
that copying stale data all over your system
is a suitable measure
to fix an inherently broken design? *

* Side note: Caches are a very important and powerful measure in many places. But they are not suitable as a cheap fix for a broken functional design
We have to re-learn design
for distributed system
No silver bullet
Yet, a few guiding thoughts ...
Foundations of design

•  “High cohesion, low coupling” & “separation of concerns”
•  “Crucial across process boundaries
•  Still poorly understood issue
•  Start with
•  Understanding organizational boundaries
•  Understanding use cases and flows
•  Identifying functional domains (à DDD)
•  Finding areas that change independently
•  Do not start with a data model!
Short activation paths

•  Long activation paths affect availability
•  Increase likelihood of failures
•  Minimize remote calls per request
•  Need to balance opposing forces
•  Avoid monolith à clear separation of concerns
•  Minimize requests à cluster functionality & data
•  Caches can sometimes help, but stale data as trade-off
Be (extremely) wary of reusability

•  Reusability increases coupling
•  Reusability usually leads to bad service design
•  Reusability compromises availability
•  Reusability rarely pays
•  Do not strive for reusable services
•  Strive for replaceable services instead
•  Try to tackle reusability issues with libraries
Quest #6
Know your toolbox
Core
Detect
 Treat
Prevent
Recover
Mitigate
 Complement
Supporting
patterns
Redundancy
Stateless
Idempotency
Escalation
Zero downtime
deployment
Location
transparency
Relaxed
temporal
constraints
Fallback
Shed load
Share load
Marked data
 Queue for
resources
Bounded queue
Finish work in
progress
Fresh work
before stale
Deferrable work
Communication
paradigm
Isolation
Bulkhead
System level
Monitor
Watchdog
Heartbeat
Either level
Voting
Synthetic
transaction
Leaky
bucket
Routine
checks
Health
check
Fail fast
Let sleeping dogs lie
Small releases
Hot deployments
Routine maintenance
Backup request
Anti-fragility
Diversity
 Jitter
Error
injection
Spread the news
Anti-entropy
Backpressure
Retry
Limit retries
Rollback
Roll-forward
Checkpoint
 Safe point
Failover
Read repair
Error
handler
Reset
Restart
Reconnect
Fail silently
Default value
Node level
Timeout
Circuit breaker
Complete
parameter
checking
Checksum
Statically
Dynamically
Confinement
Acknowledgement
Using resilience patterns

•  Patterns are options, not obligations
•  Don’t pick too many patterns
•  Each pattern increases complexity
•  Complexity is the enemy of robustness
•  Each pattern costs money in dev & ops
•  You only have a limited resilience budget
•  Look for complementary patterns
How other people did it
Core
Detect
 Treat
Prevent
Recover
Mitigate
 Complement
Supporting
patterns
Escalation
Communication
paradigm
Isolation
System level
Monitor
Heartbeat
Either level
Hot deployments
Restart
 Let it crash!
Node level
Actor
Messaging
Erlang (Akka)

Core patterns
Core
Detect
 Treat
Prevent
Recover
Mitigate
 Complement
Supporting
patterns
Fallback
Share load
Bounded
queue
Communication
paradigm
Isolation
System level
Monitor
Either level
Error
injection
Retry
Limit retries
Node level
Circuit breaker
Timeout
Zero downtime
deployment
Canary releases
Redundancy
Several variants
(Micro)service
Request/
response
Netflix

Core patterns
Quest #7
Preserve the collective memory
We face a new generation of developers
every 5 years
We loose our collective memory
every 5 years *


* Mean time until a topic discussion in the community starts over form scratch
Time working in IT
Growth of
knowledge
Depth of
insights
What do we do to compensate this effect?
We look for the new & shiny stuff ...
... as anything not new must be useless crap!
We need to rediscover our insights
every 5 years
In IT, we suffer from
continuous collective amnesia
and we are even proud of it
How can we become better?
Wrap-up
The 7 quests at a glance
Wrap-up

•  The road to resilient software design is a twisted one!
•  Most challenges are only indirectly related to RSD
•  Most challenges are not coding related
•  Mastering functional design is extremely hard ...
•  ... while learning the patterns is relatively easy
•  How do we preserve our collective memory?
Uwe Friedrichsen

IT traveller.
Connecting the dots.
Attracted by uncharted territory.
CTO at codecentric.

https://www.slideshare.net/ufried
https://medium.com/@ufried
 @ufried

More Related Content

What's hot (20)

PDF
Unit 1: Apply the Twelve-Factor App to Microservices Architectures
NGINX, Inc.
 
PDF
From Mainframe to Microservice: An Introduction to Distributed Systems
Tyler Treat
 
PDF
Site Reliability Engineering: An Enterprise Adoption Story (an ITSM Academy W...
ITSM Academy, Inc.
 
PPTX
CQRS and Event Sourcing
Inho Kang
 
PPTX
Microservices Architecture - Bangkok 2018
Araf Karsh Hamid
 
PDF
Getting Started with Infrastructure as Code
WinWire Technologies Inc
 
PPTX
Using Azure DevOps to continuously build, test, and deploy containerized appl...
Adrian Todorov
 
PDF
Overview of Site Reliability Engineering (SRE) & best practices
Ashutosh Agarwal
 
PPTX
Chaos Engineering with Gremlin Platform
Anshul Patel
 
PDF
DevOps for beginners
Pradeep Patel, PMP®
 
PPTX
Introduction To Microservices
Lalit Kale
 
PPTX
Transforming Organizations with CI/CD
Cprime
 
PPTX
Application Resilience Patterns
Kiran Sama
 
PPTX
Site reliability engineering - Lightning Talk
Michae Blakeney
 
PPTX
DevOps
Gehad Elsayed
 
PDF
Demystifying DevSecOps
Archana Joshi
 
PDF
DevSecOps
Tomas Honzak
 
PPTX
SRE-iously! Reliability!
New Relic
 
PPSX
Microservices, Containers, Kubernetes, Kafka, Kanban
Araf Karsh Hamid
 
PPTX
Event Driven Software Architecture Pattern
jeetendra mandal
 
Unit 1: Apply the Twelve-Factor App to Microservices Architectures
NGINX, Inc.
 
From Mainframe to Microservice: An Introduction to Distributed Systems
Tyler Treat
 
Site Reliability Engineering: An Enterprise Adoption Story (an ITSM Academy W...
ITSM Academy, Inc.
 
CQRS and Event Sourcing
Inho Kang
 
Microservices Architecture - Bangkok 2018
Araf Karsh Hamid
 
Getting Started with Infrastructure as Code
WinWire Technologies Inc
 
Using Azure DevOps to continuously build, test, and deploy containerized appl...
Adrian Todorov
 
Overview of Site Reliability Engineering (SRE) & best practices
Ashutosh Agarwal
 
Chaos Engineering with Gremlin Platform
Anshul Patel
 
DevOps for beginners
Pradeep Patel, PMP®
 
Introduction To Microservices
Lalit Kale
 
Transforming Organizations with CI/CD
Cprime
 
Application Resilience Patterns
Kiran Sama
 
Site reliability engineering - Lightning Talk
Michae Blakeney
 
Demystifying DevSecOps
Archana Joshi
 
DevSecOps
Tomas Honzak
 
SRE-iously! Reliability!
New Relic
 
Microservices, Containers, Kubernetes, Kafka, Kanban
Araf Karsh Hamid
 
Event Driven Software Architecture Pattern
jeetendra mandal
 

Similar to The 7 quests of resilient software design (20)

PDF
Timeless design in a cloud-native world
Uwe Friedrichsen
 
PPTX
DevOps and Microservice
Inho Kang
 
PDF
Microservices - stress-free and without increased heart attack risk
Uwe Friedrichsen
 
PPTX
Microsoft Microservices
Chase Aucoin
 
PDF
Cloud Native Future
Julie Coonce
 
PPTX
JAXLondon 2015 "DevOps and the Cloud: All Hail the (Developer) King"
Daniel Bryant
 
PPTX
DevOps and the cloud: all hail the (developer) king - Daniel Bryant, Steve Poole
JAXLondon_Conference
 
PDF
devops, microservices, and platforms, oh my!
Andrew Shafer
 
PDF
Cloud Foundry Summit 2015: Devops, microservices and platforms, oh my!
VMware Tanzu
 
PPTX
Software Architecture for Agile Development
Hayim Makabee
 
PPTX
The Road to SaaS
Victor Ionescu
 
PDF
Microservices - stress-free and without increased heart-attack risk - Uwe Fri...
distributed matters
 
PPTX
No silver bullet
Ghufran Hasan
 
PDF
AWS Innovate: Smaller IS Better – Exploiting Microservices on AWS, Craig Dickson
Amazon Web Services Korea
 
PDF
Software Architecture and Architectors: useless VS valuable
Comsysto Reply GmbH
 
PDF
Microservices Gone Wrong!
Bert Ertman
 
PPTX
Evolving Architecture and Organization - Lessons from Google and eBay
Randy Shoup
 
PDF
devops, platforms and devops platforms
VMware Tanzu
 
PDF
devops, platforms and devops platforms
Andrew Shafer
 
PDF
Architectural Decisions: Smoothly and Consistently
Comsysto Reply GmbH
 
Timeless design in a cloud-native world
Uwe Friedrichsen
 
DevOps and Microservice
Inho Kang
 
Microservices - stress-free and without increased heart attack risk
Uwe Friedrichsen
 
Microsoft Microservices
Chase Aucoin
 
Cloud Native Future
Julie Coonce
 
JAXLondon 2015 "DevOps and the Cloud: All Hail the (Developer) King"
Daniel Bryant
 
DevOps and the cloud: all hail the (developer) king - Daniel Bryant, Steve Poole
JAXLondon_Conference
 
devops, microservices, and platforms, oh my!
Andrew Shafer
 
Cloud Foundry Summit 2015: Devops, microservices and platforms, oh my!
VMware Tanzu
 
Software Architecture for Agile Development
Hayim Makabee
 
The Road to SaaS
Victor Ionescu
 
Microservices - stress-free and without increased heart-attack risk - Uwe Fri...
distributed matters
 
No silver bullet
Ghufran Hasan
 
AWS Innovate: Smaller IS Better – Exploiting Microservices on AWS, Craig Dickson
Amazon Web Services Korea
 
Software Architecture and Architectors: useless VS valuable
Comsysto Reply GmbH
 
Microservices Gone Wrong!
Bert Ertman
 
Evolving Architecture and Organization - Lessons from Google and eBay
Randy Shoup
 
devops, platforms and devops platforms
VMware Tanzu
 
devops, platforms and devops platforms
Andrew Shafer
 
Architectural Decisions: Smoothly and Consistently
Comsysto Reply GmbH
 
Ad

More from Uwe Friedrichsen (20)

PDF
Deep learning - a primer
Uwe Friedrichsen
 
PDF
Life after microservices
Uwe Friedrichsen
 
PDF
The hitchhiker's guide for the confused developer
Uwe Friedrichsen
 
PDF
Digitization solutions - A new breed of software
Uwe Friedrichsen
 
PDF
Real-world consistency explained
Uwe Friedrichsen
 
PDF
Excavating the knowledge of our ancestors
Uwe Friedrichsen
 
PDF
The truth about "You build it, you run it!"
Uwe Friedrichsen
 
PDF
The promises and perils of microservices
Uwe Friedrichsen
 
PDF
Watch your communication
Uwe Friedrichsen
 
PDF
Life, IT and everything
Uwe Friedrichsen
 
PDF
DevOps is not enough - Embedding DevOps in a broader context
Uwe Friedrichsen
 
PDF
Production-ready Software
Uwe Friedrichsen
 
PDF
Towards complex adaptive architectures
Uwe Friedrichsen
 
PDF
Conway's law revisited - Architectures for an effective IT
Uwe Friedrichsen
 
PDF
Modern times - architectures for a Next Generation of IT
Uwe Friedrichsen
 
PDF
The Next Generation (of) IT
Uwe Friedrichsen
 
PDF
Why resilience - A primer at varying flight altitudes
Uwe Friedrichsen
 
PDF
No stress with state
Uwe Friedrichsen
 
PDF
Resilience with Hystrix
Uwe Friedrichsen
 
PDF
Self healing data
Uwe Friedrichsen
 
Deep learning - a primer
Uwe Friedrichsen
 
Life after microservices
Uwe Friedrichsen
 
The hitchhiker's guide for the confused developer
Uwe Friedrichsen
 
Digitization solutions - A new breed of software
Uwe Friedrichsen
 
Real-world consistency explained
Uwe Friedrichsen
 
Excavating the knowledge of our ancestors
Uwe Friedrichsen
 
The truth about "You build it, you run it!"
Uwe Friedrichsen
 
The promises and perils of microservices
Uwe Friedrichsen
 
Watch your communication
Uwe Friedrichsen
 
Life, IT and everything
Uwe Friedrichsen
 
DevOps is not enough - Embedding DevOps in a broader context
Uwe Friedrichsen
 
Production-ready Software
Uwe Friedrichsen
 
Towards complex adaptive architectures
Uwe Friedrichsen
 
Conway's law revisited - Architectures for an effective IT
Uwe Friedrichsen
 
Modern times - architectures for a Next Generation of IT
Uwe Friedrichsen
 
The Next Generation (of) IT
Uwe Friedrichsen
 
Why resilience - A primer at varying flight altitudes
Uwe Friedrichsen
 
No stress with state
Uwe Friedrichsen
 
Resilience with Hystrix
Uwe Friedrichsen
 
Self healing data
Uwe Friedrichsen
 
Ad

Recently uploaded (20)

PDF
SWEBOK Guide and Software Services Engineering Education
Hironori Washizaki
 
PDF
Apache CloudStack 201: Let's Design & Build an IaaS Cloud
ShapeBlue
 
PDF
Sustainable and comertially viable mining process.pdf
Avijit Kumar Roy
 
PDF
Rethinking Security Operations - SOC Evolution Journey.pdf
Haris Chughtai
 
PDF
Empowering Cloud Providers with Apache CloudStack and Stackbill
ShapeBlue
 
PDF
Shuen Mei Parth Sharma Boost Productivity, Innovation and Efficiency wit...
AWS Chicago
 
PDF
CIFDAQ Weekly Market Wrap for 11th July 2025
CIFDAQ
 
PPTX
Building a Production-Ready Barts Health Secure Data Environment Tooling, Acc...
Barts Health
 
PDF
UiPath vs Other Automation Tools Meeting Presentation.pdf
Tracy Dixon
 
PDF
CloudStack GPU Integration - Rohit Yadav
ShapeBlue
 
PDF
HR agent at Mediq: Lessons learned on Agent Builder & Maestro by Tacstone Tec...
UiPathCommunity
 
PDF
HCIP-Data Center Facility Deployment V2.0 Training Material (Without Remarks ...
mcastillo49
 
PPTX
Top Managed Service Providers in Los Angeles
Captain IT
 
PDF
DevBcn - Building 10x Organizations Using Modern Productivity Metrics
Justin Reock
 
PDF
Blockchain Transactions Explained For Everyone
CIFDAQ
 
PPTX
Top iOS App Development Company in the USA for Innovative Apps
SynapseIndia
 
PPTX
Building and Operating a Private Cloud with CloudStack and LINBIT CloudStack ...
ShapeBlue
 
PPTX
Building Search Using OpenSearch: Limitations and Workarounds
Sease
 
PDF
Productivity Management Software | Workstatus
Lovely Baghel
 
PDF
Building Resilience with Digital Twins : Lessons from Korea
SANGHEE SHIN
 
SWEBOK Guide and Software Services Engineering Education
Hironori Washizaki
 
Apache CloudStack 201: Let's Design & Build an IaaS Cloud
ShapeBlue
 
Sustainable and comertially viable mining process.pdf
Avijit Kumar Roy
 
Rethinking Security Operations - SOC Evolution Journey.pdf
Haris Chughtai
 
Empowering Cloud Providers with Apache CloudStack and Stackbill
ShapeBlue
 
Shuen Mei Parth Sharma Boost Productivity, Innovation and Efficiency wit...
AWS Chicago
 
CIFDAQ Weekly Market Wrap for 11th July 2025
CIFDAQ
 
Building a Production-Ready Barts Health Secure Data Environment Tooling, Acc...
Barts Health
 
UiPath vs Other Automation Tools Meeting Presentation.pdf
Tracy Dixon
 
CloudStack GPU Integration - Rohit Yadav
ShapeBlue
 
HR agent at Mediq: Lessons learned on Agent Builder & Maestro by Tacstone Tec...
UiPathCommunity
 
HCIP-Data Center Facility Deployment V2.0 Training Material (Without Remarks ...
mcastillo49
 
Top Managed Service Providers in Los Angeles
Captain IT
 
DevBcn - Building 10x Organizations Using Modern Productivity Metrics
Justin Reock
 
Blockchain Transactions Explained For Everyone
CIFDAQ
 
Top iOS App Development Company in the USA for Innovative Apps
SynapseIndia
 
Building and Operating a Private Cloud with CloudStack and LINBIT CloudStack ...
ShapeBlue
 
Building Search Using OpenSearch: Limitations and Workarounds
Sease
 
Productivity Management Software | Workstatus
Lovely Baghel
 
Building Resilience with Digital Twins : Lessons from Korea
SANGHEE SHIN
 

The 7 quests of resilient software design

  • 1. The 7 quests of resilient software design A guide for the adventurous software engineer Uwe Friedrichsen – codecentric AG – 2012-2017
  • 2. Uwe Friedrichsen IT traveller. Connecting the dots. Attracted by uncharted territory. CTO at codecentric. https://www.slideshare.net/ufried https://medium.com/@ufried @ufried
  • 3. You want to do resilient software design ...
  • 4. ... and you expect everything to be like this
  • 5. But somehow it feels more like that ...
  • 6. ... or even that
  • 7. What the **** went wrong?
  • 8. The road to resilience is a twisted one
  • 9. “7 quests you must complete!”
  • 12. “How much money will we earn with it?”
  • 13. “Does it improve our velocity?”
  • 14. Resilience is not about making money Resilience is about not losing money
  • 15. Lack of resilient software design Reduced system availability Users cannot do what they intend to do Less transactions per time period Immediate lost revenue Users get annoyed Churn rate increases Delayed lost revenue Due to non-determinism of distributed systems This is at most your resilience budget
  • 18. Everything fails, all the time. -- Werner Vogels
  • 19. If X then Y What we learned in our IT education If X then maybe Y This changes everything! What we need for distributed systems We are good at this (due to how our brains work) Inside process thinking Reasoning about deterministic behavior Designing a complicated system We are poor at that (due to how our brains work) Reasoning about non-deterministic behavior Across process thinking Designing a complex system
  • 20. Yet, we usually use deterministic thinking to reason about distributed systems
  • 21. Failures in distributed systems ... •  Crash failure •  Omission failure •  Timing failure •  Response failure •  Byzantine failure
  • 22. ... turn seemingly simple issues into very hard ones Time & Ordering Leslie Lamport "Time, clocks, and the ordering of events in distributed systems" Consensus Leslie Lamport ”The part-time parliament” (Paxos) CAP Eric A. Brewer "Towards robust distributed systems" Faulty processes Leslie Lamport, Robert Shostak, Marshall Pease "The Byzantine generals problem" Consensus Michael J. Fischer, Nancy A. Lynch, Michael S. Paterson "Impossibility of distributed consensus with one faulty process” (FLP) Impossibility Nancy A. Lynch ”A hundred impossibility proofs for distributed computing"
  • 23. Embrace distributed systems •  Distributed systems introduce non-determinism regarding •  Execution completeness •  Message ordering •  Communication timing •  You will be affected by this at the application level •  Don’t expect your infrastructure to hide all effects from you •  Better have a plan to detect and recover from inconsistencies
  • 24. But do I really need to care? (The system, I am working on, is not a distributed system)
  • 25. (Almost) every system is a distributed system -- Chas Emerick http://www.infoq.com/presentations/problems-distributed-systems
  • 26. … and it’s getting “worse” •  Cloud-based systems •  Microservices •  Zero Downtime •  Mobile & IoT •  Social Web
  • 28. Avoid the “100% available” trap
  • 29. The “100% available” trap, version #1 You: “How should the application respond if a technical failure occurs?” Business owner: “This must not happen! It is your responsibility to make sure that this will not happen.”
  • 30. The “100% available” trap, version #2 You: “How do you handle the situation if the service you call does not respond (or does not respond timely)?” Developer 1: “We did not implement any extra measures. The other service is so important and thus needs to be so highly available that it is not worth any extra effort.” Developer 2: “Actually, if that service should be down, we would not be able to do anything useful anyway. Thus, it just needs to be up.”
  • 31. The question is not, if a failure will happen The question is, when a failure will happen
  • 32. A short note about availability Assume a service availability of 99,5% (incl. planned downtimes) •  10 services involved in a request à 95,1% probability of success •  50 services involved in a request à 77,8% probability of success
  • 34. Establish the ops-dev feedback loop
  • 35. The big wall between Dev and Ops
  • 36. In a distributed environment, you cannot solve availability issues on an infrastructure level only
  • 37. Dev Ops “I implemented something to improve production availability” “Here are the figures how it worked” Continuous improvement cycle of resilient software design Dev is where you implement your resilience measures Build Measure Learn Ops is where your resilience measures take effect
  • 38. Dev Ops “I implemented something to improve production availability” “Here are the figures how it worked” Continuous improvement cycle of resilient software design Dev is where you implement your resilience measures Build Measure Learn Ops is where your resilience measures take effect All developer activities towards improving robustness are basically “shooting at the dark” which is neither effective not sustainable Having a wall between Dev and Ops breaks the cycle required to implement effective robustness measures
  • 39. For effective resilient software design you need a working ops-dev feedback loop
  • 40. Establishing the feedback loop •  Adopt DevOps •  Adopt Site Reliability Engineering (SRE) •  Or do it your own way if you know a better way ... •  ... but make sure you establish the required feedback loops!
  • 43. Without proper functional design nothing else matters
  • 44. Isolation •  System must not fail as a whole •  Split system in parts and isolate parts against each other •  Avoid cascading failures •  Foundation of resilient software design
  • 45. Bulkhead •  Bulkheads implement the “parts” that need to be isolated •  Core isolation pattern (a.k.a. “failure units” or “units of mitigation”) •  Diverse implementation choices available, e.g., (micro)services, actors, SCS, ... •  Shaping good bulkheads is a pure functional design issue (and extremely hard)
  • 46. Hmm, sound easy. Why should that be hard?
  • 47. Service A Service B Request Due to functional design, Service A always needs backing from Service B to be able to answer a client request, i.e. the isolation is broken by design How do we avoid this …
  • 48. Service Request Due to functional design we need to call a lot of services to be able to answer a client request, i.e. availability is broken by design ... and this ... Service Service Service Service Service Service Service Service Service Service Service Service
  • 49. Mothership Service (a.k.a. Monolith) Request By trying to avoid the aforementioned issues we ended up with cramming all required functionality in one big service i.e. the isolation is broken by design ... without ending up with this?
  • 50. Let us apply our well-known best practices •  Divide & conquer a.k.a. functional decomposition •  DRY (Don’t Repeat Yourself) •  Design for reusability •  Layered architecture •  …
  • 52. Service A Service B Request Due to functional design, Service A always needs backing from Service B to be able to answer a client request, i.e. the isolation is broken by design ... this usually leads to this …
  • 53. Service Request Due to functional design we need to call a lot of services to be able to answer a client request, i.e. availability is broken by design ... and this ... Service Service Service Service Service Service Service Service Service Service Service Service
  • 54. Mothership Service (a.k.a. Monolith) Request By trying to avoid the aforementioned issues we ended up with cramming all required functionality in one big service i.e. the isolation is broken by design ... and in the end also often to this.
  • 56. Caches to the rescue!
  • 57. Service A Service B Request Due to functional design, Service A always needs backing from Service B to be able to answer a client request, i.e. the isolation is broken by design CacheofB Break tight service coupling by caching data/responses of downstream service
  • 58. Caches to the rescue?
  • 59. Do you really think that copying stale data all over your system is a suitable measure to fix an inherently broken design? * * Side note: Caches are a very important and powerful measure in many places. But they are not suitable as a cheap fix for a broken functional design
  • 60. We have to re-learn design for distributed system
  • 62. Yet, a few guiding thoughts ...
  • 63. Foundations of design •  “High cohesion, low coupling” & “separation of concerns” •  “Crucial across process boundaries •  Still poorly understood issue •  Start with •  Understanding organizational boundaries •  Understanding use cases and flows •  Identifying functional domains (à DDD) •  Finding areas that change independently •  Do not start with a data model!
  • 64. Short activation paths •  Long activation paths affect availability •  Increase likelihood of failures •  Minimize remote calls per request •  Need to balance opposing forces •  Avoid monolith à clear separation of concerns •  Minimize requests à cluster functionality & data •  Caches can sometimes help, but stale data as trade-off
  • 65. Be (extremely) wary of reusability •  Reusability increases coupling •  Reusability usually leads to bad service design •  Reusability compromises availability •  Reusability rarely pays •  Do not strive for reusable services •  Strive for replaceable services instead •  Try to tackle reusability issues with libraries
  • 68. Core Detect Treat Prevent Recover Mitigate Complement Supporting patterns Redundancy Stateless Idempotency Escalation Zero downtime deployment Location transparency Relaxed temporal constraints Fallback Shed load Share load Marked data Queue for resources Bounded queue Finish work in progress Fresh work before stale Deferrable work Communication paradigm Isolation Bulkhead System level Monitor Watchdog Heartbeat Either level Voting Synthetic transaction Leaky bucket Routine checks Health check Fail fast Let sleeping dogs lie Small releases Hot deployments Routine maintenance Backup request Anti-fragility Diversity Jitter Error injection Spread the news Anti-entropy Backpressure Retry Limit retries Rollback Roll-forward Checkpoint Safe point Failover Read repair Error handler Reset Restart Reconnect Fail silently Default value Node level Timeout Circuit breaker Complete parameter checking Checksum Statically Dynamically Confinement Acknowledgement
  • 69. Using resilience patterns •  Patterns are options, not obligations •  Don’t pick too many patterns •  Each pattern increases complexity •  Complexity is the enemy of robustness •  Each pattern costs money in dev & ops •  You only have a limited resilience budget •  Look for complementary patterns
  • 71. Core Detect Treat Prevent Recover Mitigate Complement Supporting patterns Escalation Communication paradigm Isolation System level Monitor Heartbeat Either level Hot deployments Restart Let it crash! Node level Actor Messaging Erlang (Akka) Core patterns
  • 72. Core Detect Treat Prevent Recover Mitigate Complement Supporting patterns Fallback Share load Bounded queue Communication paradigm Isolation System level Monitor Either level Error injection Retry Limit retries Node level Circuit breaker Timeout Zero downtime deployment Canary releases Redundancy Several variants (Micro)service Request/ response Netflix Core patterns
  • 75. We face a new generation of developers every 5 years
  • 76. We loose our collective memory every 5 years * * Mean time until a topic discussion in the community starts over form scratch
  • 77. Time working in IT Growth of knowledge Depth of insights
  • 78. What do we do to compensate this effect?
  • 79. We look for the new & shiny stuff ...
  • 80. ... as anything not new must be useless crap!
  • 81. We need to rediscover our insights every 5 years
  • 82. In IT, we suffer from continuous collective amnesia and we are even proud of it
  • 83. How can we become better?
  • 85. The 7 quests at a glance
  • 86. Wrap-up •  The road to resilient software design is a twisted one! •  Most challenges are only indirectly related to RSD •  Most challenges are not coding related •  Mastering functional design is extremely hard ... •  ... while learning the patterns is relatively easy •  How do we preserve our collective memory?
  • 87. Uwe Friedrichsen IT traveller. Connecting the dots. Attracted by uncharted territory. CTO at codecentric. https://www.slideshare.net/ufried https://medium.com/@ufried @ufried