Maintaining the Front Door to Netflix


The Presentation inside:

Slide 0

Maintaining the Front Door to Netflix Daniel Jacobson @daniel_jacobson http://www.linkedin.com/in/danieljacobson http://www.slideshare.net/danieljacobson


Slide 1

There are copious notes attached to each slide in this presentation. Please read those notes to get the full context of the presentation


Slide 2

Global Streaming Video for TV Shows and Movies


Slide 3

More than 44 Million Subscribers More than 40 Countries


Slide 4

Netflix Accounts for ~33% of Peak Internet Traffic in North America Netflix subscribers are watching more than 1 billion hours a month


Slide 5


Slide 6


Slide 7

Team Focus: Build the Best Global Streaming Product Three aspects of the Streaming Product: Non-Member Discovery Streaming


Slide 8

Key Responsibilities Broker data between services and UIs Maintain a resilient front-door Scale the system vertically and horizontally Maintain high velocity


Slide 9

But Before Streaming…


Slide 10


Slide 11


Slide 12

Monolithic Application In Netflix Data Centers


Slide 13

The bigger the ship… the slower it turns


Slide 14

Distributed Architecture


Slide 15


Slide 16

1000+ Device Types


Slide 17

Personalization Engine User Info Movie Metadata Movie Ratings Similar Movies Reviews A/B Test Engine Dozens of Dependencies


Slide 18

Personalization Engine User Info Movie Metadata Movie Ratings Similar Movies API Reviews A/B Test Engine


Slide 19

Dependency Relationships


Slide 20

2,000,000,000 Requests Per Day to the Netflix API


Slide 21

30 Distinct Dependent Services for the Netflix API


Slide 22

~500 Dependency jars Slurped into the Netflix API


Slide 23

14,000,000,000 Netflix API Calls Per Day to those Dependent Services


Slide 24

0 Dependent Services with 100% SLA


Slide 25

99.99% = 99.7% 30 0.3% of 2B = 6M failures per day 2+ Hours of Downtime Per Month


Slide 26

99.99% = 99.7% 30 0.3% of 2B = 6M failures per day 2+ Hours of Downtime Per Month


Slide 27

99.9% = 97% 30 3% of 2B = 60M failures per day 20+ Hours of Downtime Per Month


Slide 28

Personalization Engine User Info Movie Metadata Movie Ratings Similar Movies API Reviews A/B Test Engine


Slide 29

Personalization Engine User Info Movie Metadata Movie Ratings Similar Movies API Reviews A/B Test Engine


Slide 30

Personalization Engine User Info Movie Metadata Movie Ratings Similar Movies API Reviews A/B Test Engine


Slide 31

Personalization Engine User Info Movie Metadata Movie Ratings Similar Movies API Reviews A/B Test Engine


Slide 32

Personalization Engine User Info Movie Metadata Movie Ratings Similar Movies API Reviews A/B Test Engine


Slide 33


Slide 34

Circuit Breaker Dashboard


Slide 35


Slide 36

Call Volume and Health / Last 10 Seconds


Slide 37

Call Volume / Last 2 Minutes


Slide 38

Successful Requests


Slide 39

Successful, But Slower Than Expected


Slide 40

Short-Circuited Requests, Delivering Fallbacks


Slide 41

Timeouts, Delivering Fallbacks


Slide 42

Thread Pool & Task Queue Full, Delivering Fallbacks


Slide 43

Exceptions, Delivering Fallbacks


Slide 44

Error Rate # + # + # + # / (# + # + # + # + #) = Error Rate


Slide 45

Status of Fallback Circuit


Slide 46

Requests per Second, Over Last 10 Seconds


Slide 47

SLA Information


Slide 48

Personalization Engine User Info Movie Metadata Movie Ratings Similar Movies API Reviews A/B Test Engine


Slide 49

Personalization Engine User Info Movie Metadata Movie Ratings Similar Movies API Reviews A/B Test Engine


Slide 50

Personalization Engine User Info Movie Metadata Movie Ratings Similar Movies API Reviews A/B Test Engine


Slide 51

Personalization Engine User Info Movie Metadata Movie Ratings Similar Movies API Reviews A/B Test Engine Fallback


Slide 52

Personalization Engine User Info Movie Metadata Movie Ratings Similar Movies API Reviews A/B Test Engine Fallback


Slide 53

Scaling the Distributed System


Slide 54


Slide 55

AWS Cloud


Slide 56


Slide 57

Autoscaling


Slide 58

Autoscaling


Slide 59

Amazon Auto Scaling Limitations Hard to fit policies to variable traffic patterns (weekday vs weekend) Limited control over capacity adjustments (absolute value or %)


Slide 60

The Impact of AAS Limitations Traffic drop can lead to scale downs during outage Performance degradation between new instance launch and taking traffic Excess capacity at peak and trough


Slide 61

Scryer : Predictive Auto Scaling Not yet…


Slide 62

Typical Traffic Patterns Over Five Days


Slide 63

Predicted RPS Compared to Actual RPS


Slide 64

Scaling Plan for Predicted Workload


Slide 65

What is Scryer Doing? Evaluating needs based on historical data Week over week, month over month metrics Adjusts instance minimums based on algorithms Relies on Amazon Auto Scaling for unpredicted events


Slide 66

Results


Slide 67

Results : Load Average Reactive Predictive


Slide 68

Results : Response Latencies Reactive Predictive


Slide 69

Results : Outage Recovery


Slide 70

Results : Outage Recovery


Slide 71

Results : AWS Costs


Slide 72

Scaling Globally


Slide 73

More than 44 Million Subscribers More than 40 Countries


Slide 74

Zuul Gatekeeper for the Netflix Streaming Application


Slide 75

Zuul * Multi-Region Resiliency Insights Stress Testing Canary Testing Dynamic Routing Load Shedding Security Static Response Handling Authentication * Most closely resembles an API proxy


Slide 76

Isthmus


Slide 77


Slide 78

All of these approaches are designed to prevent failures…


Slide 79

But sometimes the best way to prevent failures is to force them!


Slide 80


Slide 81

I randomly terminate instances in production to identify dormant failures. Chaos Monkey


Slide 82

Chaos Gorilla I simulate an outage of an entire Amazon availability zone.


Slide 83

I simulate an outage in an AWS region. Chaos Kong


Slide 84

I find instances that don’t adhere to best practices. Conformity Monkey


Slide 85

I extend Conformity Monkey to find security violations. Security Monkey


Slide 86

I detect unhealthy instances and remove them from service. Doctor Monkey


Slide 87

I clean up the clutter and waste that runs in the cloud. Janitor Monkey


Slide 88

I induce artificial delays and errors into services to determine how upstream services will respond. Latency Monkey


Slide 89


Slide 90

Deployments in the Cloud


Slide 91

Dependency Relationships


Slide 92


Slide 93

Testing Philosophy: Act Fast, React Fast


Slide 94

That Doesn’t Mean We Don’t Test


Slide 95

Automated Delivery Pipeline


Slide 96

Cloud-Based Deployment Techniques


Slide 97

Current Code In Production API Requests from the Internet


Slide 98

Single Canary Instance To Test New Code with Production Traffic (around 1% or less of traffic) Current Code In Production API Requests from the Internet


Slide 99

Canary Analysis Automation


Slide 100

Single Canary Instance To Test New Code with Production Traffic (around 1% or less of traffic) Current Code In Production API Requests from the Internet Error!


Slide 101

Current Code In Production API Requests from the Internet


Slide 102

Current Code In Production API Requests from the Internet


Slide 103

Current Code In Production API Requests from the Internet Perfect!


Slide 104

Stress Test with Zuul


Slide 105

Current Code In Production API Requests from the Internet New Code Getting Prepared for Production


Slide 106

Current Code In Production API Requests from the Internet New Code Getting Prepared for Production


Slide 107

Error! Current Code In Production API Requests from the Internet New Code Getting Prepared for Production


Slide 108

Current Code In Production API Requests from the Internet New Code Getting Prepared for Production


Slide 109

Current Code In Production API Requests from the Internet Perfect!


Slide 110

Stress Test with Zuul


Slide 111

Current Code In Production API Requests from the Internet New Code Getting Prepared for Production


Slide 112

Current Code In Production API Requests from the Internet New Code Getting Prepared for Production


Slide 113

API Requests from the Internet New Code Getting Prepared for Production


Slide 114

Brokering Data to 1,000+ Device Types


Slide 115


Slide 116


Slide 117

Screen Real Estate


Slide 118

Controller


Slide 119

Technical Capabilities


Slide 120

One-Size-Fits-All API Request Request Request Request Request Request Request Request Request Request Request Request Request Request Request Request


Slide 121

Courtesy of South Florida Classical Review


Slide 122


Slide 123

Resource-Based API vs. Experience-Based API


Slide 124

Resource-Based Requests /users/<id>/ratings/title /users/<id>/queues /users/<id>/queues/instant /users/<id>/recommendations /catalog/titles/movie /catalog/titles/series /catalog/people


Slide 125

REST API RECOMMENDATIONS MOVIE DATA SIMILAR MOVIES AUTH MEMBER DATA A/B TESTS START-UP RATINGS Network Border Network Border


Slide 126

RECOMMENDATIONS MOVIE DATA SIMILAR MOVIES AUTH MEMBER DATA A/B TESTS START-UP RATINGS OSFA API Network Border Network Border SERVER CODE CLIENT CODE


Slide 127

RECOMMENDATIONS MOVIE DATA SIMILAR MOVIES AUTH MEMBER DATA A/B TESTS START-UP RATINGS OSFA API Network Border Network Border DATA GATHERING, FORMATTING, AND DELIVERY USER INTERFACE RENDERING


Slide 128


Slide 129


Slide 130

Experience-Based Requests /ps3/homescreen


Slide 131

JAVA API Network Border Network Border RECOMMENDATIONS MOVIE DATA SIMILAR MOVIES AUTH MEMBER DATA A/B TESTS START-UP RATINGS Groovy Layer


Slide 132


Slide 133

RECOMMENDATIONSAZXSXX C CCC MOVIE DATA SIMILAR MOVIES AUTH MEMBER DATA A/B TESTS START-UP RATINGS JAVA API SERVER CODE CLIENT CODE CLIENT ADAPTER CODE (WRITTEN BY CLIENT TEAMS, DYNAMICALLY UPLOADED TO SERVER) Network Border Network Border


Slide 134

RECOMMENDATIONSAZXSXX C CCC MOVIE DATA SIMILAR MOVIES AUTH MEMBER DATA A/B TESTS START-UP RATINGS JAVA API DATA GATHERING DATA FORMATTING AND DELIVERY USER INTERFACE RENDERING Network Border Network Border


Slide 135


Slide 136

https://www.github.com/Netflix


Slide 137

Maintaining the Front Door to Netflix Daniel Jacobson @daniel_jacobson http://www.linkedin.com/in/danieljacobson http://www.slideshare.net/danieljacobson


×

HTML:





Ссылка: