Machine Learning @ Netflix (and some lessons learned) Yves Raimond (@moustaki) Research/Engineering Manager Search & Recommendations Algorithm Engineering
Netflix scale ● ● ● ● ● > 69M members > 50 countries > 1000 device types > 3B hours/month 36% of peak US downstream traffic
Recommendations @ Netflix ● Goal: Help members find content to watch and enjoy to maximize satisfaction and retention ● Over 80% of what people watch comes from our recommendations ● Top Picks, Because you Watched, Trending Now, Row Ordering, Evidence, Search, Search Recommendations, Personalized Genre Rows, ...
Models & Algorithms ▪ Regression (Linear, logistic, elastic net) ▪ SVD and other Matrix Factorizations ▪ Factorization Machines ▪ Restricted Boltzmann Machines ▪ Deep Neural Networks ▪ Markov Models and Graph Algorithms ▪ Clustering ▪ Latent Dirichlet Allocation ▪ Gradient Boosted Decision Trees/Random Forests ▪ Gaussian Processes ▪ …
Some lessons learned
Build the offline experimentation framework first
When tackling a new problem ● ● ● What offline metrics can we compute that capture what online improvements we’ re actually trying to achieve? How should the input data to that evaluation be constructed (train, validation, test)? How fast and easy is it to run a full cycle of offline experimentations? ○ ● Minimize time to first metric How replicable is the evaluation? How shareable are the results? ○ ○ Provenance (see Dagobah) Notebooks (see Jupyter, Zeppelin, Spark Notebook)
When tackling an old problem ● Same… ○ Were the metrics designed when first running experimentation in that space still appropriate now?
Think about distribution from the outermost layers
1. For each combination of hyper-parameter (e.g. grid search, random search, gaussian processes…) 2. For each subset of the training data a. b. Multi-core learning (e.g. HogWild) Distributed learning (e.g. ADMM, distributed L-BFGS, …)
When to use distributed learning? ● The impact of communication overhead when building distributed ML algorithms is non-trivial ● Is your data big enough that the distribution offsets the communication overhead?
Example: Uncollapsed Gibbs sampler for LDA (more details here)
Design production code to be experimentation-friendly
Example development process Idea Offline Modeling (R, Python, MATLAB, …) Data Iterate Missing postprocessing logic Data discrepancies Production environment (A/B test) Final model Actual output Performance issues Implement in production system (Java, C++, …) Code discrepancies
Avoid dual implementations Experiment code Production code Experiment Production Shared Engine
To be continued...
We’re hiring! Yves Raimond (@moustaki)