

Reliable Machine Learning



Reliable Machine Learning - Najlepsze oferty
Reliable Machine Learning - Opis
Whether you're part of a small startup or a multinational corporation, this practical book shows data scientists, software and site reliability engineers, product managers, and business owners how to run and establish ML reliably, effectively, and accountably within your organization. You'll gain insight into everything from how to do model monitoring in production to how to run a well-tuned model development team in a product organization.By applying an SRE mindset to machine learning, authors and engineering professionals Cathy Chen, Kranti Parisa, Niall Richard Murphy, D. Sculley, Todd Underwood, and featured guest authors show you how to run an efficient and reliable ML system. Whether you want to increase revenue, optimize decision making, solve problems, or understand and influence customer behavior, you'll learn how to perform day-to-day ML tasks while keeping the bigger picture in mind.
You'll examine:What ML is: how it functions and what it relies onConceptual frameworks for understanding how ML "loops" workHow effective productionization can make your ML systems easily monitorable, deployable, and operableWhy ML systems make production troubleshooting more difficult, and how to compensate accordinglyHow ML, product, and production teams can communicate effectively Spis treści:
Foreword
Preface
Why We Wrote This Book
SRE as the Lens on ML
Intended Audience
How This Book Is Organized
Our Approach
Lets Knit!
Navigating This Book
About the Authors
Conventions Used in This Book
OReilly Online Learning
How (...) więcej to Contact Us
Acknowledgments
Cathy Chen
Niall Richard Murphy
Kranti Parisa
D. Sculley
Todd Underwood
1. Introduction
The ML Lifecycle
Data Collection and Analysis
ML Training Pipelines
Build and Validate Applications
Quality and Performance Evaluation
Defining and Measuring SLOs
Launch
Models as code
Launch slowly
Release, not refactor
Isolate rollouts at the data layer
Measure SLOs during launch
Review the rollout
Monitoring and Feedback Loops
Lessons from the Loop
2. Data Management Principles
Data as Liability
The Data Sensitivity of ML Pipelines
Phases of Data
Creation
Ingestion
Processing
Validation
Cleaning and ensuring data consistency
Enriching and extending
Storage
Management
Analysis and Visualization
Data Reliability
Durability
Consistency
Version Control
Performance
Availability
Data Integrity
Security
Privacy
Policy and Compliance
Jurisdictional rules
Reporting requirements
Conclusion
3. Basic Introduction to Models
What Is a Model?
A Basic Model Creation Workflow
Model Architecture Versus Model Definition Versus Trained Model
Where Are the Vulnerabilities?
Training Data
Incomplete coverage
Spurious correlations
Cold start
Self-fulfilling prophecies and ML echo chambers
Changes in the world
Labels
Label noise
Wrong label objective
Fraud or malicious feedback
Training Methods
Overfitting
Lack of stability
Peculiarities of deep learning
Infrastructure and Pipelines
Platforms
Feature Generation
Upgrades and Fixes
A Set of Useful Questions to Ask About Any Model
An Example ML System
Yarn Product Click-Prediction Model
Features
Labels for Features
Model Updating
Model Serving
Common Failures
Conclusion
4. Feature and Training Data
Features
Feature Selection and Engineering
Lifecycle of a Feature
Feature Systems
Data ingestion system
Feature store
Feature quality evaluation system
Labels
Human-Generated Labels
Annotation Workforces
Measuring Human Annotation Quality
An Annotation Platform
Active Learning and AI-Assisted Labeling
Documentation and Training for Labelers
Metadata
Metadata Systems Overview
Dataset Metadata
Feature Metadata
Label Metadata
Pipeline Metadata
Data Privacy and Fairness
Privacy
PII data and features
Private data and labeling
Fairness
Conclusion
5. Evaluating Model Validity and Quality
Evaluating Model Validity
Evaluating Model Quality
Offline Evaluations
Evaluation Distributions
Held-out test data
Progressive validation
Golden sets
Stress-test distributions
Sliced analysis
Counterfactual testing
A Few Useful Metrics
Canary metrics
Bias
Calibration
Classification metrics
Accuracy
Precision and recall
AUC ROC
Precision/recall curves
Regression metrics
Mean squared error and mean absolute error
Log loss
Operationalizing Verification and Evaluation
Conclusion
6. Fairness, Privacy, and Ethical ML Systems
Fairness (a.k.a. Fighting Bias)
Definitions of Fairness
Reaching Fairness
Fairness as a Process Rather than an Endpoint
A Quick Legal Note
Privacy
Methods to Preserve Privacy
Technical measures
Institutional measures
A Quick Legal Note
Responsible AI
Explanation
Effectiveness
Social and Cultural Appropriateness
Responsible AI Along the ML Pipeline
Use Case Brainstorming
Data Collection and Cleaning
Model Creation and Training
Model Validation and Quality Assessment
Model Deployment
Products for the Market
Conclusion
7. Training Systems
Requirements
Basic Training System Implementation
Features
Feature Store
Model Management System
Orchestration
Job/process/resource scheduling system
ML framework
Quality Evaluation
Monitoring
General Reliability Principles
Most Failures Will Not Be ML Failures
Models Will Be Retrained
Models Will Have Multiple Versions (at the Same Time!)
Good Models Will Become Bad
Data Will Be Unavailable
Models Should Be Improvable
Features Will Be Added and Changed
Models Can Train Too Fast
Resource Utilization Matters
Utilization != Efficiency
Outages Include Recovery
Common Training Reliability Problems
Data Sensitivity
Example Data Problem at YarnIt
Reproducibility
Example Reproducibility Problem at YarnIt
Compute Resource Capacity
Example Capacity Problem at YarnIt
Structural Reliability
Organizational Challenges
Ethics and Fairness Considerations
Conclusion
8. Serving
Key Questions for Model Serving
What Will Be the Load to Our Model?
What Are the Prediction Latency Needs of Our Model?
Where Does the Model Need to Live?
On a local machine
On servers owned or managed by our organization
In the cloud
On-device
What Are the Hardware Needs for Our Model?
How Will the Serving Model Be Stored, Loaded, Versioned, and Updated?
What Will Our Feature Pipeline for Serving Look Like?
Model Serving Architectures
Offline Serving (Batch Inference)
Advantages
Disadvantages
Online Serving (Online Inference)
Advantages
Disadvantages
Model as a Service
Advantages
Disadvantages
Serving at the Edge
Advantages
Disadvantages
Choosing an Architecture
Model API Design
Testing
Serving for Accuracy or Resilience?
Scaling
Autoscaling
Caching
Disaster Recovery
Ethics and Fairness Considerations
Conclusion
9. Monitoring and Observability for Models
What Is Production Monitoring and Why Do It?
What Does It Look Like?
The Concerns That ML Brings to Monitoring
Reasons for Continual ML Observabilityin Production
Problems with ML Production Monitoring
Difficulties of Development Versus Serving
A Mindset Change Is Required
Best Practices for ML Model Monitoring
Generic Pre-serving Model Recommendations
Explainability and monitoring
Training and Retraining
Concrete recommendations
Model Validation (Before Rollout)
Fallbacks in validation
Call to action
Concrete recommendations
Serving
Model
Case 1: Real-time actuals
Case 2: Delayed actuals
Case 3: Biased actuals
Case 4: No/few actuals
Other approaches
Troubleshooting model performance metrics
Data
Drift
Measuring drift
Troubleshooting drift
Data quality
Categorical data
Numerical data
Measuring data quality
Service
Optimizing performance of the model
Optimizing performance of the service
Other Things to Consider
SLOs in ML monitoring
Monitoring across services
Fairness in monitoring
Privacy in monitoring
Business impact
Dense data types (image, video, text documents, audio, and so on)
High-Level Recommendations for Monitoring Strategy
Conclusion
10. Continuous ML
Anatomy of a Continuous ML System
Training Examples
Training Labels
Filtering Out Bad Data
Feature Stores and Data Management
Updating the Model
Pushing Updated Models to Serving
Observations About Continuous ML Systems
External World Events May Influence Our Systems
Models Can Influence Their Own Training Data
Temporal Effects Can Arise at Several Timescales
Emergency Response Must Be Done in Real Time
Stop training
Fall back
Roll back
Remove bad data
Roll through
Choosing a response strategy
Organizational considerations
New Launches Require Staged Ramp-ups and Stable Baselines
Models Must Be Managed Rather Than Shipped
Continuous Organizations
Rethinking Noncontinuous ML Systems
Conclusion
11. Incident Response
Incident Management Basics
Life of an Incident
Incident Response Roles
Anatomy of an ML-Centric Outage
Terminology Reminder: Model
Story Time
Story 1: Searching but Not Finding
Stages of ML incident response for story 1
Story 2: Suddenly Useless Partners
Stages of ML incident response for story 2
Story 3: Recommend You Find New Suppliers
Stages of ML incident response for story 3
ML Incident Management Principles
Guiding Principles
Model Developer or Data Scientist
Preparation
Incident handling
Continuous improvement
Software Engineer
Preparation
Incident handling
Continuous improvement
ML SRE or Production Engineer
Preparation
Incident handling
Continuous improvement
Product Manager or Business Leader
Preparation
Incident handling
Continuous improvement
Special Topics
Production Engineers and ML Engineering Versus Modeling
The Ethical On-Call Engineer Manifesto
Impact
Cause
Troubleshooting
Solutions and a call to action
Conclusion
12. How Product and ML Interact
Different Types of Products
Agile ML?
ML Product Development Phases
Discovery and Definition
Business Goal Setting
MVP Construction and Validation
Model and Product Development
Deployment
Support and Maintenance
Build Versus Buy
Models
Generic use cases
Companys data initiatives
Data Processing Infrastructure
End-to-End Platforms
Scoring Approach for Making the Decision
Making the Decision
Sample YarnIt Store Features Powered by ML
Showcasing Popular Yarns by Total Sales
Recommendations Based on Browsing History
Cross-selling and Upselling
Content-Based Filtering
Collaborative Filtering
Conclusion
13. Integrating ML into Your Organization
Chapter Assumptions
Leader-Based Viewpoint
Detail Matters
ML Needs to Know About the Business
The Most Important Assumption You Make
The Value of ML
Significant Organizational Risks
ML Is Not Magic
Mental (Way of Thinking) Model Inertia
Surfacing Risk Correctly in Different Cultures
Siloed Teams Dont Solve All Problems
Implementation Models
Remembering the Goal
Greenfield Versus Brownfield
ML Roles and Responsibilities
How to Hire ML Folks
Organizational Design and Incentives
Strategy
Structure
Processes
Rewards
People
A Note on Sequencing
Conclusion
14. Practical ML Org Implementation Examples
Scenario 1: A New Centralized ML Team
Background and Organizational Description
Process
Rewards
People
Default Implementation
Scenario 2: Decentralized ML Infrastructure and Expertise
Background and Organizational Description
Process
Rewards
People
Default Implementation
Scenario 3: Hybrid with Centralized Infrastructure/Decentralized Modeling
Background and Organizational Description
Process
Rewards
People
Default Implementation
Conclusion
15. Case Studies: MLOps in Practice
1. Accommodating Privacy and Data Retention Policies in ML Pipelines
Background
Problem and Resolution
Challenge 1: Which dialects?
Solution: Get rid of the concept of dialects!
Challenge 2: Racing the clock
Solutions (and new challenges!)
Takeaways
2. Continuous ML Model Impacting Traffic
Background
Problem and Resolution
Takeaways
3. Steel Inspection
Background
Problem and Resolution
Takeaways
4. NLP MLOps: Profiling and Staging Load Test
Background
Problem and Resolution
An improved process for benchmarking
Takeaways
5. Ad Click Prediction: Databases Versus Reality
Background
Problem and Resolution
Takeaways
6. Testing and Measuring Dependencies in ML Workflow
Background
Problem and Resolution
Building the regression-testing sandbox
Monitoring for regression
Takeaways
Index mniej
Reliable Machine Learning - Opinie i recenzje
Na liście znajdują się opinie, które zostały zweryfikowane (potwierdzone zakupem) i oznaczone są one zielonym znakiem Zaufanych Opinii. Opinie niezweryfikowane nie posiadają wskazanego oznaczenia.