

Designing Data-Intensive Applications. The Big



Designing Data-Intensive Applications. The Big - Najlepsze oferty
Designing Data-Intensive Applications. The Big - Opis
Data is at the center of many challenges in system design today. Difficult issues need to be figured out, such as scalability, consistency, reliability, efficiency, and maintainability. In addition, we have an overwhelming variety of tools, including relational databases, NoSQL datastores, stream or batch processors, and message brokers. What are the right choices for your application? How do you make sense of all these buzzwords?In this practical and comprehensive guide, author Martin Kleppmann helps you navigate this diverse landscape by examining the pros and cons of various technologies for processing and storing data. Software keeps changing, but the fundamental principles remain the same. With this book, software engineers and architects will learn how to apply those ideas in practice, and how to make full use of data in modern applications.Peer under the hood of the systems you already use, and learn how to use and operate them more effectivelyMake informed decisions by identifying the strengths and weaknesses of different toolsNavigate the trade-offs around consistency, scalability, fault tolerance, and complexityUnderstand the distributed systems research upon which modern databases are builtPeek behind the scenes of major online services, and learn from their architectures Spis treści:Preface
Who Should Read This Book?
Scope of This Book
Outline of This Book
References and Further Reading
OReilly Safari
How to Contact Us
Acknowledgments
I. Foundations of Data Systems
1. Reliable, Scalable, and Maintainable (...) więcej Applications
Thinking About Data Systems
Reliability
Hardware Faults
Software Errors
Human Errors
How Important Is Reliability?
Scalability
Describing Load
Describing Performance
Approaches for Coping with Load
Maintainability
Operability: Making Life Easy for Operations
Simplicity: Managing Complexity
Evolvability: Making Change Easy
Summary
2. Data Models and Query Languages
Relational Model Versus Document Model
The Birth of NoSQL
The Object-Relational Mismatch
Many-to-One and Many-to-Many Relationships
Are Document Databases Repeating History?
The network model
The relational model
Comparison to document databases
Relational Versus Document Databases Today
Which data model leads to simpler application code?
Schema flexibility in the document model
Data locality for queries
Convergence of document and relational databases
Query Languages for Data
Declarative Queries on the Web
MapReduce Querying
Graph-Like Data Models
Property Graphs
The Cypher Query Language
Graph Queries in SQL
Triple-Stores and SPARQL
The semantic web
The RDF data model
The SPARQL query language
The Foundation: Datalog
Summary
3. Storage and Retrieval
Data Structures That Power Your Database
Hash Indexes
SSTables and LSM-Trees
Constructing and maintaining SSTables
Making an LSM-tree out of SSTables
Performance optimizations
B-Trees
Making B-trees reliable
B-tree optimizations
Comparing B-Trees and LSM-Trees
Advantages of LSM-trees
Downsides of LSM-trees
Other Indexing Structures
Storing values within the index
Multi-column indexes
Full-text search and fuzzy indexes
Keeping everything in memory
Transaction Processing or Analytics?
Data Warehousing
The divergence between OLTP databases and data warehouses
Stars and Snowflakes: Schemas for Analytics
Column-Oriented Storage
Column Compression
Memory bandwidth and vectorized processing
Sort Order in Column Storage
Several different sort orders
Writing to Column-Oriented Storage
Aggregation: Data Cubes and Materialized Views
Summary
4. Encoding and Evolution
Formats for Encoding Data
Language-Specific Formats
JSON, XML, and Binary Variants
Binary encoding
Thrift and Protocol Buffers
Field tags and schema evolution
Datatypes and schema evolution
Avro
The writers schema and the readers schema
Schema evolution rules
But what is the writers schema?
Dynamically generated schemas
Code generation and dynamically typed languages
The Merits of Schemas
Modes of Dataflow
Dataflow Through Databases
Different values written at different times
Archival storage
Dataflow Through Services: REST and RPC
Web services
The problems with remote procedure calls (RPCs)
Current directions for RPC
Data encoding and evolution for RPC
Message-Passing Dataflow
Message brokers
Distributed actor frameworks
Summary
II. Distributed Data
5. Replication
Leaders and Followers
Synchronous Versus Asynchronous Replication
Setting Up New Followers
Handling Node Outages
Follower failure: Catch-up recovery
Leader failure: Failover
Implementation of Replication Logs
Statement-based replication
Write-ahead log (WAL) shipping
Logical (row-based) log replication
Trigger-based replication
Problems with Replication Lag
Reading Your Own Writes
Monotonic Reads
Consistent Prefix Reads
Solutions for Replication Lag
Multi-Leader Replication
Use Cases for Multi-Leader Replication
Multi-datacenter operation
Clients with offline operation
Collaborative editing
Handling Write Conflicts
Synchronous versus asynchronous conflict detection
Conflict avoidance
Converging toward a consistent state
Custom conflict resolution logic
What is a conflict?
Multi-Leader Replication Topologies
Leaderless Replication
Writing to the Database When a Node Is Down
Read repair and anti-entropy
Quorums for reading and writing
Limitations of Quorum Consistency
Monitoring staleness
Sloppy Quorums and Hinted Handoff
Multi-datacenter operation
Detecting Concurrent Writes
Last write wins (discarding concurrent writes)
The happens-before relationship and concurrency
Capturing the happens-before relationship
Merging concurrently written values
Version vectors
Summary
6. Partitioning
Partitioning and Replication
Partitioning of Key-Value Data
Partitioning by Key Range
Partitioning by Hash of Key
Skewed Workloads and Relieving Hot Spots
Partitioning and Secondary Indexes
Partitioning Secondary Indexes by Document
Partitioning Secondary Indexes by Term
Rebalancing Partitions
Strategies for Rebalancing
How not to do it: hash mod N
Fixed number of partitions
Dynamic partitioning
Partitioning proportionally to nodes
Operations: Automatic or Manual Rebalancing
Request Routing
Parallel Query Execution
Summary
7. Transactions
The Slippery Concept of a Transaction
The Meaning of ACID
Atomicity
Consistency
Isolation
Durability
Single-Object and Multi-Object Operations
Single-object writes
The need for multi-object transactions
Handling errors and aborts
Weak Isolation Levels
Read Committed
No dirty reads
No dirty writes
Implementing read committed
Snapshot Isolation and Repeatable Read
Implementing snapshot isolation
Visibility rules for observing a consistent snapshot
Indexes and snapshot isolation
Repeatable read and naming confusion
Preventing Lost Updates
Atomic write operations
Explicit locking
Automatically detecting lost updates
Compare-and-set
Conflict resolution and replication
Write Skew and Phantoms
Characterizing write skew
More examples of write skew
Phantoms causing write skew
Materializing conflicts
Serializability
Actual Serial Execution
Encapsulating transactions in stored procedures
Pros and cons of stored procedures
Partitioning
Summary of serial execution
Two-Phase Locking (2PL)
Implementation of two-phase locking
Performance of two-phase locking
Predicate locks
Index-range locks
Serializable Snapshot Isolation (SSI)
Pessimistic versus optimistic concurrency control
Decisions based on an outdated premise
Detecting stale MVCC reads
Detecting writes that affect prior reads
Performance of serializable snapshot isolation
Summary
8. The Trouble with Distributed Systems
Faults and Partial Failures
Cloud Computing and Supercomputing
Unreliable Networks
Network Faults in Practice
Detecting Faults
Timeouts and Unbounded Delays
Network congestion and queueing
Synchronous Versus Asynchronous Networks
Can we not simply make network delays predictable?
Unreliable Clocks
Monotonic Versus Time-of-Day Clocks
Time-of-day clocks
Monotonic clocks
Clock Synchronization and Accuracy
Relying on Synchronized Clocks
Timestamps for ordering events
Clock readings have a confidence interval
Synchronized clocks for global snapshots
Process Pauses
Response time guarantees
Limiting the impact of garbage collection
Knowledge, Truth, and Lies
The Truth Is Defined by the Majority
The leader and the lock
Fencing tokens
Byzantine Faults
Weak forms of lying
System Model and Reality
Correctness of an algorithm
Safety and liveness
Mapping system models to the real world
Summary
9. Consistency and Consensus
Consistency Guarantees
Linearizability
What Makes a System Linearizable?
Relying on Linearizability
Locking and leader election
Constraints and uniqueness guarantees
Cross-channel timing dependencies
Implementing Linearizable Systems
Linearizability and quorums
The Cost of Linearizability
The CAP theorem
Linearizability and network delays
Ordering Guarantees
Ordering and Causality
The causal order is not a total order
Linearizability is stronger than causal consistency
Capturing causal dependencies
Sequence Number Ordering
Noncausal sequence number generators
Lamport timestamps
Timestamp ordering is not sufficient
Total Order Broadcast
Using total order broadcast
Implementing linearizable storage using total order broadcast
Implementing total order broadcast using linearizable storage
Distributed Transactions and Consensus
Atomic Commit and Two-Phase Commit (2PC)
From single-node to distributed atomic commit
Introduction to two-phase commit
A system of promises
Coordinator failure
Three-phase commit
Distributed Transactions in Practice
Exactly-once message processing
XA transactions
Holding locks while in doubt
Recovering from coordinator failure
Limitations of distributed transactions
Fault-Tolerant Consensus
Consensus algorithms and total order broadcast
Single-leader replication and consensus
Epoch numbering and quorums
Limitations of consensus
Membership and Coordination Services
Allocating work to nodes
Service discovery
Membership services
Summary
III. Derived Data
10. Batch Processing
Batch Processing with Unix Tools
Simple Log Analysis
Chain of commands versus custom program
Sorting versus in-memory aggregation
The Unix Philosophy
A uniform interface
Separation of logic and wiring
Transparency and experimentation
MapReduce and Distributed Filesystems
MapReduce Job Execution
Distributed execution of MapReduce
MapReduce workflows
Reduce-Side Joins and Grouping
Example: analysis of user activity events
Sort-merge joins
Bringing related data together in the same place
GROUP BY
Handling skew
Map-Side Joins
Broadcast hash joins
Partitioned hash joins
Map-side merge joins
MapReduce workflows with map-side joins
The Output of Batch Workflows
Building search indexes
Key-value stores as batch process output
Philosophy of batch process outputs
Comparing Hadoop to Distributed Databases
Diversity of storage
Diversity of processing models
Designing for frequent faults
Beyond MapReduce
Materialization of Intermediate State
Dataflow engines
Fault tolerance
Discussion of materialization
Graphs and Iterative Processing
The Pregel processing model
Fault tolerance
Parallel execution
High-Level APIs and Languages
The move toward declarative query languages
Specialization for different domains
Summary
11. Stream Processing
Transmitting Event Streams
Messaging Systems
Direct messaging from producers to consumers
Message brokers
Message brokers compared to databases
Multiple consumers
Acknowledgments and redelivery
Partitioned Logs
Using logs for message storage
Logs compared to traditional messaging
Consumer offsets
Disk space usage
When consumers cannot keep up with producers
Replaying old messages
Databases and Streams
Keeping Systems in Sync
Change Data Capture
Implementing change data capture
Initial snapshot
Log compaction
API support for change streams
Event Sourcing
Deriving current state from the event log
Commands and events
State, Streams, and Immutability
Advantages of immutable events
Deriving several views from the same event log
Concurrency control
Limitations of immutability
Processing Streams
Uses of Stream Processing
Complex event processing
Stream analytics
Maintaining materialized views
Search on streams
Message passing and RPC
Reasoning About Time
Event time versus processing time
Knowing when youre ready
Whose clock are you using, anyway?
Types of windows
Stream Joins
Stream-stream join (window join)
Stream-table join (stream enrichment)
Table-table join (materialized view maintenance)
Time-dependence of joins
Fault Tolerance
Microbatching and checkpointing
Atomic commit revisited
Idempotence
Rebuilding state after a failure
Summary
12. The Future of Data Systems
Data Integration
Combining Specialized Tools by Deriving Data
Reasoning about dataflows
Derived data versus distributed transactions
The limits of total ordering
Ordering events to capture causality
Batch and Stream Processing
Maintaining derived state
Reprocessing data for application evolution
The lambda architecture
Unifying batch and stream processing
Unbundling Databases
Composing Data Storage Technologies
Creating an index
The meta-database of everything
Making unbundling work
Unbundled versus integrated systems
Whats missing?
Designing Applications Around Dataflow
Application code as a derivation function
Separation of application code and state
Dataflow: Interplay between state changes and application code
Stream processors and services
Observing Derived State
Materialized views and caching
Stateful, offline-capable clients
Pushing state changes to clients
End-to-end event streams
Reads are events too
Multi-partition data processing
Aiming for Correctness
The End-to-End Argument for Databases
Exactly-once execution of an operation
Duplicate suppression
Operation identifiers
The end-to-end argument
Applying end-to-end thinking in data systems
Enforcing Constraints
Uniqueness constraints require consensus
Uniqueness in log-based messaging
Multi-partition request processing
Timeliness and Integrity
Correctness of dataflow systems
Loosely interpreted constraints
Coordination-avoiding data systems
Trust, but Verify
Maintaining integrity in the face of software bugs
Dont just blindly trust what they promise
A culture of verification
Designing for auditability
The end-to-end argument again
Tools for auditable data systems
Doing the Right Thing
Predictive Analytics
Bias and discrimination
Responsibility and accountability
Feedback loops
Privacy and Tracking
Surveillance
Consent and freedom of choice
Privacy and use of data
Data as assets and power
Remembering the Industrial Revolution
Legislation and self-regulation
Summary
Glossary
Index O autorze: Martin Kleppmann bada systemy rozproszone. Pracuje na Uniwersytecie Cambridge w Wielkiej Brytanii. Wcześniej był inżynierem oprogramowania w takich firmach, jak LinkedIn czy Rapportive, gdzie pracował nad działającą w dużej skali infrastrukturą do obsługi danych. Kleppmann jest blogerem, często występuje na konferencjach i rozwija oprogramowanie open source. Wierzy, że ważne idee nauki i techniki powinny być przystępne dla każdego, a lepsze ich zrozumienie umożliwi tworzenie lepszego oprogramowania. mniej
Designing Data-Intensive Applications. The Big - Opinie i recenzje
Na liście znajdują się opinie, które zostały zweryfikowane (potwierdzone zakupem) i oznaczone są one zielonym znakiem Zaufanych Opinii. Opinie niezweryfikowane nie posiadają wskazanego oznaczenia.