
Architecting Modern Data Platforms. A Guide to



Architecting Modern Data Platforms. A Guide to - Najlepsze oferty
Architecting Modern Data Platforms. A Guide to - Opis
There...s a lot of information about big data technologies, but splicing these technologies into an end-to-end enterprise data platform is a daunting task not widely covered. With this practical book, you...ll learn how to build big data infrastructure both on-premises and in the cloud and successfully architect a modern data platform.Ideal for enterprise architects, IT managers, application architects, and data engineers, this book shows you how to overcome the many challenges that emerge during Hadoop projects. You...ll explore the vast landscape of tools available in the Hadoop and big data realm in a thorough technical primer before diving into:Infrastructure: Look at all component layers in a modern data platform, from the server to the data center, to establish a solid foundation for data in your enterprisePlatform: Understand aspects of deployment, operation, security, high availability, and disaster recovery, along with everything you need to know to integrate your platform with the rest of your enterprise ITTaking Hadoop to the cloud: Learn the important architectural aspects of running a big data platform in the cloud while maintaining enterprise security and high availability Spis treści:Foreword
Preface
Some Misconceptions
Some General Trends
Horizontal Scaling
Adoption of Open Source
Embracing Cloud Compute
Decoupled Compute and Storage
What Is This Book About?
Who Should Read This Book?
The Road Ahead
Conventions Used in This Book
OReilly Safari
How to Contact Us
Acknowledgments
1. Big Data (...) więcej Technology Primer
A Tour of the Landscape
Core Components
HDFS
YARN
Apache ZooKeeper
Apache Hive Metastore
Going deeper
Computational Frameworks
Hadoop MapReduce
Apache Spark
Going deeper
Analytical SQL Engines
Apache Hive
Going deeper
Apache Impala
Going deeper
Also consider
Storage Engines
Apache HBase
Going deeper
Also consider
Apache Kudu
Going deeper
Apache Solr
Going deeper
Also consider
Apache Kafka
Going deeper
Ingestion
Orchestration
Apache Oozie
Also consider
Summary
I. Infrastructure
2. Clusters
Reasons for Multiple Clusters
Multiple Clusters for Resiliency
Sizing resilient clusters
Multiple Clusters for Software Development
Variation in cluster sizing
Multiple Clusters for Workload Isolation
Sizing multiple clusters for workload isolation
Multiple Clusters for Legal Separation
Multiple Clusters and Independent Storage and Compute
Multitenancy
Requirements for Multitenancy
Sizing Clusters
Sizing by Storage
Sizing HDFS by storage
Sizing Kafka by storage
Sizing Kudu by storage
Sizing by Ingest Rate
Sizing by Workload
Cluster Growth
The Drivers of Cluster Growth
Implementing Cluster Growth
Data Replication
Replication for Software Development
Replication and Workload Isolation
Summary
3. Compute and Storage
Computer Architecture for Hadoop
Commodity Servers
Server CPUs and RAM
The role of the x86 architecture
Threads and cores in Hadoop
Nonuniform Memory Access
Why is NUMA important for big data?
CPU Specifications
RAM
Commoditized Storage Meets the Enterprise
Modularity of Compute and Storage
Everything Is Java
Replication or Erasure Coding?
Alternatives
Hadoop and the Linux Storage Stack
User Space
Important System Calls
The Linux Page Cache
Short-Circuit and Zero-Copy Reads
Filesystems
Erasure Coding Versus Replication
Discussion
Network performance
Write performance
Locality optimization
Read performance
Guidance
Low-Level Storage
Storage Controllers
RAID?
Controller cache
Read-ahead caching
Write-back caching
Guidelines
Disk Layer
SAS, Nearline SAS, or SATA (or SSDs)?
Disk sizes
Disk cache
Server Form Factors
Form Factor Comparison
Guidance
Workload Profiles
Cluster Configurations and Node Types
Master Nodes
Worker Nodes
Utility Nodes
Edge Nodes
Small Cluster Configurations
Medium Cluster Configurations
Large Cluster Configurations
Summary
4. Networking
How Services Use a Network
Remote Procedure Calls (RPCs)
Implementations and architectures
Platform services and their RPCs
Process control
Latency
Latency and cluster services
Data Transfers
Replication
Shuffles
Monitoring
Backup
Consensus
Network Architectures
Small Cluster Architectures
Single switch
Implementation
Medium Cluster Architectures
Stacked networks
Resiliency
Performance
Determining oversubscription in stacked networks
Stacked network cabling considerations
Implementation
Fat-tree networks
Scalability
Resiliency
Implementation
Large Cluster Architectures
Modular switches
Spine-leaf networks
Scalability
Resilient spine-leaf networks
Implementation
Network Integration
Reusing an Existing Network
Creating an Additional Network
Edge-connected networks
Network Design Considerations
Layer 1 Recommendations
Layer 2 Recommendations
Layer 3 Recommendations
Summary
5. Organizational Challenges
Who Runs It?
Is It Infrastructure, Middleware, or an Application?
Case Study: A Typical Business Intelligence Project
The Traditional Approach
Typical Team Setup
Architect
Analyst
Software developer
Administrator
Systems engineer
Compartmentalization of IT
Revised Team Setup for Hadoop in the Enterprise
Big data architect
Data scientist
Big data engineer
Solution Overview with Hadoop
New Team Setup
Split Responsibilities
Do I Need DevOps?
Do I Need a Center of Excellence/Competence?
Summary
6. Datacenter Considerations
Why Does It Matter ?
Basic Datacenter Concepts
Cooling
Power
Network
Rack Awareness and Rack Failures
Failure Domain Alignment
Space and Racking Constraints
Ingest and Intercluster Connectivity
Software
Hardware
Replacements and Repair
Operational Procedures
Typical Pitfalls
Networking
Cluster Spanning
Nonstandard use of rack awareness
Bandwidth impairment
Quorum spanning with two datacenters
Quorum spanning with three datacenters
Alternative solutions
Summary
II. Platform
7. Provisioning Clusters
Operating Systems
OS Choices
OS Configuration for Hadoop
Automated Configuration Example
Service Databases
Required Databases
Database Integration Options
Database Considerations
Hadoop Deployment
Hadoop Distributions
Installation Choices
Distribution Architecture
Installation Process
Summary
8. Platform Validation
Testing Methodology
Useful Tools
Hardware Validation
CPU
Validation approaches
Disks
Sequential I/O performance
Disk health
Network
Measuring latency
Latency under load
Measuring throughput
Throughput under load
Hadoop Validation
HDFS Validation
Single writes and reads
Distributed writes and reads
General Validation
TeraGen
Disk-only tests
Disk and network tests
TeraSort
Validating Other Components
Operations Validation
Summary
9. Security
In-Flight Encryption
TLS Encryption
TLS and Java
TLS and non-Java processes
X.509
SASL Quality of Protection
Enabling in-Flight Encryption
Authentication
Kerberos
Principals
Accessing services
Keytabs
Kerberos over HTTP
Cross-realm trusts
LDAP Authentication
Delegation Tokens
Impersonation
Authorization
Group Resolution
Superusers and Supergroups
Restricting superusers
Supergroups
Hadoop Service Level Authorization
Centralized Security Management
HDFS
YARN
ZooKeeper
Hive
Impala
HBase
Solr
Kudu
Oozie
Hue
Kafka
Sentry
At-Rest Encryption
Volume Encryption with Cloudera Navigator Encrypt and Key Trustee Server
HDFS Transparent Data Encryption
Encrypting and decrypting files in encryption zones
Authorizing key operations
KMS implementations
Encrypting Temporary Files
Summary
10. Integration with Identity Management Providers
Integration Areas
Integration Scenarios
Scenario 1: Writing a File to HDFS
Scenario 2: Submitting a Hive Query
Scenario 3: Running a Spark Job
Integration Providers
LDAP Integration
Background
LDAP Security
Load Balancing
Application Integration
Linux Integration
SSSD
Kerberos Integration
Kerberos Clients
KDC Integration
Setting up cross-realm trusts
One-way trust between MIT KDC and AD
One-way trusts between MIT KDCs
Local cluster KDC
Local cluster KDC and corporate user KDC
Corporate KDC
Certificate Management
Signing Certificates
Converting Certificates
Wildcard Certificates
Automation
Summary
11. Accessing and Interacting with Clusters
Access Mechanisms
Programmatic Access
Command-Line Access
Web UIs
Access Topologies
Interaction Patterns
Proxy Access
HTTP proxies
SOCKS proxies
Service proxies
Load Balancing
Edge Node Interactions
HDFS
YARN
MapReduce
Spark
Hive
Impala
HBase
Solr
Oozie
Kudu
Access Security
Administration Gateways
Workbenches
Hue
Notebooks
Landing Zones
Summary
12. High Availability
High Availability Defined
Lateral/Service HA
Vertical/Systemic HA
Measuring Availability
Percentages
Percentiles
Operating for HA
Monitoring
Playbooks and Postmortems
HA Building Blocks
Quorums
Load Balancing
DNS round robin
Virtual IP
Dedicated load balancers
Session persistence
Hardware versus software
Security considerations
Database HA
Clustering
Replication
Supported databases
Ancillary Services
Essentials
Identity management providers
General Considerations
Separation of Master and Worker Processes
Separation of Identical Service Roles
Master Servers in Separate Failure Domains
Balanced Master Configurations
Optimized Server Configurations
High Availability of Cluster Services
ZooKeeper
Failover
Deployment considerations
HDFS
HA configurations
Manual failover
Automatic failover
Quorum Journal Manager mode
Security
Deployment recommendations
YARN
Manual failover
Automatic failover
Deployment recommendations
HBase
HMaster HA
Region replication
Deployment considerations
KMS
Deployment considerations
Hive
Metastore
HiveServer2
HA architecture
Deployment considerations
Impala
Impala daemons
Catalog server
Statestore
Architecting for HA
Deployment considerations
Solr
Deployment considerations
Kafka
Deployment considerations
Oozie
Deployment considerations
Hue
Deployment options
Other Services
Autoconfiguration
Summary
13. Backup and Disaster Recovery
Context
Many Distributed Systems
Policies and Objectives
Failure Scenarios
Suitable Data Sources
Strategies
Replication
Snapshots
Backups
Rack awareness and high availability
Data Types
Consistency
Validation
Summary
Data Replication
HBase
Cluster Management Tools
Kafka
Summary
Hadoop Cluster Backups
Databases
Subsystems
Cloudera Manager
Apache Ambari
HDFS
Hive Metastore
HBase
YARN
Oozie
Apache Sentry
Apache Ranger
Hue
Case Study: Automating Backups with Oozie
Introduction
Subflow: HDFS
Subflow: HBase
Subflow: Database
Backup workflow
Restore
Summary
III. Taking Hadoop to the Cloud
14. Basics of Virtualization for Hadoop
Compute Virtualization
Virtual Machine Distribution
Anti-Affinity Groups
Storage Virtualization
Virtualizing Local Storage
SANs
Object Storage and Network-Attached Storage
Network-attached storage
Object storage
Network Virtualization
Cluster Life Cycle Models
Summary
15. Solutions for Private Clouds
OpenStack
Automation and Integration
Life Cycle and Storage
Isolation
Summary
OpenShift
Automation
Life Cycle and Storage
Isolation
Summary
VMware and Pivotal Cloud Foundry
Do It Yourself?
Automation
Isolation
Life Cycle Model
Summary
Object Storage for Private Clouds
EMC Isilon
Ceph
Object storage
CephFS
Remote block storage
Summary
Summary
16. Solutions in the Public Cloud
Key Things to Know
Cloud Providers
AWS
AWS instance types
AWS storage options
Amazon Elastic MapReduce
Caveats and service limits
Microsoft Azure
Azure instance types
Azure storage options
HDInsight
Caveats and service limits
Google Cloud Platform
Instance types
Storage options
Cloud Dataproc
Caveats and service limits
Implementing Clusters
Instances
CPU-heavy instances
Balanced instances
Memory-heavy instances
Instances summary
Storage and Life Cycle Models
Suspendable clusters
One-off clusters
Sticky clusters
Storage compatibility
Storage and life cycle summary
Network Architecture
High Availability
The requirement for HA
Compute availability
Cluster availability
Instance availability
Data availability
Network availability
Service availability
Databases
Load balancers
Summary
17. Automated Provisioning
Long-Lived Clusters
Configuration and Templating
Deployment Phases
Environment configuration
Instance provisioning
Instance configuration
Cluster installation and configuration
Post-install tasks
Vendor Solutions
Cloudera Director
Ongoing management
One-Click Deployments
Homegrown Automation
Hooking Into a Provisioning Life Cycle
Scaling Up and Down
Deploying with Security
Integrating with a Kerberos KDC
TLS
Transient Clusters
Sharing Metadata Services
Summary
18. Security in the Cloud
Assessing the Risk
Risk Model
Environmental Risks
Mitigation
Deployment Risks
Mitigation
Identity Provider Options for Hadoop
Option A: Cloud-Only Self-Contained ID Services
Option B: Cloud-Only Shared ID Services
Option C: On-Premises ID Services
Object Storage Security and Hadoop
Identity and Access Management
Amazon Simple Storage Service
Hadoop integration
Temporary security credentials
Persistent credentials
Environment variables
Instance roles
Anonymous access
Further information
GCP Cloud Storage
Hadoop integration
Service account
User account
Further information
Microsoft Azure
Disk storage
Blob storage
ADLS
Hadoop integration
Azure Blob storage
ADLS
Further information
Auditing
Encryption for Data at Rest
Requirements for Key Material
Options for Encryption in the Cloud
On-Premises Key Persistence
Encryption via the Cloud Provider
Cloud Key Management Services
Server-side and client-side encryption
BYOK
Encryption in AWS
Encryption in Microsoft Azure
Encryption in GCP
Encryption Feature and Interoperability Summary
Recommendations and Summary for Cloud Encryption
Encrypting Data in Flight in the Cloud
Perimeter Controls and Firewalling
GCP
Example implementation
AWS
Example implementation
Azure
Use case implementation
Summary
A. Backup Onboarding Checklist
Backup Onboarding Checklist
Backup
Services
Cloudera Manager
HDFS
HBase
Hive/Impala
Sqoop
Oozie
Hue
Sentry
Index mniej
Architecting Modern Data Platforms. A Guide to - Opinie i recenzje
Na liście znajdują się opinie, które zostały zweryfikowane (potwierdzone zakupem) i oznaczone są one zielonym znakiem Zaufanych Opinii. Opinie niezweryfikowane nie posiadają wskazanego oznaczenia.