

Learning Spark - Jules S. Damji



Learning Spark - Jules S. Damji - Najlepsze oferty
Learning Spark - Jules S. Damji - Opis
Data is bigger, arrives faster, and comes in a variety of formatsâ??and it all needs to be processed at scale for analytics or machine learning. But how can you process such varied workloads efficiently? Enter Apache Spark.Updated to include Spark 3.0, this second edition shows data engineers and data scientists why structure and unification in Spark matters. Specifically, this book explains how to perform simple and complex data analytics and employ machine learning algorithms. Through step-by-step walk-throughs, code snippets, and notebooks, youâ??ll be able to:Learn Python, SQL, Scala, or Java high-level Structured APIsUnderstand Spark operations and SQL EngineInspect, tune, and debug Spark operations with Spark configurations and Spark UIConnect to data sources: JSON, Parquet, CSV, Avro, ORC, Hive, S3, or KafkaPerform analytics on batch and streaming data using Structured StreamingBuild reliable data pipelines with open source Delta Lake and SparkDevelop machine learning pipelines with MLlib and productionize models using MLflow Spis treści:Foreword
Preface
Who This Book Is For
How the Book Is Organized
How to Use the Code Examples
Software and Configuration Used
Conventions Used in This Book
Using Code Examples
OReilly Online Learning
How to Contact Us
Acknowledgments
1. Introduction to Apache Spark: A Unified Analytics Engine
The Genesis of Spark
Big Data and Distributed Computing at Google
Hadoop at Yahoo!
Sparks Early Years at AMPLab
What Is Apache Spark?
Speed
Ease of (...) więcej Use
Modularity
Extensibility
Unified Analytics
Apache Spark Components as a Unified Stack
Spark SQL
Spark MLlib
Spark Structured Streaming
GraphX
Apache Sparks Distributed Execution
Spark driver
SparkSession
Cluster manager
Spark executor
Deployment modes
Distributed data and partitions
The Developers Experience
Who Uses Spark, and for What?
Data science tasks
Data engineering tasks
Popular Spark use cases
Community Adoption and Expansion
2. Downloading Apache Spark and Getting Started
Step 1: Downloading Apache Spark
Sparks Directories and Files
Step 2: Using the Scala or PySpark Shell
Using the Local Machine
Step 3: Understanding Spark Application Concepts
Spark Application and SparkSession
Spark Jobs
Spark Stages
Spark Tasks
Transformations, Actions, and Lazy Evaluation
Narrow and Wide Transformations
The Spark UI
Your First Standalone Application
Counting M&Ms for the Cookie Monster
Building Standalone Applications in Scala
Summary
3. Apache Sparks Structured APIs
Spark: Whats Underneath an RDD?
Structuring Spark
Key Merits and Benefits
The DataFrame API
Sparks Basic Data Types
Sparks Structured and Complex Data Types
Schemas and Creating DataFrames
Two ways to define a schema
Columns and Expressions
Rows
Common DataFrame Operations
Using DataFrameReader and DataFrameWriter
Saving a DataFrame as a Parquet file or SQL table
Transformations and actions
Projections and filters
Renaming, adding, and dropping columns
Aggregations
Other common DataFrame operations
End-to-End DataFrame Example
The Dataset API
Typed Objects, Untyped Objects, and Generic Rows
Creating Datasets
Scala: Case classes
Dataset Operations
End-to-End Dataset Example
DataFrames Versus Datasets
When to Use RDDs
Spark SQL and the Underlying Engine
The Catalyst Optimizer
Phase 1: Analysis
Phase 2: Logical optimization
Phase 3: Physical planning
Phase 4: Code generation
Summary
4. Spark SQL and DataFrames: Introduction to Built-in Data Sources
Using Spark SQL in Spark Applications
Basic Query Examples
SQL Tables and Views
Managed Versus UnmanagedTables
Creating SQL Databases and Tables
Creating a managed table
Creating an unmanaged table
Creating Views
Temporary views versus global temporary views
Viewing the Metadata
Caching SQL Tables
Reading Tables into DataFrames
Data Sources for DataFrames and SQL Tables
DataFrameReader
DataFrameWriter
Parquet
Reading Parquet files into a DataFrame
Reading Parquet files into a Spark SQL table
Writing DataFrames to Parquet files
Writing DataFrames to Spark SQL tables
JSON
Reading a JSON file into a DataFrame
Reading a JSON file into a Spark SQL table
Writing DataFrames to JSON files
JSON data source options
CSV
Reading a CSV file into a DataFrame
Reading a CSV file into a Spark SQL table
Writing DataFrames to CSV files
CSV data source options
Avro
Reading an Avro file into a DataFrame
Reading an Avro file into a Spark SQL table
Writing DataFrames to Avro files
Avro data source options
ORC
Reading an ORC file into a DataFrame
Reading an ORC file into a Spark SQL table
Writing DataFrames to ORC files
Images
Reading an image file into a DataFrame
Binary Files
Reading a binary file into a DataFrame
Summary
5. Spark SQL and DataFrames: Interacting with External Data Sources
Spark SQL and Apache Hive
User-Defined Functions
Spark SQL UDFs
Evaluation order and null checking in Spark SQL
Speeding up and distributing PySpark UDFs with Pandas UDFs
Querying with the Spark SQL Shell, Beeline, and Tableau
Using the Spark SQL Shell
Create a table
Insert data into the table
Running a Spark SQL query
Working with Beeline
Start the Thrift server
Connect to the Thrift server via Beeline
Execute a Spark SQL query with Beeline
Stop the Thrift server
Working with Tableau
Start the Thrift server
Start Tableau
Stop the Thrift server
External Data Sources
JDBC and SQL Databases
The importance of partitioning
PostgreSQL
MySQL
Azure Cosmos DB
MS SQL Server
Other External Sources
Higher-Order Functions in DataFrames and Spark SQL
Option 1: Explode and Collect
Option 2: User-Defined Function
Built-in Functions for Complex Data Types
Higher-Order Functions
transform()
filter()
exists()
reduce()
Common DataFrames and Spark SQL Operations
Unions
Joins
Windowing
Modifications
Adding new columns
Dropping columns
Renaming columns
Pivoting
Summary
6. Spark SQL and Datasets
Single API for Java and Scala
Scala Case Classes and JavaBeans for Datasets
Working with Datasets
Creating Sample Data
Transforming Sample Data
Higher-order functions and functional programming
Converting DataFrames to Datasets
Memory Management for Datasets and DataFrames
Dataset Encoders
Sparks Internal Format Versus Java Object Format
Serialization and Deserialization (SerDe)
Costs of Using Datasets
Strategies to Mitigate Costs
Summary
7. Optimizing and Tuning Spark Applications
Optimizing and Tuning Spark for Efficiency
Viewing and Setting Apache Spark Configurations
Scaling Spark for Large Workloads
Static versus dynamic resource allocation
Configuring Spark executors memory and the shuffle service
Maximizing Spark parallelism
How partitions are created
Caching and Persistence of Data
DataFrame.cache()
DataFrame.persist()
When to Cache and Persist
When Not to Cache and Persist
A Family of Spark Joins
Broadcast Hash Join
When to use a broadcast hash join
Shuffle Sort Merge Join
Optimizing the shuffle sort merge join
When to use a shuffle sort merge join
Inspecting the Spark UI
Journey Through the Spark UI Tabs
Jobs and Stages
Executors
Storage
SQL
Environment
Debugging Spark applications
Summary
8. Structured Streaming
Evolution of the Apache Spark Stream Processing Engine
The Advent of Micro-Batch Stream Processing
Lessons Learned from Spark Streaming (DStreams)
The Philosophy of Structured Streaming
The Programming Model of Structured Streaming
The Fundamentals of a Structured Streaming Query
Five Steps to Define a Streaming Query
Step 1: Define input sources
Step 2: Transform data
Step 3: Define output sink and output mode
Step 4: Specify processing details
Step 5: Start the query
Putting it all together
Under the Hood of an Active Streaming Query
Recovering from Failures with Exactly-Once Guarantees
Monitoring an Active Query
Querying current status using StreamingQuery
Get current metrics using StreamingQuery
Get current status using StreamingQuery.status()
Publishing metrics using Dropwizard Metrics
Publishing metrics using custom StreamingQueryListeners
Streaming Data Sources and Sinks
Files
Reading from files
Writing to files
Apache Kafka
Reading from Kafka
Writing to Kafka
Custom Streaming Sources and Sinks
Writing to any storage system
Using foreachBatch()
Using foreach()
Reading from any storage system
Data Transformations
Incremental Execution and Streaming State
Stateless Transformations
Stateful Transformations
Distributed and fault-tolerant state management
Types of stateful operations
Stateful Streaming Aggregations
Aggregations Not Based on Time
Aggregations with Event-Time Windows
Handling late data with watermarks
Semantic guarantees with watermarks
Supported output modes
Streaming Joins
StreamStatic Joins
StreamStream Joins
Inner joins with optional watermarking
Outer joins with watermarking
Arbitrary Stateful Computations
Modeling Arbitrary Stateful Operations with mapGroupsWithState()
Using Timeouts to Manage Inactive Groups
Processing-time timeouts
Event-time timeouts
Generalization with flatMapGroupsWithState()
Performance Tuning
Summary
9. Building Reliable Data Lakes with Apache Spark
The Importance of an Optimal Storage Solution
Databases
A Brief Introduction to Databases
Reading from and Writing to Databases Using Apache Spark
Limitations of Databases
Data Lakes
A Brief Introduction to Data Lakes
Reading from and Writing to Data Lakes using Apache Spark
Limitations of Data Lakes
Lakehouses: The Next Step in the Evolution of Storage Solutions
Apache Hudi
Apache Iceberg
Delta Lake
Building Lakehouses with Apache Spark and Delta Lake
Configuring Apache Spark with Delta Lake
Loading Data into a Delta Lake Table
Loading Data Streams into a Delta Lake Table
Enforcing Schema on Write to Prevent Data Corruption
Evolving Schemas to Accommodate Changing Data
Transforming Existing Data
Updating data to fix errors
Deleting user-related data
Upserting change data to a table using merge()
Deduplicating data while inserting using insert-only merge
Auditing Data Changes with Operation History
Querying Previous Snapshots of a Table with Time Travel
Summary
10. Machine Learning with MLlib
What Is Machine Learning?
Supervised Learning
Unsupervised Learning
Why Spark for Machine Learning?
Designing Machine Learning Pipelines
Data Ingestion and Exploration
Creating Training and Test Data Sets
Preparing Features with Transformers
Understanding Linear Regression
Using Estimators to Build Models
Creating a Pipeline
One-hot encoding
Evaluating Models
RMSE
Interpreting the value of RMSE
R2
Saving and Loading Models
Hyperparameter Tuning
Tree-Based Models
Decision trees
Random forests
k-Fold Cross-Validation
Optimizing Pipelines
Summary
11. Managing, Deploying, and Scaling Machine Learning Pipelines with Apache Spark
Model Management
MLflow
Tracking
Model Deployment Options with MLlib
Batch
Streaming
Model Export Patterns for Real-Time Inference
Leveraging Spark for Non-MLlib Models
Pandas UDFs
Spark for Distributed Hyperparameter Tuning
Joblib
Hyperopt
Summary
12. Epilogue: Apache Spark 3.0
Spark Core and Spark SQL
Dynamic Partition Pruning
Adaptive Query Execution
The AQE framework
SQL Join Hints
Shuffle sort merge join (SMJ)
Broadcast hash join (BHJ)
Shuffle hash join (SHJ)
Shuffle-and-replicate nested loop join (SNLJ)
Catalog Plugin API and DataSourceV2
Accelerator-Aware Scheduler
Structured Streaming
PySpark, Pandas UDFs, and Pandas Function APIs
Redesigned Pandas UDFs with Python Type Hints
Iterator Support in Pandas UDFs
New Pandas Function APIs
Changed Functionality
Languages Supported and Deprecated
Changes to the DataFrame and Dataset APIs
DataFrame and SQL Explain Commands
Summary
Index mniej
Learning Spark - Jules S. Damji - Opinie i recenzje
Na liście znajdują się opinie, które zostały zweryfikowane (potwierdzone zakupem) i oznaczone są one zielonym znakiem Zaufanych Opinii. Opinie niezweryfikowane nie posiadają wskazanego oznaczenia.