
Learning Apache Drill. Query and Analyze



Learning Apache Drill. Query and Analyze - Najlepsze oferty
Learning Apache Drill. Query and Analyze - Opis
Get up to speed with Apache Drill, an extensible distributed SQL query engine that reads massive datasets in many popular file formats such as Parquet, JSON, and CSV. Drill reads data in HDFS or in cloud-native storage such as S3 and works with Hive metastores along with distributed databases such as HBase, MongoDB, and relational databases. Drill works everywhere: on your laptop or in your largest cluster.In this practical book, Drill committers Charles Givre and Paul Rogers show analysts and data scientists how to query and analyze raw data using this powerful tool. Data scientists today spend about 80% of their time just gathering and cleaning data. With this book, you...ll learn how Drill helps you analyze data more effectively to drive down time to insight.Use Drill to clean, prepare, and summarize delimited data for further analysisQuery file types including logfiles, Parquet, JSON, and other complex formatsQuery Hadoop, relational databases, MongoDB, and Kafka with standard SQLConnect to Drill programmatically using a variety of languagesUse Drill even with challenging or ambiguous file formatsPerform sophisticated analysis by extending Drill...s functionality with user-defined functionsFacilitate data analysis for network security, image metadata, and machine learning Spis treści:Preface
Who Should Read This Book
Why We Wrote This Book
Navigating This Book
Online Resources
Conventions Used in This Book
Using Code Examples
OReilly Safari
How to Contact Us
Acknowledgments
Special Thanks from Charles
Special Thanks (...) więcej from Paul
1. Introduction to Apache Drill
What Is Apache Drill?
Drill Is Versatile
Drill Is Easy to Use
Drill does not require you to define a schema
A Word About Drills Performance
A Very Brief History of Big Data
Hadoop
Drill in the Big Data Ecosystem
Comparing Drill with Similar Tools
2. Installing and Running Drill
Preparing Your Machine for Drill
Special Configuration Instructions for Windows Installations
Installing Drill on Windows
Starting Drill on a Windows Machine
Installing Drill in Embedded Mode on macOS or Linux
Starting Drill on macOS or Linux in Embedded Mode
Installing Drill in Distributed Mode on macOS or Linux
Preparing Your Cluster for Drill
Starting Drill in Distributed Mode
Connecting to the Cluster
Conclusion
3. Overview of Apache Drill
The Apache Hadoop Ecosystem
Drill Is a Low-Latency Query Engine
Distributed Processing with HDFS
Elements of a Drill System
Drill Operation: The 30,000-Foot View
Drill Is a Query Engine, Not a Database
Drill Operation Overview
Drill Components
SQL Session State
Statement Preparation
Parsing and semantic analysis
Logical and physical plans
Distribution
Statement Execution
Data representation
Low-Latency Features
Long-lived Drillbits
Code generation
Network exchanges
Conclusion
4. Querying Delimited Data
Ways of Querying Data with Drill
Other Interfaces
Drill SQL Query Format
Choosing a Data Source
Defining a Workspace
Specifying a Default Data Source
Accessing Columns in a Query
Delimited Data with Column Headers
Table Functions
Querying Directories
Directory functions
Understanding Drill Data Types
Cleaning and Preparing Data Using String Manipulation Functions
Complex Data Conversion Functions
Reformatting numbers
Working with Dates and Times in Drill
Converting Strings to Dates
Reformatting Dates
Date Arithmetic and Manipulation
Date and Time Functions in Drill
Creating Views
Data Analysis Using Drill
Summarizing Data with Aggregate Functions
Other analytic functions: Window functions
Comparison of aggregate and window analytic functions
Common Problems in Querying Delimited Data
Spaces in Column Names
Illegal Characters in Column Headers
Reserved Words in Column Names
Conclusion
5. Analyzing Complex and Nested Data
Arrays and Maps
Arrays in Drill
Accessing Maps (KeyValue Pairs) in Drill
Querying Nested Data
Data types in JSON files
Formats of nested data
Querying record-oriented files
Using the FLATTEN() function to query split JSON files
Querying column-oriented JSON files with KVGEN()
Analyzing Log Files with Drill
Configuring Drill to Read HTTPD Web Server Logs
Querying Web Server Logs
Analyzing user agent strings
Analyzing URLs and query strings
Other Log Analysis with Drill
Conclusion
6. Connecting Drill to Data Sources
Querying Multiple Data Sources
Configuring a New Storage Plug-in
Connecting Drill to a Relational Database
Configuring Drill to query an RDBMS
Microsoft SQL Server
MySQL
Oracle
PostgreSQL
SQLite
Querying an RDBMS from Drill
Other uses of the drill JDBC storage plug-in
Querying Data in Hadoop from Drill
Connecting to and Querying HBase from Drill
Querying data from HBase
Querying Hive Data from Drill
Connecting Drill to Hive
Connecting to Hive with a remote metastore
Connecting to and Querying Streaming Data with Drill and Kafka
Querying streaming data
Improving the performance of Kafka queries
Connecting to and Querying Kudu
Connecting to and Querying MongoDB from Drill
Connecting Drill to Cloud Storage
Querying data on Amazon S3
Getting access credentials for S3
Querying Minio datastores from drill
Connecting to other cloud storage services
Querying Time Series Data from Drill and OpenTSDB
Special considerations for time series data
Conclusion
7. Connecting to Drill
Understanding Drills Interfaces
JDBC and Drill
ODBC and Drill
Installing the ODBC driver
Configuring ODBC on Linux or macOS
Configuring ODBC on Windows
Drills REST Interface
Connecting to Drill with Python
Using drillpy to Query Drill
Connecting to Drill Using pydrill
Other functionality of pydrill
Other Ways of Connecting to Drill from Python
Connecting to Drill Using R
Querying Drill from R Using sergeant
Accessing other functionality in R
Connecting to Drill Using Java
Querying Drill with PHP
Using the Connector
Querying Drill from PHP
Interacting with Drill from PHP
Querying Drill Using Node.js
Using Drill as a Data Source in BI Tools
Exploring Data with Apache Zeppelin and Drill
Configuring Zeppelin to query Drill
Querying Drill from a Zeppelin notebook
Adding interactivity in Zeppelin
Exploring Data with Apache Superset
Configuring Superset to work with Drill
Building a demonstration visualization using Drill and Superset
Conclusion
8. Data Engineering with Drill
Schema-on-Read
The SQL Relational Model
Data Life Cycle: Data Exploration to Production
Schema Inference
Data Source Inference
Storage Plug-ins
Storage Configurations
Workspaces
Querying Directories
Default Schema
File Type Inference
Format Plug-ins and Format Configuration
Format Inference
File Format Variations
Schema Inference Overview
Distributed File Scans
Schema Inference for Delimited Data
CSV with header
Explicit projection
TypeOf functions
Casts to specify types
CSV Summary
CSV without a header row
Explicit projection
Schema Inference for JSON
JSON column names
JSON scalar types
Ambiguous Numeric Schemas
Mixed string and number types
Missing values
Leading null values
Null versus missing values in JSON output
Aligning Schemas Across Files
JSON Objects
JSON Lists in Drill
JSON Summary
Using Drill with the Parquet File Format
Schema Evolution in Parquet
Partitioning Data Directories
Defining a Table Workspace
Working with Queries in Production
Capturing Schema Mapping in Views
Running Challenging Queries in Scripts
Conclusion
9. Deploying Drill in Production
Installing Drill
Prerequisites
Production Installation
Creating a Site Directory
Configuring ZooKeeper
Advanced ZooKeeper configuration
Configuring Memory
Configuring Logging
Testing the Installation
Distributing Drill Binaries and Configuration
Installing clush
Distributing Drill files
Starting the Drill Cluster
Configuring Storage
Working with Apache Hadoop HDFS
Simple HDFS integration
Full HDFS integration
Working with Amazon S3
Access keys with Hadoop
Standalone Drill
Distributing the configuration
Defining the Amazon S3 storage configuration
Troubleshooting
Admission Control
Additional Configuration
User-Defined Functions and Custom Plug-ins
Security
Logging Levels
Controlling CPU Usage
Monitoring
Monitoring the Drill Process
Monitoring JMX Metrics
Monitoring Queries
Other Deployment Options
MapR Installer
Drill-on-YARN
Docker
Conclusion
10. Setting Up Your Development Environment
Installing Maven
Creating the Drill Build Environment
Setting Up Git and Getting the Source Code
Building Drill from Source
Installing the IDE
Conclusion
11. Writing Drill User-Defined Functions
Use Case: Finding and Filtering Valid Credit Card Numbers
How User-Defined Functions Work in Drill
Structure of a Simple Drill UDF
The pom.xml File
Including dependencies
The Function File
Defining input parameters
Setting the output value
Accessing data in holder objects
The Simple Function API
Putting It All Together
Building and Installing Your UDF
Statically Installing a UDF
Dynamically Installing a UDF
Complex Functions: UDFs That Return Maps or Arrays
Example: Extracting User Agent Metadata
The ComplexWriter
Writing Aggregate User-Defined Functions
The Aggregate Function API
Example Aggregate UDF: Kendalls Rank Correlation Coefficient
Conclusion
12. Writing a Format Plug-in
The Example Regex Format Plug-in
Creating the Easy Format Plug-in
Creating the Maven pom.xml File
Creating the Plug-in Package
Drill Module Configuration
Format Plug-in Configuration
Cautions Before Getting Started
Creating the Regex Plug-in Configuration Class
Copyright Headers and Code Format
Testing the Configuration
Fixing Configuration Problems
Troubleshooting
Creating the Format Plug-in Class
Creating a Test File
Configuring RAT
Efficient Debugging
Creating the Unit Test
How Drill Finds Your Plug-in
The Record Reader
Testing the Reader Shell
Logging
Error Handling
Setup
Regex Parsing
Defining Column Names
Projection
Column Projection Accounting
Project None
Project All
Project Some
Opening the File
Record Batches
Drills Columnar Structure
Defining Vectors
Reading Data
Loading Data into Vectors
Releasing Resources
Testing the Reader
Testing the Wildcard Case
Testing Explicit Projection
Testing Empty Projection
Scaling Up
Additional Details
File Chunks
Default Format Configuration
Next Steps
Production Build
Contributing to Drill: The Pull Request
Maintaining Your Branch
Create a Plug-In Project
Conclusion
13. Unique Uses of Drill
Finding Photos Taken Within a Geographic Region
Drilling Excel Files
The pom.xml File
The Excel Custom Record Reader
Using the Excel Format Plug-in
Network Packet Analysis (PCAP) with Drill
Examples of Queries Using PCAP Data Files
Automating the process using an aggregate function
Analyzing Twitter Data with Drill
Using Drill in a Machine Learning Pipeline
Making Predictions Within Drill
Building and Serializing a Model
Writing the UDF Wrapper
Making Predictions Using the UDF
Conclusion
A. List of Drill Functions
Aggregate and Window Functions
Window Functions
Cryptological and Hashing Functions
Data Conversion Functions
Geospatial Functions
Math and Trigonometric Functions
Networking Functions
Null Handling Functions
String Manipulation Functions
Approximate String Matching Functions
Phonetic Functions
String Distance Functions
B. Drill Formatting Strings
Index mniej
Learning Apache Drill. Query and Analyze - Opinie i recenzje
Na liście znajdują się opinie, które zostały zweryfikowane (potwierdzone zakupem) i oznaczone są one zielonym znakiem Zaufanych Opinii. Opinie niezweryfikowane nie posiadają wskazanego oznaczenia.