

Pentaho 3.2 Data Integration: Beginner's Guide. Explore, transform, validate, and integrate your data with ease



Pentaho 3.2 Data Integration: Beginner's Guide. Explore, transform, validate, and integrate your data with ease - Najlepsze oferty
Pentaho 3.2 Data Integration: Beginner's Guide. Explore, transform, validate, and integrate your data with ease - Opis
Pentaho Data Integration (a.k.a. Kettle) is a full-featured open source ETL (Extract, Transform, and Load) solution. Although PDI is a feature-rich tool, effectively capturing, manipulating, cleansing, transferring, and loading data can get complicated.This book is full of practical examples that will help you to take advantage of Pentaho Data Integration's graphical, drag-and-drop design environment. You will quickly get started with Pentaho Data Integration by following the step-by-step guidance in this book. The useful tips in this book will encourage you to exploit powerful features of Pentaho Data Integration and perform ETL operations with ease.Starting with the installation of the PDI software, this book will teach you all the key PDI concepts. Each chapter introduces new features, allowing you to gradually get involved with the tool. First, you will learn to work with plain files, and to do all kinds of data manipulation. Then, the book gives you a primer on databases and teaches you how to work with databases inside PDI. Not only that, you'll be given an introduction to data warehouse concepts and you will learn to load data in a data warehouse. After that, you will learn to implement simple and complex processes.Once you've learned all the basics, you will build a simple datamart that will serve to reinforce all the concepts learned through the book. Spis treści:Pentaho 3.2 Data Integration Beginners Guide
Table of Contents
Pentaho 3.2 Data Integration Beginner's Guide
Credits
Foreword
The Kettle Project
About the Author
About (...) więcej the Reviewers
Preface
How to read this book
What this book covers
What you need for this book
Who this book is for
Conventions
Reader feedback
Customer support
Errata
Piracy
Questions
1. Getting Started with Pentaho Data Integration
Pentaho Data Integration and Pentaho BI Suite
Exploring the Pentaho Demo
Pentaho Data Integration
Using PDI in real world scenarios
Loading datawarehouses or datamarts
Integrating data
Data cleansing
Migrating information
Exporting data
Integrating PDI using Pentaho BI
Pop quiz PDI data sources
Installing PDI
Time for action installing PDI
What just happened?
Pop quiz PDI prerequisites
Launching the PDI graphical designer: Spoon
Time for action starting and customizing Spoon
What just happened?
Spoon
Setting preferences in the Options window
Storing transformations and jobs in a repository
Creating your first transformation
Time for action creating a hello world transformation
What just happened?
Directing the Kettle engine with transformations
Exploring the Spoon interface
Viewing the transformation structure
Running and previewing the transformation
Time for action running and previewing the hello_world transformation
What just happened?
Previewing the results in the Execution Results window
Pop quiz PDI basics
Installing MySQL
Time for action installing MySQL on Windows
What just happened?
Time for action installing MySQL on Ubuntu
What just happened?
Summary
2. Getting Started with Transformations
Reading data from files
Time for action reading results of football matches from files
What just happened?
Input files
Input steps
Reading several files at once
Time for action reading all your files at a time using a single Text file input step
What just happened?
Time for action reading all your files at a time using a single Text file input step and regular expressions
What just happened?
Regular expressions
Troubleshooting reading files
Grids
Have a go hero explore your own files
Sending data to files
Time for action sending the results of matches to a plain file
What just happened?
Output files
Output steps
Some data definitions
Rowset
Streams
The Select values step
Have a go hero extending your transformations by writing output files
Getting system information
Time for action updating a file with news about examinations
What just happened?
Getting information by using Get System Info step
Data types
Date fields
Numeric fields
Running transformations from a terminal window
Time for action running the examination transformation from a terminal window
What just happened?
Have a go hero using different date formats
Go for a hero formatting 99.55
Pop quizformatting data
XML files
Time for action getting data from an XML file with information about countries
What just happened?
What is XML
PDI transformation files
Getting data from XML files
XPath
Configuring the Get data from XML step
Kettle variables
How and when you can use variables
Have a go hero exploring XML files
Have a go hero enhancing the output countries file
Have a go hero documenting your work
Summary
3. Basic Data Manipulation
Basic calculations
Time for action reviewing examinations by using the Calculator step
What just happened?
Adding or modifying fields by using different PDI steps
The Calculator step
The Formula step
Time for action reviewing examinations by using the Formula step
What just happened?
Have a go hero listing students and their examinations results
Pop quiz concatenating strings
Calculations on groups of rows
Time for action calculating World Cup statistics by grouping data
What just happened?
Group by step
Have a go hero calculating statistics for the examinations
Have a go hero listing the languages spoken by country
Filtering
Time for action counting frequent words by filtering
What just happened?
Filtering rows using the Filter rows step
Have a go hero playing with filters
Have a go hero counting words and discarding those that are commonly used
Looking up data
Time for action finding out which language people speak
What just happened?
The Stream lookup step
Have a go hero counting words more precisely
Summary
4. Controlling the Flow of Data
Splitting streams
Time for action browsing new PDI features by copyinga dataset
What just happened?
Copying rows
Have a go hero recalculating statistics
Distributing rows
Time for action assigning tasks by distributing
What just happened?
Pop quiz data movement (copying and distributing)
Splitting the stream based on conditions
Time for action assigning tasks by filtering priorities with the Filter rows step
What just happened?
PDI steps for splitting the stream based on conditions
Time for action assigning tasks by filtering priorities with the Switch/ Case step
What just happened?
Have a go hero listing languages and countries
Pop quiz splitting a stream
Merging streams
Time for action gathering progress and merging all together
What just happened?
PDI options for merging streams
Time for action giving priority to Bouchard by using Append Stream
What just happened?
Have a go hero sorting and merging all tasks
Have a go hero trying to find missing countries
Summary
5. Transforming Your Data with JavaScript Code and the JavaScript Step
Doing simple tasks with the JavaScript step
Time for action calculating scores with JavaScript
What just happened?
Using the JavaScript language in PDI
Inserting JavaScript code using the Modified Java Script Value step
Adding fields
Modifying fields
Turning on the compatibility switch
Have a go hero adding and modifying fields to the contest data
Testing your code
Time for action testing the calculation of averages
What just happened?
Testing the script using the Test script button
Have a go hero testing the new calculation of the average
Enriching the code
Time for action calculating flexible scores by using variables
What just happened?
Using named parameters
Using the special Start, Main, and End scripts
Using transformation predefined constants
Pop quiz finding the 7 errors
Have a go hero keeping the top 10 performances
Have a go hero calculating scores with Java code
Reading and parsing unstructured files
Time for action changing a list of house descriptions with JavaScript
What just happened?
Looking at previous rows
Have a go hero enhancing the houses file
Have a go hero fill gaps in the contest file
Avoiding coding by using purpose-built steps
Have a go hero creating alternative solutions
Summary
6. Transforming the Row Set
Converting rows to columns
Time for action enhancing a films file by converting rows to columns
What just happened?
Converting row data to column data by using the Row denormalizer step
Have a go hero houses revisited
Aggregating data with a Row denormalizer step
Time for action calculating total scores by performances by country
What just happened?
Using Row denormalizer for aggregating data
Have a go hero calculating scores by skill by continent
Normalizing data
Time for action enhancing the matches file by normalizing the dataset
What just happened?
Modifying the dataset with a Row Normalizer step
Summarizing the PDI steps that operate on sets of rows
Have a go hero verifying the benefits of normalization
Have a go hero normalizing the Films file
Have a go hero calculating scores by judge
Generating a custom time dimension dataset by using Kettle variables
Time for action creating the time dimension dataset
What just happened?
Getting variables
Time for action getting variables for setting the default starting date
What just happened?
Using the Get Variables step
Have a go hero enhancing the time dimension
Pop quiz using Kettle variables inside transformations
Summary
7. Validating Data and Handling Errors
Capturing errors
Time for action capturing errors while calculating the ageof a film
What just happened?
Using PDI error handling functionality
Aborting a transformation
Time for action aborting when there are too many errors
What just happened?
Aborting a transformation using the Abort step
Fixing captured errors
Time for action treating errors that may appear
What just happened?
Treating rows coming to the error stream
Pop quiz PDI error handling
Have a go hero capturing errors while seeing who wins
Avoiding unexpected errors by validating data
Time for action validating genres with a Regex Evaluation step
What just happened?
Validating data
Time for action checking films file with the Data Validator
What just happened?
Defining simple validation rules using the Data Validator
Have a go hero validating the football matches file
Cleansing data
Have a go hero cleansing films data
Summary
8. Working with Databases
Introducing the Steel Wheels sample database
Connecting to the Steel Wheels database
Time for action creating a connection with the Steel Wheels database
What just happened?
Connecting with Relational Database Management Systems
Pop quiz defining database connections
Have a go hero connecting to your own databases
Exploring the Steel Wheels database
Time for action exploring the sample database
What just happened?
A brief word about SQL
Exploring any configured database with the PDI Database explorer
Have a go hero exploring the sample data in depth
Have a go hero exploring your own databases
Querying a database
Time for action getting data about shipped orders
What just happened?
Getting data from the database with the Table input step
Using the SELECT statement for generating a new dataset
Making flexible queries by using parameters
Time for action getting orders in a range of dates by using parameters
What just happened?
Adding parameters to your queries
Making flexible queries by using Kettle variables
Time for action getting orders in a range of dates by using variables
What just happened?
Using Kettle variables in your queries
Pop quiz database datatypes versus PDI datatypes
Have a go hero querying the sample data
Sending data to a database
Time for action loading a table with a list of manufacturers
What just happened?
Inserting new data into a database table with the Table output step
Inserting or updating data by using other PDI steps
Time for action inserting new products or updating existent ones
What just happened?
Time for action testing the update of existing products
What just happened?
Inserting or updating data with the Insert/Update step
Have a go hero populating a films database
Have a go hero creating the time dimension
Have a go hero populating the products table
Pop quiz Insert/Update step versus Table Output/Update steps
Pop quiz filtering the first 10 rows
Eliminating data from a database
Time for action deleting data about discontinued items
What just happened?
Deleting records of a database table with the Delete step
Have a go hero deleting old orders
Summary
9. Performing Advanced Operations with Databases
Preparing the environment
Time for action populating the Jigsaw database
What just happened?
Exploring the Jigsaw database model
Looking up data in a database
Doing simple lookups
Time for action using a Database lookup step to create a list of products to buy
What just happened?
Looking up values in a database with the Database lookup step
Have a go hero preparing the delivery of the products
Have a go hero refining the transformation
Doing complex lookups
Time for action using a Database join step to create a list of suggested products to buy
What just happened?
Joining data from the database to the stream data by using a Database join step
Have a go hero rebuilding the list of customers
Introducing dimensional modeling
Loading dimensions with data
Time for action loading a region dimension with a Combination lookup/update step
What just happened?
Time for action testing the transformation that loads the region dimension
What just happened?
Describing data with dimensions
Loading Type I SCD with a Combination lookup/update step
Have a go hero adding regions to the Region Dimension
Have a go hero loading the manufacturers dimension
Have a go hero loading a mini-dimension
Keeping a history of changes
Time for action keeping a history of product changes with the Dimension lookup/update step
What just happened?
Time for action testing the transformation that keeps a historyof product changes
What just happened?
Keeping an entire history of data with a Type II slowly changing dimension
Loading Type II SCDs with the Dimension lookup/update step
Have a go hero keeping a history just for the theme of a product
Have a go hero loading a Type II SCD dimension
Pop quiz loading slowly changing dimensions
Pop quiz loading type III slowly changing dimensions
Summary
10. Creating Basic Task Flows
Introducing PDI jobs
Time for action creating a simple hello world job
What just happened?
Executing processes with PDI jobs
Using Spoon to design and run jobs
Using the transformation job entry
Pop quiz defining PDI jobs
Have a go hero loading the dimension tables
Receiving arguments and parameters in a job
Time for action customizing the hello world file with arguments and parameters
What just happened?
Using named parameters in jobs
Have a go hero backing up your work
Running jobs from a terminal window
Time for action executing the hello world job from a terminal window
What just happened?
Have a go hero experiencing Kitchen
Using named parameters and command-line arguments in transformations
Time for action calling the hello world transformation with fixed arguments and parameters
What just happened?
Have a go hero saying hello again and again
Have a go hero loading the time dimension from a job
Deciding between the use of a command-line argument and a named parameter
Have a go hero analysing the use of arguments and named parameters
Running job entries under conditions
Time for action sending a sales report and warning the administrator if something is wrong
What just happened?
Changing the flow of execution on the basis of conditions
Have a go hero refining the sales report
Creating and using a file results list
Have a go hero sharing your work
Summary
11. Creating Advanced Transformations and Jobs
Enhancing your processes with the use of variables
Time for action updating a file with news about examinations by setting a variable with the name of the file
What just happened?
Setting variables inside a transformation
Have a go hero enhancing the examination tutorial even more
Have a go hero enhancing the jigsaw database update process
Have a go hero executing the proper jigsaw database update process
Enhancing the design of your processes
Time for action generating files with top scores
What just happened?
Pop quiz using the Add Sequence step
Reusing part of your transformations
Time for action calculating the top scores with a subtransformation
What just happened?
Creating and using subtransformations
Have a go hero refining the subtransformation
Have a go hero counting words more precisely (second version)
Creating a job as a process flow
Time for action splitting the generation of top scores by copying and getting rows
What just happened?
Transferring data between transformations by using the copy /get rows mechanism
Have a go hero modifying the flow
Nesting jobs
Time for action generating the files with top scores by nesting jobs
What just happened?
Running a job inside another job with a job entry
Understanding the scope of variables
Pop quiz deciding the scope of variables
Iterating jobs and transformations
Time for action generating custom files by executing a transformation for every input row
What just happened?
Executing for each row
Have a go hero processing several files at once
Have a go hero building lists of products to buy
Have a go hero e-mail students to let them know how they did
Summary
12. Developing and Implementing a Simple Datamart
Exploring the sales datamart
Deciding the level of granularity
Loading the dimensions
Time for action loading dimensions for the sales datamart
What just happened?
Extending the sales datamart model
Have a go hero loading the dimensions for the puzzles star model
Loading a fact table with aggregated data
Time for action loading the sales fact table by looking up dimensions
What just happened?
Getting the information from the source with SQL queries
Translating the business keys into surrogate keys
Obtaining the surrogate key for a Type I SCD
Obtaining the surrogate key for a Type II SCD
Obtaining the surrogate key for the Junk dimension
Obtaining the surrogate key for the Time dimension
Pop quiz modifying a star model and loading the star with PDI
Have a go hero loading a puzzles fact table
Getting facts and dimensions together
Time for action loading the fact table using a range of dates obtained from the command line
What just happened?
Time for action loading the sales star
What just happened?
Have a go hero enhancing the loading process of the sales fact table
Have a go hero loading the puzzles sales star
Have a go hero loading the facts once a month
Getting rid of administrative tasks
Time for action automating the loading of the sales datamart
What just happened?
Have a go hero Creating a back up of your work automatically
Have a go hero enhancing the automate process by sending an e-mail if an error occurs
Summary
13. Taking it Further
PDI best practices
Getting the most out of PDI
Extending Kettle with plugins
Have a go hero listing the top 10 students by using the Head plugin step
Overcoming real world risks with some remote execution
Scaling out to overcome bigger risks
Pop quiz remote execution and clustering
Integrating PDI and the Pentaho BI suite
PDI as a process action
PDI as a datasource
More about the Pentaho suite
PDI Enterprise Edition and Kettle Developer Support
Summary
A. Working with Repositories
Creating a repository
Time for action creating a PDI repository
What just happened?
Creating repositories to store your transformationand jobs
Working with the repository storage system
Time for action logging into a repository
What just happened?
Logging into a repository by using credentials
Defining repository user accounts
Creating transformations and jobs in repository folders
Creating database connections, partitions, servers, and clusters
Backing up and restoring a repository
Examining and modifying the contents of a repository with the Repository explorer
Migrating from a file-based system to a repository-based system and vice-versa
Summary
B. Pan and Kitchen: Launching Transformations and Jobs from the Command Line
Running transformations and jobs stored in files
Running transformations and jobs from a repository
Specifying command line options
Checking the exit code
Providing options when running Pan and Kitchen
Log details
Named parameters
Arguments
Variables
C. Quick Reference: Steps and Job Entries
Transformation steps
Job entries
D. Spoon Shortcuts
General shortcuts
Designing transformations and jobs
Grids
Repositories
E. Introducing PDI 4 Features
Agile BI
Visual improvements for designing transformations and jobs
Experiencing the mouse-over assistance
Time for action creating a hop with the mouse-over assistance
What just happened?
Using the mouse-over assistance toolbar
Experiencing the sniff-testing feature
Experiencing the job drill-down feature
Experiencing even more visual changes
Enterprise features
Summary
F. Pop Quiz Answers
Chapter 1
PDI data sources
PDI prerequisites
PDI basics
Chapter 2
formatting data
Chapter 3
concatenating strings
Chapter 4
data movement (copying and distributing)
splitting a stream
Chapter 5
finding the seven errors
Chapter 6
using Kettle variables inside transformations
Chapter 7
PDI error handling
Chapter 8
defining database connections
database datatypes versus PDI datatypes
Insert/Update step versus Table Output/Update steps
filtering the first 10 rows
Chapter 9
loading slowly changing dimensions
loading type III slowly changing dimensions
Chapter 10
defining PDI jobs
Chapter 11
using the Add sequence step
deciding the scope of variables
Chapter 12
modifying a star model and loading the star with PDI
Chapter 13
remote execution and clustering
Index mniej
Pentaho 3.2 Data Integration: Beginner's Guide. Explore, transform, validate, and integrate your data with ease - Opinie i recenzje
Na liście znajdują się opinie, które zostały zweryfikowane (potwierdzone zakupem) i oznaczone są one zielonym znakiem Zaufanych Opinii. Opinie niezweryfikowane nie posiadają wskazanego oznaczenia.