Getting Started with Kudu: Perform Fast Analytics on Fast Data

Fast data ingestion, serving, and analytics in the Hadoop ecosystem have forced developers and architects to choose solutions using the least common denominator—either fast analytics at the cost of slow data ingestion or fast data ingestion at the cost of slow analytics. There is an answer to this problem. With the Apache Kudu column-oriented data store, you can easily perform fast analytics on fast data. This practical guide shows you how.

Begun as an internal project at Cloudera, Kudu is an open source solution compatible with many data processing frameworks in the Hadoop environment. In this book, current and former solutions professionals from Cloudera provide use cases, examples, best practices, and sample code to help you get up to speed with Kudu.

  • Explore Kudu’s high-level design, including how it spreads data across servers
  • Fully administer a Kudu cluster, enable security, and add or remove nodes
  • Learn Kudu’s client-side APIs, including how to integrate Apache Impala, Spark, and other frameworks for data manipulation
  • Examine Kudu’s schema design, including basic concepts and primitives necessary to make your project successful
  • Explore case studies for using Kudu for real-time IoT analytics, predictive modeling, and in combination with another storage engine
"1126841771"
Getting Started with Kudu: Perform Fast Analytics on Fast Data

Fast data ingestion, serving, and analytics in the Hadoop ecosystem have forced developers and architects to choose solutions using the least common denominator—either fast analytics at the cost of slow data ingestion or fast data ingestion at the cost of slow analytics. There is an answer to this problem. With the Apache Kudu column-oriented data store, you can easily perform fast analytics on fast data. This practical guide shows you how.

Begun as an internal project at Cloudera, Kudu is an open source solution compatible with many data processing frameworks in the Hadoop environment. In this book, current and former solutions professionals from Cloudera provide use cases, examples, best practices, and sample code to help you get up to speed with Kudu.

  • Explore Kudu’s high-level design, including how it spreads data across servers
  • Fully administer a Kudu cluster, enable security, and add or remove nodes
  • Learn Kudu’s client-side APIs, including how to integrate Apache Impala, Spark, and other frameworks for data manipulation
  • Examine Kudu’s schema design, including basic concepts and primitives necessary to make your project successful
  • Explore case studies for using Kudu for real-time IoT analytics, predictive modeling, and in combination with another storage engine
32.49 In Stock
Getting Started with Kudu: Perform Fast Analytics on Fast Data

Getting Started with Kudu: Perform Fast Analytics on Fast Data

Getting Started with Kudu: Perform Fast Analytics on Fast Data

Getting Started with Kudu: Perform Fast Analytics on Fast Data

eBook

$32.49  $42.99 Save 24% Current price is $32.49, Original price is $42.99. You Save 24%.

Available on Compatible NOOK devices, the free NOOK App and in My Digital Library.
WANT A NOOK?  Explore Now

Related collections and offers


Overview

Fast data ingestion, serving, and analytics in the Hadoop ecosystem have forced developers and architects to choose solutions using the least common denominator—either fast analytics at the cost of slow data ingestion or fast data ingestion at the cost of slow analytics. There is an answer to this problem. With the Apache Kudu column-oriented data store, you can easily perform fast analytics on fast data. This practical guide shows you how.

Begun as an internal project at Cloudera, Kudu is an open source solution compatible with many data processing frameworks in the Hadoop environment. In this book, current and former solutions professionals from Cloudera provide use cases, examples, best practices, and sample code to help you get up to speed with Kudu.

  • Explore Kudu’s high-level design, including how it spreads data across servers
  • Fully administer a Kudu cluster, enable security, and add or remove nodes
  • Learn Kudu’s client-side APIs, including how to integrate Apache Impala, Spark, and other frameworks for data manipulation
  • Examine Kudu’s schema design, including basic concepts and primitives necessary to make your project successful
  • Explore case studies for using Kudu for real-time IoT analytics, predictive modeling, and in combination with another storage engine

Product Details

ISBN-13: 9781491980200
Publisher: O'Reilly Media, Incorporated
Publication date: 07/09/2018
Sold by: Barnes & Noble
Format: eBook
Pages: 156
File size: 5 MB

About the Author

Jean-Marc Spaggiari, an early adopter of Kudu, works as a Principal Solutions Architect for Cloudera to support Hadoop, Kudu, HBase and other tools through technical support and consulting work. His deep knowledge of HBase and HDFS allows him to better understand Kudu and its applications.

Jean-Marc’s primary role is to support HBase users over their HBase cluster deployments, upgrades, configuration and optimization, as well as to support them regarding HBase related application development. He is also a very active HBase community member, testing every release from performance and stability standpoints. However, with Kudu being geared to quickly penetrate the market, he will also begin recommending, building demo applications and deploying proof of concepts around it.

Prior to Cloudera, Jean-Marc worked as a Project Manager and as a Solutions Architect for CGI and insurances companies. He has almost 20 years of Java development experience. In addition to regularly attending Strata+Hadoop World and HBaseCon, he has spoken at various Hadoop User Group meetings and many conferences in North America, usually focusing on HBase related presentations and demonstrations. Jean-Marc is also the author of Architecting HBase Applications (O'Reilly).


Mladen Kovacevic comes from a development background in RDBMS technology, and sees Kudu as a game changer in the Hadoop ecosystem. He has presented Kudu at several local meetups, presented on the state of Spark on Kudu during its beta while providing feedback early enough to ensure Spark with Kudu is a first-class citizen at its launch. He is a contributor to Apache Kudu and Kite SDK projects, and works as a Solutions Architect at Cloudera. Mladen’s experience includes years of RDBMS engine development, systems optimization, performance and architecture, including optimizing Hadoop on the Power 8 platform while developing IBM’s Big SQL technology.


Brock Noland followed Kudu months before the first line of code was written, by following Todd Lipcon’s paper reading habits. Brock is Chief Architect of phData, a pure-play Hadoop Managed Service Provider. Prior to founding phData, Brock spent four years at Cloudera as a Trainer, Solution Architect, Engineer, Sales Engineer, and Engineering Manager. Brock is a co-founder of Apache Sentry and Apache Project Committee Member on Apache Hive, Parquet, Crunch, Flume, and Incubator. Brock was a mentor to Kudu in the incubator and currently mentors Apache Impala (incubating). In addition he is a member of the Apache Software Foundation.

Brock is frequent public speaker, having spoken at dozens of conferences including HBaseCon, numerous Hadoop User Groups, and other conferences.


Ryan Bosshart is a Principal Systems Engineer at Cloudera. Ryan has spent the last 10 years building and architecting distributed systems. At Cloudera, Ryan leads the field storage specialization team where he focuses on Apache HDFS, HBase, and Kudu. He has worked with many early users of Kudu to build their relational, time-series, IOT, or real-time architectures. He has seen first-hand Kudu’s ability to improve performance and simplify architectures. Ryan is a co-chair of the Twin Cities Spark and Hadoop User Group and the author of the training video Getting Started with Kudu (O'Reilly).

Table of Contents

Preface ix

1 Why Kudu? 1

Why Does Kudu Matter? 1

Simplicity Drives Adoption 2

New Use Cases 4

IoT 4

Current Approaches to Real-Time Analytics 5

Real-Time Processing 10

Hardware Landscape 12

Kudu's Unique Place in the Big Data Ecosystem 13

Comparing Kudu with Other Ecosystem Components 15

Big Data-HDFS, HBase, Cassandra 18

Conclusion 19

2 About Kudu 21

Kudu High-Level Design 22

Kudu Roles 23

Master Server 24

Tablet Server 25

Kudu Concepts and Mechanisms 32

Hotspotting 32

Partitioning 33

3 Getting Up and Running 37

Installation 37

Apache Kudu Quickstart VM 37

Using Cloudera Manager 39

Building from Source 40

Packages 40

Cloudera Quickstart VM 40

Quick Install: Three Minutes or Less 41

Conclusion 44

4 Kudu Administration 45

Planning for Kudu 45

Master and Tablet Servers 46

Write-Ahead Log 50

Data Servers and Storage 52

Replication Strategies 53

Deployment Considerations: New or Existing Clusters? 54

New Kudu-Only Cluster 54

New Hadoop Cluster with Kudu 54

Add Kudu to Existing Hadoop Cluster 59

Web UI of Tablet and Master Servers 63

Master Server UI and Tablet Server UI 63

Master Server UI 64

Tablet Server UI 64

The Kudu Command-Line Interface 64

Cluster 65

Filesystem 66

Tablet Replica 70

Consensus Metadata 80

Adding and Removing Tablet Servers 81

Adding Tablet Servers 81

Removing a Tablet Server 82

Security 82

A Simple Analogy 83

Kudu Security Features 84

Basic Performance Tuning 88

Kudu Memory Limits 89

Maintenance Manager Threads 89

Monitoring Performance 90

Getting Ahead and Staying Out of Trouble 90

Avoid Running Out of Disk Space 90

Disk Failures Tolerance 91

Backup 91

Conclusion 92

5 Common Developer Tasks for Kudu 93

Client API 94

Kudu Client 94

Kudu Table 94

Kudu DDL 94

Kudu Scanner Read Modes 95

C++ API 96

Python API 98

Preparing the Python Development Environment 98

Python Kudu Application 99

Java 102

Java Application 103

Spark 105

Impala with Kudu 109

6 Table and Schema Design 111

Schema Design Basics 111

Schema for Hybrid Transactional/Analytical Processing 112

Lambda Architecture 113

OLTP/OLAP Split 113

Primary Key and Column Design 114

Other Column Schema Considerations 115

Partitioning Basics 119

Range Partitioning 120

Hash Partitioning 120

Schema Alteration 121

Best Practices and Tips 121

Partitioning 121

Large Objects 122

Decimal 122

Unique Strings 122

Compression 122

Object Names 123

Number of Columns 123

Binary Types 123

Network Packet Example 123

Conclusion 125

7 Kudu Use Cases 127

Real-Time Internet of Things Analytics 127

Predictive Modeling 130

Mixed Platforms Solution 132

Index 135

From the B&N Reads Blog

Customer Reviews