![MapReduce Design Patterns: Building Effective Algorithms and Analytics for Hadoop and Other Systems](http://img.images-bn.com/static/redesign/srcs/images/grey-box.png?v11.9.4)
MapReduce Design Patterns: Building Effective Algorithms and Analytics for Hadoop and Other Systems
247![MapReduce Design Patterns: Building Effective Algorithms and Analytics for Hadoop and Other Systems](http://img.images-bn.com/static/redesign/srcs/images/grey-box.png?v11.9.4)
MapReduce Design Patterns: Building Effective Algorithms and Analytics for Hadoop and Other Systems
247Paperback
-
PICK UP IN STORECheck Availability at Nearby Stores
Available within 2 business hours
Related collections and offers
Overview
Until now, design patterns for the MapReduce framework have been scattered among various research papers, blogs, and books. This handy guide brings together a unique collection of valuable MapReduce patterns that will save you time and effort regardless of the domain, language, or development framework you're using.
Each pattern is explained in context, with pitfalls and caveats clearly identified to help you avoid common design mistakes when modeling your big data architecture. This book also provides a complete overview of MapReduce that explains its origins and implementations, and why design patterns are so important. All code examples are written for Hadoop.
- Summarization patterns: get a top-level view by summarizing and grouping data
- Filtering patterns: view data subsets such as records generated from one user
- Data organization patterns: reorganize data to work with other systems, or to make MapReduce analysis easier
- Join patterns: analyze different datasets together to discover interesting relationships
- Metapatterns: piece together several patterns to solve multi-stage problems, or to perform several analytics in the same job
- Input and output patterns: customize the way you use Hadoop to load or store data
"A clear exposition of MapReduce programs for common data processing patterns--this book is indespensible for anyone using Hadoop." --Tom White, author of Hadoop: The Definitive Guide
Product Details
ISBN-13: | 9781449327170 |
---|---|
Publisher: | O'Reilly Media, Incorporated |
Publication date: | 12/22/2012 |
Pages: | 247 |
Product dimensions: | 6.90(w) x 9.10(h) x 0.60(d) |
About the Author
Donald Miner serves as a Solutions Architect at EMC Greenplum, advising and helping customers implement and use Greenplum's big data systems. Prior to working with Greenplum, Dr. Miner architected several large-scale and mission-critical Hadoop deployments with the U.S. Government as a contractor. He is also involved in teaching, having previously instructed industry classes on Hadoop and a variety of artificial intelligence courses at the University of Maryland, BC. Dr. Miner received his PhD from the University of Maryland, BC in Computer Science, where he focused on Machine Learning and Multi-Agent Systems in his dissertation.
Adam Shook is a Software Engineer at ClearEdge IT Solutions, LLC, working with a number of big data technologies such as Hadoop, Accumulo, Pig, and ZooKeeper. Shook graduated with a B.S. in Computer Science from the University of Maryland Baltimore County (UMBC) and took a job building a new high-performance graphics engine for a game studio. Seeking new challenges, he enrolled in the graduate program at UMBC with a focus on distributed computing technologies. He quickly found development work as a U.S. government contractor on a large-scale Hadoop deployment. Shook is involved in developing and instructing training curriculum for both Hadoop and Pig. He spends what little free time he has working on side projects and playing video games.
Table of Contents
Preface ix
1 Design Patterns and MapReduce 1
Design Patterns 2
MapReduce History 4
MapReduce and Hadoop Refresher 4
Hadoop Example: Word Count 7
Pig and Hive 11
2 Summarization Patterns 13
Numerical Summarizations 14
Pattern Description 14
Numerical Summarization Examples 17
Inverted Index Summarizations 32
Pattern Description 32
Inverted Index Example 35
Counting with Counters 37
Pattern Description 37
Counting with Counters Example 40
3 Filtering Patterns 43
Filtering 44
Pattern Description 44
Filtering Examples 47
Bloom Filtering 49
Pattern Description 49
Bloom Filtering Examples 53
Top Ten 58
Pattern Description 58
Top Ten Examples 63
Distinct 65
Pattern Description 65
Distinct Examples 68
4 Data Organization Patterns 71
Structured to Hierarchical 72
Pattern Description 72
Structured to Hierarchical Examples 76
Partitioning 82
Pattern Description 82
Partitioning Examples 86
Binning 88
Pattern Description 88
Binning Examples 90
Total Order Sorting 92
Pattern Description 92
Total Order Sorting Examples 95
Shuffling 99
Pattern Description 99
Shuffle Examples 101
5 Join Patterns 103
A Refresher on Joins 104
Reduce Side Join 108
Pattern Description 108
Reduce Side Join Example 111
Reduce Side Join with Bloom Filter 117
Replicated Join 119
Pattern Description 119
Replicated Join Examples 121
Composite Join 123
Pattern Description 123
Composite Join Examples 126
Cartesian Product 128
Pattern Description 128
Cartesian Product Examples 132
6 Metapatterns 139
Job Chaining 139
With the Driver 140
Job Chaining Examples 141
With Shell Scripting 150
With JobControl 153
Chain Folding 158
The ChainMapper and ChainReducer Approach 163
Chain Folding Example 163
Job Merging 168
Job Merging Examples 170
7 Input and Output Patterns 177
Customizing Input and Output in Hadoop 177
InputFormat 178
RecordReader 179
OutputFormat 180
RecordWriter 181
Generating Data 182
Pattern Description 182
Generating Data Examples 184
External Source Output 189
Pattern Description 189
External Source Output Example 191
External Source Input 195
Pattern Description 195
External Source Input Example 197
Partition Pruning 202
Pattern Description 202
Partition Pruning Examples 205
8 Final Thoughts and the Future of Design Patterns 217
Trends in the Nature of Data 217
Images, Audio, and Video 217
Streaming Data 218
The Effects of Yarn 219
Patterns as a Library or Component 220
How You Can Help 220
A. Bloom Filters 221
Index 227