Spark GraphX in Action

Summary

Spark GraphX in Action starts out with an overview of Apache Spark and the GraphX graph processing API. This example-based tutorial then teaches you how to configure GraphX and how to use it interactively. Along the way, you'll collect practical techniques for enhancing applications and applying machine learning algorithms to graph data.

Purchase of the print book includes a free eBook in PDF, Kindle, and ePub formats from Manning Publications.

About the Technology

GraphX is a powerful graph processing API for the Apache Spark analytics engine that lets you draw insights from large datasets. GraphX gives you unprecedented speed and capacity for running massively parallel and machine learning algorithms.

About the Book

Spark GraphX in Action begins with the big picture of what graphs can be used for. This example-based tutorial teaches you how to use GraphX interactively. You'll start with a crystal-clear introduction to building big data graphs from regular data, and then explore the problems and possibilities of implementing graph algorithms and architecting graph processing pipelines. Along the way, you'll collect practical techniques for enhancing applications and applying machine learning algorithms to graph data.

What's Inside

Understanding graph technology
Using the GraphX API
Developing algorithms for big graphs
Machine learning with graphs
Graph visualization

About the Reader

Readers should be comfortable writing code. Experience with Apache Spark and Scala is not required.

About the Authors

Michael Malak has worked on Spark applications for Fortune 500 companies since early 2013. Robin East has worked as a consultant to large organizations for over 15 years and is a data scientist at Worldpay.

Table of Contents

Two important technologies: Spark and graphs
GraphX quick start
Some fundamentals
GraphX Basics
Built-in algorithms
Other useful graph algorithms
Machine learning
The missing algorithms
Performance and monitoring
Other languages and tools

1123162596

Spark GraphX in Action

Understanding graph technology
Using the GraphX API
Developing algorithms for big graphs
Machine learning with graphs
Graph visualization

Two important technologies: Spark and graphs
GraphX quick start
Some fundamentals
GraphX Basics
Built-in algorithms
Other useful graph algorithms
Machine learning
The missing algorithms
Performance and monitoring
Other languages and tools

38.99 In Stock

Spark GraphX in Action

Add to Wishlist

Spark GraphX in Action

eBook

$38.99

eBook
$38.99

Available on Compatible NOOK devices, the free NOOK App and in My Digital Library.

WANT A NOOK? Explore Now

Buy As Gift

Related collections and offers

Overview

Understanding graph technology
Using the GraphX API
Developing algorithms for big graphs
Machine learning with graphs
Graph visualization

Two important technologies: Spark and graphs
GraphX quick start
Some fundamentals
GraphX Basics
Built-in algorithms
Other useful graph algorithms
Machine learning
The missing algorithms
Performance and monitoring
Other languages and tools

Product Details

ISBN-13:	9781638353300
Publisher:	Manning
Publication date:	06/12/2016
Sold by:	SIMON & SCHUSTER
Format:	eBook
Pages:	280
File size:	6 MB

About the Author

Michael Malak has worked on Spark applications for Fortune 500 companies since early 2013.

Robin East has worked as a consultant to large organizations for over 15 years and is a data scientist at Worldpay.

Preface xi

Acknowledgments xiii

About this book xiv

About the cover illustration xviii

Part 1 Spark and Graphs 1

1 Two important technologies: Spark and graphs 3

1.1 Spark: the step beyond Hadoop MapReduce 4

The elusive definition of Big Data 6

Hadoop: the world before Spark 6

Spark: in-memory MapReduce processing 7

1.2 Graphs: finding meaning from relationships 9

Uses of graphs 10

Types of graph data 12

Plain RDBMS inadequate for graphs 14

1.3 Putting them together for lightning fast graph processing: Spark GraphX 14

Property graph: adding richness 15

Graph partitioning: graphs meet Big Data 17

GraphX lets you choose: graph parallel or data parallel 19

Various ways GraphX fits into a processing flow 19

GraphX vs. other systems 21

Storing the graphs: distributed file storage vs. graph database 23

1.4 Summary 23

2 GraphX quick start 24

2.1 Getting set up and getting data 25

2.2 Interactive GraphX querying using the Spark Shell 26

2.3 PageRank example 29

2.4 Summary 31

3 Some fundamentals 32

3.1 Scala, the native language of Spark 33

Scala's philosophy: conciseness and expressiveness 33

Functional programming 34

Inferred typing 38

Class declaration 39

Map and reduce 40

Everything is a function 41

Java interoperability 42

3.2 Spark 43

Distributed in-memory data: HDDs 43

Laziness 44

Cluster requirements and terminology 47

Serialization Common RDD operations 48

Hello World with Spark and sbt 51

3.3 Graph terminology 52

Basics 52

RDF graphs vs. property graphs 55

Adjacency matrix 56

Graph querying systems 56

3.4 Summary 57

Part 2 Connecting Vertices 59

4 GraphX Basics 61

4.1 Vertex and edge classes 61

4.2 Mapping operations 67

Simple graph transformation 67

Map/Reduce 68

Iterated Map/Reduce 72

4.3 Serialization/deserialization 74

Reading/writing binary format 74

Json format 76

GEXF format for Gephi visualization software 78

4.4 Graph generation 80

Deterministic graphs 80

Random graphs 81

4.5 Pregel API 83

4.6 Summary 89

5 Built-in algorithms 90

5.1 Seek out authoritative nodes: PageRank 91

PageRank algorithm explained 91

Invoking PageRank in GraphX 92

Personalized PageRank 94

5.2 Measuring connectedness: Triangle Count 95

Uses of Triangle Count 96

Slashdot friends and foes example 96

5.3 Find the fewest hops: ShortestPaths 99

5.4 Finding isolated populations: Connected Components 100

Predicting social circles 101

Reciprocated love only, please: Strongly Connected Components 106

5.5 Community detection: LabelPropagation 107

5.6 Summary 108

6 Other useful graph algorithms 110

6.1 Your own GPS: Shortest Paths with Weights 111

6.2 Travelling Salesman: greedy algorithm 115

6.3 Route utilities: Minimum Spanning Trees 117

Deriving-taxonomies with Word2Vec and Minimum Spanning Trees 121

6.4 Summary 124

7 Machine learning 125

7.1 Supervised, unsupervised, and semi-supervised learning 126

7.2 Recommend a movie: SVDPlusPlus 128

Explanation of the Koren formula 134

7.3 Using GraphX With MLlib 135

Determine topics: Latent Dirichlet Allocation 135

Detect spam: LogisticRegressionWithSGD 143

Image segmentation for computer vision) using Power Iteration Clustering 147

7.4 Poor man's training data: graph-based semi-supervised learning 151

K-Nearest Neighbors graph construction 154

Semi-supervised learning label propagation 160

7.5 Summary 164

Part 3 Over the Arc 165

8 The missing algorithms 167

8.1 Missing basic graph operations 168

Common sense subgraphs 168

Merge two graphs 169

8.2 Reading RDF graph files 172

Matching vertices and constructing the graph 173

Improving performance with IndexedRDD, the RDD HashMap 174

8.3 Poor man's graph isomorphism: finding missing Wikipedia infobox items 179

8.4 Global clustering coefficient: compare connectedness 184

8.5 Summary 186

9 Performance and monitoring 187

9.1 Monitoring your Spark application 188

How Spark runs your application 188

Understanding your application runtime with Spark monitoring 191

History server 201

9.2 Configuring Spark 203

Utilizing all CPU cores 206

9.3 Spark performance tuning 207

Speeding up Spark with caching and persistence 207

Checkpointing 210

Reducing memory pressure with serialization 211

9.4 Graph partitioning 213

9.5 Summary 215

10 Other languages and tools 216

10.1 Using languages other than Scala with GraphX 217

Using CraphX with Java 7 217

Using GraphX with Java 8 222

Whether GraphX may gain Python or R bindings in the future 222

10.2 Another visualization tool: Apache Zeppelin plus d3.js 222

10.3 Almost a database: Spark Job Server 225

Example: Query Slashdot friends degree of separation 226

Related collections and offers

Overview

Product Details

About the Author

Table of Contents

Related Subjects

Customer Reviews