Malware Data Science: Attack Detection and Attribution

Malware Data Science: Attack Detection and Attribution

Malware Data Science: Attack Detection and Attribution

Malware Data Science: Attack Detection and Attribution

Paperback

$49.95 
  • SHIP THIS ITEM
    Qualifies for Free Shipping
  • PICK UP IN STORE
    Check Availability at Nearby Stores

Related collections and offers


Overview

Malware Data Science explains how to identify, analyze, and classify large-scale malware using machine learning and data visualization.

Security has become a "big data" problem. The growth rate of malware has accelerated to tens of millions of new files per year while our networks generate an ever-larger flood of security-relevant data each day. In order to defend against these advanced attacks, you'll need to know how to think like a data scientist.

In Malware Data Science, security data scientist Joshua Saxe introduces machine learning, statistics, social network analysis, and data visualization, and shows you how to apply these methods to malware detection and analysis.

You'll learn how to:
- Analyze malware using static analysis
- Observe malware behavior using dynamic analysis
- Identify adversary groups through shared code analysis
- Catch 0-day vulnerabilities by building your own machine learning detector
- Measure malware detector accuracy
- Identify malware campaigns, trends, and relationships through data visualization

Whether you're a malware analyst looking to add skills to your existing arsenal, or a data scientist interested in attack detection and threat intelligence, Malware Data Science will help you stay ahead of the curve.

Product Details

ISBN-13: 9781593278595
Publisher: No Starch Press
Publication date: 09/25/2018
Pages: 272
Sales rank: 415,652
Product dimensions: 7.00(w) x 9.20(h) x 0.90(d)

About the Author

Joshua Saxe is Chief Data Scientist at major security vendor, Sophos, where he leads a security data science research team. He's also a principal inventor of Sophos' neural network-based malware detector, which defends tens of millions of Sophos customers from malware infections. Before joining Sophos, Joshua spent 5 years leading DARPA funded security data research projects for the US government.

Hillary Sanders leads the infrastructure data science team at Sophos, which develops the frameworks used to build Sophos' deep learning models. Before joining Sophos, Hillary created a recipe web app and spent three years as a data scientist at Premise Data Corporation.

Table of Contents

Foreword Anup Ghosh xvii

Acknowledgements xix

Introduction xxi

What Is Data Science? xxii

Why Data Science Matters for Security xxii

Applying Data Science to Malware xxiii

Who Should Read This Book? xxiv

About This Book xxiv

How to Use the Sample Code and Data xxv

1 Basic Static Malware Analysis 1

The Microsoft Windows Portable Executable Format 2

The PE Header 3

The Optional Header 3

Section Headers 4

Dissecting the PE Format Using pefile 5

Examining Malware Images 7

Examining Malware Strings 8

Using the strings Program 8

Analyzing Your strings Dump 9

Summary 10

2 Beyond Basic Static Analysis: X86 Disassembly 11

Disassembly Methods 12

Basics of x86 Assembly Language 12

CPU Registers 13

Arithmetic Instructions 15

Data Movement Instructions 15

Disassembling ircbot.exe Using pefile and capstone 20

Factors That Limit Static Analysis 21

Packing 21

Resource Obfuscation 22

Anti-disassembly Techniques 22

Dynamically Downloaded Data 22

Summary 23

3 A Brief Introduction to Dynamic Analysis 25

Why Use Dynamic Analysis? 26

Dynamic Analysis for Malware Data Science 26

Basic Tools for Dynamic Analysis 27

Typical Malware Behaviors 27

Loading a File on malwr.com 27

Analyzing Results on malwr.com 28

Limitations of Basic Dynamic Analysis 33

Summary 34

4 Identifying Attack Campaigns Using Malware Networks 35

Nodes and Edges 37

Bipartite Networks 37

Visualizing Malware Networks 39

The Distortion Problem 39

Force-Directed Algorithms 40

Building Networks with NetworkX 40

Adding Nodes and Edges 41

Adding Attributes 42

Saving Networks to Disk 42

Network Visualization with GraphViz 43

Using Parameters to Adjust Networks 44

The GraphViz Command Line Tools 44

Adding Visual Attributes to Nodes and Edges 48

Building Malware Networks 51

Building a Shared Image Relationship Network 54

Summary 58

5 Shared Code Analysis 59

Preparing Samples for Comparison by Extracting Features 62

How Bag of Features Models Work 62

What are N-Grams? 63

Using the Jaccard Index to Quantify Similarity 64

Using Similarity Matrices to Evaluate Malware Shared Code Estimation Methods 66

Instruction Sequence-Based Similarity 67

Strings-Based Similarity 70

Import Address Table-Based Similarity 71

Dynamic API Call-Based Similarity 72

Building a Similarity Graph 73

Scaling Similarity Comparisons 77

Minhash in a Nutshell 77

Minhash in Depth 78

Building a Persistent Malware Similarity Search System 79

Running the Similarity Search System 85

Summary 87

6 Understanding Machine Learning-Based Malware Detectors 89

Steps for Building a Machine Learning-Based Detector 90

Gathering Training Examples 91

Extracting Features 91

Designing Good Features 92

Training Machine Learning Systems 92

Testing Machine Learning Systems 93

Understanding Feature Spaces and Decision Boundaries 93

What Makes Models Good or Bad: Overfitting and Underfitting 98

Mojor Types of Machine Learning Algorithms 101

Logistic Regression 102

K-Nearest Neighbors 105

Decision Trees 109

Random Forest 115

Summary 117

7 Evaluating Malware Detection Systems 119

Four Possible Detection Outcomes 120

True and False Positive Rates 120

Relationship Between True and False Positive Rates 121

ROC Curves 123

Considering Base Rates in Your Evaluation 124

How Base Rate Affects Precision 124

Estimating Precision in a Deployment Environment 125

Summary 126

8 Building Machine Learning Detectors 127

Terminology and Concepts 128

Building a Toy Decision Tree-Based Detector 129

Training Your Decision Tree Classifier 130

Visualizing the Decision Tree 131

Complete Sample Code 133

Building Real-World Machine Learning Defectors with sklearn 134

Real-World Feature Extraction 134

Why You Can't Use All Possible Features 137

Using the Hashing Trick to Compress Features 138

Building an Industrial-Strength Detector 141

Extracting Features 141

Training the Detector 142

Running the Detector on New Binaries 144

What We've Implemented So Far 144

Evaluating Your Detector's Performance 146

Using ROC Curves to Evaluate Detector Efficacy 147

Computing ROC Curves 147

Splitting Data into Training and Test Sets 148

Computing the ROC Curve 149

Cross-Validation 150

Next Steps 153

Summary 154

9 Visualizing Malware Trends 155

Why Visualizing Malware Data Is Important 156

Understanding Our Malware Dataset 158

Loading Data into pandas 158

Working with a pandas DataFrame 159

Filtering Data Using Conditions 161

Using matplotlib to Visualize Data 162

Plotting the Relationship Between Malware Size and Detection 162

Plotting Ransomware Detection Rates 164

Plotting Ransomware and Worm Detection Rates 165

Using seaborn to Visualize Data 168

Plotting the Distribution of Antivirus Detections 169

Creating a Violin Plot 172

Summary 174

10 Deep Learning Basks 175

What Is Deep Learning? 176

How Neural Networks Work 177

Anatomy of a Neuron 177

A Network of Neurons 180

Universal Approximation Theorem 181

Building Your Own Neural Network 182

Adding Another Neuron to the Network 186

Automatic Feature Generation 188

Training Neural Networks 189

Using Backpropagation to Optimize a Neural Network 190

Path Explosion 192

Vanishing Gradient 192

Types of Neural Networks 193

Feed-Forward Neural Network 193

Convolutional Neural Network 193

Autoencoder Neural Network 194

Generative Adversarial Network 195

Recurrent Neural Network 196

ResNet 196

Summary 197

11 Building a Neural Network Malware Detector with Keras 199

Defining a Model's Architecture 200

Compiling the Model 202

Training the Model 203

Extracting Features 203

Creating a Data Generator 204

Incorporating Validation Data 207

Saving and Loading the Model 209

Evaluating the Model 209

Enhancing the Model Training Process with Callbacks 211

Using a Built-in Callback 212

Using a Custom Callback 213

Summary 214

12 Becoming a Data Scientist 215

Paths to Becoming a Security Data Scientist 216

A Day in the Life of a Security Data Scientist 216

Traits of an Effective Security Data Scientist 218

Open-Mindedness 218

Boundless Curiosity 218

Obsession with Results 219

Skepticism of Results 219

Where to Go from Here 219

Appendix: An Overview of Datasets and Tools 221

Overview of Datasets 222

Chapter 1 Basic Static Malware Analysis 222

Chapter 2 Beyond Basic Static Analysis: x86 Disassembly 222

Chapter 3 A Brief Introduction to Dynamic Analysis 222

Chapter 4 Identifying Attack Campaigns Using Malware Networks 222

Chapter 5 Shared Code Analysis 223

Chapter 6 Understanding Machine Learning-Based Malware Detectors and Chapter 7: Evaluating Malware Detection Systems 223

Chapter 8 Building Machine Learning Detectors 224

Chapter 9 Visualizing Malware Trends 224

Chapter 10 Deep Learning Basics 224

Chapter 11 Building a Neural Network Malware Detector with Keras 224

Chapter 12 Becoming a Data Scientist 224

Tool Implementation Guide 225

Shared Hostname Network Visualization 225

Shared Image Network Visualization 226

Malware Similarity Visualization 227

Malware Similarity Search System 229

Machine Learning Malware Detection System 230

Index 233

From the B&N Reads Blog

Customer Reviews