Machine Learning & Graph Processing w/ Spark and Accumulo

http://purdygoodengineering.com http://anant.us
Accumulo and Spark
With MLLib and GraphX

Introduction
● Section 1: Understanding the Technology
○ Big Picture
○ Accumulo
○ Spark
○ Example Code
● Section 2: Use Cases
○ Multi-Tenant Data Processing
○ Machine Learning / Graph Processing in Spark
○ Example ML + Graph on Business Data
● Questions and Answers
● Contact Information

○ Big Picture
■ Why Accumulo
■ Why Spark
○ Accumulo
■ Key/Value Structure
■ Table Structure
■ Cell Level Security
■ Splits
■ Reads (scans)
■ Writes (upserts)
■ Deletes
○ Spark
■ Batch/Streaming
■ Machine Learning
■ Graph Processing
○ Example Code
■ Writing to Accumulo
■ Reading from Accumulo
■ Shell
Section 1: Understanding the Technology

Section 1: Big Picture
● Accumulo
○ Scalable, sorted, distributed key/value store with cell level security
● Spark
○ General compute engine for large-scale data processing
■ Batch Processing
■ Streaming
■ Machine Learning Library
● Use Spark for Compute and Accumulo for storage for a security distributed
scalable solution

Section 1: Accumulo: Key Structure
(image from accumulo.apache.org)

Section 1: Accumulo: Key Structure
Accumulo
Table
Design
RDBM
Table
Design

Section 1: Accumulo: Table Structure
● Each table has many tablets (distributed across nodes)
● Tablet servers are replicated (default is 3)
● Each row resides on the same tablets
○ A Row Id design strategy needs to ensure binning is
evenly distributed
○ Each table has “splits” which determine binning
○ If Row Ids are still too large; a sharding strategy is
required

Section 1: Accumulo: Cell Level Security
● Each cell (or field) has its own access control determined
by visibility
● Each user has authorizations which correspond to
visibilities
● Only fields with visibilities which a user has authorization
to access can be retrieved by that user
● Visibilities have limited logic such as AND and OR
○ e.g. private | system public & dna_partner

Section 1: Splits
● Each table has a default split
● Splits can be added to tables
● Accumulo auto splits when tablets get to large
● Table splits and tablet max size can is configurable
● Row ids are generally hashed to support distribution
● Example splits based on hashing
○ 0,1,2,3,4,5,6,7,8,9,a,b,c,d,e,f

Section 1: Accumulo Reads
● Reads (are scans)
○ Scanner
○ BatchScanner (parallelizes over ranges)
● MapReduce/Spark
○ AccumuloInputFormat (one field at a time)
○ AccumuloRowInputFormat (one row at a time)

Section 1: Accumulo: Writes
● Writes
○ Writer
○ BatchWriter (parallelizes over tablets)
● MapReduce/Spark
○ AccumuloOutputFormat
○ AccumuloFileOutputFormat (bulk ingest)
● Both use Mutations to write to accumulo

Section 1: Accumulo: Mutations (write and delete)
● Mutations are used to write and delete
● Mutation.put (to write)
● Mutation.putDelete (to delete)
● Writes are Upserts (insert or updates)

Section 1: Accumulo
● accumulo.apache.org
● Download accumulo
● Examples
● Documentation
Concerned about scalling; how about 4T Nodes, 70T edges
in a graph => see link
http://www.pdl.cmu.edu/SDI/2013/slides/big_graph_nsa_rd_2
013_56002v1.pdf

Section 1: Spark: MapReduce first
● Hadoop MapReduce (batch processing)
○ Mapping
○ Reducing
○ Chain jobs
○ 95% IO (each job must read/write to disk)
○ scalable

Section 1: Spark
● Batch Processing - MapReduce (many more functions)
● Streaming - mini batch processing
● Machine Learning - MLLib
● Graph Processing - GraphX
● Many Languages - (Java, Scala, Python, R)

Section 1: Spark
● spark.apache.org
● Download spark
● Example code
● Documentation

○ Spark
■ Batch/Streaming
■ Machine Learning
○ Example Code
■ Writing to Accumulo
■ Reading from Accumulo
■ Shell
Section 1: Understanding the Technology
○ Big Picture
■ Why Accumulo
■ Why Spark
○ Accumulo
■ Key/Value Structure
■ Table Structure
■ Cell Level Security
■ Splits
■ Reads (scans)
■ Writes (upserts)
■ Deletes

Section 1: Example Code
Simple Examples for bookkeeping with spark and accumulo
https://github.com/matthewpurdy/purdy-good/tree/master/purdy-good-spark/purdy-good-spark-accumulo

Section 2: Use Case(s) Machine Learning and
Graph Processing
● Multi-Tenant Data Processing
● Machine Learning / Graph Processing in Spark
● Example Usecase of ML + Graph on Business Data

Section 2: Multi-Tenant Data Processing Needs
Customer (C) (P) & (C) Provider (P)
Team Customer Private Customer Data
shared w/ Provider
Private Provider Data
for Economy of Scale
Sales
Marketing
IBM Indicators
Relationships
Classification
Classification Model
Relationship Graph
Marketing
Finance
Apple Indicators
Correlation
Prediction
Correlation Model
Prediction Model
Sales
Marketing
Finance
Microsoft Indicators
Relationships
Correlation
Prediction
Correlation Model
Prediction Model
Relationship Graph
Finance Google Indicators
Correlation
Prediction
Correlation Model
Prediction Model

Customer (C) (P) & (C) Provider (P)
C User C Team C Management C Management
P Analytics
P Analytics
P Support
CU Manager
CU Employee
CT Sales CM Executive CM Executive
CU Manager
PA * / PS *
PA * / PS *
CU Manager
CU Employee
CT Marketing CM Executive CM Executive
CU Manager
PA * / PS *
PA * / PS *
CU Employee CT Research CM Executive CM Executive
CU Manager
PA * / PS *
PA * / PS *
CU Employee CT Finance CM Executive CM Executive
CU Manager
PA * / PS *
PA * / PS *

● Analyze Sales Team successes (Closed Accounts) to recommend companies
to target for Marketing campaigns.
● Analyze Sales Team User social account against social network users against
recommended companies to create Call Lists
● Correlate historic Marketing (Traffic & Conversions) with historic Sales (Leads
& Closed Accounts) data with historic Finance ( Revenue & Profit) to Predict
Sales from current Marketing & Sales activities

Section 2: Out of the Box : MLLib in Spark
● Classification
● Regression
● Decision Trees
● Recommendation
● Clustering
● Topic Modeling
● Feature Transformations
● ML Pipelining / Persistence
● “Based on past
performance in the
companies in the CRM,
the most successful sales
have come from these
categories, so go after
these companies.”

Section 2: Out of the Box : MLLib in Spark
● Load Data
● Extract Features
● Train Model
● Find Best Model
● Use Model to Predict

Section 2: KeystoneML - End to End ML
http://keystone-ml.org/

Section 2: Out of the Box : GraphX in Spark
● PageRank
● Connected components
● Label propagation
● SVD++
● Strongly connected components
● Triangle count
● “Based on the social graph
of sales team members
and the companies in your
CRM, talk to the
companies you are most
“closest” to.

● Load Nodes RDD
● Load Vertices RDD
● Create Graph from
Nodes & Vertices RDD
● Run Graph Process /
Query
● Get Data
http://ampcamp.berkeley.edu/big-d
ata-mini-course/graph-analytics-wit
h-graphx.html

● Load Edges into Graph
● Run Page Rank
● Load Nodes into RDD
● Join Users RDD with
Rank

Questions and Answers
?

Contact Information
Matthew Purdy
● matthew.purdy@purdygoodengineering.com
● http://www.purdygoodengineering.com
● https://www.linkedin.com/in/matthewpurdy
● https://github.com/matthewpurdy
Rahul Singh
● rahul.singh@anant.us
● http://www.anant.us
● http://www.linkedin.com/in/xingh
● https://github.com/xingh

Machine Learning & Graph Processing w/ Spark and Accumulo

More Related Content

Viewers also liked

Similar to Machine Learning & Graph Processing w/ Spark and Accumulo

More from Rahul Singh

Recently uploaded

Machine Learning & Graph Processing w/ Spark and Accumulo