http://purdygoodengineering.com http://anant.us
Accumulo and Spark
With MLLib and GraphX
http://purdygoodengineering.com http://anant.us
Introduction
● Section 1: Understanding the Technology
○ Big Picture
○ Accumulo
○ Spark
○ Example Code
● Section 2: Use Cases
○ Multi-Tenant Data Processing
○ Machine Learning / Graph Processing in Spark
○ Example ML + Graph on Business Data
● Questions and Answers
● Contact Information
http://purdygoodengineering.com http://anant.us
● Section 1: Understanding the Technology
○ Big Picture
■ Why Accumulo
■ Why Spark
○ Accumulo
■ Key/Value Structure
■ Table Structure
■ Cell Level Security
■ Splits
■ Reads (scans)
■ Writes (upserts)
■ Deletes
○ Spark
■ Batch/Streaming
■ Machine Learning
■ Graph Processing
○ Example Code
■ Writing to Accumulo
■ Reading from Accumulo
■ Shell
Section 1: Understanding the Technology
http://purdygoodengineering.com http://anant.us
Section 1: Big Picture
● Accumulo
○ Scalable, sorted, distributed key/value store with cell level security
● Spark
○ General compute engine for large-scale data processing
■ Batch Processing
■ Streaming
■ Machine Learning Library
■ Graph Processing
● Use Spark for Compute and Accumulo for storage for a security distributed
scalable solution
http://purdygoodengineering.com http://anant.us
● Section 1: Understanding the Technology
○ Big Picture
■ Why Accumulo
■ Why Spark
○ Accumulo
■ Key/Value Structure
■ Table Structure
■ Cell Level Security
■ Splits
■ Reads (scans)
■ Writes (upserts)
■ Deletes
○ Spark
■ Batch/Streaming
■ Machine Learning
■ Graph Processing
○ Example Code
■ Writing to Accumulo
■ Reading from Accumulo
■ Shell
Section 1: Understanding the Technology
http://purdygoodengineering.com http://anant.us
Section 1: Accumulo: Key Structure
(image from accumulo.apache.org)
http://purdygoodengineering.com http://anant.us
Section 1: Accumulo: Key Structure
Accumulo
Table
Design
RDBM
Table
Design
http://purdygoodengineering.com http://anant.us
Section 1: Accumulo: Table Structure
● Each table has many tablets (distributed across nodes)
● Tablet servers are replicated (default is 3)
● Each row resides on the same tablets
○ A Row Id design strategy needs to ensure binning is
evenly distributed
○ Each table has “splits” which determine binning
○ If Row Ids are still too large; a sharding strategy is
required
http://purdygoodengineering.com http://anant.us
Section 1: Accumulo: Cell Level Security
● Each cell (or field) has its own access control determined
by visibility
● Each user has authorizations which correspond to
visibilities
● Only fields with visibilities which a user has authorization
to access can be retrieved by that user
● Visibilities have limited logic such as AND and OR
○ e.g. private | system public & dna_partner
http://purdygoodengineering.com http://anant.us
Section 1: Splits
● Each table has a default split
● Splits can be added to tables
● Accumulo auto splits when tablets get to large
● Table splits and tablet max size can is configurable
● Row ids are generally hashed to support distribution
● Example splits based on hashing
○ 0,1,2,3,4,5,6,7,8,9,a,b,c,d,e,f
http://purdygoodengineering.com http://anant.us
Section 1: Accumulo Reads
● Reads (are scans)
○ Scanner
○ BatchScanner (parallelizes over ranges)
● MapReduce/Spark
○ AccumuloInputFormat (one field at a time)
○ AccumuloRowInputFormat (one row at a time)
http://purdygoodengineering.com http://anant.us
Section 1: Accumulo: Writes
● Writes
○ Writer
○ BatchWriter (parallelizes over tablets)
● MapReduce/Spark
○ AccumuloOutputFormat
○ AccumuloFileOutputFormat (bulk ingest)
● Both use Mutations to write to accumulo
http://purdygoodengineering.com http://anant.us
Section 1: Accumulo: Mutations (write and delete)
● Mutations are used to write and delete
● Mutation.put (to write)
● Mutation.putDelete (to delete)
● Writes are Upserts (insert or updates)
http://purdygoodengineering.com http://anant.us
Section 1: Accumulo
● accumulo.apache.org
● Download accumulo
● Examples
● Documentation
Concerned about scalling; how about 4T Nodes, 70T edges
in a graph => see link
http://www.pdl.cmu.edu/SDI/2013/slides/big_graph_nsa_rd_2
013_56002v1.pdf
http://purdygoodengineering.com http://anant.us
● Section 1: Understanding the Technology
○ Big Picture
■ Why Accumulo
■ Why Spark
○ Accumulo
■ Key/Value Structure
■ Table Structure
■ Cell Level Security
■ Splits
■ Reads (scans)
■ Writes (upserts)
■ Deletes
○ Spark
■ Batch/Streaming
■ Machine Learning
■ Graph Processing
○ Example Code
■ Writing to Accumulo
■ Reading from Accumulo
■ Shell
Section 1: Understanding the Technology
http://purdygoodengineering.com http://anant.us
Section 1: Spark: MapReduce first
● Hadoop MapReduce (batch processing)
○ Mapping
○ Reducing
○ Chain jobs
○ 95% IO (each job must read/write to disk)
○ scalable
http://purdygoodengineering.com http://anant.us
Section 1: Spark
● Batch Processing - MapReduce (many more functions)
● Streaming - mini batch processing
● Machine Learning - MLLib
● Graph Processing - GraphX
● Many Languages - (Java, Scala, Python, R)
http://purdygoodengineering.com http://anant.us
Section 1: Spark
● spark.apache.org
● Download spark
● Example code
● Documentation
http://purdygoodengineering.com http://anant.us
○ Spark
■ Batch/Streaming
■ Machine Learning
■ Graph Processing
○ Example Code
■ Writing to Accumulo
■ Reading from Accumulo
■ Shell
Section 1: Understanding the Technology
● Section 1: Understanding the Technology
○ Big Picture
■ Why Accumulo
■ Why Spark
○ Accumulo
■ Key/Value Structure
■ Table Structure
■ Cell Level Security
■ Splits
■ Reads (scans)
■ Writes (upserts)
■ Deletes
http://purdygoodengineering.com http://anant.us
Section 1: Example Code
Simple Examples for bookkeeping with spark and accumulo
https://github.com/matthewpurdy/purdy-good/tree/master/purdy-good-spark/purdy-good-spark-accumulo
http://purdygoodengineering.com http://anant.us
Section 2: Use Case(s) Machine Learning and
Graph Processing
● Multi-Tenant Data Processing
● Machine Learning / Graph Processing in Spark
● Example Usecase of ML + Graph on Business Data
http://purdygoodengineering.com http://anant.us
Section 2: Multi-Tenant Data Processing Needs
Customer (C) (P) & (C) Provider (P)
Team Customer Private Customer Data
shared w/ Provider
Private Provider Data
for Economy of Scale
Sales
Marketing
IBM Indicators
Relationships
Classification
Classification Model
Relationship Graph
Marketing
Finance
Apple Indicators
Correlation
Prediction
Correlation Model
Prediction Model
Sales
Marketing
Finance
Microsoft Indicators
Relationships
Correlation
Prediction
Correlation Model
Prediction Model
Relationship Graph
Finance Google Indicators
Correlation
Prediction
Correlation Model
Prediction Model
http://purdygoodengineering.com http://anant.us
Section 2: Multi-Tenant Data Processing Needs
Customer (C) (P) & (C) Provider (P)
C User C Team C Management C Management
P Analytics
P Analytics
P Support
CU Manager
CU Employee
CT Sales CM Executive CM Executive
CU Manager
PA * / PS *
PA * / PS *
CU Manager
CU Employee
CT Marketing CM Executive CM Executive
CU Manager
PA * / PS *
PA * / PS *
CU Employee CT Research CM Executive CM Executive
CU Manager
PA * / PS *
PA * / PS *
CU Employee CT Finance CM Executive CM Executive
CU Manager
PA * / PS *
PA * / PS *
http://purdygoodengineering.com http://anant.us
Section 2: Multi-Tenant Data Processing Needs
● Analyze Sales Team successes (Closed Accounts) to recommend companies
to target for Marketing campaigns.
● Analyze Sales Team User social account against social network users against
recommended companies to create Call Lists
● Correlate historic Marketing (Traffic & Conversions) with historic Sales (Leads
& Closed Accounts) data with historic Finance ( Revenue & Profit) to Predict
Sales from current Marketing & Sales activities
http://purdygoodengineering.com http://anant.us
Section 2: Out of the Box : MLLib in Spark
● Classification
● Regression
● Decision Trees
● Recommendation
● Clustering
● Topic Modeling
● Feature Transformations
● ML Pipelining / Persistence
● “Based on past
performance in the
companies in the CRM,
the most successful sales
have come from these
categories, so go after
these companies.”
http://purdygoodengineering.com http://anant.us
Section 2: Out of the Box : MLLib in Spark
● Load Data
● Extract Features
● Train Model
● Find Best Model
● Use Model to Predict
http://purdygoodengineering.com http://anant.us
Section 2: KeystoneML - End to End ML
http://keystone-ml.org/
http://purdygoodengineering.com http://anant.us
Section 2: Out of the Box : GraphX in Spark
● PageRank
● Connected components
● Label propagation
● SVD++
● Strongly connected components
● Triangle count
● “Based on the social graph
of sales team members
and the companies in your
CRM, talk to the
companies you are most
“closest” to.
http://purdygoodengineering.com http://anant.us
Section 2: Out of the Box : GraphX in Spark
● Load Nodes RDD
● Load Vertices RDD
● Create Graph from
Nodes & Vertices RDD
● Run Graph Process /
Query
● Get Data
http://ampcamp.berkeley.edu/big-d
ata-mini-course/graph-analytics-wit
h-graphx.html
http://purdygoodengineering.com http://anant.us
Section 2: Out of the Box : GraphX in Spark
● Load Edges into Graph
● Run Page Rank
● Load Nodes into RDD
● Join Users RDD with
Rank
http://purdygoodengineering.com http://anant.us
Questions and Answers
?
http://purdygoodengineering.com http://anant.us
Contact Information
Matthew Purdy
● matthew.purdy@purdygoodengineering.com
● http://www.purdygoodengineering.com
● https://www.linkedin.com/in/matthewpurdy
● https://github.com/matthewpurdy
Rahul Singh
● rahul.singh@anant.us
● http://www.anant.us
● http://www.linkedin.com/in/xingh
● https://github.com/xingh

Machine Learning & Graph Processing w/ Spark and Accumulo

  • 1.
  • 2.
    http://purdygoodengineering.com http://anant.us Introduction ● Section1: Understanding the Technology ○ Big Picture ○ Accumulo ○ Spark ○ Example Code ● Section 2: Use Cases ○ Multi-Tenant Data Processing ○ Machine Learning / Graph Processing in Spark ○ Example ML + Graph on Business Data ● Questions and Answers ● Contact Information
  • 3.
    http://purdygoodengineering.com http://anant.us ● Section1: Understanding the Technology ○ Big Picture ■ Why Accumulo ■ Why Spark ○ Accumulo ■ Key/Value Structure ■ Table Structure ■ Cell Level Security ■ Splits ■ Reads (scans) ■ Writes (upserts) ■ Deletes ○ Spark ■ Batch/Streaming ■ Machine Learning ■ Graph Processing ○ Example Code ■ Writing to Accumulo ■ Reading from Accumulo ■ Shell Section 1: Understanding the Technology
  • 4.
    http://purdygoodengineering.com http://anant.us Section 1:Big Picture ● Accumulo ○ Scalable, sorted, distributed key/value store with cell level security ● Spark ○ General compute engine for large-scale data processing ■ Batch Processing ■ Streaming ■ Machine Learning Library ■ Graph Processing ● Use Spark for Compute and Accumulo for storage for a security distributed scalable solution
  • 5.
    http://purdygoodengineering.com http://anant.us ● Section1: Understanding the Technology ○ Big Picture ■ Why Accumulo ■ Why Spark ○ Accumulo ■ Key/Value Structure ■ Table Structure ■ Cell Level Security ■ Splits ■ Reads (scans) ■ Writes (upserts) ■ Deletes ○ Spark ■ Batch/Streaming ■ Machine Learning ■ Graph Processing ○ Example Code ■ Writing to Accumulo ■ Reading from Accumulo ■ Shell Section 1: Understanding the Technology
  • 6.
    http://purdygoodengineering.com http://anant.us Section 1:Accumulo: Key Structure (image from accumulo.apache.org)
  • 7.
    http://purdygoodengineering.com http://anant.us Section 1:Accumulo: Key Structure Accumulo Table Design RDBM Table Design
  • 8.
    http://purdygoodengineering.com http://anant.us Section 1:Accumulo: Table Structure ● Each table has many tablets (distributed across nodes) ● Tablet servers are replicated (default is 3) ● Each row resides on the same tablets ○ A Row Id design strategy needs to ensure binning is evenly distributed ○ Each table has “splits” which determine binning ○ If Row Ids are still too large; a sharding strategy is required
  • 9.
    http://purdygoodengineering.com http://anant.us Section 1:Accumulo: Cell Level Security ● Each cell (or field) has its own access control determined by visibility ● Each user has authorizations which correspond to visibilities ● Only fields with visibilities which a user has authorization to access can be retrieved by that user ● Visibilities have limited logic such as AND and OR ○ e.g. private | system public & dna_partner
  • 10.
    http://purdygoodengineering.com http://anant.us Section 1:Splits ● Each table has a default split ● Splits can be added to tables ● Accumulo auto splits when tablets get to large ● Table splits and tablet max size can is configurable ● Row ids are generally hashed to support distribution ● Example splits based on hashing ○ 0,1,2,3,4,5,6,7,8,9,a,b,c,d,e,f
  • 11.
    http://purdygoodengineering.com http://anant.us Section 1:Accumulo Reads ● Reads (are scans) ○ Scanner ○ BatchScanner (parallelizes over ranges) ● MapReduce/Spark ○ AccumuloInputFormat (one field at a time) ○ AccumuloRowInputFormat (one row at a time)
  • 12.
    http://purdygoodengineering.com http://anant.us Section 1:Accumulo: Writes ● Writes ○ Writer ○ BatchWriter (parallelizes over tablets) ● MapReduce/Spark ○ AccumuloOutputFormat ○ AccumuloFileOutputFormat (bulk ingest) ● Both use Mutations to write to accumulo
  • 13.
    http://purdygoodengineering.com http://anant.us Section 1:Accumulo: Mutations (write and delete) ● Mutations are used to write and delete ● Mutation.put (to write) ● Mutation.putDelete (to delete) ● Writes are Upserts (insert or updates)
  • 14.
    http://purdygoodengineering.com http://anant.us Section 1:Accumulo ● accumulo.apache.org ● Download accumulo ● Examples ● Documentation Concerned about scalling; how about 4T Nodes, 70T edges in a graph => see link http://www.pdl.cmu.edu/SDI/2013/slides/big_graph_nsa_rd_2 013_56002v1.pdf
  • 15.
    http://purdygoodengineering.com http://anant.us ● Section1: Understanding the Technology ○ Big Picture ■ Why Accumulo ■ Why Spark ○ Accumulo ■ Key/Value Structure ■ Table Structure ■ Cell Level Security ■ Splits ■ Reads (scans) ■ Writes (upserts) ■ Deletes ○ Spark ■ Batch/Streaming ■ Machine Learning ■ Graph Processing ○ Example Code ■ Writing to Accumulo ■ Reading from Accumulo ■ Shell Section 1: Understanding the Technology
  • 16.
    http://purdygoodengineering.com http://anant.us Section 1:Spark: MapReduce first ● Hadoop MapReduce (batch processing) ○ Mapping ○ Reducing ○ Chain jobs ○ 95% IO (each job must read/write to disk) ○ scalable
  • 17.
    http://purdygoodengineering.com http://anant.us Section 1:Spark ● Batch Processing - MapReduce (many more functions) ● Streaming - mini batch processing ● Machine Learning - MLLib ● Graph Processing - GraphX ● Many Languages - (Java, Scala, Python, R)
  • 18.
    http://purdygoodengineering.com http://anant.us Section 1:Spark ● spark.apache.org ● Download spark ● Example code ● Documentation
  • 19.
    http://purdygoodengineering.com http://anant.us ○ Spark ■Batch/Streaming ■ Machine Learning ■ Graph Processing ○ Example Code ■ Writing to Accumulo ■ Reading from Accumulo ■ Shell Section 1: Understanding the Technology ● Section 1: Understanding the Technology ○ Big Picture ■ Why Accumulo ■ Why Spark ○ Accumulo ■ Key/Value Structure ■ Table Structure ■ Cell Level Security ■ Splits ■ Reads (scans) ■ Writes (upserts) ■ Deletes
  • 20.
    http://purdygoodengineering.com http://anant.us Section 1:Example Code Simple Examples for bookkeeping with spark and accumulo https://github.com/matthewpurdy/purdy-good/tree/master/purdy-good-spark/purdy-good-spark-accumulo
  • 21.
    http://purdygoodengineering.com http://anant.us Section 2:Use Case(s) Machine Learning and Graph Processing ● Multi-Tenant Data Processing ● Machine Learning / Graph Processing in Spark ● Example Usecase of ML + Graph on Business Data
  • 22.
    http://purdygoodengineering.com http://anant.us Section 2:Multi-Tenant Data Processing Needs Customer (C) (P) & (C) Provider (P) Team Customer Private Customer Data shared w/ Provider Private Provider Data for Economy of Scale Sales Marketing IBM Indicators Relationships Classification Classification Model Relationship Graph Marketing Finance Apple Indicators Correlation Prediction Correlation Model Prediction Model Sales Marketing Finance Microsoft Indicators Relationships Correlation Prediction Correlation Model Prediction Model Relationship Graph Finance Google Indicators Correlation Prediction Correlation Model Prediction Model
  • 23.
    http://purdygoodengineering.com http://anant.us Section 2:Multi-Tenant Data Processing Needs Customer (C) (P) & (C) Provider (P) C User C Team C Management C Management P Analytics P Analytics P Support CU Manager CU Employee CT Sales CM Executive CM Executive CU Manager PA * / PS * PA * / PS * CU Manager CU Employee CT Marketing CM Executive CM Executive CU Manager PA * / PS * PA * / PS * CU Employee CT Research CM Executive CM Executive CU Manager PA * / PS * PA * / PS * CU Employee CT Finance CM Executive CM Executive CU Manager PA * / PS * PA * / PS *
  • 24.
    http://purdygoodengineering.com http://anant.us Section 2:Multi-Tenant Data Processing Needs ● Analyze Sales Team successes (Closed Accounts) to recommend companies to target for Marketing campaigns. ● Analyze Sales Team User social account against social network users against recommended companies to create Call Lists ● Correlate historic Marketing (Traffic & Conversions) with historic Sales (Leads & Closed Accounts) data with historic Finance ( Revenue & Profit) to Predict Sales from current Marketing & Sales activities
  • 25.
    http://purdygoodengineering.com http://anant.us Section 2:Out of the Box : MLLib in Spark ● Classification ● Regression ● Decision Trees ● Recommendation ● Clustering ● Topic Modeling ● Feature Transformations ● ML Pipelining / Persistence ● “Based on past performance in the companies in the CRM, the most successful sales have come from these categories, so go after these companies.”
  • 26.
    http://purdygoodengineering.com http://anant.us Section 2:Out of the Box : MLLib in Spark ● Load Data ● Extract Features ● Train Model ● Find Best Model ● Use Model to Predict
  • 27.
    http://purdygoodengineering.com http://anant.us Section 2:KeystoneML - End to End ML http://keystone-ml.org/
  • 28.
    http://purdygoodengineering.com http://anant.us Section 2:Out of the Box : GraphX in Spark ● PageRank ● Connected components ● Label propagation ● SVD++ ● Strongly connected components ● Triangle count ● “Based on the social graph of sales team members and the companies in your CRM, talk to the companies you are most “closest” to.
  • 29.
    http://purdygoodengineering.com http://anant.us Section 2:Out of the Box : GraphX in Spark ● Load Nodes RDD ● Load Vertices RDD ● Create Graph from Nodes & Vertices RDD ● Run Graph Process / Query ● Get Data http://ampcamp.berkeley.edu/big-d ata-mini-course/graph-analytics-wit h-graphx.html
  • 30.
    http://purdygoodengineering.com http://anant.us Section 2:Out of the Box : GraphX in Spark ● Load Edges into Graph ● Run Page Rank ● Load Nodes into RDD ● Join Users RDD with Rank
  • 31.
  • 32.
    http://purdygoodengineering.com http://anant.us Contact Information MatthewPurdy ● matthew.purdy@purdygoodengineering.com ● http://www.purdygoodengineering.com ● https://www.linkedin.com/in/matthewpurdy ● https://github.com/matthewpurdy Rahul Singh ● rahul.singh@anant.us ● http://www.anant.us ● http://www.linkedin.com/in/xingh ● https://github.com/xingh