BUILDING  A  GRAPH  OF  ALL  
U.S.  BUSINESSES  USING  
SPARK  TECHNOLOGIES.  
Alexis  Roos @alexisroos
Engineering  manager  /  Lead  Data  Science  Engineer
Radius  Intelligence
AGENDA
• Radius  Intelligence
• Objectives  and  terminology
• Business  graph  data  pipeline
• Lessons  learned
• Q&As
2
Radius  Messaging  Framework
June  8,  2015  
Radius transforms the way B2B
marketers discover markets, acquire
customers, and measure success.
RADIUS  INTELLIGENCE
Our software is powered by the
Radius Intelligence Cloud
a proprietary data science engine
which provides predictive analytics,
powerful segmentation, and
seamless integrations.
Predictive Marketing software
RADIUS  INTELLIGENCE
➔ Founded  in  2009
➔ 125+  employees (and  growing)
➔ Headquartered  in  San  Francisco,  CA
➔ $125  million in  funding  to  date
➔ Cutting  edge  in  Data  Science and  Data  
Engineering  Technologies
4
Produce  an  accurate and comprehensive  Business  Graph from  records coming  from  
dozens  of  sources  comprising  hundreds  of  data  points  such  as:
BUSINESS  GRAPH  OBJECTIVE
5
…
Answer  complex  connected  questions  such  as:
6
BUSINESS  GRAPH  BENEFITS
• Employees:  Find  engineering  executives  at  a  given  company  
focusing  on  Big  Data  or  security
• News:  Show  News related  to  companies  in  my  marketing  segment  
such  as  Funding  announcements  or  mergers  and  acquisitions
• Intent:  Show  companies  with  a  buying  Intent for  my  product  or  
engineering  managers  looking  for  security  product
• Organization:  Show  all  locations  or  Employees  for  a  given  business
etc
o Business:  a  company  or  organization
ex:  Blue  Bottle  company
BUSINESS  GRAPH  TERMINOLOGY
7
E I
N T
o Location:  a  business  at  a  particular  address
ex:  Blue  Bottle  at  Sansome St,  San  Francisco
o Attributes:  information  associated  with  a
business  and/or  location
ex:  address,  website,  phone  number,  etc.
o Additional  Data  types:
ex:  Employees,  Intent,  News,  Technologies,  etc.
BUSINESS  GRAPH  DATA  PIPELINE
Data  preparation Clustering ConstructionData  acquisition
8
Dozens sources
7B+ records
50B+ Data  Points
• Crawling
• Licensing
Raw  Records
BUSINESS  GRAPH  DATA  PIPELINE
Data  preparation Clustering ConstructionData  acquisition
9
• Standardization
• Validation
• Normalization
Normalized  RecordsRaw  Records
BUSINESS  GRAPH  DATA  PIPELINE
Data  preparation Clustering ConstructionData  acquisition
10
Cluster  records  on:
• Address
• Name  of  the  business
• Websites
• Phone  numbers
• Employees
• Description
• Category  codes
• Etc.
Clusters  of  records  by  
address/business
Normalized  Records
BUSINESS  GRAPH  DATA  PIPELINE
Data  preparation Clustering Construction
-­ Locations      
-­ Businesses  
&  more
Data  acquisition
11
Construct  Locations:
• Filter  bad  clusters
• Use  all  records  in  cluster  to  
create  Location  with  attributes:
o Select  best  value(s)
o Score  values:  confidence  
based  on  specific  record  
features  including  historical
o Impute  missing  values
• Etc.
Locations
MLLib
Clusters
BUSINESS  GRAPH  DATA  PIPELINE
Data  preparation ClusteringData  acquisition Construction
Businesses  
&  more
12
Construct  business  &  more:
• Create  business  and  link  
locations
• Add  business  level  info:  
news,  employees,  intent,  etc.
• Correct/  impute  fields  at  
business  level
• Etc.
Locations Business  graph
MLLib
E
I
N
T
E
GraphX models  a  “property  graph”  which is  an  multi  directed,  attributed graph.
SPARK  /  GRAPHX
13
(1,  Location) (2,  Domain)
(1,  2,  source1)
(3,  Phone)
(1,  2,  source2)
class Graph[VD,  ED]  {
val vertices: VertexRDD[VD]
val edges: EdgeRDD[ED]
val triplets: RDD[EdgeTriplet[VD,  ED]]  
}
VertexRDD[VD] extends RDD[(VertexID,  VD)]
EdgeRDD[ED] extends RDD[Edge[ED]]
VD  -­ Vertex  data attribute type
ED  -­ Edge data attribute type
Graph  operations  are  like  RDD  operations:  functional   and  lazily  evaluated
SPARK  /  GRAPHX
14
Operations  from  Graph  and  GraphOps:
• Information:  numEdges,  numVertices,  inDegrees,  outDegrees,  degrees
• Collections:  vertices,  edges,  triplets
• Transformations  and  modifications:  mapVertices,  mapEdges,  mapTriplets,
filter,  reverse,  subgraph,  mask,  groupEdges,  convertToCanonicalEdges
• Join:  joinVertices,  outerJoinVertices
• Aggregations:  collectNeighbor(Id)s,  aggregateMessages,  sendMsg,  mergeMsg,  tripletFields
• Caching  and  partitioning: persist,  cache,  checkpoint,  unpersist(Vertices),  partitionBy
• Graph  algorithms:  pregel,  pageRank,  connectedComponents,  triangleCount
And  data  operations  on  VertexRDD and  EdgeRDD:
• VertexRDD:  filter,  mapValues,  minus,  diff,  innerJoin,  leftJoin,  aggregateUsingIndex
• EdgeRDD:  mapValues,  reverse,  innerJoin
GRAPH  DATA  PIPELINE  CONSTRUCTION  ZOOM  IN
Data  preparation ClusteringData  acquisition
Construct  business  graph:
• Create  business  and  link  
locations
• Add  business  level  info:  
news,  employees,  intent,  
etc.
• Correct/  impute  fields  at  
business  level
• Etc.
Locations Business  graph
Construction
Businesses  
&  more
15
E
I
N
T
E
1. Create  vertex  for  each  location  and  each  known  business
GRAPH  DATA  PIPELINE  CONSTRUCTION  ZOOM  IN
16
VertexRDD[Location]
Clusters
map
VertexRDD[Business]
Curated  list
map
2. Link  Locations  to  Known  Business  per  relationships  (such  as  domain)
GRAPH  DATA  PIPELINE  CONSTRUCTION  ZOOM  IN
flatMap
flatMap
attributes
Create
Edges
Location Business
17
Locations
Businesses
join
3.  Create  vertex  for  unique  top  attributes  such  as  address,  phone,  domain,  etc.
and  link  locations  to  these  attributes
GRAPH  DATA  PIPELINE  CONSTRUCTION  ZOOM  IN
flatMap
attributes
groupBy
Create
Vertices
& Edges
Location  and  attributes
18
Locations
Aggregate  messages:  
sendMsg,  mergeMsg
SPARK  /  GRAPHX
Seq
19
sendMsg:  (srcID,  name) mergeMsg:  x  ++  y
Location  1
Location  2
Domain
sendMsg mergeMsg
4.  Create  location  to  location  edges  using  attributes  relationship  and  graph  cutting
GRAPH  DATA  PIPELINE  CONSTRUCTION  ZOOM  IN
Locations  and  attributes
aggregate
Messages
.combinations(2)
20
Locations
.filter(isSimilar)  *
*  such  as  name  matching
Note:  .combinations  (2)  and  bottleneck
PERFORMANCE  TIP
.combinations(2)
O(n):  N^2
5
1
4
2
3
21
5
1
4
2
3
smarter  combination
O(n):  N
1 2 3 4 5
SPARK  /  GRAPHX
22
Connected  components
Group  1
Group  2
connected
Components
Locations
5.  Call  connected  components  on  linked  locations  and  create  businesses
GRAPH  DATA  PIPELINE  CONSTRUCTION  ZOOM  IN
connected
Components
Locations
Location Business
Create
Vertices
& Edges
23
GraphX Connected  components  implementation  already  performs  optimizations:
– Incremental  Updates  to  Mirror  Caches
– Join  Elimination
– Index  Scanning  for  Active  Sets
– Local  Vertex  and  Edge  Indices
– Index  and  Routing  Table  Reuse
But  algorithm  is  iterative  so  optimization  matters:
PERFORMANCE  TIP
24
• Vertices  and  edges  should  be  in  memory:  more  efficient  to
subset  /  keep  the  graph  minimal  or  filter  and  rejoin  later
• Check-­pointing  improves  performance  (less  lineage)
6.  More  steps:
• Assign  headquarter
• Join  additional  information:  Employees,  News,  Intent,  etc.
• Further  corrections  /  imputations:  attributes,  revenues,  etc.
GRAPH  DATA  PIPELINE  CONSTRUCTION  ZOOM  IN
Location Business
more
operations
25
Business  graph
E
I
N
T
E
GRAPHICAL  VIEW
26
From  Locations  with  attributes  to  Locations  with  Business
GRAPHICAL  VIEW
27
From  Locations  with  attributes  to  Locations  with  Business
• Using  a  Graph  allows  us  to  naturally  model  business  relationships
and  append  new  data  types  over  time.  Also  allowed  us  to  debug  data!
LESSONS  LEARNED
28
• GraphX scales:  250M  vertices  and  1b  edges  in  clustering
• Creating  and  updating  the  graph  is  easy  using  RDD  operations
• Edge  cutting  is  an  art  as  much  as  a  science
• Connected  Components  is  an  expensive  operation
THANK  YOU.
WE  ARE  HIRING!
http://radius.com/jobs/
Alexis  Roos
alexis@radius.com
@alexisroos

Building a Graph of all US Businesses Using Spark Technologies by Alexis Roos

  • 1.
    BUILDING  A  GRAPH OF  ALL   U.S.  BUSINESSES  USING   SPARK  TECHNOLOGIES.   Alexis  Roos @alexisroos Engineering  manager  /  Lead  Data  Science  Engineer Radius  Intelligence
  • 2.
    AGENDA • Radius  Intelligence •Objectives  and  terminology • Business  graph  data  pipeline • Lessons  learned • Q&As 2
  • 3.
    Radius  Messaging  Framework June 8,  2015   Radius transforms the way B2B marketers discover markets, acquire customers, and measure success. RADIUS  INTELLIGENCE Our software is powered by the Radius Intelligence Cloud a proprietary data science engine which provides predictive analytics, powerful segmentation, and seamless integrations. Predictive Marketing software
  • 4.
    RADIUS  INTELLIGENCE ➔ Founded in  2009 ➔ 125+  employees (and  growing) ➔ Headquartered  in  San  Francisco,  CA ➔ $125  million in  funding  to  date ➔ Cutting  edge  in  Data  Science and  Data   Engineering  Technologies 4
  • 5.
    Produce  an  accurateand comprehensive  Business  Graph from  records coming  from   dozens  of  sources  comprising  hundreds  of  data  points  such  as: BUSINESS  GRAPH  OBJECTIVE 5 …
  • 6.
    Answer  complex  connected questions  such  as: 6 BUSINESS  GRAPH  BENEFITS • Employees:  Find  engineering  executives  at  a  given  company   focusing  on  Big  Data  or  security • News:  Show  News related  to  companies  in  my  marketing  segment   such  as  Funding  announcements  or  mergers  and  acquisitions • Intent:  Show  companies  with  a  buying  Intent for  my  product  or   engineering  managers  looking  for  security  product • Organization:  Show  all  locations  or  Employees  for  a  given  business etc
  • 7.
    o Business:  a company  or  organization ex:  Blue  Bottle  company BUSINESS  GRAPH  TERMINOLOGY 7 E I N T o Location:  a  business  at  a  particular  address ex:  Blue  Bottle  at  Sansome St,  San  Francisco o Attributes:  information  associated  with  a business  and/or  location ex:  address,  website,  phone  number,  etc. o Additional  Data  types: ex:  Employees,  Intent,  News,  Technologies,  etc.
  • 8.
    BUSINESS  GRAPH  DATA PIPELINE Data  preparation Clustering ConstructionData  acquisition 8 Dozens sources 7B+ records 50B+ Data  Points • Crawling • Licensing Raw  Records
  • 9.
    BUSINESS  GRAPH  DATA PIPELINE Data  preparation Clustering ConstructionData  acquisition 9 • Standardization • Validation • Normalization Normalized  RecordsRaw  Records
  • 10.
    BUSINESS  GRAPH  DATA PIPELINE Data  preparation Clustering ConstructionData  acquisition 10 Cluster  records  on: • Address • Name  of  the  business • Websites • Phone  numbers • Employees • Description • Category  codes • Etc. Clusters  of  records  by   address/business Normalized  Records
  • 11.
    BUSINESS  GRAPH  DATA PIPELINE Data  preparation Clustering Construction -­ Locations       -­ Businesses   &  more Data  acquisition 11 Construct  Locations: • Filter  bad  clusters • Use  all  records  in  cluster  to   create  Location  with  attributes: o Select  best  value(s) o Score  values:  confidence   based  on  specific  record   features  including  historical o Impute  missing  values • Etc. Locations MLLib Clusters
  • 12.
    BUSINESS  GRAPH  DATA PIPELINE Data  preparation ClusteringData  acquisition Construction Businesses   &  more 12 Construct  business  &  more: • Create  business  and  link   locations • Add  business  level  info:   news,  employees,  intent,  etc. • Correct/  impute  fields  at   business  level • Etc. Locations Business  graph MLLib E I N T E
  • 13.
    GraphX models  a “property  graph”  which is  an  multi  directed,  attributed graph. SPARK  /  GRAPHX 13 (1,  Location) (2,  Domain) (1,  2,  source1) (3,  Phone) (1,  2,  source2) class Graph[VD,  ED]  { val vertices: VertexRDD[VD] val edges: EdgeRDD[ED] val triplets: RDD[EdgeTriplet[VD,  ED]]   } VertexRDD[VD] extends RDD[(VertexID,  VD)] EdgeRDD[ED] extends RDD[Edge[ED]] VD  -­ Vertex  data attribute type ED  -­ Edge data attribute type
  • 14.
    Graph  operations  are like  RDD  operations:  functional   and  lazily  evaluated SPARK  /  GRAPHX 14 Operations  from  Graph  and  GraphOps: • Information:  numEdges,  numVertices,  inDegrees,  outDegrees,  degrees • Collections:  vertices,  edges,  triplets • Transformations  and  modifications:  mapVertices,  mapEdges,  mapTriplets, filter,  reverse,  subgraph,  mask,  groupEdges,  convertToCanonicalEdges • Join:  joinVertices,  outerJoinVertices • Aggregations:  collectNeighbor(Id)s,  aggregateMessages,  sendMsg,  mergeMsg,  tripletFields • Caching  and  partitioning: persist,  cache,  checkpoint,  unpersist(Vertices),  partitionBy • Graph  algorithms:  pregel,  pageRank,  connectedComponents,  triangleCount And  data  operations  on  VertexRDD and  EdgeRDD: • VertexRDD:  filter,  mapValues,  minus,  diff,  innerJoin,  leftJoin,  aggregateUsingIndex • EdgeRDD:  mapValues,  reverse,  innerJoin
  • 15.
    GRAPH  DATA  PIPELINE CONSTRUCTION  ZOOM  IN Data  preparation ClusteringData  acquisition Construct  business  graph: • Create  business  and  link   locations • Add  business  level  info:   news,  employees,  intent,   etc. • Correct/  impute  fields  at   business  level • Etc. Locations Business  graph Construction Businesses   &  more 15 E I N T E
  • 16.
    1. Create  vertex for  each  location  and  each  known  business GRAPH  DATA  PIPELINE  CONSTRUCTION  ZOOM  IN 16 VertexRDD[Location] Clusters map VertexRDD[Business] Curated  list map
  • 17.
    2. Link  Locations to  Known  Business  per  relationships  (such  as  domain) GRAPH  DATA  PIPELINE  CONSTRUCTION  ZOOM  IN flatMap flatMap attributes Create Edges Location Business 17 Locations Businesses join
  • 18.
    3.  Create  vertex for  unique  top  attributes  such  as  address,  phone,  domain,  etc. and  link  locations  to  these  attributes GRAPH  DATA  PIPELINE  CONSTRUCTION  ZOOM  IN flatMap attributes groupBy Create Vertices & Edges Location  and  attributes 18 Locations
  • 19.
    Aggregate  messages:   sendMsg, mergeMsg SPARK  /  GRAPHX Seq 19 sendMsg:  (srcID,  name) mergeMsg:  x  ++  y Location  1 Location  2 Domain sendMsg mergeMsg
  • 20.
    4.  Create  location to  location  edges  using  attributes  relationship  and  graph  cutting GRAPH  DATA  PIPELINE  CONSTRUCTION  ZOOM  IN Locations  and  attributes aggregate Messages .combinations(2) 20 Locations .filter(isSimilar)  * *  such  as  name  matching
  • 21.
    Note:  .combinations  (2) and  bottleneck PERFORMANCE  TIP .combinations(2) O(n):  N^2 5 1 4 2 3 21 5 1 4 2 3 smarter  combination O(n):  N 1 2 3 4 5
  • 22.
    SPARK  /  GRAPHX 22 Connected components Group  1 Group  2 connected Components Locations
  • 23.
    5.  Call  connected components  on  linked  locations  and  create  businesses GRAPH  DATA  PIPELINE  CONSTRUCTION  ZOOM  IN connected Components Locations Location Business Create Vertices & Edges 23
  • 24.
    GraphX Connected  components implementation  already  performs  optimizations: – Incremental  Updates  to  Mirror  Caches – Join  Elimination – Index  Scanning  for  Active  Sets – Local  Vertex  and  Edge  Indices – Index  and  Routing  Table  Reuse But  algorithm  is  iterative  so  optimization  matters: PERFORMANCE  TIP 24 • Vertices  and  edges  should  be  in  memory:  more  efficient  to subset  /  keep  the  graph  minimal  or  filter  and  rejoin  later • Check-­pointing  improves  performance  (less  lineage)
  • 25.
    6.  More  steps: •Assign  headquarter • Join  additional  information:  Employees,  News,  Intent,  etc. • Further  corrections  /  imputations:  attributes,  revenues,  etc. GRAPH  DATA  PIPELINE  CONSTRUCTION  ZOOM  IN Location Business more operations 25 Business  graph E I N T E
  • 26.
    GRAPHICAL  VIEW 26 From  Locations with  attributes  to  Locations  with  Business
  • 27.
    GRAPHICAL  VIEW 27 From  Locations with  attributes  to  Locations  with  Business
  • 28.
    • Using  a Graph  allows  us  to  naturally  model  business  relationships and  append  new  data  types  over  time.  Also  allowed  us  to  debug  data! LESSONS  LEARNED 28 • GraphX scales:  250M  vertices  and  1b  edges  in  clustering • Creating  and  updating  the  graph  is  easy  using  RDD  operations • Edge  cutting  is  an  art  as  much  as  a  science • Connected  Components  is  an  expensive  operation
  • 29.
    THANK  YOU. WE  ARE HIRING! http://radius.com/jobs/ Alexis  Roos alexis@radius.com @alexisroos