Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Building a Graph of all US Businesses Using Spark Technologies by Alexis Roos


Published on

Spark Summit East talk

Published in: Data & Analytics
  • Be the first to comment

Building a Graph of all US Businesses Using Spark Technologies by Alexis Roos

  1. 1. BUILDING  A  GRAPH  OF  ALL   U.S.  BUSINESSES  USING   SPARK  TECHNOLOGIES.   Alexis  Roos @alexisroos Engineering  manager  /  Lead  Data  Science  Engineer Radius  Intelligence
  2. 2. AGENDA • Radius  Intelligence • Objectives  and  terminology • Business  graph  data  pipeline • Lessons  learned • Q&As 2
  3. 3. Radius  Messaging  Framework June  8,  2015   Radius transforms the way B2B marketers discover markets, acquire customers, and measure success. RADIUS  INTELLIGENCE Our software is powered by the Radius Intelligence Cloud a proprietary data science engine which provides predictive analytics, powerful segmentation, and seamless integrations. Predictive Marketing software
  4. 4. RADIUS  INTELLIGENCE ➔ Founded  in  2009 ➔ 125+  employees (and  growing) ➔ Headquartered  in  San  Francisco,  CA ➔ $125  million in  funding  to  date ➔ Cutting  edge  in  Data  Science and  Data   Engineering  Technologies 4
  5. 5. Produce  an  accurate and comprehensive  Business  Graph from  records coming  from   dozens  of  sources  comprising  hundreds  of  data  points  such  as: BUSINESS  GRAPH  OBJECTIVE 5 …
  6. 6. Answer  complex  connected  questions  such  as: 6 BUSINESS  GRAPH  BENEFITS • Employees:  Find  engineering  executives  at  a  given  company   focusing  on  Big  Data  or  security • News:  Show  News related  to  companies  in  my  marketing  segment   such  as  Funding  announcements  or  mergers  and  acquisitions • Intent:  Show  companies  with  a  buying  Intent for  my  product  or   engineering  managers  looking  for  security  product • Organization:  Show  all  locations  or  Employees  for  a  given  business etc
  7. 7. o Business:  a  company  or  organization ex:  Blue  Bottle  company BUSINESS  GRAPH  TERMINOLOGY 7 E I N T o Location:  a  business  at  a  particular  address ex:  Blue  Bottle  at  Sansome St,  San  Francisco o Attributes:  information  associated  with  a business  and/or  location ex:  address,  website,  phone  number,  etc. o Additional  Data  types: ex:  Employees,  Intent,  News,  Technologies,  etc.
  8. 8. BUSINESS  GRAPH  DATA  PIPELINE Data  preparation Clustering ConstructionData  acquisition 8 Dozens sources 7B+ records 50B+ Data  Points • Crawling • Licensing Raw  Records
  9. 9. BUSINESS  GRAPH  DATA  PIPELINE Data  preparation Clustering ConstructionData  acquisition 9 • Standardization • Validation • Normalization Normalized  RecordsRaw  Records
  10. 10. BUSINESS  GRAPH  DATA  PIPELINE Data  preparation Clustering ConstructionData  acquisition 10 Cluster  records  on: • Address • Name  of  the  business • Websites • Phone  numbers • Employees • Description • Category  codes • Etc. Clusters  of  records  by   address/business Normalized  Records
  11. 11. BUSINESS  GRAPH  DATA  PIPELINE Data  preparation Clustering Construction -­ Locations       -­ Businesses   &  more Data  acquisition 11 Construct  Locations: • Filter  bad  clusters • Use  all  records  in  cluster  to   create  Location  with  attributes: o Select  best  value(s) o Score  values:  confidence   based  on  specific  record   features  including  historical o Impute  missing  values • Etc. Locations MLLib Clusters
  12. 12. BUSINESS  GRAPH  DATA  PIPELINE Data  preparation ClusteringData  acquisition Construction Businesses   &  more 12 Construct  business  &  more: • Create  business  and  link   locations • Add  business  level  info:   news,  employees,  intent,  etc. • Correct/  impute  fields  at   business  level • Etc. Locations Business  graph MLLib E I N T E
  13. 13. GraphX models  a  “property  graph”  which is  an  multi  directed,  attributed graph. SPARK  /  GRAPHX 13 (1,  Location) (2,  Domain) (1,  2,  source1) (3,  Phone) (1,  2,  source2) class Graph[VD,  ED]  { val vertices: VertexRDD[VD] val edges: EdgeRDD[ED] val triplets: RDD[EdgeTriplet[VD,  ED]]   } VertexRDD[VD] extends RDD[(VertexID,  VD)] EdgeRDD[ED] extends RDD[Edge[ED]] VD  -­ Vertex  data attribute type ED  -­ Edge data attribute type
  14. 14. Graph  operations  are  like  RDD  operations:  functional   and  lazily  evaluated SPARK  /  GRAPHX 14 Operations  from  Graph  and  GraphOps: • Information:  numEdges,  numVertices,  inDegrees,  outDegrees,  degrees • Collections:  vertices,  edges,  triplets • Transformations  and  modifications:  mapVertices,  mapEdges,  mapTriplets, filter,  reverse,  subgraph,  mask,  groupEdges,  convertToCanonicalEdges • Join:  joinVertices,  outerJoinVertices • Aggregations:  collectNeighbor(Id)s,  aggregateMessages,  sendMsg,  mergeMsg,  tripletFields • Caching  and  partitioning: persist,  cache,  checkpoint,  unpersist(Vertices),  partitionBy • Graph  algorithms:  pregel,  pageRank,  connectedComponents,  triangleCount And  data  operations  on  VertexRDD and  EdgeRDD: • VertexRDD:  filter,  mapValues,  minus,  diff,  innerJoin,  leftJoin,  aggregateUsingIndex • EdgeRDD:  mapValues,  reverse,  innerJoin
  15. 15. GRAPH  DATA  PIPELINE  CONSTRUCTION  ZOOM  IN Data  preparation ClusteringData  acquisition Construct  business  graph: • Create  business  and  link   locations • Add  business  level  info:   news,  employees,  intent,   etc. • Correct/  impute  fields  at   business  level • Etc. Locations Business  graph Construction Businesses   &  more 15 E I N T E
  16. 16. 1. Create  vertex  for  each  location  and  each  known  business GRAPH  DATA  PIPELINE  CONSTRUCTION  ZOOM  IN 16 VertexRDD[Location] Clusters map VertexRDD[Business] Curated  list map
  17. 17. 2. Link  Locations  to  Known  Business  per  relationships  (such  as  domain) GRAPH  DATA  PIPELINE  CONSTRUCTION  ZOOM  IN flatMap flatMap attributes Create Edges Location Business 17 Locations Businesses join
  18. 18. 3.  Create  vertex  for  unique  top  attributes  such  as  address,  phone,  domain,  etc. and  link  locations  to  these  attributes GRAPH  DATA  PIPELINE  CONSTRUCTION  ZOOM  IN flatMap attributes groupBy Create Vertices & Edges Location  and  attributes 18 Locations
  19. 19. Aggregate  messages:   sendMsg,  mergeMsg SPARK  /  GRAPHX Seq 19 sendMsg:  (srcID,  name) mergeMsg:  x  ++  y Location  1 Location  2 Domain sendMsg mergeMsg
  20. 20. 4.  Create  location  to  location  edges  using  attributes  relationship  and  graph  cutting GRAPH  DATA  PIPELINE  CONSTRUCTION  ZOOM  IN Locations  and  attributes aggregate Messages .combinations(2) 20 Locations .filter(isSimilar)  * *  such  as  name  matching
  21. 21. Note:  .combinations  (2)  and  bottleneck PERFORMANCE  TIP .combinations(2) O(n):  N^2 5 1 4 2 3 21 5 1 4 2 3 smarter  combination O(n):  N 1 2 3 4 5
  22. 22. SPARK  /  GRAPHX 22 Connected  components Group  1 Group  2 connected Components Locations
  23. 23. 5.  Call  connected  components  on  linked  locations  and  create  businesses GRAPH  DATA  PIPELINE  CONSTRUCTION  ZOOM  IN connected Components Locations Location Business Create Vertices & Edges 23
  24. 24. GraphX Connected  components  implementation  already  performs  optimizations: – Incremental  Updates  to  Mirror  Caches – Join  Elimination – Index  Scanning  for  Active  Sets – Local  Vertex  and  Edge  Indices – Index  and  Routing  Table  Reuse But  algorithm  is  iterative  so  optimization  matters: PERFORMANCE  TIP 24 • Vertices  and  edges  should  be  in  memory:  more  efficient  to subset  /  keep  the  graph  minimal  or  filter  and  rejoin  later • Check-­pointing  improves  performance  (less  lineage)
  25. 25. 6.  More  steps: • Assign  headquarter • Join  additional  information:  Employees,  News,  Intent,  etc. • Further  corrections  /  imputations:  attributes,  revenues,  etc. GRAPH  DATA  PIPELINE  CONSTRUCTION  ZOOM  IN Location Business more operations 25 Business  graph E I N T E
  26. 26. GRAPHICAL  VIEW 26 From  Locations  with  attributes  to  Locations  with  Business
  27. 27. GRAPHICAL  VIEW 27 From  Locations  with  attributes  to  Locations  with  Business
  28. 28. • Using  a  Graph  allows  us  to  naturally  model  business  relationships and  append  new  data  types  over  time.  Also  allowed  us  to  debug  data! LESSONS  LEARNED 28 • GraphX scales:  250M  vertices  and  1b  edges  in  clustering • Creating  and  updating  the  graph  is  easy  using  RDD  operations • Edge  cutting  is  an  art  as  much  as  a  science • Connected  Components  is  an  expensive  operation
  29. 29. THANK  YOU. WE  ARE  HIRING! Alexis  Roos @alexisroos