Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Tackling a 1 billion member social network

3,450 views

Published on

Enabling blazing fast search on a 1 billion member social network demands special weapons and tactics. In Evojam we took advantage of the Scala ecosystem and modern tools to realize persistence and processing of such a data set.

Case study focused on the client requirements and tools we have made use of to meet them. Thoughts, injuries and ideas about real life fast & big data challenge brought to you directly from the trenches.

This talk has been given on Scalar 2016 conference.

Published in: Engineering

Tackling a 1 billion member social network

  1. 1. Scala SWAT Artur Bańkowski artur@evojam.com @abankowski TACKLING A 1 BILLION MEMBER SOCIAL NETWORK This is a story about our participation in one of our customer’s project 1
  2. 2. Previous Experience ~10 millions records ~60 millions documents ~60 million vertices graph (non-commercial) Datasets I have previously worked on were much smaller! This time we were about to deal with a much bigger social network, which can easily be represented as a graph. We were going to join the team…
  3. 3. Graph structure User User User Foo Inc. User User User worked worked knows knows knows knows Bar Ltd. worked knows 2 types of vertices 2 types of relations Companies and Users
  4. 4. User User User Foo Inc. User User User worked worked knows knows knows knows The concept: provide graph subset Bar Ltd. worked Foo Inc. User User User User worked worked knows knows knows knows knows Fulltext searchable, sortable, filterable result for Foo Inc.
  5. 5. 1 billion challenge 835,272,759 835,272,759 vertices 751,857,081 users 83,415,678 companies 6,956,990,209 relations Subsets from thousands to millions users When we finally put our hands on the data, we have realized that datasize is smaller than expected What more have we found?
  6. 6. Existing app workflow One Engineer-Evening Multiple custom scripts JSON files extract subset eg: 3m profiles 750m profiles data duplication only manually selected 60m incl. duplicates Manual process, takes few days from user perspective This was already used by final customers
  7. 7. PoC - The Goal Handle 1 billion profiles automatically Our primary goal
  8. 8. PoC - Definition of done • Graph traversable on demand • First results available under 1 minute • Entire subset ready in few minutes REST API with search
  9. 9. PoC - Concept Graph DB Document DB API 6,956,990,209 relations 751,857,081 profiles Whole dataset stored in two engines All relations All profiles
  10. 10. Why graph DB? •Fast graph traversal •Easily extendable with new relations and vertices •Convenient algorithm description
  11. 11. PoC - Flow Graph DB Document DB API 1. Generate subset 2. Tag documents 3. Perform search with tag as a filter Generate subset from graph with users and companies relations Tag documents with unique id in database with all profiles Search with unique id as a primary filter
  12. 12. Weapon of choice ? ?Graph DB Document DB API Scala and Play were natural choice for API app Some research required for databases
  13. 13. First Steps: Extraction Extract anonymized profiles, companies and relations Cleanup data, sort and generate input files few days to pull, streaming with Akka and Slick To make a research we had to put our hands on the real data
  14. 14. First Steps: Pushing forward Push profiles for searching purposes Push vertices and relations for traversal Document DB Graph DB two tools, highly dependent on db engines
  15. 15. Fulltext Searchable Document DB • Mature • Horizontally scalable • Fast indexing (~3k documents per second on the single node) • Well documented • With scala libraries: • https://github.com/sksamuel/elastic4s • https://github.com/evojam/play-elastic4s We already had significant experience with scaling Elastic Search for 80 millions
  16. 16. Which graph DB? Considered three engines, OrientDB and Neo4j in community editions
  17. 17. Goodbye Titan • Performed well in ~60M tests • fast traversing • Development has been put on hold?
  18. 18. • No batch insertion, slow initial load • Stalled writes after 200 millions of vertices with relations • Horror stories in the internet
  19. 19. Neo4j FTW https://github.com/AnormCypher/AnormCypher • Fast • Already known • Convenient in Scala with AnormCypher • Offline bulk loading • Result streaming
  20. 20. Weapon of choice Graph DB Document DB API Final stack :)
  21. 21. Final Setup on AWS Neo4j API hr-1 hr-2 ES Cluster i2.xlarge 4vCPU 30.5GB i2.2xlarge 8vCPU 61GB m4.large 2vCPU 8GB • 2 nodes • 2 indexes • 10 shards each index • 0 replicas
  22. 22. Step #1 - Bulk loading into Neo4j Importing the contents of these files into data/graph.db. […] IMPORT DONE in 3h 24m 58s 140ms. Imported:  835273352 nodes  6956990209 relationships  0 properties Importing the contents of these files into data/graph.db. […] IMPORT DONE in 3h 24m 58s 140ms. Imported:  835273352 nodes  6956990209 relationships  0 properties Importing the contents of these files into data/graph.db. […] IMPORT DONE in 3h 24m 58s 140ms. Imported:  835273352 nodes  6956990209 relationships  0 properties Importing the contents of these files into data/graph.db. […] IMPORT DONE in 3h 24m 58s 140ms. Imported:  835273352 nodes  6956990209 relationships  0 properties It took 12 hours on 2 times smaller Amazon instance
  23. 23. Step #2 - Bulk loading into ES grouped insert 1. Create Source from CSV file, frame by n 2. Decode id from ByteString, generate user json 3. Group by the bulk size (eg.: 4500) 4. Throttle 5. Execute bulk insert into the ElasticSearch throttle Akka advantage: CPU utilization, parallel data enrichment, human readable
  24. 24. Step #2 - Bulk loading into ES FileIO.fromFile(new File(sourceOfUserIds))
 .via( Framing.delimiter(ByteString('n'), maximumFrameLength = 1024, allowTruncation = false))
 .mapAsyncUnordered(16)(prepareUserJson)
 .grouped(4500)
 .throttle(
 elements = 2,
 per = 2 seconds,
 maximumBurst = 2,
 mode = ThrottleMode.Shaping)
 .mapAsyncUnordered(2)(executeBulkInsert)
 .runWith(Sink.ignore) Flow description with Akka Streams
  25. 25. User User User Foo Inc. User User User worked worked knows knows knows knows Step #3 - Tagging Bar Ltd. worked Foo Inc. User User User User worked worked knows knows knows knows knows MATCH (c:Company)-[:worked]-(employees:User)-[:knows]-(aquitance:User)
 WHERE ID(c)={foo-inc-id} AND NOT (aquitance)-[:worked]-(c)
 RETURN DISTINCT ID(aquitance) Neo4j traversal query in CYPHER
  26. 26. Akka Streams to the rescue val idEnum : Enumeratee[CypherRow] = _ val src = Source.fromPublisher(Streams.enumeratorToPublisher(idEnum))
 .map(_.data.head.asInstanceOf[BigDecimal].toInt)
 .via(new TimeoutOnEmptyBuffer())
 .map(UserId(_)) .mapAsyncUnordered(parallelism = 1)(id => dao.tagProfiles(id, companyId)) Readable flow Buffering to protect Neo4j when indexing is too slow Timeout due to the bug in the underlying implementation
  27. 27. Bottleneck 1.5 hour for 3 million subset, that’s too long!
  28. 28. Bulk update with AkkaStream tuning src .grouped(20000)
 .throttle( elements = 2, per = 6 second, maximumBurst = 2, mode = ThrottleMode.Shaping) .mapAsyncUnordered(parallelism = 1)(ids => dao.bulkTag(ids, ...)) Bulk tagging, few lines
  29. 29. Tagging Foo-Company ~14 seconds until first batch is tagged ~7 minutes 11 seconds until all tagged reference implementation - few hours 2,222,840 profiles matching criteria Sample subset :)
  30. 30. Step #5 - Search Neo4j API hr-1 hr-2 ES Cluster i2.xlarge 4vCPU 30.5GBi2.2xlarge 8vCPU 61GB m4.large 2vCPU 8GB With data ready in ES search implementation was pretty straightforward
  31. 31. Step #5 - Search benchmark Fulltext search on 2 millions subset GET /users?company=foo-company&phrase=John 2000 phrases for the benchmarking Response JSON with 50 profiles Searching in 750 millions database Phrases based on real names and surnames, used during profiles enrichment Random requests ordering
  32. 32. Step #5 - Search under siege Response time ~ 0.14s50 concurrent users constant latency constant search rate
  33. 33. Objective achieved search response time from API ~0.14s 14 seconds until first batch is tagged 7 minutes until 2 millions ready
  34. 34. Summary reference implementation PoC manual automatic few days 14 seconds few days 7 minutes ~40 millions ~750 millions no analytics GraphX ready Tool ready for data scientists W can implement core traversal modifications almost instantly
  35. 35. Summary https://snap.stanford.edu/data/ http://neo4j.com/ https://playframework.com/ http://doc.akka.io/docs/akka-stream-and-http-experimental/2.0/scala/stream-index.html Try at home! Sample graphs ready to use https://www.elastic.co/ Artur Bańkowski artur@evojam.com @abankowski

×