Datafiniti: The Internet in a Database - Cassandra Use Case

1,100 views
957 views

Published on

Austin Cassandra Users Meetup on July 15th 2013: http://www.meetup.com/Austin-Cassandra-Users/events/125837112/

Datafiniti will be presenting on some of the unique and interesting challenges they've faced when trying to build out their data search engine. Including a detailed use-case around their Cassandra data model and other integrated technologies like Solr.

Published in: Technology, Design
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
1,100
On SlideShare
0
From Embeds
0
Number of Embeds
5
Actions
Shares
0
Downloads
18
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Datafiniti: The Internet in a Database - Cassandra Use Case

  1. 1. The Internet in a Database A Cassandra Use Case
  2. 2. Data on the Web DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET ● 48 billion pages on the Internet ● 56 million GB of data ● Incredibly powerful connections ● 70% of useful data is unstructured ● User generated data + facts
  3. 3. DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET Too Much Data…
  4. 4. DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET ● Modern search engines ○ Unstructured data ○ Unconnected data ○ Unnormalized data Search
  5. 5. DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET ● Goals ○ Collect vast amounts of data through web crawling ○ Normalize and deduplicate data ○ Make it searchable and meaningful
  6. 6. DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET ● Speed ● Scale ● Adaptable Needs
  7. 7. ● Very fast ○ Log-structured storage ● Easily scalable ○ Decentralized rings ● Completely adaptable ○ Schema-less key/value store DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET The Solution
  8. 8. DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET …Almost ● Useful searching was missing ○ Secondary indexes not flexible ○ No free text searches ○ No (reasonable) range queries
  9. 9. DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET ● Pros: Full control over indexing ● Cons: Not scalable What We Needed
  10. 10. DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET ● Reasons to go with DSE ○ Combines Cassandra and Solr ○ Constant refinements and integrations ○ Support Putting It All Together
  11. 11. DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET Normalization Cassandra Solr Cassandra Solr Cassandra Solr Load Balancing Our Stack Web Crawling
  12. 12. DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET Cassandra / Solr Setup ● 3 column families / 3 cores ○ Locations ○ Products ○ People ● 73,114,909 records
  13. 13. DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET ● 29,818,644 records ● Interesting data ○ Reviews ○ Revenue ○ Contact information ● Businesses vs. Locations ○ Unique key ○ Location specific user data Data: Locations
  14. 14. DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET Data: Products ● 18,470,005 records ● Interesting data ○ Categories ○ Price ○ Reviews ● Challenges ○ Too many unique keys
  15. 15. DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET Data: People ● 24,826,260 records ● Interesting data ○ Work History ○ Education History ○ Location ● Challenges ○ Normalization ○ Identification
  16. 16. DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET Challenges ● Memory ● Speed ● Space ● Representation
  17. 17. DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET Challenges: Memory ● Multi-minute garbage collection ● Exponential increase in frequency ● Virtual memory confusion ● Solr + Cassandra ● Heap Size vs Buffer Cache ● Bash Scripts
  18. 18. DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET Challenges: Speed ● Upgrade ○ Better memory management ○ Smaller index size ● Reduce index size ● Future: Solaris
  19. 19. DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET Challenges: Speed ● Providing a real-time service ● Issues ○ Solr not inherently real time ○ Search speeds ○ I/O
  20. 20. DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET Challenges: Speed ● Solr Solution: DSE integration leverages ○ Cassandra's speed ○ Cassandra's caches ○ Cassandra's distribution ○ Solr caches less useful
  21. 21. DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET Challenges: Speed ● Search complexity solution ○ Text vs String indexing ○ Uniqueness vs Flexibility ○ Leveraging Cassandra
  22. 22. DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET Challenges: Speed ● I/O Solution ○ Cassandra's built in mapping ○ Increase disk access speeds (SSDs) ■ Not cost effective ○ Future: Solaris
  23. 23. DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET Challenges: Space ● Field corruption ○ Caused by improper encoding ○ Exponential growth ○ Fills up Solr index ● Locate, inspect & remove corrupt records
  24. 24. DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET Challenges: Space ● Solr index issue ○ No compression (vs Cassandra) ○ Must adjust indexing ● Key things to keep in mind ○ Size of fields ○ Scale vs Flexibility ○ Index as little as possible
  25. 25. DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET Challenges: Representation ● Cassandra is flat ● Actual data is not flat ○ Reviews ○ Price information ● Many different output formats ○ CSV, JSON, XML, etc.
  26. 26. DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET ● Solution: Flatten when possible ○ E.g. Address object -> Separate fields ● Internal subgroup representation ○ Composite keys (Occasionally) ■ Known subgroups ■ Non multiple subgroups ○ Dynamic fields ■ Composite field + Dynamic tag ■ E.g. review.text_<tag> Challenges: Representation
  27. 27. DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET Challenges: Representation ● Robust and adaptable conversion package ● JSON -> Internal ○ Solr returns JSON ● Internal -> CSV, JSON, XML ○ User defined views ○ Specify field groupings ○ Specify partitioning
  28. 28. DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET
  29. 29. DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET ● Memory Usage ● Speed ● Space ● Containers Future Work
  30. 30. DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET Future Work: Memory ● Java 7 G1 (Garbage First) Collector ○ Ideal for large heaps ○ Big Data Sets ○ Bursty Workloads
  31. 31. DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET Future Work: Speed ● Solaris Kernel Scheduler > Linux Kernel Scheduler ○ (At large number of cores) ● Drastically increase iops ○ Cache reads (L2ARC) on PCIe SSD (~800 MB/s) ○ Cache writes (ZIL) on PCIe SSD (~800 MB/s) ○ Reduce needed size of SSD ■ More smaller SSDs in ZFS pool ○ Fewer moving parts
  32. 32. DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET Future Work: Space ● Caching at PCIe, Storing on SATA III ○ Cheaper larger storage via ZFS pools ○ Easier to grow ● ZFS Compression (LZ4) ○ Replaces Cassandra's Snappy compression ○ Very fast lossless compression (400 Mb/s per core) ○ Scales to multiple CPUs ○ Hits the ram speed limit
  33. 33. DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET Future Work: Containers ● OS Level virtualization ○ Resource control ○ Boundary separation ● More control over cassandra resources ● Better snapshots (whole machine) ● Hardware abstracted out ○ Many disks represented as single space ○ Easily add or remove hardware
  34. 34. Questions? https://www.datafiniti.net http://blog.datafiniti.net @datafiniti
  35. 35. DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET Addendum 1 ZFS Comparison Name Ratio (MB/s) Compression (MB/s) Decompression (MB/s) LZ4 (r97) 2.084 410 1810 LZO 2.06 2.106 409 600 QuickLZ 1.5.1b6 2.237 373 420 Snappy 1.1.0 2.091 323 1070 LZF 2.077 270 570 zlib 1.2.8 -1 2.730 65 280 LZ4 HC (r97) 2.720 25 2040 zlib 1.2.8 -6 3.099 21 300

×