Big Data Analysis Patterns - TriHUG 6/27/2013

5,545 views
5,322 views

Published on

Big Data Analysis Patterns: Tying real world use cases to strategies for analysis using big data technologies and tools.

Big data is ushering in a new era for analytics with large scale data and relatively simple algorithms driving results rather than relying on complex models that use sample data. When you are ready to extract benefits from your data, how do you decide what approach, what algorithm, what tool to use? The answer is simpler than you think.

This session tackles big data analysis with a practical description of strategies for several classes of application types, identified concretely with use cases. Topics include new approaches to search and recommendation using scalable technologies such as Hadoop, Mahout, Storm, Solr, & Titan.

Published in: Technology
1 Comment
14 Likes
Statistics
Notes
No Downloads
Views
Total views
5,545
On SlideShare
0
From Embeds
0
Number of Embeds
1,660
Actions
Shares
0
Downloads
127
Comments
1
Likes
14
Embeds 0
No embeds

No notes for slide

Big Data Analysis Patterns - TriHUG 6/27/2013

  1. 1. Big Data Analysis Patterns TriHUG 6/27/2013 1
  2. 2. whoami • Brad Anderson • Solutions Architect at MapR (Atlanta) • ATLHUG co-chair • NoSQL East Conference 2009 • “boorad” most places (twitter, github) • banderson@maprtech.com 2
  3. 3. BIG DATA 3
  4. 4. 4
  5. 5. Big Data is not new! but the tools are. 5
  6. 6. The Good News in Big Data: “Simple algorithms and lots of data trump complex models” Halevy, Norvig, and Pereira, Google IEEE Intelligent Systems 6
  7. 7. The Challenge: So Many Solutions! What solutions fit your business problem? For example, do you need…   Apache Mahout?  Storm?  Apache Solr/Lucene?  Apache HBase (or MapR M7)?  Apache Drill (or Impala?)  d3.js or Tableau?  Node.js  7 Apache Hadoop? Titan? 7
  8. 8. Ask a Different Question It may be more useful to better define the problem by asking some of these questions:   How large is the data to be queried? (the analysis volume)  What time frame is appropriate for your query response?  How fast is data arriving? (bursts or continuously?)  Are queries by sophisticated users?  Are you looking for common patterns or outliers?  8 How large is the data to be stored? How are your data sources structures? 8
  9. 9. Picking the Best Solution Your responses to these questions can help you better:  define the problem  recognize the analysis pattern to which it belongs  guide the choice of solutions to try But first, here’s a quick review of a few of the technologies you might choose, and then we will focus on three of the questions as a part of the landscape. 9 9
  10. 10. Apache Solr/Lucene Solr/Lucene is a powerful search engine used for flexible, heavily indexed queries including data such as  Full text  Geographical data  Statistically weighted data Solr is a small data tool that has flourished in a big data world 10
  11. 11. Apache Mahout Mahout provides a library of scalable machine learning algorithms useful for big data analysis based on Hadoop or other storage systems. Mahout algorithms mainly are used for  Recommendation (collaborative filtering)  Clustering  Classification Mahout can be used in conjunction with solutions such as Solr: You might use Mahout to create a co-occurrence data base that could then be queried using a search tool such as Solr 11
  12. 12. Apache Drill  Google Dremel clone  Pluggable Query Languages – –  Pluggable Storage Backends – – –  Starts with ANSI SQL 2003 Hive, Pig, Cascading, MongoQL, … Hadoop, Hbase MongoDB (BSON) RDBMS? Bypasses MapReduce 12
  13. 13. Storm  Realtime Stream Computation Engine  Horizontal Scalability  Guaranteed Data Processing  Fault Tolerance  Higher level abstraction over: – –  Message Queues Worker Logic “The Hadoop of Realtime” 13
  14. 14. Titan  Distributed Graph Database  Property Graph  Pluggable Backend Storage – – –  Search Integrated – –  HBase or M7 Cassandra Berkeley DB Solr/Lucene Elastic Search Faunus – – Graph traversals on subset In-memory 14
  15. 15. Using the Answers to Guide Your Choices For simplicity, let’s focus in on the first three questions:  How large is the data to be stored?  How large is the data to be queried? (the analysis volume)  What time frame is appropriate for your query response? 15
  16. 16. Big Data Decision Tree How big is your data? <10 GB mid ? ? A Single element at a time >200 GB What size queries? One pass over 100% B Response time? C Big storage Multiple passes over big chunks Streaming < 100s (human scale) D 16 throughput not response E
  17. 17. Use Cases Company  Data Shape  Technique(s)  Business Value  17
  18. 18. Business Value 18
  19. 19. Business Value 19
  20. 20. Telecommunications Giant ETL Offload 20
  21. 21. Telecommunications    Data Shape Lots of Data Lots of Queries across Large Sets Throughput important 21
  22. 22. Telecommunications Techniques Analytics ETL 22
  23. 23. Telecommunications Techniques + ETL (Hadoop) Analytics (Teradata) 23
  24. 24. Telecommunications Business Value 24
  25. 25. Credit Card Issuer 25
  26. 26. Credit Card Issuer Data Shape      Customer Purchase History (big) Merchant Designations Merchant Special Offers Throughput important Recommendations 26
  27. 27. Search Abuse Techniques A Recommendation Engine with Mahout and Solr/Lucene History matrix One row per user One column per thing 27
  28. 28. Techniques Recommendation based on cooccurrence Cooccurrence gives item-item mapping One row and column per thing 28
  29. 29. Techniques Cooccurrence matrix can also be implemented as a search index 29
  30. 30. Techniques Complete history Cooccurrence (Mahout) SolR SolR Indexer Solr Indexer indexing Item metadata Index shards 30 20 Hrs  3 Hrs
  31. 31. Techniques User history SolR SolR Indexer Solr Indexer search Web tier 8Hrs  3 Min Item metadata Index shards 31
  32. 32. Techniques Hadoop Purchase History Export (4 hrs) App App Merchant Information Recommendation Engine Results (Mahout) Presentation Data Store (DB2) App App Merchant Offers App Import (4 hrs) 32
  33. 33. Techniques Hadoop Purchase History Merchant Information Recommendation Engine Results (Mahout) Index Update (3 min) App App Recommendation Search Index (Solr) App App Merchant Offers App 33
  34. 34. Business Value 34
  35. 35. Waste & Recycling Leader Idle Alerts 35
  36. 36. Data Shape Truck Geolocation Data – 20,000 trucks – 5 sec interval (arriving quickly)  Landfill Geographic Boundaries  36
  37. 37. Techniques Realtime Stream Computation (Storm) Truck Geolocation Data Hadoop Storage Immediate Alerts Batch Computation (MapReduce) Tax Reduction Reporting Shortest Path Graph Algorithm (Titan) Route Optimization 37
  38. 38. Business Value 38
  39. 39. Beverage Company Social Engagement Application 39
  40. 40. Data Shape Tweets, FB Messages  Person, Activity links  Graph Traversal  40
  41. 41. Consumer Activity Graph Wal*Mart.com Ebay Shopping.com Sam’s Ebay Motors Dollar General StubHub CVS 41 Toys R Us
  42. 42. Techniques Property Graph (Titan) Social Activity Stream Key/Value Store (MapR M7) 42 Graph Traversal (Faunus)
  43. 43. Business Value 43
  44. 44. Questions? 44

×