Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Big Data
Analysis Patterns
Atlanta Big Data User Group
8/15/2013
1
whoami
•

Brad Anderson

•

Solutions Architect at MapR (Atlanta)

•

ATLHUG co-chair

•

NoSQL East Conference 2009

•

“...
Announcements


Next ATLHUG Meeting - Sept. 26
– How Google Does Big Data



Wednesday – MapR Data Warehouse Offload
Roa...
BIG DATA
4
5
Big Data is not new!
but the tools are.

6
The Good News in Big Data:

“Simple algorithms and lots of data
trump complex models”

Halevy, Norvig, and Pereira, Google...
The Challenge: So Many Solutions!
What solutions fit your business problem?
For example, do you need…



Apache Mahout?
...
Ask a Different Question
It may be more useful to better define the problem by asking some
of these questions:



How la...
Picking the Best Solution
Your responses to these questions can help you better:


define the problem



recognize the a...
Apache Solr/Lucene
Solr/Lucene is a powerful search engine used for flexible, heavily
indexed queries including data such ...
Apache Mahout
Mahout provides a library of scalable machine learning algorithms
useful for big data analysis based on Hado...
Apache Drill


Google Dremel clone



Pluggable Query Languages
–
–



Pluggable Storage Backends
–
–
–



Starts with...
Storm


Realtime Stream Computation Engine



Horizontal Scalability



Guaranteed Data Processing



Fault Tolerance
...
Titan


Distributed Graph Database



Property Graph



Pluggable Backend Storage
–
–
–



Search Integrated
–
–



S...
Using the Answers to Guide Your Choices
For simplicity, let’s focus in on the first three questions:


How large is the d...
Big Data Decision Tree
How big is your data?
<10 GB

mid
?

?

A

Single element
at a time

>200 GB

What size queries?
On...
Use Cases
Company
 Data Shape
 Technique(s)
 Business Value


18
Business Value
19
Business Value
20
Telecommunications Giant

ETL Offload
21
Telecommunications






Data Shape

Lots of Data
Lots of Queries across Large Sets
Throughput important

22
Telecommunications

Techniques
Analytics

ETL

23
Telecommunications

Techniques

+
ETL (Hadoop)

Analytics (Teradata)
24
Telecommunications

Business Value

25
Credit Card
Issuer

26
Credit Card
Issuer

Data Shape








Customer Purchase History (big)
Merchant Designations
Merchant Special Offers
...
Credit Card
Issuer

Techniques
A Recommendation Engine with Mahout and Solr/Lucene

History matrix
One row per user
One co...
Credit Card
Issuer

Techniques
Recommendation based on
cooccurrence
Cooccurrence gives item-item
mapping
One row and colum...
Credit Card
Issuer

Techniques
Cooccurrence matrix can also be
implemented as a search index

30
Credit Card
Issuer

Techniques
Complete
history

Cooccurrence
(Mahout)

SolR
SolR
Indexer
Solr
Indexer
indexing

Item meta...
Credit Card
Issuer

Techniques
User
history

SolR
SolR
Indexer
Solr
Indexer
search

Web tier

8Hrs  3 Min

Item metadata
...
Credit Card
Issuer

Techniques
Hadoop
Purchase
History

Export
(4 hrs)

App
App

Merchant
Information

Recommendation
Engi...
Credit Card
Issuer

Techniques
Hadoop
Purchase
History
Merchant
Information

Recommendation
Engine Results
(Mahout)

Index...
Credit Card
Issuer

Business Value

35
Waste & Recycling Leader

Idle Alerts
36
Data Shape
Truck Geolocation Data
– 20,000 trucks
– 5 sec interval (arriving quickly)
 Landfill Geographic Boundaries


...
Techniques
Realtime Stream Computation
(Storm)

Truck
Geolocation

Data

Hadoop
Storage

Immediate
Alerts

Batch Computati...
Business Value

39
Beverage Company

Social Engagement Application

40
Data Shape

Tweets, FB Messages
 Person, Activity links
 Graph Traversal


41
Consumer Activity Graph
Wal*Mart.com
Ebay
Shopping.com
Sam’s
Ebay Motors
Dollar General
StubHub
CVS

42

Toys R Us
Techniques
Property Graph
(Titan)

Social
Activity
Stream
Key/Value Store
(MapR M7)

43

Graph Traversal
(Faunus/Fulgora)
Business Value

44
Fraud Detection
Data Lake
45
Data Sources



Anti-Money Laundering
Consumer Transactions

46
Techniques
Anti-Money Laundering
System

Consumer Transactions
System

47
Techniques
AML
Data Lake
(Hadoop)

Suspicious
Events

Consumer
Transactions

Analyst
Latent Dirichlet Allocation,
Bayesian...
Business Value

49
Machine Learning
Search Relevance
DNA Matching
50
Data Sources

Birth, Death, Census, Military, I
mmigration records
 Search Behavior Activity
 DNA SNP (snips)


51
Techniques
Record Linking
 Search Relevance
 Clickstream Behavior
 Security Forensics
 DNA Matching


52
Business Value

53
Traffic Analytics
54
Data Sources


Inrix Road Segment Data

Avg Speed / minute / segment
– Reference Speeds
–



Road Segment Geolocation Da...
Techniques
 Bottleneck Detection Algorithm
 Time Offset Correlations
–



Alternate Routes

Predictive Congestion Analy...
57
58
Business Value

59
Similar Characteristics
Lots of Data
 Structured, Semi-Structured, Unstructured
 Varied Systems Interoperating
– Hadoop,...
Questions?

61
Upcoming SlideShare
Loading in …5
×

Big Data Analysis Patterns with Hadoop, Mahout and Solr

19,514 views

Published on

Big Data Analysis Patterns: Tying real world use cases to strategies for analysis using big data technologies and tools.

Big data is ushering in a new era for analytics with large scale data and relatively simple algorithms driving results rather than relying on complex models that use sample data. When you are ready to extract benefits from your data, how do you decide what approach, what algorithm, what tool to use? The answer is simpler than you think.

This session tackles big data analysis with a practical description of strategies for several classes of application types, identified concretely with use cases. Topics include new approaches to search and recommendation using scalable technologies such as Hadoop, Mahout, Storm, Solr, & Titan.

Big Data Analysis Patterns with Hadoop, Mahout and Solr

  1. 1. Big Data Analysis Patterns Atlanta Big Data User Group 8/15/2013 1
  2. 2. whoami • Brad Anderson • Solutions Architect at MapR (Atlanta) • ATLHUG co-chair • NoSQL East Conference 2009 • “boorad” most places (twitter, github) • banderson@maprtech.com 2
  3. 3. Announcements  Next ATLHUG Meeting - Sept. 26 – How Google Does Big Data  Wednesday – MapR Data Warehouse Offload Roadshow  MapR Upcoming Training • • • 3 MapR M7 & HBase for Developers on August 27 in Campbell, CA MapR M7 & HBase for Developers on Sept 17 in Reston, VA MapR M5 for Administrators on Oct 3 in Campbell, CA 3
  4. 4. BIG DATA 4
  5. 5. 5
  6. 6. Big Data is not new! but the tools are. 6
  7. 7. The Good News in Big Data: “Simple algorithms and lots of data trump complex models” Halevy, Norvig, and Pereira, Google IEEE Intelligent Systems 7
  8. 8. The Challenge: So Many Solutions! What solutions fit your business problem? For example, do you need…   Apache Mahout?  Storm?  Apache Solr/Lucene?  Apache HBase (or MapR M7)?  Apache Drill (or Impala?)  d3.js or Tableau?  Node.js  8 Apache Hadoop? Titan? 8
  9. 9. Ask a Different Question It may be more useful to better define the problem by asking some of these questions:   How large is the data to be queried? (the analysis volume)  What time frame is appropriate for your query response?  How fast is data arriving? (bursts or continuously?)  Are queries by sophisticated users?  Are you looking for common patterns or outliers?  9 How large is the data to be stored? How are your data sources structures? 9
  10. 10. Picking the Best Solution Your responses to these questions can help you better:  define the problem  recognize the analysis pattern to which it belongs  guide the choice of solutions to try But first, here’s a quick review of a few of the technologies you might choose, and then we will focus on three of the questions as a part of the landscape. 10 10
  11. 11. Apache Solr/Lucene Solr/Lucene is a powerful search engine used for flexible, heavily indexed queries including data such as  Full text  Geographical data  Statistically weighted data Solr is a small data tool that has flourished in a big data world 11
  12. 12. Apache Mahout Mahout provides a library of scalable machine learning algorithms useful for big data analysis based on Hadoop or other storage systems. Mahout algorithms mainly are used for  Recommendation (collaborative filtering)  Clustering  Classification Mahout can be used in conjunction with solutions such as Solr: You might use Mahout to create a co-occurrence data base that could then be queried using a search tool such as Solr 12
  13. 13. Apache Drill  Google Dremel clone  Pluggable Query Languages – –  Pluggable Storage Backends – – –  Starts with ANSI SQL 2003 Hive, Pig, Cascading, MongoQL, … Hadoop, Hbase MongoDB (BSON) RDBMS? Bypasses MapReduce 13
  14. 14. Storm  Realtime Stream Computation Engine  Horizontal Scalability  Guaranteed Data Processing  Fault Tolerance  Higher level abstraction over: – –  Message Queues Worker Logic “The Hadoop of Realtime” 14
  15. 15. Titan  Distributed Graph Database  Property Graph  Pluggable Backend Storage – – –  Search Integrated – –  Solr/Lucene Elastic Search Faunus –  HBase or M7 Cassandra Berkeley DB Batch processing of large graphs Fulgora – – Graph traversals on subset In-memory 15
  16. 16. Using the Answers to Guide Your Choices For simplicity, let’s focus in on the first three questions:  How large is the data to be stored?  How large is the data to be queried? (the analysis volume)  What time frame is appropriate for your query response? 16
  17. 17. Big Data Decision Tree How big is your data? <10 GB mid ? ? A Single element at a time >200 GB What size queries? One pass over 100% B Response time? C Big storage Multiple passes over big chunks Streaming < 100s (human scale) D 17 throughput not response E
  18. 18. Use Cases Company  Data Shape  Technique(s)  Business Value  18
  19. 19. Business Value 19
  20. 20. Business Value 20
  21. 21. Telecommunications Giant ETL Offload 21
  22. 22. Telecommunications    Data Shape Lots of Data Lots of Queries across Large Sets Throughput important 22
  23. 23. Telecommunications Techniques Analytics ETL 23
  24. 24. Telecommunications Techniques + ETL (Hadoop) Analytics (Teradata) 24
  25. 25. Telecommunications Business Value 25
  26. 26. Credit Card Issuer 26
  27. 27. Credit Card Issuer Data Shape      Customer Purchase History (big) Merchant Designations Merchant Special Offers Throughput important Recommendations 27
  28. 28. Credit Card Issuer Techniques A Recommendation Engine with Mahout and Solr/Lucene History matrix One row per user One column per thing 28
  29. 29. Credit Card Issuer Techniques Recommendation based on cooccurrence Cooccurrence gives item-item mapping One row and column per thing 29
  30. 30. Credit Card Issuer Techniques Cooccurrence matrix can also be implemented as a search index 30
  31. 31. Credit Card Issuer Techniques Complete history Cooccurrence (Mahout) SolR SolR Indexer Solr Indexer indexing Item metadata Index shards 31 20 Hrs  3 Hrs
  32. 32. Credit Card Issuer Techniques User history SolR SolR Indexer Solr Indexer search Web tier 8Hrs  3 Min Item metadata Index shards 32
  33. 33. Credit Card Issuer Techniques Hadoop Purchase History Export (4 hrs) App App Merchant Information Recommendation Engine Results (Mahout) Presentation Data Store (DB2) App App Merchant Offers App Import (4 hrs) 33
  34. 34. Credit Card Issuer Techniques Hadoop Purchase History Merchant Information Recommendation Engine Results (Mahout) Index Update (3 min) App App Recommendation Search Index (Solr) App App Merchant Offers App 34
  35. 35. Credit Card Issuer Business Value 35
  36. 36. Waste & Recycling Leader Idle Alerts 36
  37. 37. Data Shape Truck Geolocation Data – 20,000 trucks – 5 sec interval (arriving quickly)  Landfill Geographic Boundaries  37
  38. 38. Techniques Realtime Stream Computation (Storm) Truck Geolocation Data Hadoop Storage Immediate Alerts Batch Computation (MapReduce) Tax Reduction Reporting Shortest Path Graph Algorithm (Titan) Route Optimization 38
  39. 39. Business Value 39
  40. 40. Beverage Company Social Engagement Application 40
  41. 41. Data Shape Tweets, FB Messages  Person, Activity links  Graph Traversal  41
  42. 42. Consumer Activity Graph Wal*Mart.com Ebay Shopping.com Sam’s Ebay Motors Dollar General StubHub CVS 42 Toys R Us
  43. 43. Techniques Property Graph (Titan) Social Activity Stream Key/Value Store (MapR M7) 43 Graph Traversal (Faunus/Fulgora)
  44. 44. Business Value 44
  45. 45. Fraud Detection Data Lake 45
  46. 46. Data Sources   Anti-Money Laundering Consumer Transactions 46
  47. 47. Techniques Anti-Money Laundering System Consumer Transactions System 47
  48. 48. Techniques AML Data Lake (Hadoop) Suspicious Events Consumer Transactions Analyst Latent Dirichlet Allocation, Bayesian Learning Neural Network, Peer Group Analysis 48
  49. 49. Business Value 49
  50. 50. Machine Learning Search Relevance DNA Matching 50
  51. 51. Data Sources Birth, Death, Census, Military, I mmigration records  Search Behavior Activity  DNA SNP (snips)  51
  52. 52. Techniques Record Linking  Search Relevance  Clickstream Behavior  Security Forensics  DNA Matching  52
  53. 53. Business Value 53
  54. 54. Traffic Analytics 54
  55. 55. Data Sources  Inrix Road Segment Data Avg Speed / minute / segment – Reference Speeds –  Road Segment Geolocation Data 55
  56. 56. Techniques  Bottleneck Detection Algorithm  Time Offset Correlations –  Alternate Routes Predictive Congestion Analysis – Growth & Term Assumptions 56
  57. 57. 57
  58. 58. 58
  59. 59. Business Value 59
  60. 60. Similar Characteristics Lots of Data  Structured, Semi-Structured, Unstructured  Varied Systems Interoperating – Hadoop, Storm, Solr, MPP, Visualizations  Increase Revenue  Decrease Costs  60
  61. 61. Questions? 61

×