Big Data Use
DevNexus Conference
2/18/2013

*Fully buzzword-compliant title

1

*
Cases
whoami
•

Brad Anderson

•

Solutions Architect at MapR (Atlanta)

•

ATLHUG co-chair

•

NoSQL East Conference 2009

•

“boorad” most places (twitter, github)

•

banderson@maprtech.com
2
Mobile

Virtualization

Social
Media

B2B

Application Service Provider

Cloud
Client/Server
Web 2.0

Service Bureau

Software-as-a-Service
3
BIG DATA
4
5
Business Value
6
Business Value
7
Big Data is not new!
but the tools are.

8
Ship the Function to the Data
Distributed Computing

Traditional Architecture
function

function

data

data

function

data

data

function

function

data

data

function

data

RDBMS

function

data

data

data

data

data

data

data

data

function

function

function

data

data

data

data

data

data

data

data

data

function

function

function

data

data

data

SAN/NAS

9
Variation: Multiple MapReduces
Example: Fraud Detection in User Transactions
MapReduce

Transaction
data

LDA training
LDA scoring

G2 score

95 %-ile LDA anomaly

HBase /
MapR M7 Edition

Candidate events
for analyst review
http://en.wikipedia.org/wiki/Latent_Dirichlet_allocation
10
MapR Distribution for Apache Hadoop


Complete Hadoop
distribution



Comprehensive
management suite



Industry-standard
interfaces



Enterprise-grade
dependability



Higher performance
11
Big Data Ecosystem

12
Use Case
Company
 Data Source(s)
 Technique(s)
 Business Value


13
Proactive Monitoring
14
Data Sources

Server Telemetry
 Monitoring Logs
 Network Flow


15
Techniques

Pattern Recognition
 Proactive Monitoring
 Early Alert Delivery


16
Business Value

17
Telecommunications Giant

ETL Offload
18
Telecommunications

Data Sources

Customer Records
 Contract Data
 Purchase Orders
 Call Center


19
Telecommunications

Techniques
Analytics

ETL

20
Telecommunications

Techniques

+
ETL (Hadoop)

Analytics (Teradata)
21
Telecommunications

Business Value

22
Credit Card
Issuer

Data Sources

Customer Purchase History
 Merchant Designations
 Merchant Special Offers


23
Credit Card
Issuer

Techniques
Hadoop
Purchase
History

Export
(4 hrs)

App
App

Merchant
Information

Recommendation
Engine Results
(Mahout)

Presentation
Data Store
(DB2)

App
App

Merchant
Offers

App

Import
(4 hrs)
24
Credit Card
Issuer

Techniques
Hadoop
Purchase
History
Merchant
Information

Recommendation
Engine Results
(Mahout)

Index
Update
(2 min)

App
App

Recommendation
Search Index
(Solr)

App
App

Merchant
Offers

App

25
Credit Card
Issuer

Business Value

26
Waste & Recycling Leader

Idle Alerts
27
Data Sources


Truck Geolocation Data

20,000 trucks
– 5 sec interval
–



Landfill Geographic Boundaries
28
Techniques
Realtime Stream Computation
(Storm)

Truck
Geolocation

Data

Hadoop
Storage

Immediate
Alerts

Batch Computation
(MapReduce)

Tax Reduction
Reporting

Shortest Path
Graph Algorithm

Route
Optimization

29
Business Value

30
Fraud Detection
Data Lake
31
Data Sources



Anti-Money Laundering
Consumer Transactions

32
Techniques
Anti-Money Laundering
System

Consumer Transactions
System

33
Techniques
AML
Data Lake
(Hadoop)

Suspicious
Events

Consumer
Transactions

Analyst
Latent Dirichlet Allocation,
Bayesian Learning Neural Network,
Peer Group Analysis
34
Business Value

35
Machine Learning
Search Relevance
DNA Matching
36
Data Sources

Birth, Death, Census, Military, I
mmigration records
 Search Behavior Activity
 DNA SNP (snips)


37
Techniques
Record Linking
 Search Relevance
 Clickstream Behavior
 Security Forensics
 DNA Matching


38
Business Value

39
Traffic Analytics
40
Data Sources


Inrix Road Segment Data

Avg Speed / minute / segment
– Reference Speeds
–



Road Segment Geolocation Data
41
Techniques
 Bottleneck Detection Algorithm
 Time Offset Correlations
–



Alternate Routes

Predictive Congestion Analysis

–

Growth & Term Assumptions
42
43
44
Business Value

45
Similar Characteristics
Lots of Data
 Structured, Semi-Structured, Unstructured
 Varied Systems Interoperating
– Hadoop, Storm, Solr, MPP, Visualizations


Increase Revenue
 Decrease Costs


46
Thank You

47

Big Data Use Cases

Editor's Notes

  • #11 SCRIPT:You can see from the Word Count example that a MapReduce is a low level construct. Typical applications require more complex processing, which is accomplished by performing multiple stages of MapReduce. Here is an example of a Hadoop system to detect account fraud after a security breach, using machine learning models. (*) Each step is its own MapReduce program. We’ll return to this example in more detail later.---------------[DON’T do any explanation of the algorithm here. Just twinkle the MR stages.(*) User transaction data is loaded into a distributed datastore for massive tables, such as HBase running on Hadoop, or native tables available with MapR’s M7 distribution.(*) There’s a training phase, to train the system what normal transactions look like.(*) Later, individual user transactions are scored against the “normal behavior” pattern.(*) Then, transactions with highly anomalous behavior are singled out as candidate events to be manually reviewed by analysts for potential fraud.In your data flow, any place you have a group-by, or join, or filter, or count occurrences event, it typically equates to one or more map-reduce jobs.
  • #12 MapR provides a complete distribution for Apache Hadoop. MapR has integrated, tested and hardened a broad array of packages as part of this distribution Hive, Pig, Oozie, Sqoop, plus additional packages such as Cascading. We have spent over a two year well funded effort to provide deep architectural improvements to create the next generation distribution for Hadoop. MapR has made significant updates combined with a dozen open source packages. Any of the innovations MapR has delivered include 100% compatibility with the Apache Hadoop APIs. This is in stark contrast with the alternative distributions from Cloudera, HortonWorks, Apache which are all equivalent.