© 2013 Datameer, Inc. All rights reserved.
Top 3 Things to Consider with
Machine Learning on Big Data
Karen Hsu
Elliott Cordo

© 2013 Datameer, Inc. All rights reser...
About our Speakers
Karen Hsu
•

Karen is Senior Director, Product Marketing at
Datameer. With over 15 years of experience ...
About our Speakers
Elliott Cordo
• Elliott is a data warehouse and information

management expert. He brings more than a
d...
Drivers &
Challenges

Use Cases

© 2013 Datameer, Inc. All rights reserved.

Key Criteria

Best
Practices

Next Steps
Drivers & Challenges
Big Data Drives Results
Amazon vs Barnes & Noble

Big Data Analytics Drives Results

$300

$225

$150

$75

$0
12

/31

/0...
Alternatives Are Lacking

Data
Mining

•
•
•
•

Traditional
BI

Hard to use
Requires PHD experts
Must write code
Expensive...
Costs of Building Can be $1M+

Solution

$1M+ in Capital

Bay Area
140,000.00
117,000.00
119,000.00
125,000.00
116,000.00
...
Use Cases
Use Cases
Use Case

What is Revealed

Profiling and
segmentation

Customer, product, market characteristics and segments

A...
Customer Examples
Industry

Use Case

Financial Services

• Show correlation between services purchased and
investments/tr...
Polling Question I
Key Criteria
Ease of Use

© 2013 Datameer, Inc. All rights reserved.

Quality
Clustering
Clustering Overview
•
•
•

K-means is a popular and versatile general purpose clustering
algorithm.
Commonly used to group...
Ease of Use
First, the set up...

In Datameer, you select the columns... And
get the results

And then run the results...
...
Ease of Use
First, you have to set up...
pca <- princomp(iris[1:4]);
colors <- kmeans(iris[1:4], 3)$cluster;
plot(pca$scor...
Ease of Use
First, select the data...

In Datameer, you select the columns... And
get the results
Second, you need to crea...
Ease of Use
1. First a dataset’s attirbutes must be converted to numeric representations
User

Location

Company

Favorite...
Quality Comparison

© 2013 Datameer, Inc. All rights reserved.
Column Dependencies
Column Dependencies Overview
A

B

C

D

a

x

a

x

b

y

b

x

b

y

b

y

a

x

a

z

c

z

c

y

a

y

a

y

Column
De...
Quality Comparison
ColumnDependency(A,B) = 0.5

ColumnDependency(A,B) = 0.5

0

Column B

0

Column B

0
-2

-5

-5

-1

C...
Decision Tree
Decision Tree Overview
Goal: Create a model that predicts the value of a target
based on several inputs.

© 2013 Datameer,...
Ease of Use
First, you need to code...
packages.install(rpart);
library(rpart);
treeInput <- read.csv("/PathToData/
iris.c...
Ease of Use

First, select the data...

In Datameer, you select the columns... And
get the results
Second, you configure th...
Quality Comparison
Iris

Wine

Breast	
  
Cancer	
  
Wisconsin

R

92.66%

86.47%

92.86%

Weka

95.33%

89.33%

93.5%

Da...
Recommendations
Recommendations Overview
Increased revenue
Your customers expect them
What makes a good
recommendation?
Combination of alg...
Ease of Use
First, the set up...
# run factorization of ratings matrix
$MAHOUT parallelALS --input ${WORK_DIR}/dataset/tra...
Quality Comparison
Shawshank

Godfather

Pulp
Fiction

Fight
Club

Dianna

4.76

4.98

1.95

2.44

Jon

1.99

2.51

2.87

...
Best Practices
Big Data Analytics Process

Integrate

Define

Ad
Hoc

Prepare and
Analyze
Deploy

Visualize

© 2013 Datameer, Inc. All rig...
Clustering
• Leverage Hierarchies
• If possible, use numbering schemes
• Scale the surrogate key of attributes
• Try diffe...
Recommendations
K-Means:
Similar

Item-Based

• Leverage a combination of
algorithms

• Clustering is your friend!
• Treat...
Process Best Practices

Map

© 2013 Datameer, Inc. All rights reserved.

Chain

Iterate
Demonstration
Polling Question II
Return on Investment
Funnel
Optimization

Behavioral
Analytics

Fraud
Prevention

EDW
Optimization

Customer
Segmentation
...
Call to Action
Workshop
Contact

• Elliott Cordo elliott@casertaconcepts.com
• Karen Hsu khsu@datameer.com

© 2013 Datamee...
Upcoming SlideShare
Loading in …5
×

Top 3 Considerations for Machine Learning on Big Data

1,219 views

Published on

View the full recording of this deck here:

http://info.datameer.com/Slideshare-Top-3-Things-to-Consider-for-Machine-Learning-on-Big-Data.html

Machine learning is powerful but requires coding and access to all the relevant datasets to get full insights. With new Big Data analytic tools, business users can now use machine learning to gain a competitive edge.

Based on best practices and customer experiences, join Datameer and Caserta Concepts as we discuss what to look for and what value organizations get out of Machine Learning on Big Data.

This webinar will provide:

*an overview of challenges and tools available today
*use cases for machine learning on hadoop
*capabilities to look for
*comparison of available solutions

Published in: Technology
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,219
On SlideShare
0
From Embeds
0
Number of Embeds
16
Actions
Shares
0
Downloads
47
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide

Top 3 Considerations for Machine Learning on Big Data

  1. 1. © 2013 Datameer, Inc. All rights reserved.
  2. 2. Top 3 Things to Consider with Machine Learning on Big Data Karen Hsu Elliott Cordo © 2013 Datameer, Inc. All rights reserved.
  3. 3. About our Speakers Karen Hsu • Karen is Senior Director, Product Marketing at Datameer. With over 15 years of experience in enterprise software, Karen Hsu has co-authored 4 patents and worked in a variety of engineering, marketing and sales roles. • Most recently she came from Informatica where she worked with the start-ups Informatica purchased to bring data quality, master data management, B2B and data security solutions to market.  • Karen has a Bachelors of Science degree in Management Science and Engineering from Stanford University.   © 2013 Datameer, Inc. All rights reserved.
  4. 4. About our Speakers Elliott Cordo • Elliott is a data warehouse and information management expert. He brings more than a decade of experience in implementing data solutions with hands-on experience in every component of the data warehouse software development lifecycle. • At Caserta Concepts, Elliott oversees largescale major technology projects, including those involving business intelligence, data analytics, Big Data and data warehousing. © 2013 Datameer, Inc. All rights reserved.
  5. 5. Drivers & Challenges Use Cases © 2013 Datameer, Inc. All rights reserved. Key Criteria Best Practices Next Steps
  6. 6. Drivers & Challenges
  7. 7. Big Data Drives Results Amazon vs Barnes & Noble Big Data Analytics Drives Results $300 $225 $150 $75 $0 12 /31 /09 1/10 0/10 0/10 1/10 1/11 0/11 0/11 1/11 1/12 0/12 0/12 1/12 1/13 /3 /3 /3 /3 /3 /3 /3 /3 /3 /3 /3 /3 /2 03 06 09 12 03 06 09 12 03 06 09 12 03 NetFlix vs Blockbuster $300 $225 $150 $75 $0 12 /31 /09 1/10 0/10 0/10 1/10 1/11 0/11 0/11 1/11 1/12 0/12 0/12 1/12 1/13 /3 /3 /3 /3 /3 /3 /3 /3 /3 /3 /3 /3 /2 03 06 09 12 03 06 09 12 03 06 09 12 03 © 2013 Datameer, Inc. All rights reserved.
  8. 8. Alternatives Are Lacking Data Mining • • • • Traditional BI Hard to use Requires PHD experts Must write code Expensive • Fixed DW models • Must write code for analytics • Very high IT labor costs • Not agile © 2013 Datameer, Inc. All rights reserved. Visualization • Easy for small teams • Can’t manage large data volume • Lack support of advanced analytics
  9. 9. Costs of Building Can be $1M+ Solution $1M+ in Capital Bay Area 140,000.00 117,000.00 119,000.00 125,000.00 116,000.00 New York $126,000.00 $105,000.00 $107,000.00 $119,000.00 $104,000.00 137,000.00 $133,000.00 138,000.00 136,000.00 120,000.00 $133,000.00 $133,000.00 $114,000.00 1,148,000.00 $1M+ in Salaries Job Title IT Project Manager System Administrator Network Administrator Database Administrator IT Security Manager Business Intelligence Analyst Data Scientist Java Developer QA Engineer $1,074,000.00 Cost / 100TB Teradata EDW 1,650,000.00 Oracle Exadata 1,400,000.00 IBM Netezza 1,000,000.00 © 2013 Datameer, Inc. All rights reserved.
  10. 10. Use Cases
  11. 11. Use Cases Use Case What is Revealed Profiling and segmentation Customer, product, market characteristics and segments Acquisition and retention What leads a person to become a customer or stop being a customer Product development and operations optimization What led to product or network failure Campaign management Patterns of successful campaigns Cross-sell / up-sell Recommendations on services, products, or advisors for a given user/customer profile © 2013 Datameer, Inc. All rights reserved.
  12. 12. Customer Examples Industry Use Case Financial Services • Show correlation between services purchased and investments/trades made • Identify customer segments • Recommendations for research articles to drive trading eCommerce • Show types of events person will like • Decision tree based on likelihood to click through • Recommendations for a large “cold start” population Gaming • Clustering for user profiles • Correlation between attributes of a game and behavior • Churn analysis Healthcare • Recommend tests or other offerings • Identify factors/trends that lead to disease © 2013 Datameer, Inc. All rights reserved.
  13. 13. Polling Question I
  14. 14. Key Criteria
  15. 15. Ease of Use © 2013 Datameer, Inc. All rights reserved. Quality
  16. 16. Clustering
  17. 17. Clustering Overview • • • K-means is a popular and versatile general purpose clustering algorithm. Commonly used to group people and objects together to form segments Often leveraged to enhance recommendation and search systems K-Means How it works 1. Treats items as coordinates 2. Places a number of random “centroids” and assigns the nearest items 3. Moves the centroids around based on average location 4. Process repeats until the assignments stop changing © 2013 Datameer, Inc. All rights reserved. *Diagram from Collective Intelligence by Toby Segaran
  18. 18. Ease of Use First, the set up... In Datameer, you select the columns... And get the results And then run the results... And the quality of results increases with larger data sets… And write additional code to scale... © 2013 Datameer, Inc. All rights reserved.
  19. 19. Ease of Use First, you have to set up... pca <- princomp(iris[1:4]); colors <- kmeans(iris[1:4], 3)$cluster; plot(pca$scores[,1], pca$scores[,2], col=colors, pch=5); And then run the results... And then write more code to scale... © 2013 Datameer, Inc. All rights reserved. In Datameer, you select the columns... And get the results
  20. 20. Ease of Use First, select the data... In Datameer, you select the columns... And get the results Second, you need to create the cluster... And then see the results © 2013 Datameer, Inc. All rights reserved.
  21. 21. Ease of Use 1. First a dataset’s attirbutes must be converted to numeric representations User Location Company Favorite Algo Elliott New Jersey Caserta K-Means Karen California Datameer K-Means User Location Company Favorite Algo 1001 1 101 1001 1002 2 102 1001 2. This numeric dataset is then converted to a sequence file, then sparse vector leveraging Seqdirectory and seq2sparse  3. Mahout is called, number of clusters, distance calculation is specified bin/mahout kmeans -i /user/kmeans/vectors -c /user/ kmeans/input -o /user/kmeans/output -k 200 -dm CosineSimilarity -x 20 -ow 4. The sparse vector output is then converted back to a delimted format, 5. Textual attributes willl be appended back to the record, numeric values preserved for ad-hoc distance comparison of members within a cluster © 2013 Datameer, Inc. All rights reserved. *Diagram from Collective Intelligence by Toby Segaran In Datameer, you select the columns... And get the results
  22. 22. Quality Comparison © 2013 Datameer, Inc. All rights reserved.
  23. 23. Column Dependencies
  24. 24. Column Dependencies Overview A B C D a x a x b y b x b y b y a x a z c z c y a y a y Column Dependency ~ 0.99 Column Dependency ~ 0.01 Value •See how data is related after joining multiple sets of data •See column dependencies on multiple types of data © 2013 Datameer, Inc. All rights reserved.
  25. 25. Quality Comparison ColumnDependency(A,B) = 0.5 ColumnDependency(A,B) = 0.5 0 Column B 0 Column B 0 -2 -5 -5 -1 Column B 1 5 5 2 ColumnDependency(A,B) = 0 -3 -2 -1 0 1 2 3 -2 -1 0 1 2 3 -2 -1 0 1 2 Column A Column A ColumnDependency(A,B) = 1 ColumnDependency(A,B) = 0.5 ColumnDependency(A,B) = 1 m k j i h g f e Column B (STRING) a b c a d b 0 -6000 -4000 -2000 Column B 2000 Column B (STRING) l c 4000 n 6000 o Column A -3 -2 -1 0 1 2 3 Column A © 2013 Datameer, Inc. All rights reserved. 0 0.5 1 1.5 Column A (NUMBER) 2 2.5 3 1 2 3 4 5 6 7 8 9 10 Column A (NUMBER) 12 14
  26. 26. Decision Tree
  27. 27. Decision Tree Overview Goal: Create a model that predicts the value of a target based on several inputs. © 2013 Datameer, Inc. All rights reserved.
  28. 28. Ease of Use First, you need to code... packages.install(rpart); library(rpart); treeInput <- read.csv("/PathToData/ iris.csv"); fit <- rpart(class ~ sepalLength +sepalWidth+petalLength+petalWidth, data=treeInput); par(mfrow=c(1,2), xpd=NA); plot(fit); text(fit, use.n=TRUE); And then run the results... And then write more code to scale... © 2013 Datameer, Inc. All rights reserved. In Datameer, you select the columns... And get the results
  29. 29. Ease of Use First, select the data... In Datameer, you select the columns... And get the results Second, you configure the settings... And then see the results © 2013 Datameer, Inc. All rights reserved.
  30. 30. Quality Comparison Iris Wine Breast   Cancer   Wisconsin R 92.66% 86.47% 92.86% Weka 95.33% 89.33% 93.5% Datameer 93.33% 91.18% 93.04% © 2013 Datameer, Inc. All rights reserved.
  31. 31. Recommendations
  32. 32. Recommendations Overview Increased revenue Your customers expect them What makes a good recommendation? Combination of algorithms and Hadoop make effective recommendations platform achievable © 2013 Datameer, Inc. All rights reserved.
  33. 33. Ease of Use First, the set up... # run factorization of ratings matrix $MAHOUT parallelALS --input ${WORK_DIR}/dataset/trainingSet/ --output $ {WORK_DIR}/als/out     --tempDir ${WORK_DIR}/als/tmp --numFeatures 20 --numIterations 10 --lambda 0.065 --numThreadsPerSolver 2 # compute recommendations $MAHOUT recommendfactorized --input ${WORK_DIR}/als/out/userRatings/ -output ${WORK_DIR}/recommendations/     --userFeatures ${WORK_DIR}/als/out/U/ --itemFeatures ${WORK_DIR}/als/out/ M/     --numRecommendations 6 --maxRating 5 --numThreads 2 In Datameer, you select the columns... And get the results And then run the results... 1 [845:5.0,550:5.0,546:5.0,25:5.0,531:5.0,529:5.0,52 7:5.0,31:5.0,515:5.0,514:5.0] 2 [546:5.0,288:5.0,11:5.0,25:5.0,531:5.0,527:5.0,515 :5.0,508:5.0,496:5.0,483:5.0] 3 [137:5.0,284:5.0,508:4.832,24:4.82,285:4.8,845:4.7 5,124:4.7,319:4.703,29:4.67,591:4.6] 4 [748:5.0,1296:5.0,546:5.0,568:5.0,538:5.0,508:5.0, 483:5.0,475:5.0,471:5.0,876:5.0] 5 [732:5.0,550:5.0,9:5.0,546:5.0,11:5.0,527:5.0,523: 5.0,514:5.0,511:5.0,508:5.0] 6 [739:5.0,9:5.0,546:5.0,11:5.0,25:5.0,531:5.0,528:5 .0,527:5.0,526:5.0,521:5.0] © 2013 Datameer, Inc. All rights reserved.
  34. 34. Quality Comparison Shawshank Godfather Pulp Fiction Fight Club Dianna 4.76 4.98 1.95 2.44 Jon 1.99 2.51 2.87 4.83 Karen 3.28 4.72 1.89 2.95 Elliott 2.92 3.64 2.97 4.83 © 2013 Datameer, Inc. All rights reserved. Same Results
  35. 35. Best Practices
  36. 36. Big Data Analytics Process Integrate Define Ad Hoc Prepare and Analyze Deploy Visualize © 2013 Datameer, Inc. All rights reserved. Production
  37. 37. Clustering • Leverage Hierarchies • If possible, use numbering schemes • Scale the surrogate key of attributes • Try different cluster sizes • Avoid numeric similarities when building your data © 2013 Datameer, Inc. All rights reserved.
  38. 38. Recommendations K-Means: Similar Item-Based • Leverage a combination of algorithms • Clustering is your friend! • Treat cold start situations differently • Think about ranking • Don’t let recommendations go wild © 2013 Datameer, Inc. All rights reserved. Item Similarity Best Recommendations
  39. 39. Process Best Practices Map © 2013 Datameer, Inc. All rights reserved. Chain Iterate
  40. 40. Demonstration
  41. 41. Polling Question II
  42. 42. Return on Investment Funnel Optimization Behavioral Analytics Fraud Prevention EDW Optimization Customer Segmentation Increase Customer conversion by 3x Increase Revenue by 2x Identify $2B in potential fraud 98% OpEx savings $1M+ CapEx savings Lower Customer Acquisition Costs by 30% © 2013 Datameer, Inc. All rights reserved.
  43. 43. Call to Action Workshop Contact • Elliott Cordo elliott@casertaconcepts.com • Karen Hsu khsu@datameer.com © 2013 Datameer, Inc. All rights reserved.

×