Your SlideShare is downloading. ×
Top 3 Considerations for Machine Learning on Big Data
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Top 3 Considerations for Machine Learning on Big Data

554
views

Published on

View the full recording of this deck here: …

View the full recording of this deck here:

http://info.datameer.com/Slideshare-Top-3-Things-to-Consider-for-Machine-Learning-on-Big-Data.html

Machine learning is powerful but requires coding and access to all the relevant datasets to get full insights. With new Big Data analytic tools, business users can now use machine learning to gain a competitive edge.

Based on best practices and customer experiences, join Datameer and Caserta Concepts as we discuss what to look for and what value organizations get out of Machine Learning on Big Data.

This webinar will provide:

*an overview of challenges and tools available today
*use cases for machine learning on hadoop
*capabilities to look for
*comparison of available solutions

Published in: Technology

0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
554
On Slideshare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
33
Comments
0
Likes
2
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. © 2013 Datameer, Inc. All rights reserved.
  • 2. Top 3 Things to Consider with Machine Learning on Big Data Karen Hsu Elliott Cordo © 2013 Datameer, Inc. All rights reserved.
  • 3. About our Speakers Karen Hsu • Karen is Senior Director, Product Marketing at Datameer. With over 15 years of experience in enterprise software, Karen Hsu has co-authored 4 patents and worked in a variety of engineering, marketing and sales roles. • Most recently she came from Informatica where she worked with the start-ups Informatica purchased to bring data quality, master data management, B2B and data security solutions to market.  • Karen has a Bachelors of Science degree in Management Science and Engineering from Stanford University.   © 2013 Datameer, Inc. All rights reserved.
  • 4. About our Speakers Elliott Cordo • Elliott is a data warehouse and information management expert. He brings more than a decade of experience in implementing data solutions with hands-on experience in every component of the data warehouse software development lifecycle. • At Caserta Concepts, Elliott oversees largescale major technology projects, including those involving business intelligence, data analytics, Big Data and data warehousing. © 2013 Datameer, Inc. All rights reserved.
  • 5. Drivers & Challenges Use Cases © 2013 Datameer, Inc. All rights reserved. Key Criteria Best Practices Next Steps
  • 6. Drivers & Challenges
  • 7. Big Data Drives Results Amazon vs Barnes & Noble Big Data Analytics Drives Results $300 $225 $150 $75 $0 12 /31 /09 1/10 0/10 0/10 1/10 1/11 0/11 0/11 1/11 1/12 0/12 0/12 1/12 1/13 /3 /3 /3 /3 /3 /3 /3 /3 /3 /3 /3 /3 /2 03 06 09 12 03 06 09 12 03 06 09 12 03 NetFlix vs Blockbuster $300 $225 $150 $75 $0 12 /31 /09 1/10 0/10 0/10 1/10 1/11 0/11 0/11 1/11 1/12 0/12 0/12 1/12 1/13 /3 /3 /3 /3 /3 /3 /3 /3 /3 /3 /3 /3 /2 03 06 09 12 03 06 09 12 03 06 09 12 03 © 2013 Datameer, Inc. All rights reserved.
  • 8. Alternatives Are Lacking Data Mining • • • • Traditional BI Hard to use Requires PHD experts Must write code Expensive • Fixed DW models • Must write code for analytics • Very high IT labor costs • Not agile © 2013 Datameer, Inc. All rights reserved. Visualization • Easy for small teams • Can’t manage large data volume • Lack support of advanced analytics
  • 9. Costs of Building Can be $1M+ Solution $1M+ in Capital Bay Area 140,000.00 117,000.00 119,000.00 125,000.00 116,000.00 New York $126,000.00 $105,000.00 $107,000.00 $119,000.00 $104,000.00 137,000.00 $133,000.00 138,000.00 136,000.00 120,000.00 $133,000.00 $133,000.00 $114,000.00 1,148,000.00 $1M+ in Salaries Job Title IT Project Manager System Administrator Network Administrator Database Administrator IT Security Manager Business Intelligence Analyst Data Scientist Java Developer QA Engineer $1,074,000.00 Cost / 100TB Teradata EDW 1,650,000.00 Oracle Exadata 1,400,000.00 IBM Netezza 1,000,000.00 © 2013 Datameer, Inc. All rights reserved.
  • 10. Use Cases
  • 11. Use Cases Use Case What is Revealed Profiling and segmentation Customer, product, market characteristics and segments Acquisition and retention What leads a person to become a customer or stop being a customer Product development and operations optimization What led to product or network failure Campaign management Patterns of successful campaigns Cross-sell / up-sell Recommendations on services, products, or advisors for a given user/customer profile © 2013 Datameer, Inc. All rights reserved.
  • 12. Customer Examples Industry Use Case Financial Services • Show correlation between services purchased and investments/trades made • Identify customer segments • Recommendations for research articles to drive trading eCommerce • Show types of events person will like • Decision tree based on likelihood to click through • Recommendations for a large “cold start” population Gaming • Clustering for user profiles • Correlation between attributes of a game and behavior • Churn analysis Healthcare • Recommend tests or other offerings • Identify factors/trends that lead to disease © 2013 Datameer, Inc. All rights reserved.
  • 13. Polling Question I
  • 14. Key Criteria
  • 15. Ease of Use © 2013 Datameer, Inc. All rights reserved. Quality
  • 16. Clustering
  • 17. Clustering Overview • • • K-means is a popular and versatile general purpose clustering algorithm. Commonly used to group people and objects together to form segments Often leveraged to enhance recommendation and search systems K-Means How it works 1. Treats items as coordinates 2. Places a number of random “centroids” and assigns the nearest items 3. Moves the centroids around based on average location 4. Process repeats until the assignments stop changing © 2013 Datameer, Inc. All rights reserved. *Diagram from Collective Intelligence by Toby Segaran
  • 18. Ease of Use First, the set up... In Datameer, you select the columns... And get the results And then run the results... And the quality of results increases with larger data sets… And write additional code to scale... © 2013 Datameer, Inc. All rights reserved.
  • 19. Ease of Use First, you have to set up... pca <- princomp(iris[1:4]); colors <- kmeans(iris[1:4], 3)$cluster; plot(pca$scores[,1], pca$scores[,2], col=colors, pch=5); And then run the results... And then write more code to scale... © 2013 Datameer, Inc. All rights reserved. In Datameer, you select the columns... And get the results
  • 20. Ease of Use First, select the data... In Datameer, you select the columns... And get the results Second, you need to create the cluster... And then see the results © 2013 Datameer, Inc. All rights reserved.
  • 21. Ease of Use 1. First a dataset’s attirbutes must be converted to numeric representations User Location Company Favorite Algo Elliott New Jersey Caserta K-Means Karen California Datameer K-Means User Location Company Favorite Algo 1001 1 101 1001 1002 2 102 1001 2. This numeric dataset is then converted to a sequence file, then sparse vector leveraging Seqdirectory and seq2sparse  3. Mahout is called, number of clusters, distance calculation is specified bin/mahout kmeans -i /user/kmeans/vectors -c /user/ kmeans/input -o /user/kmeans/output -k 200 -dm CosineSimilarity -x 20 -ow 4. The sparse vector output is then converted back to a delimted format, 5. Textual attributes willl be appended back to the record, numeric values preserved for ad-hoc distance comparison of members within a cluster © 2013 Datameer, Inc. All rights reserved. *Diagram from Collective Intelligence by Toby Segaran In Datameer, you select the columns... And get the results
  • 22. Quality Comparison © 2013 Datameer, Inc. All rights reserved.
  • 23. Column Dependencies
  • 24. Column Dependencies Overview A B C D a x a x b y b x b y b y a x a z c z c y a y a y Column Dependency ~ 0.99 Column Dependency ~ 0.01 Value •See how data is related after joining multiple sets of data •See column dependencies on multiple types of data © 2013 Datameer, Inc. All rights reserved.
  • 25. Quality Comparison ColumnDependency(A,B) = 0.5 ColumnDependency(A,B) = 0.5 0 Column B 0 Column B 0 -2 -5 -5 -1 Column B 1 5 5 2 ColumnDependency(A,B) = 0 -3 -2 -1 0 1 2 3 -2 -1 0 1 2 3 -2 -1 0 1 2 Column A Column A ColumnDependency(A,B) = 1 ColumnDependency(A,B) = 0.5 ColumnDependency(A,B) = 1 m k j i h g f e Column B (STRING) a b c a d b 0 -6000 -4000 -2000 Column B 2000 Column B (STRING) l c 4000 n 6000 o Column A -3 -2 -1 0 1 2 3 Column A © 2013 Datameer, Inc. All rights reserved. 0 0.5 1 1.5 Column A (NUMBER) 2 2.5 3 1 2 3 4 5 6 7 8 9 10 Column A (NUMBER) 12 14
  • 26. Decision Tree
  • 27. Decision Tree Overview Goal: Create a model that predicts the value of a target based on several inputs. © 2013 Datameer, Inc. All rights reserved.
  • 28. Ease of Use First, you need to code... packages.install(rpart); library(rpart); treeInput <- read.csv("/PathToData/ iris.csv"); fit <- rpart(class ~ sepalLength +sepalWidth+petalLength+petalWidth, data=treeInput); par(mfrow=c(1,2), xpd=NA); plot(fit); text(fit, use.n=TRUE); And then run the results... And then write more code to scale... © 2013 Datameer, Inc. All rights reserved. In Datameer, you select the columns... And get the results
  • 29. Ease of Use First, select the data... In Datameer, you select the columns... And get the results Second, you configure the settings... And then see the results © 2013 Datameer, Inc. All rights reserved.
  • 30. Quality Comparison Iris Wine Breast   Cancer   Wisconsin R 92.66% 86.47% 92.86% Weka 95.33% 89.33% 93.5% Datameer 93.33% 91.18% 93.04% © 2013 Datameer, Inc. All rights reserved.
  • 31. Recommendations
  • 32. Recommendations Overview Increased revenue Your customers expect them What makes a good recommendation? Combination of algorithms and Hadoop make effective recommendations platform achievable © 2013 Datameer, Inc. All rights reserved.
  • 33. Ease of Use First, the set up... # run factorization of ratings matrix $MAHOUT parallelALS --input ${WORK_DIR}/dataset/trainingSet/ --output $ {WORK_DIR}/als/out     --tempDir ${WORK_DIR}/als/tmp --numFeatures 20 --numIterations 10 --lambda 0.065 --numThreadsPerSolver 2 # compute recommendations $MAHOUT recommendfactorized --input ${WORK_DIR}/als/out/userRatings/ -output ${WORK_DIR}/recommendations/     --userFeatures ${WORK_DIR}/als/out/U/ --itemFeatures ${WORK_DIR}/als/out/ M/     --numRecommendations 6 --maxRating 5 --numThreads 2 In Datameer, you select the columns... And get the results And then run the results... 1 [845:5.0,550:5.0,546:5.0,25:5.0,531:5.0,529:5.0,52 7:5.0,31:5.0,515:5.0,514:5.0] 2 [546:5.0,288:5.0,11:5.0,25:5.0,531:5.0,527:5.0,515 :5.0,508:5.0,496:5.0,483:5.0] 3 [137:5.0,284:5.0,508:4.832,24:4.82,285:4.8,845:4.7 5,124:4.7,319:4.703,29:4.67,591:4.6] 4 [748:5.0,1296:5.0,546:5.0,568:5.0,538:5.0,508:5.0, 483:5.0,475:5.0,471:5.0,876:5.0] 5 [732:5.0,550:5.0,9:5.0,546:5.0,11:5.0,527:5.0,523: 5.0,514:5.0,511:5.0,508:5.0] 6 [739:5.0,9:5.0,546:5.0,11:5.0,25:5.0,531:5.0,528:5 .0,527:5.0,526:5.0,521:5.0] © 2013 Datameer, Inc. All rights reserved.
  • 34. Quality Comparison Shawshank Godfather Pulp Fiction Fight Club Dianna 4.76 4.98 1.95 2.44 Jon 1.99 2.51 2.87 4.83 Karen 3.28 4.72 1.89 2.95 Elliott 2.92 3.64 2.97 4.83 © 2013 Datameer, Inc. All rights reserved. Same Results
  • 35. Best Practices
  • 36. Big Data Analytics Process Integrate Define Ad Hoc Prepare and Analyze Deploy Visualize © 2013 Datameer, Inc. All rights reserved. Production
  • 37. Clustering • Leverage Hierarchies • If possible, use numbering schemes • Scale the surrogate key of attributes • Try different cluster sizes • Avoid numeric similarities when building your data © 2013 Datameer, Inc. All rights reserved.
  • 38. Recommendations K-Means: Similar Item-Based • Leverage a combination of algorithms • Clustering is your friend! • Treat cold start situations differently • Think about ranking • Don’t let recommendations go wild © 2013 Datameer, Inc. All rights reserved. Item Similarity Best Recommendations
  • 39. Process Best Practices Map © 2013 Datameer, Inc. All rights reserved. Chain Iterate
  • 40. Demonstration
  • 41. Polling Question II
  • 42. Return on Investment Funnel Optimization Behavioral Analytics Fraud Prevention EDW Optimization Customer Segmentation Increase Customer conversion by 3x Increase Revenue by 2x Identify $2B in potential fraud 98% OpEx savings $1M+ CapEx savings Lower Customer Acquisition Costs by 30% © 2013 Datameer, Inc. All rights reserved.
  • 43. Call to Action Workshop Contact • Elliott Cordo elliott@casertaconcepts.com • Karen Hsu khsu@datameer.com © 2013 Datameer, Inc. All rights reserved.

×