predictive-analytics-san-diego-2013-02-21
Upcoming SlideShare
Loading in...5
×
 

Like this? Share it with your network

Share

predictive-analytics-san-diego-2013-02-21

on

  • 1,115 views

The unification of big and little data processing onto a single platform is an important requirement for Hadoop. How can this be achieved? I explain what is needed for three important use cases.

The unification of big and little data processing onto a single platform is an important requirement for Hadoop. How can this be achieved? I explain what is needed for three important use cases.

Statistics

Views

Total Views
1,115
Views on SlideShare
1,064
Embed Views
51

Actions

Likes
3
Downloads
40
Comments
0

1 Embed 51

http://www.mapr.com 51

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

predictive-analytics-san-diego-2013-02-21 Presentation Transcript

  • 1. Remembering the Future©MapR Technologies - Confidential 1
  • 2. My Background University, Startups – Aptex, MusicMatch, ID Analytics, Veoh – big data since before it was big Open source – even before the internet – Apache Hadoop, Mahout, Zookeeper, Drill – bought the beer at first HUG MapR Founding member of Apache Drill©MapR Technologies - Confidential 2
  • 3. MapR Technologies Silicon Valley Startup – Top investors – Top technical and management team • Google, Microsoft, EMC, NetApp, Oracle Enterprise quality distribution for Hadoop Many extensions to basic Hadoop function Strong supporter of Apache Drill©MapR Technologies - Confidential 3
  • 4. Philosophy First What is History?©MapR Technologies - Confidential 4
  • 5. The study of the past(what came before now)©MapR Technologies - Confidential 5
  • 6. What is the future? (it comes after now)©MapR Technologies - Confidential 6
  • 7. ©MapR Technologies - Confidential 7
  • 8. ©MapR Technologies - Confidential 8
  • 9. ©MapR Technologies - Confidential 9
  • 10. But the future also has a past!©MapR Technologies - Confidential 10
  • 11. Do you remember the future?©MapR Technologies - Confidential 11
  • 12. ©MapR Technologies - Confidential 12
  • 13. ©MapR Technologies - Confidential 13
  • 14. ©MapR Technologies - Confidential 14
  • 15. ©MapR Technologies - Confidential 15
  • 16. ©MapR Technologies - Confidential 16
  • 17. Some things turned out as expected©MapR Technologies - Confidential 17
  • 18. Guys wearing Fedoras©MapR Technologies - Confidential 18
  • 19. Many things are different!©MapR Technologies - Confidential 19
  • 20. Hadoop has a history©MapR Technologies - Confidential 20
  • 21. Hadoop also has a future©MapR Technologies - Confidential 21
  • 22. The Old Future of Hadoop Map-reduce and HDFS – more and more, but not really different Eco-system additions – Simpler programming (Hive and Pig) – Key-value store – Ad hoc query Stands apart from other computing – Required by HDFS and other limitations©MapR Technologies - Confidential 22
  • 23. The New Future of Hadoop Real-time processing – Combines real-time and long-time Integration with traditional IT – No need to stand apart Integration with new technologies – Solr, Node.js, Twisted all should interface directly Fast and flexible computation – Drill logical plan language©MapR Technologies - Confidential 23
  • 24. Example #1 Search Abuse©MapR Technologies - Confidential 24
  • 25. History matrix One row per user One column per thing©MapR Technologies - Confidential 25
  • 26. Recommendation based on cooccurrence Cooccurrence gives item-item mapping One row and column per thing©MapR Technologies - Confidential 26
  • 27. Cooccurrence matrix can also be implemented as a search index©MapR Technologies - Confidential 27
  • 28. SolR SolR Complete Cooccurrence Indexer Solr Indexer history (Mahout) indexing Item meta- Index data shards©MapR Technologies - Confidential 28
  • 29. SolR SolR User Indexer Solr Web tier Indexer history search Item meta- Index data shards©MapR Technologies - Confidential 29
  • 30. Objective Results At a very large credit card company History is all transactions, all web interaction Processing time cut from 20 hours per day to 3 Recommendation engine load time decreased from 8 hours to 3 minutes©MapR Technologies - Confidential 30
  • 31. Example #2 Web Technology©MapR Technologies - Confidential 31
  • 32. Real-time Fast analysis data (Storm) Analytic Raw logs output©MapR Technologies - Confidential 32
  • 33. Large analysis (map-reduce) Analytic Raw logs output©MapR Technologies - Confidential 33
  • 34. Presentation Browser tier (d3 + query node.js) Analytic Raw logs output©MapR Technologies - Confidential 34
  • 35. Objective Results Real-time + long-time analysis is seamless Web tier can be rooted directly on Hadoop cluster No need to move data©MapR Technologies - Confidential 35
  • 36. Example #3 Apache Drill©MapR Technologies - Confidential 36
  • 37. Big Data Processing – Hadoop Batch processing Query runtime Minutes to hours Data volume TBs to PBs Programming MapReduce model Users Developers Google project MapReduce Open source Hadoop project MapReduce©MapR Technologies - Confidential 37
  • 38. Big Data Processing – Hadoop and Storm Batch processing Stream processing Query runtime Minutes to hours Never-ending Data volume TBs to PBs Continuous stream Programming MapReduce DAG model (pre-programmed) Users Developers Developers Google project MapReduce Open source Hadoop Storm or Apache S4 project MapReduce©MapR Technologies - Confidential 38
  • 39. Big Data Processing – The missing part Batch processing Interactive analysis Stream processing Query runtime Minutes to hours Never-ending Data volume TBs to PBs Continuous stream Programming MapReduce DAG model (pre-programmed) Users Developers Developers Google project MapReduce Open source Hadoop Storm and S4 project MapReduce©MapR Technologies - Confidential 39
  • 40. Big Data Processing – The missing part Batch processing Interactive analysis Stream processing Query runtime Minutes to hours Milliseconds to Never-ending minutes Data volume TBs to PBs GBs to PBs Continuous stream Programming MapReduce Queries DAG model (ad hoc) (pre-programmed) Users Developers Analysts and Developers developers Google project MapReduce Open source Hadoop Storm and S4 project MapReduce©MapR Technologies - Confidential 40
  • 41. Big Data Processing Batch processing Interactive analysis Stream processing Query runtime Minutes to hours Milliseconds to Never-ending minutes Data volume TBs to PBs GBs to PBs Continuous stream Programming MapReduce Queries DAG model Users Developers Analysts and Developers developers Google project MapReduce Dremel Open source Hadoop Storm and S4 project MapReduce©MapR Technologies - Confidential 41
  • 42. Big Data Processing Batch processing Interactive analysis Stream processing Query runtime Minutes to hours Milliseconds to Never-ending minutes Data volume TBs to PBs GBs to PBs Continuous stream Programming MapReduce Queries DAG model Users Developers Analysts and Developers developers Google project MapReduce Dremel Open source Hadoop Storm and S4 project MapReduce Apache Drill©MapR Technologies - Confidential 42
  • 43. Design Principles Flexible Easy • Pluggable query languages • Unzip and run • Extensible execution engine • Zero configuration • Pluggable data formats • Reverse DNS not needed • Column-based and row-based • IP addresses can change • Schema and schema-less • Clear and concise log messages • Pluggable data sources Dependable Fast • No SPOF • C/C++ core with Java support • Instant recovery from crashes • Google C++ style guide • Min latency and max throughput (limited only by hardware)©MapR Technologies - Confidential 43
  • 44. Simple Architecture Query Interface language Logical Transform Language Physical Optimize Execute Plan©MapR Technologies - Confidential 44
  • 45. Standard Interfaces Query SQL 2003 Interface language Drill logical syntax Logical Transform Scanner Language API Physical Optimize Execute Plan©MapR Technologies - Confidential 45
  • 46. Logical Plan Syntax: query:[ { op:"sequence", do:[ { op: "scan", memo: "initial_scan", ref: "donuts", source: "local-logs", selection: {data: "activity"} }, { op: "transform", transforms: [ { ref: "donuts.quanity", expr: "donuts.sales”} ] }, { op: "filter", expr: "donuts.ppu < 1.00" }, …©MapR Technologies - Confidential 46
  • 47. Logical Streaming Example 01 23 4 { @id: <refnum>, op: “window-frame”, input: <input>, keys: [ 0 <name>,... 01 ], 012 ref: <name>, 123 before: 2, 234 after: here }©MapR Technologies - Confidential 47
  • 48. Logical Plan scan-json "table-1" filter exp1 flatten aggregate exp2©MapR Technologies - Confidential 48
  • 49. Execution Plan scan-json "table-1" scan-json "table-1" scan-json "table-1" filter exp1 filter exp1 filter exp1 flatten flatten flatten node1 node2 node3 aggregate exp2©MapR Technologies - Confidential 49
  • 50. Representing a DAG 18 aggregate exp2 19 { @id: 19, op: "aggregate", input: 18, type: <simple|running|repeat>, keys: [<name>,...], aggregations: [ {ref: <name>, expr: <aggexpr> },... ] }©MapR Technologies - Confidential 50
  • 51. Non-SQL queries scan-json "table-1" scan-json "table-1" streaming k-means ball k- k means aggregate exp2 k-means join cluster features©MapR Technologies - Confidential 51
  • 52. Design Principles Flexible Easy • Pluggable query languages • Unzip and run • Extensible execution engine • Zero configuration • Pluggable data formats • Reverse DNS not needed • Column-based and row-based • IP addresses can change • Schema and schema-less • Clear and concise log messages • Pluggable data sources Dependable Fast • No SPOF • C/C++ core with Java support • Instant recovery from crashes • Google C++ style guide • Min latency and max throughput (limited only by hardware)©MapR Technologies - Confidential 52
  • 53. The future is not what we thought it would be©MapR Technologies - Confidential 53
  • 54. It is better!©MapR Technologies - Confidential 54
  • 55. Get Involved! Tweet: #hcj13w #mapr @ted_dunning©MapR Technologies - Confidential 55
  • 56. Get Involved! Download these slides – http://www.mapr.com/company/events/hcj-01-21-2013 Join the Drill project – drill-dev-subscribe@incubator.apache.org – #apachedrill Contact me: – tdunning@maprtech.com – tdunning@apache.org – @ted_dunning Join MapR (in Japan!) – jobs@mapr.com©MapR Technologies - Confidential 56