Your SlideShare is downloading. ×
0
Remembering the Future©MapR Technologies - Confidential   1
My Background     University, Startups       – Aptex, MusicMatch, ID Analytics, Veoh       – big data since before it was...
MapR Technologies     Silicon Valley Startup       – Top investors       – Top technical and management team            •...
Philosophy First                              What is History?©MapR Technologies - Confidential    4
The study of the past(what came before now)©MapR Technologies - Confidential   5
What is the future?        (it comes after now)©MapR Technologies - Confidential   6
©MapR Technologies - Confidential   7
©MapR Technologies - Confidential   8
©MapR Technologies - Confidential   9
But the future also                     has a past!©MapR Technologies - Confidential   10
Do you remember the                       future?©MapR Technologies - Confidential   11
©MapR Technologies - Confidential   12
©MapR Technologies - Confidential   13
©MapR Technologies - Confidential   14
©MapR Technologies - Confidential   15
©MapR Technologies - Confidential   16
Some things                                    turned out                                    as expected©MapR Technologies...
Guys wearing                                      Fedoras©MapR Technologies - Confidential      18
Many things                                        are                                    different!©MapR Technologies - C...
Hadoop has                                     a history©MapR Technologies - Confidential       20
Hadoop also                                       has a                                      future©MapR Technologies - Co...
The Old Future of Hadoop     Map-reduce and HDFS       –   more and more, but not really different     Eco-system additi...
The New Future of Hadoop     Real-time processing       –   Combines real-time and long-time     Integration with tradit...
Example #1                                    Search Abuse©MapR Technologies - Confidential        24
History matrix                                    One row per user                                    One column per thing...
Recommendation based on                                    cooccurrence                                    Cooccurrence gi...
Cooccurrence matrix can also be                                    implemented as a search index©MapR Technologies - Confi...
SolR                                                              SolR                          Complete    Cooccurrence  ...
SolR                                                             SolR                                User                 ...
Objective Results     At a very large credit card company     History is all transactions, all web interaction     Proc...
Example #2                         Web Technology©MapR Technologies - Confidential   31
Real-time   Fast analysis                                         data     (Storm)                                        ...
Large analysis                                                    (map-reduce)                                    Analytic...
Presentation                                    Browser                                                tier (d3 +         ...
Objective Results     Real-time + long-time analysis is seamless     Web tier can be rooted directly on Hadoop cluster ...
Example #3                                    Apache Drill©MapR Technologies - Confidential         36
Big Data Processing – Hadoop                                    Batch processing  Query runtime                     Minute...
Big Data Processing – Hadoop and Storm                                    Batch processing   Stream processing  Query runt...
Big Data Processing – The missing part                                    Batch processing   Interactive analysis   Stream...
Big Data Processing – The missing part                                    Batch processing   Interactive analysis   Stream...
Big Data Processing                                    Batch processing   Interactive analysis   Stream processing  Query ...
Big Data Processing                                    Batch processing   Interactive analysis   Stream processing  Query ...
Design Principles             Flexible                               Easy             • Pluggable query languages         ...
Simple Architecture                                                  Query                                    Interface   ...
Standard Interfaces                                                  Query     SQL 2003                                   ...
Logical Plan Syntax:                                    query:[                                     {                     ...
Logical Streaming Example                                    01                                    23                     ...
Logical Plan                                    scan-json        "table-1"                                      filter     ...
Execution Plan             scan-json                     "table-1"   scan-json               "table-1"   scan-json        ...
Representing a DAG                                18                       aggregate       exp2                           ...
Non-SQL queries                             scan-json   "table-1"        scan-json   "table-1"                            ...
Design Principles             Flexible                               Easy             • Pluggable query languages         ...
The future is                               not what we                                thought it                         ...
It is better!©MapR Technologies - Confidential   54
Get Involved!                                        Tweet:                                       #hcj13w                 ...
Get Involved!     Download these slides       –   http://www.mapr.com/company/events/hcj-01-21-2013     Join the Drill p...
Upcoming SlideShare
Loading in...5
×

predictive-analytics-san-diego-2013-02-21

867

Published on

The unification of big and little data processing onto a single platform is an important requirement for Hadoop. How can this be achieved? I explain what is needed for three important use cases.

Published in: Technology
0 Comments
3 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
867
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
43
Comments
0
Likes
3
Embeds 0
No embeds

No notes for slide

Transcript of "predictive-analytics-san-diego-2013-02-21"

  1. 1. Remembering the Future©MapR Technologies - Confidential 1
  2. 2. My Background University, Startups – Aptex, MusicMatch, ID Analytics, Veoh – big data since before it was big Open source – even before the internet – Apache Hadoop, Mahout, Zookeeper, Drill – bought the beer at first HUG MapR Founding member of Apache Drill©MapR Technologies - Confidential 2
  3. 3. MapR Technologies Silicon Valley Startup – Top investors – Top technical and management team • Google, Microsoft, EMC, NetApp, Oracle Enterprise quality distribution for Hadoop Many extensions to basic Hadoop function Strong supporter of Apache Drill©MapR Technologies - Confidential 3
  4. 4. Philosophy First What is History?©MapR Technologies - Confidential 4
  5. 5. The study of the past(what came before now)©MapR Technologies - Confidential 5
  6. 6. What is the future? (it comes after now)©MapR Technologies - Confidential 6
  7. 7. ©MapR Technologies - Confidential 7
  8. 8. ©MapR Technologies - Confidential 8
  9. 9. ©MapR Technologies - Confidential 9
  10. 10. But the future also has a past!©MapR Technologies - Confidential 10
  11. 11. Do you remember the future?©MapR Technologies - Confidential 11
  12. 12. ©MapR Technologies - Confidential 12
  13. 13. ©MapR Technologies - Confidential 13
  14. 14. ©MapR Technologies - Confidential 14
  15. 15. ©MapR Technologies - Confidential 15
  16. 16. ©MapR Technologies - Confidential 16
  17. 17. Some things turned out as expected©MapR Technologies - Confidential 17
  18. 18. Guys wearing Fedoras©MapR Technologies - Confidential 18
  19. 19. Many things are different!©MapR Technologies - Confidential 19
  20. 20. Hadoop has a history©MapR Technologies - Confidential 20
  21. 21. Hadoop also has a future©MapR Technologies - Confidential 21
  22. 22. The Old Future of Hadoop Map-reduce and HDFS – more and more, but not really different Eco-system additions – Simpler programming (Hive and Pig) – Key-value store – Ad hoc query Stands apart from other computing – Required by HDFS and other limitations©MapR Technologies - Confidential 22
  23. 23. The New Future of Hadoop Real-time processing – Combines real-time and long-time Integration with traditional IT – No need to stand apart Integration with new technologies – Solr, Node.js, Twisted all should interface directly Fast and flexible computation – Drill logical plan language©MapR Technologies - Confidential 23
  24. 24. Example #1 Search Abuse©MapR Technologies - Confidential 24
  25. 25. History matrix One row per user One column per thing©MapR Technologies - Confidential 25
  26. 26. Recommendation based on cooccurrence Cooccurrence gives item-item mapping One row and column per thing©MapR Technologies - Confidential 26
  27. 27. Cooccurrence matrix can also be implemented as a search index©MapR Technologies - Confidential 27
  28. 28. SolR SolR Complete Cooccurrence Indexer Solr Indexer history (Mahout) indexing Item meta- Index data shards©MapR Technologies - Confidential 28
  29. 29. SolR SolR User Indexer Solr Web tier Indexer history search Item meta- Index data shards©MapR Technologies - Confidential 29
  30. 30. Objective Results At a very large credit card company History is all transactions, all web interaction Processing time cut from 20 hours per day to 3 Recommendation engine load time decreased from 8 hours to 3 minutes©MapR Technologies - Confidential 30
  31. 31. Example #2 Web Technology©MapR Technologies - Confidential 31
  32. 32. Real-time Fast analysis data (Storm) Analytic Raw logs output©MapR Technologies - Confidential 32
  33. 33. Large analysis (map-reduce) Analytic Raw logs output©MapR Technologies - Confidential 33
  34. 34. Presentation Browser tier (d3 + query node.js) Analytic Raw logs output©MapR Technologies - Confidential 34
  35. 35. Objective Results Real-time + long-time analysis is seamless Web tier can be rooted directly on Hadoop cluster No need to move data©MapR Technologies - Confidential 35
  36. 36. Example #3 Apache Drill©MapR Technologies - Confidential 36
  37. 37. Big Data Processing – Hadoop Batch processing Query runtime Minutes to hours Data volume TBs to PBs Programming MapReduce model Users Developers Google project MapReduce Open source Hadoop project MapReduce©MapR Technologies - Confidential 37
  38. 38. Big Data Processing – Hadoop and Storm Batch processing Stream processing Query runtime Minutes to hours Never-ending Data volume TBs to PBs Continuous stream Programming MapReduce DAG model (pre-programmed) Users Developers Developers Google project MapReduce Open source Hadoop Storm or Apache S4 project MapReduce©MapR Technologies - Confidential 38
  39. 39. Big Data Processing – The missing part Batch processing Interactive analysis Stream processing Query runtime Minutes to hours Never-ending Data volume TBs to PBs Continuous stream Programming MapReduce DAG model (pre-programmed) Users Developers Developers Google project MapReduce Open source Hadoop Storm and S4 project MapReduce©MapR Technologies - Confidential 39
  40. 40. Big Data Processing – The missing part Batch processing Interactive analysis Stream processing Query runtime Minutes to hours Milliseconds to Never-ending minutes Data volume TBs to PBs GBs to PBs Continuous stream Programming MapReduce Queries DAG model (ad hoc) (pre-programmed) Users Developers Analysts and Developers developers Google project MapReduce Open source Hadoop Storm and S4 project MapReduce©MapR Technologies - Confidential 40
  41. 41. Big Data Processing Batch processing Interactive analysis Stream processing Query runtime Minutes to hours Milliseconds to Never-ending minutes Data volume TBs to PBs GBs to PBs Continuous stream Programming MapReduce Queries DAG model Users Developers Analysts and Developers developers Google project MapReduce Dremel Open source Hadoop Storm and S4 project MapReduce©MapR Technologies - Confidential 41
  42. 42. Big Data Processing Batch processing Interactive analysis Stream processing Query runtime Minutes to hours Milliseconds to Never-ending minutes Data volume TBs to PBs GBs to PBs Continuous stream Programming MapReduce Queries DAG model Users Developers Analysts and Developers developers Google project MapReduce Dremel Open source Hadoop Storm and S4 project MapReduce Apache Drill©MapR Technologies - Confidential 42
  43. 43. Design Principles Flexible Easy • Pluggable query languages • Unzip and run • Extensible execution engine • Zero configuration • Pluggable data formats • Reverse DNS not needed • Column-based and row-based • IP addresses can change • Schema and schema-less • Clear and concise log messages • Pluggable data sources Dependable Fast • No SPOF • C/C++ core with Java support • Instant recovery from crashes • Google C++ style guide • Min latency and max throughput (limited only by hardware)©MapR Technologies - Confidential 43
  44. 44. Simple Architecture Query Interface language Logical Transform Language Physical Optimize Execute Plan©MapR Technologies - Confidential 44
  45. 45. Standard Interfaces Query SQL 2003 Interface language Drill logical syntax Logical Transform Scanner Language API Physical Optimize Execute Plan©MapR Technologies - Confidential 45
  46. 46. Logical Plan Syntax: query:[ { op:"sequence", do:[ { op: "scan", memo: "initial_scan", ref: "donuts", source: "local-logs", selection: {data: "activity"} }, { op: "transform", transforms: [ { ref: "donuts.quanity", expr: "donuts.sales”} ] }, { op: "filter", expr: "donuts.ppu < 1.00" }, …©MapR Technologies - Confidential 46
  47. 47. Logical Streaming Example 01 23 4 { @id: <refnum>, op: “window-frame”, input: <input>, keys: [ 0 <name>,... 01 ], 012 ref: <name>, 123 before: 2, 234 after: here }©MapR Technologies - Confidential 47
  48. 48. Logical Plan scan-json "table-1" filter exp1 flatten aggregate exp2©MapR Technologies - Confidential 48
  49. 49. Execution Plan scan-json "table-1" scan-json "table-1" scan-json "table-1" filter exp1 filter exp1 filter exp1 flatten flatten flatten node1 node2 node3 aggregate exp2©MapR Technologies - Confidential 49
  50. 50. Representing a DAG 18 aggregate exp2 19 { @id: 19, op: "aggregate", input: 18, type: <simple|running|repeat>, keys: [<name>,...], aggregations: [ {ref: <name>, expr: <aggexpr> },... ] }©MapR Technologies - Confidential 50
  51. 51. Non-SQL queries scan-json "table-1" scan-json "table-1" streaming k-means ball k- k means aggregate exp2 k-means join cluster features©MapR Technologies - Confidential 51
  52. 52. Design Principles Flexible Easy • Pluggable query languages • Unzip and run • Extensible execution engine • Zero configuration • Pluggable data formats • Reverse DNS not needed • Column-based and row-based • IP addresses can change • Schema and schema-less • Clear and concise log messages • Pluggable data sources Dependable Fast • No SPOF • C/C++ core with Java support • Instant recovery from crashes • Google C++ style guide • Min latency and max throughput (limited only by hardware)©MapR Technologies - Confidential 52
  53. 53. The future is not what we thought it would be©MapR Technologies - Confidential 53
  54. 54. It is better!©MapR Technologies - Confidential 54
  55. 55. Get Involved! Tweet: #hcj13w #mapr @ted_dunning©MapR Technologies - Confidential 55
  56. 56. Get Involved! Download these slides – http://www.mapr.com/company/events/hcj-01-21-2013 Join the Drill project – drill-dev-subscribe@incubator.apache.org – #apachedrill Contact me: – tdunning@maprtech.com – tdunning@apache.org – @ted_dunning Join MapR (in Japan!) – jobs@mapr.com©MapR Technologies - Confidential 56
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×