1©MapR Technologies - Confidential
Remembering the Future
2©MapR Technologies - Confidential
My Background
 University, Startups
– Aptex, MusicMatch, ID Analytics, Veoh
– big data...
3©MapR Technologies - Confidential
MapR Technologies
 Silicon Valley Startup
– Top investors
– Top technical and manageme...
4©MapR Technologies - Confidential
Philosophy First
What is History?
5©MapR Technologies - Confidential
The study of the past
(what came before now)
6©MapR Technologies - Confidential
What is the future?
(it comes after now)
7©MapR Technologies - Confidential
8©MapR Technologies - Confidential
9©MapR Technologies - Confidential
10©MapR Technologies - Confidential
But the future also
has a past!
11©MapR Technologies - Confidential
Do you remember the
future?
12©MapR Technologies - Confidential
13©MapR Technologies - Confidential
14©MapR Technologies - Confidential
15©MapR Technologies - Confidential
16©MapR Technologies - Confidential
17©MapR Technologies - Confidential
Some things
turned out
as expected
18©MapR Technologies - Confidential
Guys wearing
Fedoras
19©MapR Technologies - Confidential
Many things
are
different!
20©MapR Technologies - Confidential
Hadoop has
a history
21©MapR Technologies - Confidential
Hadoop also
has a
future
22©MapR Technologies - Confidential
The Old Future of Hadoop
 Map-reduce and HDFS
– more and more, but not really differe...
23©MapR Technologies - Confidential
The New Future of Hadoop
 Real-time processing
– Combines real-time and long-time
 I...
24©MapR Technologies - Confidential
Example #1
Search Abuse
25©MapR Technologies - Confidential
History matrix
One row per user
One column per thing
26©MapR Technologies - Confidential
Recommendation based on
cooccurrence
Cooccurrence gives item-item
mapping
One row and ...
27©MapR Technologies - Confidential
Cooccurrence matrix can also be
implemented as a search index
28©MapR Technologies - Confidential
SolR
Indexer
SolR
Indexer
Solr
indexing
Cooccurrence
(Mahout)
Item meta-
data
Index
sh...
29©MapR Technologies - Confidential
SolR
Indexer
SolR
Indexer
Solr
search
Web tier
Item meta-
data
Index
shards
User
histo...
30©MapR Technologies - Confidential
Objective Results
 At a very large credit card company
 History is all transactions,...
31©MapR Technologies - Confidential
Example #2
Web Technology
32©MapR Technologies - Confidential
Fast analysis
(Storm)
Analytic
output
Real-time
data
Raw logs
33©MapR Technologies - Confidential
Large analysis
(map-reduce)
Analytic
output
Raw logs
34©MapR Technologies - Confidential
Presentation
tier (d3 +
node.js)
Analytic
output
Browser
query
Raw logs
35©MapR Technologies - Confidential
Objective Results
 Real-time + long-time analysis is seamless
 Web tier can be roote...
36©MapR Technologies - Confidential
Example #3
Apache Drill
37©MapR Technologies - Confidential
Big Data Processing – Hadoop
Batch processing
Query runtime Minutes to hours
Data volu...
38©MapR Technologies - Confidential
Big Data Processing – Hadoop and Storm
Batch processing Stream processing
Query runtim...
39©MapR Technologies - Confidential
Big Data Processing – The missing part
Batch processing Interactive analysis Stream pr...
40©MapR Technologies - Confidential
Big Data Processing – The missing part
Batch processing Interactive analysis Stream pr...
41©MapR Technologies - Confidential
Big Data Processing
Batch processing Interactive analysis Stream processing
Query runt...
42©MapR Technologies - Confidential
Big Data Processing
Batch processing Interactive analysis Stream processing
Query runt...
43©MapR Technologies - Confidential
Design Principles
Flexible
• Pluggable query languages
• Extensible execution engine
•...
44©MapR Technologies - Confidential
Simple Architecture
Interface
Query
language
Transform
Logical
Language
Optimize
Physi...
45©MapR Technologies - Confidential
Standard Interfaces
Interface
Query
language
Transform
Logical
Language
Optimize
Physi...
46©MapR Technologies - Confidential
query:[
{
op:"sequence", do:[
{ op: "scan",
memo: "initial_scan",
ref: "donuts",
sourc...
47©MapR Technologies - Confidential
Logical Streaming Example
{ @id: <refnum>, op: “window-frame”,
input: <input>,
keys: [...
48©MapR Technologies - Confidential
scan-json
filter
flatten
aggregate
exp1
exp2
"table-1"
Logical Plan
49©MapR Technologies - Confidential
Execution Plan
scan-json
filter
flatten
aggregate
exp1
exp2
"table-1" scan-json
filter
fla...
50©MapR Technologies - Confidential
Representing a DAG
{ @id: 19, op: "aggregate",
input: 18,
type: <simple|running|repeat...
51©MapR Technologies - Confidential
Non-SQL queries
scan-json
streaming
k-means
ball k-
means
aggregate exp2
"table-1"
k
k...
52©MapR Technologies - Confidential
Design Principles
Flexible
• Pluggable query languages
• Extensible execution engine
•...
53©MapR Technologies - Confidential
The future is
not what we
thought it
would be
54©MapR Technologies - Confidential
It is better!
55©MapR Technologies - Confidential
Get Involved!
Tweet:
#hcj13w
#mapr
@ted_dunning
56©MapR Technologies - Confidential
Get Involved!
 Download these slides
– http://www.mapr.com/company/events/hcj-01-21-2...
Upcoming SlideShare
Loading in...5
×

Predictive Analytics San Diego

251

Published on

The unification of big and little data processing onto a single platform is an important requirement for Hadoop. How can this be achieved? Ted Dunning explains what is needed for three important use cases.

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
251
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
23
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Predictive Analytics San Diego

  1. 1. 1©MapR Technologies - Confidential Remembering the Future
  2. 2. 2©MapR Technologies - Confidential My Background  University, Startups – Aptex, MusicMatch, ID Analytics, Veoh – big data since before it was big  Open source – even before the internet – Apache Hadoop, Mahout, Zookeeper, Drill – bought the beer at first HUG  MapR  Founding member of Apache Drill
  3. 3. 3©MapR Technologies - Confidential MapR Technologies  Silicon Valley Startup – Top investors – Top technical and management team • Google, Microsoft, EMC, NetApp, Oracle  Enterprise quality distribution for Hadoop  Many extensions to basic Hadoop function  Strong supporter of Apache Drill
  4. 4. 4©MapR Technologies - Confidential Philosophy First What is History?
  5. 5. 5©MapR Technologies - Confidential The study of the past (what came before now)
  6. 6. 6©MapR Technologies - Confidential What is the future? (it comes after now)
  7. 7. 7©MapR Technologies - Confidential
  8. 8. 8©MapR Technologies - Confidential
  9. 9. 9©MapR Technologies - Confidential
  10. 10. 10©MapR Technologies - Confidential But the future also has a past!
  11. 11. 11©MapR Technologies - Confidential Do you remember the future?
  12. 12. 12©MapR Technologies - Confidential
  13. 13. 13©MapR Technologies - Confidential
  14. 14. 14©MapR Technologies - Confidential
  15. 15. 15©MapR Technologies - Confidential
  16. 16. 16©MapR Technologies - Confidential
  17. 17. 17©MapR Technologies - Confidential Some things turned out as expected
  18. 18. 18©MapR Technologies - Confidential Guys wearing Fedoras
  19. 19. 19©MapR Technologies - Confidential Many things are different!
  20. 20. 20©MapR Technologies - Confidential Hadoop has a history
  21. 21. 21©MapR Technologies - Confidential Hadoop also has a future
  22. 22. 22©MapR Technologies - Confidential The Old Future of Hadoop  Map-reduce and HDFS – more and more, but not really different  Eco-system additions – Simpler programming (Hive and Pig) – Key-value store – Ad hoc query  Stands apart from other computing – Required by HDFS and other limitations
  23. 23. 23©MapR Technologies - Confidential The New Future of Hadoop  Real-time processing – Combines real-time and long-time  Integration with traditional IT – No need to stand apart  Integration with new technologies – Solr, Node.js, Twisted all should interface directly  Fast and flexible computation – Drill logical plan language
  24. 24. 24©MapR Technologies - Confidential Example #1 Search Abuse
  25. 25. 25©MapR Technologies - Confidential History matrix One row per user One column per thing
  26. 26. 26©MapR Technologies - Confidential Recommendation based on cooccurrence Cooccurrence gives item-item mapping One row and column per thing
  27. 27. 27©MapR Technologies - Confidential Cooccurrence matrix can also be implemented as a search index
  28. 28. 28©MapR Technologies - Confidential SolR Indexer SolR Indexer Solr indexing Cooccurrence (Mahout) Item meta- data Index shards Complete history
  29. 29. 29©MapR Technologies - Confidential SolR Indexer SolR Indexer Solr search Web tier Item meta- data Index shards User history
  30. 30. 30©MapR Technologies - Confidential Objective Results  At a very large credit card company  History is all transactions, all web interaction  Processing time cut from 20 hours per day to 3  Recommendation engine load time decreased from 8 hours to 3 minutes
  31. 31. 31©MapR Technologies - Confidential Example #2 Web Technology
  32. 32. 32©MapR Technologies - Confidential Fast analysis (Storm) Analytic output Real-time data Raw logs
  33. 33. 33©MapR Technologies - Confidential Large analysis (map-reduce) Analytic output Raw logs
  34. 34. 34©MapR Technologies - Confidential Presentation tier (d3 + node.js) Analytic output Browser query Raw logs
  35. 35. 35©MapR Technologies - Confidential Objective Results  Real-time + long-time analysis is seamless  Web tier can be rooted directly on Hadoop cluster  No need to move data
  36. 36. 36©MapR Technologies - Confidential Example #3 Apache Drill
  37. 37. 37©MapR Technologies - Confidential Big Data Processing – Hadoop Batch processing Query runtime Minutes to hours Data volume TBs to PBs Programming model MapReduce Users Developers Google project MapReduce Open source project Hadoop MapReduce
  38. 38. 38©MapR Technologies - Confidential Big Data Processing – Hadoop and Storm Batch processing Stream processing Query runtime Minutes to hours Never-ending Data volume TBs to PBs Continuous stream Programming model MapReduce DAG (pre-programmed) Users Developers Developers Google project MapReduce Open source project Hadoop MapReduce Storm or Apache S4
  39. 39. 39©MapR Technologies - Confidential Big Data Processing – The missing part Batch processing Interactive analysis Stream processing Query runtime Minutes to hours Never-ending Data volume TBs to PBs Continuous stream Programming model MapReduce DAG (pre-programmed) Users Developers Developers Google project MapReduce Open source project Hadoop MapReduce Storm and S4
  40. 40. 40©MapR Technologies - Confidential Big Data Processing – The missing part Batch processing Interactive analysis Stream processing Query runtime Minutes to hours Milliseconds to minutes Never-ending Data volume TBs to PBs GBs to PBs Continuous stream Programming model MapReduce Queries (ad hoc) DAG (pre-programmed) Users Developers Analysts and developers Developers Google project MapReduce Open source project Hadoop MapReduce Storm and S4
  41. 41. 41©MapR Technologies - Confidential Big Data Processing Batch processing Interactive analysis Stream processing Query runtime Minutes to hours Milliseconds to minutes Never-ending Data volume TBs to PBs GBs to PBs Continuous stream Programming model MapReduce Queries DAG Users Developers Analysts and developers Developers Google project MapReduce Dremel Open source project Hadoop MapReduce Storm and S4
  42. 42. 42©MapR Technologies - Confidential Big Data Processing Batch processing Interactive analysis Stream processing Query runtime Minutes to hours Milliseconds to minutes Never-ending Data volume TBs to PBs GBs to PBs Continuous stream Programming model MapReduce Queries DAG Users Developers Analysts and developers Developers Google project MapReduce Dremel Open source project Hadoop MapReduce Storm and S4 Apache Drill
  43. 43. 43©MapR Technologies - Confidential Design Principles Flexible • Pluggable query languages • Extensible execution engine • Pluggable data formats • Column-based and row-based • Schema and schema-less • Pluggable data sources Easy • Unzip and run • Zero configuration • Reverse DNS not needed • IP addresses can change • Clear and concise log messages Dependable • No SPOF • Instant recovery from crashes Fast • C/C++ core with Java support • Google C++ style guide • Min latency and max throughput (limited only by hardware)
  44. 44. 44©MapR Technologies - Confidential Simple Architecture Interface Query language Transform Logical Language Optimize Physical Plan Execute
  45. 45. 45©MapR Technologies - Confidential Standard Interfaces Interface Query language Transform Logical Language Optimize Physical Plan Execute SQL 2003 Drill logical syntax Scanner API
  46. 46. 46©MapR Technologies - Confidential query:[ { op:"sequence", do:[ { op: "scan", memo: "initial_scan", ref: "donuts", source: "local-logs", selection: {data: "activity"} }, { op: "transform", transforms: [ { ref: "donuts.quanity", expr: "donuts.sales”} ] }, { op: "filter", expr: "donuts.ppu < 1.00" }, … Logical Plan Syntax:
  47. 47. 47©MapR Technologies - Confidential Logical Streaming Example { @id: <refnum>, op: “window-frame”, input: <input>, keys: [ <name>,... ], ref: <name>, before: 2, after: here } 0 1 2 3 4 0 0 1 0 1 2 1 2 3 2 3 4
  48. 48. 48©MapR Technologies - Confidential scan-json filter flatten aggregate exp1 exp2 "table-1" Logical Plan
  49. 49. 49©MapR Technologies - Confidential Execution Plan scan-json filter flatten aggregate exp1 exp2 "table-1" scan-json filter flatten exp1 "table-1" scan-json filter flatten exp1 "table-1" node1 node2 node3
  50. 50. 50©MapR Technologies - Confidential Representing a DAG { @id: 19, op: "aggregate", input: 18, type: <simple|running|repeat>, keys: [<name>,...], aggregations: [ {ref: <name>, expr: <aggexpr> },... ] } aggregate exp2 18 19
  51. 51. 51©MapR Technologies - Confidential Non-SQL queries scan-json streaming k-means ball k- means aggregate exp2 "table-1" k k-means join cluster features scan-json "table-1"
  52. 52. 52©MapR Technologies - Confidential Design Principles Flexible • Pluggable query languages • Extensible execution engine • Pluggable data formats • Column-based and row-based • Schema and schema-less • Pluggable data sources Easy • Unzip and run • Zero configuration • Reverse DNS not needed • IP addresses can change • Clear and concise log messages Dependable • No SPOF • Instant recovery from crashes Fast • C/C++ core with Java support • Google C++ style guide • Min latency and max throughput (limited only by hardware)
  53. 53. 53©MapR Technologies - Confidential The future is not what we thought it would be
  54. 54. 54©MapR Technologies - Confidential It is better!
  55. 55. 55©MapR Technologies - Confidential Get Involved! Tweet: #hcj13w #mapr @ted_dunning
  56. 56. 56©MapR Technologies - Confidential Get Involved!  Download these slides – http://www.mapr.com/company/events/hcj-01-21-2013  Join the Drill project – drill-dev-subscribe@incubator.apache.org – #apachedrill  Contact me: – tdunning@maprtech.com – tdunning@apache.org – @ted_dunning  Join MapR (in Japan!) – jobs@mapr.com
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×