Dataiku - hadoop ecosystem - @Epitech Paris - janvier 2014

1,725 views

Published on

Overview of Hadoop and its ecosystem: Pig, Hive, GraphLab, Mahout, Impala, Spark, Storm, ...

Presented at Epitech Paris on January 2014

Published in: Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,725
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
41
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Dataiku - hadoop ecosystem - @Epitech Paris - janvier 2014

  1. 1. How do Elephant Make Babies Florian Douetteau CEO, Dataiku
  2. 2. Agenda • Part #1 Big Data • Part #2 Why Hadoop, How, and When • Part #3 Overview of the Coding Ecosystem Pig / Hive / Cascading • Part #4 Overview of the Machine Learning Ecosystem Mahout • Part #5 Overview of the Extended Ecosystem
  3. 3. PART #1 BIG DATA 3 Dataiku 1/8/14
  4. 4. Collocation Dataiku C o l l o c a t A familiar grouping of words, especially words that habitually appear together and thereby convey meaning by association. Big Apple Big Mama Big Data 4 1/8/14
  5. 5. “Big” Data in 1999 struct Element { Key key; void* stat_data ; } …. C Optimized Data structures Perfect Hashing HP-UNIX Servers – 4GB Ram 100 GB data Web Crawler – Socket reuse HTTP 0.9 Dataiku 1 Month 5 1/8/14
  6. 6. Big Data in 2013      Hadoop Java / Pig / Hive / Scala / Closure / … A Dozen NoSQL data store MPP Databases Real-Time 1 Hour 6 Dataiku 1/8/14
  7. 7. To Hadoop 1 TB 1B $ 1 TB ?$ 1 TB 100M $ Web Search 1999 Logistics 2004 10 TB 10M $ 100 TB ?$ Banking CRM 2008 SQL OR AD HOC 50TB 1B$ 1000TB 500M $ E-Commerce 2013 Social Gaming 2011 Web Search 2010 Online Advertising 2012 SQL + HADOOP
  8. 8. Meet Hal Alowne Hal Alowne BI Manager Dim‟s Private Showroom European E-commerce Web site • 100M$ Revenue • 1 Million customer • 1 Data Analyst (Hal Himself) Dataiku - Data Tuesday ‟ Dim Sum CEO & Founder Dim‟s Private Showroom Hey Hal ! We need a big data platform like the big guys. Let‟s just do as they do! Big Data Copy Cat Project ” Big Guys • 10B$+ Revenue • 100M+ customers • 100+ Data Scientist 1/8/14 8
  9. 9. QUESTION #1 IS IT EASY OR NOT ?
  10. 10. SUBTLE PATTERN S
  11. 11. "MORE BUSINESS" BUTTONS
  12. 12. QUESTION #2 WHO TO HIRE ?
  13. 13. DATA SCIENTIST AT NIGHT
  14. 14. DATA CLEANER THE DAY
  15. 15. PARADOX #3 WHERE ?
  16. 16. MY DATA IS WORTH MILLIONS
  17. 17. I SEND IT TO THE MARKETING CLOUD
  18. 18. QUERSTION #4 IS IT BIG OR NOT
  19. 19. WE ALL LIVE IN A BIG DATA LAKE
  20. 20. ALL MY DATA PROBABLY FITS IN HERE
  21. 21. QUESTION #5 (at last) HUMAN OR NOT ?
  22. 22. MACHINE LEARNING WILL SAVE US ALL
  23. 23. I JUST WANT MORE REPORTS
  24. 24. MERIT = TIME + ROI TIME : 6 MONTHS ROI : APPS 2014 2013 Find the right people (6 months?) Choose the technology (6 months?) Make it work (6 months?) 2013 Build the lab (6 months) • Train People • Reuse working patterns Build a lab in 6 months (rather than 18 months) Dataiku Targeted Newsletter Recommender Systems Adapted Product / Promotions Deploy apps 24 that actually deliver value 1/9/14
  25. 25. Statistics and Machine Learning is complex ! Try to understand myself 25 Dataiku 1/9/14
  26. 26. (Some Book you might want to read) 26 Dataiku 1/9/14
  27. 27. CHOOSE TECHNOLOGY NoSQL-Slavia Machine Learning Mystery Land Scalability Central Hadoop ElasticSearch Ceph SOLR Scikit-Learn GraphLAB prediction.io jubatus Mahout WEKA Sphere Cassandra MongoDB Riak CouchBase MLBase LibSVM Real-time island SQL Colunnar Republic InfiniDB Drill Kafka Flume Spark Storm RapidMiner Vertica GreenPlum Impala Netezza QlickView Cascading Tableau Vizualization County Dataiku - Pig, Hive and Cascading SPSS Panda Pig Kibana SpotFire D3 R SAS Talend Data Clean Wasteland Statistician Old House
  28. 28. Big Data Use Case #1 Manage Volumes    Business Intelligence Stack as Scalability and maintenance issues Backoffice implements business rules that are challenged Existing infrastructure cannot cope with per-user information Main Pain Point: 23 hours 52 minutes to compute Business Intelligence aggregates for one day. 28 Dataiku 1/9/14
  29. 29. Big Data Use Case #1 Manage Volumes • • • Relieve their current DWH and accelerate production of some aggregates/KPIs Be the backbone for new personalized user experience on their website: more recommendations, more profiling, etc., Train existing people around machine learning and segmentation experience 1h12 to perform the aggregate, available every morning New home page personalization deployed in a few weeks Hadoop Cluster (24 cores) Google Compute Engine Python + R + Vertica 12 TB dataset 6 weeks projects 29 Dataiku - Data Tuesday 1/9/14
  30. 30. Big Data Use Case #2 Find Patterns  Correlation ◦ between community size and engagement / virality  Some mid-size communities Meaningul patterns ◦ 2 players / Family / Group  What is the minimum number of friends to have in the application to get additional engagement ? A very large community Lots of small clusters mostly 2 players) 30 Dataiku 1/9/14
  31. 31. How do I (pre)process data? Implicit User Data (Views, Searches…) Online User Information Transformation Predictor 500TB Transformation Matrix Explicit User Data Predictor Runtime (Click, Buy, …) Per User Stats Rank Predictor 50TB Per Content Stats User Information (Location, Graph…) User Similarity 1TB Content Data (Title, Categories, Price, …) 200GB Content Similarity A/B Test Data Dataiku - Pig, Hive and Cascading
  32. 32. Always the same Pour Data In Compute Something Smart About It Make Available
  33. 33. The Questions Pour Data In How often ? What kind of interaction? How much ? Compute Something Smart About It How complex ? Do you need all data at once ? How incremental ? Make Available Interaction ? Random Access ?
  34. 34. PART #2 AT THE BEGINNING WAS THE ELEPHANT
  35. 35. The Text Use Case Pour Data In Large Volume 1TB Textual Like Data (Logs, Docs,….) Compute Something Smart About It Massive Global Transformation Then Aggregation (Counting, Invert Index, ….) Make Available Every Day
  36. 36. What‟s Difficult (back in 2000) • Large Data won‟t fit in one server • Large computation (a few hours) are bound to fail one time or another • Data is so big that my memory is too big to perform full aggregations • Parallelization with threading is errorprone • Data is so big that my Ethernet cable is not that big
  37. 37. What‟s Difficult (back in 2000) • Large Data won‟t fit in one server HDFS • Large computation (a few hours) are bound to fail one time or another • Data is so big that my memory is too big to perform full aggregations • Parallelization with threading is errorprone • Data is so big that my Ethernet cable is not that big MAP REDUCE JOB TRACKER
  38. 38. MapReduce How to count works in many many boxes
  39. 39. MapReduce PREREQUESITES GROUPS CAN BE DETERMINED AT THE ROW LEVEL AGGREGATION OPERATION IS IDEMPOTENT 41
  40. 40. Questions ?
  41. 41. PART #3 CODING HADOOP
  42. 42. Pig History  Yahoo Research in 2006 Inspired from Sawzall, a Google Paper from 2003 2007 as an Apache Project  Initial motivation   ◦ Search Log Analytics: how long is the average user session ? how many links does a user click ? on before leaving a website ? how do click patterns vary in the course of a day/week/month ? … words = LOAD '/training/hadoopwordcount/output„ USING PigStorage(„t‟) AS (word:chararray, count:int); sorted_words = ORDER words BY count DESC; first_words = LIMIT sorted_words 10; DUMP first_words; Dataiku - Pig, Hive and Cascading
  43. 43. Hive History  Developed by Facebook in January 2007  Open source in August 2008  Initial Motivation ◦ Provide a SQL like abstraction to perform statistics on status updates create external table wordcounts ( word string, count int ) row format delimited fields terminated by 't' location '/training/hadoop-wordcount/output'; select * from wordcounts order by count desc limit 10; select SUM(count) from wordcounts where word like „th%‟; Dataiku - Pig, Hive and Cascading
  44. 44. Cascading History  Authored by Chris Wensel 2008  Associated Projects ◦ Cascalog : Cascading in Closure ◦ Scalding : Cascading in Scala (Twitter in 2012) ◦ Lingual ( to be released soon): SQL layer on top of cascading Dataiku - Pig, Hive and Cascading
  45. 45. Pig Hive Mapping to Mapreduce jobs events = LOAD „/events‟ USING PigStorage(„t‟) AS (type:chararray, user:chararray, price:int, timestamp:int); events_filtered = FILTER events BY type; by_user = GROUP events_filtered BY user; price_by_user = FOREACH by_user GENERATE type, SUM(price) AS total_price, MAX(timestamp) as max_ts; high_pbu = FILTER price_by_user BY total_price 1000; Job 1 : Mapper LOAD FILTER Job 1 : Reducer1 Shuffle and sort by user GROUP FOREACH FILTER * VAT excluded Dataiku - Innovation Services 1/8/14 47
  46. 46. Pig Hive Mapping to Mapreduce jobs = LOAD „/events‟ USING PigStorage(„t‟) AS (type:chararray, user:chararray, price:int, timestamp:int); events events_filtered = FILTER events BY type; by_user = GROUP events_filtered BY user; price_by_user = FOREACH by_user GENERATE type, SUM(price) AS total_price, MAX(timestamp) as max_ts; high_pbu = FILTER price_by_user BY total_price 1000; recent_high = ORDER high_pbu BY max_ts DESC; STORE recent_high INTO „/output‟; Job 1: Mapper LOAD FILTER Job 1 :Reducer Shuffle and sort by user Job 2: Mapper LOAD (from tmp) GROUP FOREACH FILTER Job 2: Reducer Shuffle and sort by max_ts STORE 48 Dataiku - Innovation Services 1/8/14
  47. 47. Pig How does it work Data Execution Plan compiled into 10 map reduce jobs executed in parallel (or not) Dataiku - Pig, Hive and Cascading
  48. 48. Hive Joins How to join with MapReduce ? Uid tbl_idx uid 1 2 1 1 2 Dupont Type2 Type1 2 Type2 type Tbl_idx Name Type Uid 1 Type Durand Type1 Durand Type2 2 Name 2 Type1 2 2 Type1 Reducer 1 2 2 Dupont 1 2 Durand Uid 2 Type Dupont Shuffle by uid Sort by (uid, tbl_idx) uid Name 1 1 Dupont 1 tbl_idx Type Uid 1 1 Name name 1 1 Tbl_idx Type1 Type1 Mappers output Reducer 2 50 Dataiku - Innovation Services 1/8/14
  49. 49. WHAT IS THE BEST TOOL ?
  50. 50. Comparing without Comparable  Philosophy ◦ Procedural Vs Declarative ◦ Data Model and Schema  Productivity ◦ Headachability ◦ Checkpointing ◦ Testing and environment  Integration ◦ Partitioning ◦ Formats Integration ◦ External Code Integration  Performance and optimization Dataiku - Pig, Hive and Cascading
  51. 51. Procedural Vs Declarative  Transformation as a sequence of operations Users = load 'users' as (name, age, ipaddr); Clicks = load 'clicks' as (user, url, value); ValuableClicks = filter Clicks by value 0; UserClicks = join Users by name, ValuableClicks by user; Geoinfo = load 'geoinfo' as (ipaddr, dma); UserGeo = join UserClicks by ipaddr, Geoinfo by ipaddr; ByDMA = group UserGeo by dma; ValuableClicksPerDMA = foreach ByDMA generate group, COUNT(UserGeo); store ValuableClicksPerDMA into 'ValuableClicksPerDMA';  Transformation as a set of formulas insert into ValuableClicksPerDMA select dma, count(*) from geoinfo join ( select name, ipaddr from users join clicks on (users.name = clicks.user) where value 0; ) using ipaddr group by dma; Dataiku - Pig, Hive and Cascading
  52. 52. Data type and Model Rationale  All three Extend basic data model with extended data types ◦ array-like [ event1, event2, event3] ◦ map-like { type1:value1, type2:value2, …}  Different approach ◦ Resilient Schema ◦ Static Typing ◦ No Static Typing Dataiku - Pig, Hive and Cascading
  53. 53. Hive Data Type and Schema CREATE TABLE visit ( user_name user_id user_details ); STRING, INT, STRUCTage:INT, zipcode:INT Simple type Details TINYINT, SMALLINT, INT, BIGINT 1, 2, 4 and 8 bytes FLOAT, DOUBLE 4 and 8 bytes BOOLEAN STRING Arbitrary-length, replaces VARCHAR TIMESTAMP Complex type Details ARRAY Array of typed items (0-indexed) MAP Associative map STRUCT Complex class-like objects 55 Dataiku Training – Hadoop for Data Science 1/8/14
  54. 54. Data types and Schema Pig rel = LOAD '/folder/path/' USING PigStorage(„t‟) AS (col:type, col:type, col:type); Simple type Details int, long, float, double 32 and 64 bits, signed chararray A string bytearray An array of … bytes boolean A boolean Complex type Details tuple a tuple is an ordered fieldname:value map bag a bag is a set of tuples 56 Dataiku Training – Hadoop for Data Science 1/8/14
  55. 55. Data Type and Schema Cascading   Support for Any Java Types, provided they can be serialized in Hadoop No support for Typing Simple type Details Int, Long, Float, Double 32 and 64 bits, signed String A string byte[] An array of … bytes Boolean A boolean Complex type Object Dataiku - Pig, Hive and Cascading Details Object must be « Hadoop serializable »
  56. 56. Style Summary Style Typing Data Model Metadata store Pig Procedural Static + Dynamic scalar + tuple+ bag (fully recursive) No (HCatalog) Hive Declarative Static + Dynamic, enforced at execution time scalar+ list + map Integrated Cascading Procedural Weak scalar+ java objects No Dataiku - Pig, Hive and Cascading
  57. 57. Comparing without Comparable  Philosophy ◦ Procedural Vs Declarative ◦ Data Model and Schema  Productivity ◦ Headachability ◦ Checkpointing ◦ Testing, error management and environment  Integration ◦ Partitioning ◦ Formats Integration ◦ External Code Integration  Performance and optimization Dataiku - Pig, Hive and Cascading
  58. 58. Headachility Motivation  Does debugging the tool lead to bad headaches ? Dataiku - Pig, Hive and Cascading
  59. 59. Headaches Pig  Out Of Memory Error (Reducer)  Exception in Building / Extended Functions (handling of null)  Null vs “”  Nested Foreach and scoping  Date Management (pig 0.10)  Field implicit ordering Dataiku - Pig, Hive and Cascading
  60. 60. A Pig Error Dataiku - Pig, Hive and Cascading
  61. 61. Headaches Hive  Out of Memory Errors in Reducers  Few Debugging Options  Null / “”  No builtin “first” Dataiku - Pig, Hive and Cascading
  62. 62. Headaches Cascading  Weak Typing Errors (comparing Int and String … )  Illegal Operation Sequence (Group after group …)  Field Implicit Ordering Dataiku - Pig, Hive and Cascading
  63. 63. Testing Motivation   How to perform unit tests ? How to have different versions of the same script (parameter) ? Dataiku - Pig, Hive and Cascading
  64. 64. Testing Pig     System Variables Comment to test No Meta Programming pig –x local to execute on local files Dataiku - Pig, Hive and Cascading
  65. 65. Testing / Environment Cascading   Junit Tests are possible Ability to use code to actually comment out some variables Dataiku - Pig, Hive and Cascading
  66. 66. Checkpointing Motivation    Lots of iteration while developing on Hadoop Sometime jobs fail Sometimes need to restart from the start … Parse Logs Per Page Stats Page User Correlation FIX and relaunch Dataiku - Pig, Hive and Cascading Filtering Output
  67. 67. Pig Manual Checkpointing  STORE Command to manually store files Parse Logs Per Page Stats Page User Correlation // COMMENT Beginning of script and relaunch Dataiku - Pig, Hive and Cascading Filtering Output
  68. 68. Cascading Automated Checkpointing  Ability to re-run a flow automatically from the last saved checkpoint addCheckpoint(… ) Dataiku - Pig, Hive and Cascading
  69. 69. Cascading Topological Scheduler   Check each file intermediate timestamp Execute only if more recent Parse Logs Per Page Stats Page User Correlation Filtering Dataiku - Pig, Hive and Cascading Output
  70. 70. Productivity Summary Headaches Pig Hive Cascading Checkpointing/Rep lay Testing / Metaprogrammation Lots Manual Save Difficult Meta programming, easy local testing Few, but without None (That‟s SQL) debugging options Weak Typing Complexity Dataiku - Pig, Hive and Cascading Checkpointing Partial Updates None (That‟s SQL) Possible
  71. 71. Comparing without Comparable  Philosophy ◦ Procedural Vs Declarative ◦ Data Model and Schema  Productivity ◦ Headachability ◦ Checkpointing ◦ Testing and environment  Integration ◦ Formats Integration ◦ Partitioning ◦ External Code Integration  Performance and optimization Dataiku - Pig, Hive and Cascading
  72. 72. Formats Integration Motivation  Ability to integrate different file formats  Ability to integrate with external data sources or sink ( MongoDB, ElasticSearch, Database. …) ◦ Text Delimited ◦ Sequence File (Binary Hadoop format) ◦ Avro, Thrift .. Format impact on size and performance Format Size on Disk (GB) HIVE Processing time (24 cores) Text File, uncompressed 18.7 1m32s 1 Text File, Gzipped 3.89 6m23s JSON compressed 7.89 2m42s multiple text file gzipped 4.02 43s Sequence File, Block, Gzip 5.32 1m18s Text File, LZO Indexed 7.03 1m22s Dataiku - Pig, Hive and Cascading (no parallelization)
  73. 73. Format Integration    Hive: Serde (Serialize-Deserializer) Pig : Storage Cascading: Tap Dataiku - Pig, Hive and Cascading
  74. 74. Partitions Motivation   No support for “UPDATE” patterns, any increment is performed by adding or deleting a partition Common partition schemas on Hadoop ◦ ◦ ◦ ◦ ◦ By Date /apache_logs/dt=2013-01-23 By Data center /apache_logs/dc=redbus01/… By Country … Or any combination of the above Dataiku - Pig, Hive and Cascading
  75. 75. Hive Partitioning Partitioned tables CREATE TABLE event ( user_id INT, type STRING, message STRING) PARTITIONED BY (day STRING, server_id STRING); Disk structure /hive/event/day=2013-01-27/server_id=s1/file0 /hive/event/day=2013-01-27/server_id=s1/file1 /hive/event/day=2013-01-27/server_id=s2/file0 /hive/event/day=2013-01-27/server_id=s2/file1 … /hive/event/day=2013-01-28/server_id=s2/file0 /hive/event/day=2013-01-28/server_id=s2/file1 INSERT OVERWRITE TABLE event PARTITION(ds='2013-01-27', server_id=„s1‟) SELECT * FROM event_tmp; Dataiku Training – Hadoop for Data Science 1/8/14 77
  76. 76. Cascading Partition No Direct support for partition  Support for “Glob” Tap, to build read from files using patterns   ➔ You can code your own custom or virtual partition schemes Dataiku - Pig, Hive and Cascading
  77. 77. External Code Integration Simple UDF Pig Hive Cascadin g Dataiku - Pig, Hive and Cascading
  78. 78. Hive Complex UDF (Aggregators) Dataiku - Pig, Hive and Cascading
  79. 79. Cascading Direct Code Evaluation Uses Janino, a very cool project: http://docs.codehaus.org/display/JANINO Dataiku - Pig, Hive and Cascading
  80. 80. Integration Summary Partition/Increme External Code ntal Updates Pig No Direct Support Hive Cascading Dataiku - Pig, Hive and Cascading Fully integrated, SQL Like With Coding Simple Format Integration Doable and rich community Very simple, but Doable and existing complex dev setup community Complex UDFS but regular, and Java Expression embeddable Doable and growing commuinty
  81. 81. Comparing without Comparable  Philosophy ◦ Procedural Vs Declarative ◦ Data Model and Schema  Productivity ◦ Headachability ◦ Checkpointing ◦ Testing and environment  Integration ◦ Formats Integration ◦ Partitioning ◦ External Code Integration  Performance and optimization Dataiku - Pig, Hive and Cascading
  82. 82. Optimization  Several Common Map Reduce Optimization Patterns ◦ ◦ ◦ ◦ ◦  Combiners MapJoin Job Fusion Job Parallelism Reducer Parallelism Different support per framework ◦ Fully Automatic ◦ Pragma / Directives / Options ◦ Coding style / Code to write Dataiku - Pig, Hive and Cascading
  83. 83. Combiner Perform Partial Aggregate at Mapper Stage SELECT date, COUNT(*) FROM product GROUP BY date 2012-02-14 4354 Map … 2012-02-14 4354 2012-02-15 21we2 … Reduc e 2012-02-14 20 2012-02-15 21we2 2012-02-15 35 2012-02-14 qa334 … 2012-02-15 23aq2 2012-02-14 qa334 … 2012-02-15 23aq2 Dataiku - Pig, Hive and Cascading 2012-02-16 1
  84. 84. Combiner Perform Partial Aggregate at Mapper Stage SELECT date, COUNT(*) FROM product GROUP BY date Map 2012-02-14 4354 2012-02-14 8 … 2012-02-15 12 Reduc e 2012-02-14 20 2012-02-15 21we2 2012-02-15 35 2012-02-14 qa334 … 2012-02-15 23aq2 2012-02-14 12 2012-02-15 23 2012-02-16 1 Reduced network bandwith. Better parallelism Dataiku - Pig, Hive and Cascading 2012-02-16 1
  85. 85. Join Optimization Map Join Hive set hive.auto.convert.join = true; Pig Cascadin g ( no aggregation support after HashJoin) Dataiku - Pig, Hive and Cascading
  86. 86. Number of Reducers  Critical for performance  Estimated per the size of input file ◦ Hive  divide size per hive.exec.reducers.bytes.per.reducer (default 1GB) ◦ Pig  divide size pig.exec.reducers.bytes.per.reducer (default 1GB) Dataiku - Pig, Hive and Cascading
  87. 87. Performance Optimization Summary Combiner Optimization Pig Cascading Hive Dataiku - Pig, Hive and Cascading Join Optimization Number of reducers optimization Automatic Option Estimate or DIY DIY HashJoin DIY Partial DIY Automatic (Map Join) Estimate or DIY
  88. 88. Questions ?
  89. 89. PART #4 QUICK MAHOUT
  90. 90. Clustering Revenu e c Age
  91. 91. Clustering Revenu e One Cluster Centroid == Center of the cluster c Age
  92. 92. clustering applications • Fraud: Detect Outliers • CRM : Mine for customer segments • Image Processing : Similar Images • Search : Similar documents • Search : Allocate Topics
  93. 93. K-Means Guess an initial placement for centroids Assign each point to closest Center Reposition Center MAP REDUCE
  94. 94. clustering challenges • Curse of Dimensionality • Choice of distance / number of parameters • Performance • Choice # of clusters
  95. 95. Mahout Clustering Challenges • No Integrated Feature Engineering Stack: Get ready to write data processing in Java • Hadoop SequenceFile required as an input • Iterations as Map/Reduce read and write to disks: Relatively slow compared to in-memory processing
  96. 96. Data Processing Image Voice Log / DB Data Processing Vectorized Data
  97. 97. Mahout K-Means on Text Workflow Text Files mahout seqdirectory Mahout Sequence Files mahout seq2parse Tfidf Vectors mahout kmeans Clusters
  98. 98. Mahout K-Means on Database Extract Worflow Database Dump (CSV) org.apache.mahout.clustering.conve rsion.InputDriver Mahout Vectors mahout kmeans Clusters
  99. 99. Convert a CSV File to Mahout Vector • Real Code would have • Converting Categorical variables to dimensions • Variable Rescaling • Dropping IDs (name, forname …)
  100. 100. Mahout Algorithms Parameters Implicit Assumption Ouput K-Means K (number of clusters) Convergence Circles Point - ClusterId Fuzzy K-Means K (number of clusters) Convergence Circles Point - ClusterId * , Probability Expectation Maximization K (Number of clusterS) Convergence Gaussian distribution Point - ClusterId*, Probability Mean-Shift Clustering Distance boundaries, Convergence Gradient like distribution Point - Cluster ID Top Down Clustering Two Clustering Algorithns Hierarchy Point - Large ClusterId, Small ClusterId Dirichlet Process Model Distribution Points are a mixture of distribution Point - ClusterId, Probability Spectral Clustering - - Point - ClusterId MinHash Clustering Number of hash / keys Hash Type High Dimension Point - Hash*
  101. 101. Comparing Clustering KMeans MeanShif t Dirichlet Fuzzy KMeans
  102. 102. Questions ?
  103. 103. PART #5 ELEPHANT MAKE BABIES
  104. 104. What if ? Pour Data In Data Comes continously ? Compute Something Smart About It Aggregation patterns are not “hashable” Make Available Human Interaction requires results fast or incrementally available ?
  105. 105. After Hadoop Random Access In Memory MultiCore Machine Learning Faster in Memory Computation Massive Batch Map Reduce Over HDFS Real-Time Distributed Computation Faster SQL Analytics Queries
  106. 106. HBase • Started by Powerset (now in Bing) in 2007 • Provide a key-value store on top of Hadoop
  107. 107. HBASE
  108. 108. GRAPHLAB • High-Perfomance, distributed computing framework, in C++ • Started in 2009, Carneggie-Mellon • Main application in Machine Learning Tasks: Topic Modeling, Collaborative Filtering, Computer Vision • Can read data in HDFS
  109. 109. SPARK • Developped in 2010 at UC Berkeley • Provide a distributed memory abstraction for efficient sequence of map/filter/join applications. • Can Read/Store to HDFS or file
  110. 110. SPARK
  111. 111. STORM • Developped in 2011 by Nathan Marz at BackType (then Twitter) • Provide a framework for distributed real-time fault tolerant computation • Not a message queuing system, a complex event processing system
  112. 112. STORM
  113. 113. STORM WITH HADOOP
  114. 114. IMPALA • Started by Cloudera in 2012 • Provide real-time answers to SQL Queries on top of HDFS
  115. 115. BENCHMARK
  116. 116. Questions ?

×