• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
BreizhJUG - Janvier 2014 - Big Data -  Dataiku - Pages Jaunes
 

BreizhJUG - Janvier 2014 - Big Data - Dataiku - Pages Jaunes

on

  • 3,093 views

Presentation made at Rennes in January for the handsome BreizhJUG. This is a mixed presentation for big data technologies, which covers topics such as : Why Hadoop ? What next ? Machine Learning for ...

Presentation made at Rennes in January for the handsome BreizhJUG. This is a mixed presentation for big data technologies, which covers topics such as : Why Hadoop ? What next ? Machine Learning for big data in practice.

Statistics

Views

Total Views
3,093
Views on SlideShare
3,057
Embed Views
36

Actions

Likes
0
Downloads
21
Comments
0

1 Embed 36

https://twitter.com 36

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    BreizhJUG - Janvier 2014 - Big Data -  Dataiku - Pages Jaunes BreizhJUG - Janvier 2014 - Big Data - Dataiku - Pages Jaunes Presentation Transcript

    • BIG DATA How do elephant make babies Florian Douetteau CEO, Dataiku
    • Agenda • Big Data & Hadoop Overview • Practical Big Data Coding: Pig / Hive / Cascading • PagesJaunes Big Data Use Case • Machine Learning For Big Data
    • Motivation 3 Dataiku 1/8/14
    • Collocation Dataiku C o l l o c a t A familiar grouping of words, especially words that habitually appear together and thereby convey meaning by association. Big Apple Big Mama Big Data 4 1/8/14
    • “Big” Data in 1999 struct Element { Key key; void* stat_data ; } …. C Optimized Data structures Perfect Hashing HP-UNIX Servers – 4GB Ram 100 GB data Web Crawler – Socket reuse HTTP 0.9 Dataiku 1 Month 5 1/8/14
    • Big Data in 2013      Hadoop Java / Pig / Hive / Scala / Closure / … A Dozen NoSQL data store MPP Databases Real-Time 1 Hour 6 Dataiku 1/8/14
    • Data Analytics: The Stakes 1 TB 1B $ 1 TB ?$ 1 TB 100M $ Web Search 1999 Logistics 2004 Dataiku 10 TB 10M $ 100 TB ?$ Banking CRM 2008 50TB 1B$ 1000TB 500M $ E-Commerce 2013 Social Gaming 2011 Web Search 2010 Online Advertising 2012 1/8/14 7
    • Meet Hal Alowne Hal Alowne BI Manager Dim‟s Private Showroom European E-commerce Web site • 100M$ Revenue • 1 Million customer • 1 Data Analyst (Hal Himself) Dataiku - Data Tuesday ‟ Dim Sum CEO & Founder Dim‟s Private Showroom Hey Hal ! We need a big data platform like the big guys. Let‟s just do as they do! Big Data Copy Cat Project ” Big Guys • 10B$+ Revenue • 100M+ customers • 100+ Data Scientist 1/8/14 8
    • QUESTION #1 IS IT EASY OR NOT ?
    • SUBTLE PATTERN S
    • "MORE BUSINESS" BUTTONS
    • QUESTION #2 WHO TO HIRE ?
    • DATA SCIENTIST AT NIGHT
    • DATA CLEANER THE DAY
    • PARADOX #3 WHERE ?
    • MY DATA IS WORTH MILLIONS
    • I SEND IT TO THE MARKETING CLOUD
    • QUERSTION #4 IS IT BIG OR NOT
    • WE ALL LIVE IN A BIG DATA LAKE
    • ALL MY DATA PROBABLY FITS IN HERE
    • QUESTION #5 (at last) HUMAN OR NOT ?
    • MACHINE LEARNING WILL SAVE US ALL
    • I JUST WANT MORE REPORTS
    • MERIT = TIME + ROI TIME : 6 MONTHS ROI : APPS 2014 2013 Find the right people (6 months?) Choose the technology (6 months?) Make it work (6 months?) 2013 Build the lab (6 months) • Train People • Reuse working patterns  Build a lab in 6 months (rather than 18 months) Dataiku Targeted Newsletter Recommender Systems Adapted Product / Promotions  Deploy apps 24 that actually deliver value 1/9/14
    • Statistics and Machine Learning is complex !  Try to understand myself 25 Dataiku 1/9/14
    • (Some Book you might want to read) 26 Dataiku 1/9/14
    • CHOOSE TECHNOLOGY NoSQL-Slavia Hadoop Elastic Search Ceph SOLR Riak Machine Learning Mystery Land Scalability Central Cassandra MongoDB Membase Scikit-Learn GraphLAB prediction.io jubatus Mahout WEKA Sphere Kafka Flume Real-time island Spark Storm SQL Colunnar Republic MLBase RapidMiner Vertica Netezza QlickView Kibana SpotFire D3 Cascading Tableau Dataiku - Pig, Hive and Cascading SPSS Panda Pig Vizualization County R SAS InfiniDB Drill GreenPlum Impala LibSVM Talend Data Clean Wasteland Statistician Old House
    • Large E-Retailer    Business Intelligence Stack as Scalability and maintenance issues Backoffice implements business rules that are challenged Existing infrastructure cannot cope with per-user information Main Pain Point: 23 hours 52 minutes to compute Business Intelligence aggregates for one day. 29 Dataiku 1/9/14
    • Large E-Retailer : The Datalab • • • Relieve their current DWH and accelerate production of some aggregates/KPIs Be the backbone for new personalized user experience on their website: more recommendations, more profiling, etc., Train existing people around machine learning and segmentation experience 1h12 to perform the aggregate, available every morning New home page personalization deployed in a few weeks Hadoop Cluster (24 cores) Google Compute Engine Python + R + Vertica 12 TB dataset 6 weeks projects 30 Dataiku - Data Tuesday 1/9/14
    • Example (Social Gaming) Social Gaming Communities  Correlation ◦ between community size and engagement / virality  Some mid-size communities Meaningul patterns ◦ 2 players / Family / Group  What is the minimum number of friends to have in the application to get additional engagement ? A very large community Lots of small clusters mostly 2 players) 31 Dataiku 1/9/14
    • How do I (pre)process data? Implicit User Data (Views, Searches…) Online User Information Transformation Predictor 500TB Transformation Matrix Explicit User Data Predictor Runtime (Click, Buy, …) Per User Stats Rank Predictor 50TB Per Content Stats User Information (Location, Graph…) User Similarity 1TB Content Data (Title, Categories, Price, …) 200GB Content Similarity A/B Test Data Dataiku - Pig, Hive and Cascading
    • Always the same Pour Data In Compute Something Smart About It Make Available
    • The Questions Pour Data In How often ? What kind of interaction? How much ? Compute Something Smart About It How complex ? Do you need all data at once ? How incremental ? Make Available Interaction ? Random Access ?
    • At the Beginning was the elephant
    • MapReduce How to count works in many many boxes 37 Dataiku - Innovation Services 1/8/14
    • ELEPHANT MAKE BABIES
    • After Hadoop Random Access In Memory MultiCore Machine Learning Faster in Memory Computation Massive Batch Map Reduce Over HDFS Real-Time Distributed Computation Faster SQL Analytics Queries
    • MapReduce Simplicity is a complexity 40 Dataiku - Innovation Services 1/8/14
    • Agenda       Dataiku - Pig, Hive and Cascading Hadoop and Context (->0:03) Pig, Hive, Cascading, … (->0:09) How they work (->0:15) Comparing the tools (->0:35) Make them work together (->0:40) Wrap‟up and question (->Beer)
    • Pig History  Yahoo Research in 2006 Inspired from Sawzall, a Google Paper from 2003 2007 as an Apache Project  Initial motivation   ◦ Search Log Analytics: how long is the average user session ? how many links does a user click ? on before leaving a website ? how do click patterns vary in the course of a day/week/month ? … words = LOAD '/training/hadoopwordcount/output„ USING PigStorage(„t‟) AS (word:chararray, count:int); sorted_words = ORDER words BY count DESC; first_words = LIMIT sorted_words 10; DUMP first_words; Dataiku - Pig, Hive and Cascading
    • Hive History  Developed by Facebook in January 2007  Open source in August 2008  Initial Motivation ◦ Provide a SQL like abstraction to perform statistics on status updates create external table wordcounts ( word string, count int ) row format delimited fields terminated by 't' location '/training/hadoop-wordcount/output'; select * from wordcounts order by count desc limit 10; select SUM(count) from wordcounts where word like „th%‟; Dataiku - Pig, Hive and Cascading
    • Cascading History  Authored by Chris Wensel 2008  Associated Projects ◦ Cascalog : Cascading in Closure ◦ Scalding : Cascading in Scala (Twitter in 2012) ◦ Lingual ( to be released soon): SQL layer on top of cascading Dataiku - Pig, Hive and Cascading
    • Agenda       Dataiku - Pig, Hive and Cascading Hadoop and Context (->0:03) Pig, Hive, Cascading, … (->0:09) How they work (->0:15) Comparing the tools (->0:35) Make them work together (->0:40) Wrap‟up and question (->Beer)
    • Pig & Hive Mapping to Mapreduce jobs events = LOAD „/events‟ USING PigStorage(„t‟) AS (type:chararray, user:chararray, price:int, timestamp:int); events_filtered = FILTER events BY type; by_user = GROUP events_filtered BY user; price_by_user = FOREACH by_user GENERATE type, SUM(price) AS total_price, MAX(timestamp) as max_ts; high_pbu = FILTER price_by_user BY total_price > 1000; Job 1 : Mapper LOAD FILTER Job 1 : Reducer1 Shuffle and sort by user GROUP FOREACH FILTER * VAT excluded Dataiku - Innovation Services 1/8/14 46
    • Pig & Hive Mapping to Mapreduce jobs = LOAD „/events‟ USING PigStorage(„t‟) AS (type:chararray, user:chararray, price:int, timestamp:int); events events_filtered = FILTER events BY type; by_user = GROUP events_filtered BY user; price_by_user = FOREACH by_user GENERATE type, SUM(price) AS total_price, MAX(timestamp) as max_ts; high_pbu = FILTER price_by_user BY total_price > 1000; recent_high = ORDER high_pbu BY max_ts DESC; STORE recent_high INTO „/output‟; Job 1: Mapper LOAD FILTER Job 1 :Reducer Shuffle and sort by user Job 2: Mapper LOAD (from tmp) GROUP FOREACH FILTER Job 2: Reducer Shuffle and sort by max_ts STORE 47 Dataiku - Innovation Services 1/8/14
    • Pig How does it work Data Execution Plan compiled into 10 map reduce jobs executed in parallel (or not) Dataiku - Pig, Hive and Cascading
    • Hive Joins How to join with MapReduce ? Uid tbl_idx uid 1 2 1 1 2 Dupont Type2 Type1 2 Type2 type Tbl_idx Name Type Uid 1 Type Durand Type1 Durand Type2 2 Name 2 Type1 2 2 Type1 Reducer 1 2 2 Dupont 1 2 Durand Uid 2 Type Dupont Shuffle by uid Sort by (uid, tbl_idx) uid Name 1 1 Dupont 1 tbl_idx Type Uid 1 1 Name name 1 1 Tbl_idx Type1 Type1 Mappers output Reducer 2 49 Dataiku - Innovation Services 1/8/14
    • Agenda       Dataiku - Pig, Hive and Cascading Hadoop and Context (->0:03) Pig, Hive, Cascading, … (->0:09) How they work (->0:15) Comparing the tools (->0:35) Make them work together (->0:40) Wrap‟up and question (->Beer)
    • Comparing without Comparable  Philosophy ◦ Procedural Vs Declarative ◦ Data Model and Schema  Productivity ◦ Headachability ◦ Checkpointing ◦ Testing and environment  Integration ◦ Partitioning ◦ Formats Integration ◦ External Code Integration  Performance and optimization Dataiku - Pig, Hive and Cascading
    • Procedural Vs Declarative  Transformation as a sequence of operations Users = load 'users' as (name, age, ipaddr); Clicks = load 'clicks' as (user, url, value); ValuableClicks = filter Clicks by value > 0; UserClicks = join Users by name, ValuableClicks by user; Geoinfo = load 'geoinfo' as (ipaddr, dma); UserGeo = join UserClicks by ipaddr, Geoinfo by ipaddr; ByDMA = group UserGeo by dma; ValuableClicksPerDMA = foreach ByDMA generate group, COUNT(UserGeo); store ValuableClicksPerDMA into 'ValuableClicksPerDMA';  Transformation as a set of formulas insert into ValuableClicksPerDMA select dma, count(*) from geoinfo join ( select name, ipaddr from users join clicks on (users.name = clicks.user) where value > 0; ) using ipaddr group by dma; Dataiku - Pig, Hive and Cascading
    • Data type and Model Rationale  All three Extend basic data model with extended data types ◦ array-like [ event1, event2, event3] ◦ map-like { type1:value1, type2:value2, …}  Different approach ◦ Resilient Schema ◦ Static Typing ◦ No Static Typing Dataiku - Pig, Hive and Cascading
    • Hive Data Type and Schema CREATE TABLE visit ( user_name user_id user_details ); STRING, INT, STRUCT<age:INT, zipcode:INT> Simple type Details TINYINT, SMALLINT, INT, BIGINT 1, 2, 4 and 8 bytes FLOAT, DOUBLE 4 and 8 bytes BOOLEAN STRING Arbitrary-length, replaces VARCHAR TIMESTAMP Complex type Details ARRAY Array of typed items (0-indexed) MAP Associative map STRUCT Complex class-like objects 54 Dataiku Training – Hadoop for Data Science 1/8/14
    • Data types and Schema Pig rel = LOAD '/folder/path/' USING PigStorage(„t‟) AS (col:type, col:type, col:type); Simple type Details int, long, float, double 32 and 64 bits, signed chararray A string bytearray An array of … bytes boolean A boolean Complex type Details tuple a tuple is an ordered fieldname:value map bag a bag is a set of tuples 55 Dataiku Training – Hadoop for Data Science 1/8/14
    • Data Type and Schema Cascading   Support for Any Java Types, provided they can be serialized in Hadoop No support for Typing Simple type Details Int, Long, Float, Double 32 and 64 bits, signed String A string byte[] An array of … bytes Boolean A boolean Complex type Object Dataiku - Pig, Hive and Cascading Details Object must be « Hadoop serializable »
    • Style Summary Style Typing Data Model Metadata store Pig Procedural Static + Dynamic scalar + tuple+ bag (fully recursive) No (HCatalog) Hive Declarative Static + Dynamic, enforced at execution time scalar+ list + map Integrated Cascading Procedural Weak scalar+ java objects No Dataiku - Pig, Hive and Cascading
    • Comparing without Comparable  Philosophy ◦ Procedural Vs Declarative ◦ Data Model and Schema  Productivity ◦ Headachability ◦ Checkpointing ◦ Testing, error management and environment  Integration ◦ Partitioning ◦ Formats Integration ◦ External Code Integration  Performance and optimization Dataiku - Pig, Hive and Cascading
    • Headachility Motivation  Does debugging the tool lead to bad headaches ? Dataiku - Pig, Hive and Cascading
    • Headaches Pig  Out Of Memory Error (Reducer)  Exception in Building / Extended Functions (handling of null)  Null vs “”  Nested Foreach and scoping  Date Management (pig 0.10)  Field implicit ordering Dataiku - Pig, Hive and Cascading
    • A Pig Error Dataiku - Pig, Hive and Cascading
    • Headaches Hive  Out of Memory Errors in Reducers  Few Debugging Options  Null / “”  No builtin “first” Dataiku - Pig, Hive and Cascading
    • Headaches Cascading  Weak Typing Errors (comparing Int and String … )  Illegal Operation Sequence (Group after group …)  Field Implicit Ordering Dataiku - Pig, Hive and Cascading
    • Testing Motivation   How to perform unit tests ? How to have different versions of the same script (parameter) ? Dataiku - Pig, Hive and Cascading
    • Testing Pig     System Variables Comment to test No Meta Programming pig –x local to execute on local files Dataiku - Pig, Hive and Cascading
    • Testing / Environment Cascading   Junit Tests are possible Ability to use code to actually comment out some variables Dataiku - Pig, Hive and Cascading
    • Checkpointing Motivation    Lots of iteration while developing on Hadoop Sometime jobs fail Sometimes need to restart from the start … Parse Logs Per Page Stats Page User Correlation FIX and relaunch Dataiku - Pig, Hive and Cascading Filtering Output
    • Pig Manual Checkpointing  STORE Command to manually store files Parse Logs Per Page Stats Page User Correlation // COMMENT Beginning of script and relaunch Dataiku - Pig, Hive and Cascading Filtering Output
    • Cascading Automated Checkpointing  Ability to re-run a flow automatically from the last saved checkpoint addCheckpoint(… ) Dataiku - Pig, Hive and Cascading
    • Cascading Topological Scheduler   Check each file intermediate timestamp Execute only if more recent Parse Logs Per Page Stats Page User Correlation Filtering Dataiku - Pig, Hive and Cascading Output
    • Productivity Summary Headaches Pig Hive Cascading Checkpointing/Rep lay Testing / Metaprogrammation Lots Manual Save Difficult Meta programming, easy local testing Few, but without None (That‟s SQL) debugging options Weak Typing Complexity Dataiku - Pig, Hive and Cascading Checkpointing Partial Updates None (That‟s SQL) Possible
    • Comparing without Comparable  Philosophy ◦ Procedural Vs Declarative ◦ Data Model and Schema  Productivity ◦ Headachability ◦ Checkpointing ◦ Testing and environment  Integration ◦ Formats Integration ◦ Partitioning ◦ External Code Integration  Performance and optimization Dataiku - Pig, Hive and Cascading
    • Formats Integration Motivation  Ability to integrate different file formats  Ability to integrate with external data sources or sink ( MongoDB, ElasticSearch, Database. …) ◦ Text Delimited ◦ Sequence File (Binary Hadoop format) ◦ Avro, Thrift .. Format impact on size and performance Format Size on Disk (GB) HIVE Processing time (24 cores) Text File, uncompressed 18.7 1m32s 1 Text File, Gzipped 3.89 6m23s JSON compressed 7.89 2m42s multiple text file gzipped 4.02 43s Sequence File, Block, Gzip 5.32 1m18s Text File, LZO Indexed 7.03 1m22s Dataiku - Pig, Hive and Cascading (no parallelization)
    • Format Integration    Hive: Serde (Serialize-Deserializer) Pig : Storage Cascading: Tap Dataiku - Pig, Hive and Cascading
    • Partitions Motivation   No support for “UPDATE” patterns, any increment is performed by adding or deleting a partition Common partition schemas on Hadoop ◦ ◦ ◦ ◦ ◦ By Date /apache_logs/dt=2013-01-23 By Data center /apache_logs/dc=redbus01/… By Country … Or any combination of the above Dataiku - Pig, Hive and Cascading
    • Hive Partitioning Partitioned tables CREATE TABLE event ( user_id INT, type STRING, message STRING) PARTITIONED BY (day STRING, server_id STRING); Disk structure /hive/event/day=2013-01-27/server_id=s1/file0 /hive/event/day=2013-01-27/server_id=s1/file1 /hive/event/day=2013-01-27/server_id=s2/file0 /hive/event/day=2013-01-27/server_id=s2/file1 … /hive/event/day=2013-01-28/server_id=s2/file0 /hive/event/day=2013-01-28/server_id=s2/file1 INSERT OVERWRITE TABLE event PARTITION(ds='2013-01-27', server_id=„s1‟) SELECT * FROM event_tmp; Dataiku Training – Hadoop for Data Science 1/8/14 76
    • Cascading Partition No Direct support for partition  Support for “Glob” Tap, to build read from files using patterns   ➔ You can code your own custom or virtual partition schemes Dataiku - Pig, Hive and Cascading
    • External Code Integration Simple UDF Pig Hive Cascadin g Dataiku - Pig, Hive and Cascading
    • Hive Complex UDF (Aggregators) Dataiku - Pig, Hive and Cascading
    • Cascading Direct Code Evaluation Uses Janino, a very cool project: http://docs.codehaus.org/display/JANINO Dataiku - Pig, Hive and Cascading
    • Spring Batch Cascading Integration   Allow to call a cascading flow from a Spring Batch No full Integration with Spring MessageSource or MessageHandler yet (only for local flows) Dataiku - Pig, Hive and Cascading
    • Integration Summary Partition/Increme External Code ntal Updates Pig No Direct Support Hive Cascading Dataiku - Pig, Hive and Cascading Fully integrated, SQL Like With Coding Simple Format Integration Doable and rich community Very simple, but Doable and existing complex dev setup community Complex UDFS but regular, and Java Expression embeddable Doable and growing commuinty
    • Comparing without Comparable  Philosophy ◦ Procedural Vs Declarative ◦ Data Model and Schema  Productivity ◦ Headachability ◦ Checkpointing ◦ Testing and environment  Integration ◦ Formats Integration ◦ Partitioning ◦ External Code Integration  Performance and optimization Dataiku - Pig, Hive and Cascading
    • Optimization  Several Common Map Reduce Optimization Patterns ◦ ◦ ◦ ◦ ◦  Combiners MapJoin Job Fusion Job Parallelism Reducer Parallelism Different support per framework ◦ Fully Automatic ◦ Pragma / Directives / Options ◦ Coding style / Code to write Dataiku - Pig, Hive and Cascading
    • Combiner Perform Partial Aggregate at Mapper Stage SELECT date, COUNT(*) FROM product GROUP BY date 2012-02-14 4354 Map … 2012-02-14 4354 2012-02-15 21we2 … Reduc e 2012-02-14 20 2012-02-15 21we2 2012-02-15 35 2012-02-14 qa334 … 2012-02-15 23aq2 2012-02-14 qa334 … 2012-02-15 23aq2 Dataiku - Pig, Hive and Cascading 2012-02-16 1
    • Combiner Perform Partial Aggregate at Mapper Stage SELECT date, COUNT(*) FROM product GROUP BY date Map 2012-02-14 4354 2012-02-14 8 … 2012-02-15 12 Reduc e 2012-02-14 20 2012-02-15 21we2 2012-02-15 35 2012-02-14 qa334 … 2012-02-15 23aq2 2012-02-14 12 2012-02-15 23 2012-02-16 1 Reduced network bandwith. Better parallelism Dataiku - Pig, Hive and Cascading 2012-02-16 1
    • Join Optimization Map Join Hive set hive.auto.convert.join = true; Pig Cascadin g ( no aggregation support after HashJoin) Dataiku - Pig, Hive and Cascading
    • Number of Reducers  Critical for performance  Estimated per the size of input file ◦ Hive  divide size per hive.exec.reducers.bytes.per.reducer (default 1GB) ◦ Pig  divide size pig.exec.reducers.bytes.per.reducer (default 1GB) Dataiku - Pig, Hive and Cascading
    • Performance & Optimization Summary Combiner Optimization Pig Cascading Hive Dataiku - Pig, Hive and Cascading Join Optimization Number of reducers optimization Automatic Option Estimate or DIY DIY HashJoin DIY Partial DIY Automatic (Map Join) Estimate or DIY
    • Date • Titre de la présentation CAS D’USAGE DU BIG DATA ET MACHINE LEARNING Qualité du search • ERWAN PIGNEUL • TEAM LEADER – RESPONSABLE DE PROJET 90
    • CONTEXTE PAGESJAUNES CŒUR DE MÉTIER : LA RECHERCHE LOCALE DE PROFESSIONNELS PAGESJAUNES UTILISE UN MOTEUR D'INTERPRÉTATION SPÉCIFIQUE NÉCESSITANT UNE INDEXATION MANUELLE CELA PERMET DE BIEN TRAITER LES REQUÊTES LES PLUS JOUÉES MAIS CELA NE GÈRE PAS LA LONGUE TRAINE
    • COMMENT AMÉLIORER LA PERTINENCE DE NOS RÉPONSES VIA L’ANALYSE DU COMPORTEMENT UTILISATEUR ?  20 M 1,4M  >10 occurrences requêtes Analyse & corrections >200M recherches 0,5M requêtes priorisées automatisation
    • SOLUTION pagesjaunes.fr crawl hadoop PIG+Hive Moteur d‟interprétation Sickit-learn indexation Autres Annuaire référentiels Export
    • ENSEIGNEMENTS TECHNIQUES HADOOP / PIG / HIVE : Efficace Remet en question certaines logiques test/prod (apparition de pbs sur gros volumes) Attention, ca reste jeune (compatibilité, …) DATAIKU STUDIO : Accélérateur de dev big data Ordonnanceur des traitements en intégrant tous nos jobs et gère les dépendances Easy Machine learning ELASTICSEARCH : Volume indexé et rapidité de search
    • EFFICACITÉ DE L’APPROCHE Evolution de la fragilité de la requête ‘Parc enfant’ Fragile Requête ‘Parc enfant’ Moyenne générale Not fragile
    • Mahout 102 Clustering
    • Goal for Today • Quick Introduction To Clustering • How does it work in Practice • How does it work in Mahout • Overview of Mahout Algorithms
    • Clustering Revenu e c Age
    • Clustering Revenu e One Cluster Centroid == Center of the cluster c Age
    • clustering applications • Fraud: Detect Outliers • CRM : Mine for customer segments • Image Processing : Similar Images • Search : Similar documents • Search : Allocate Topics
    • K-Means Guess an initial placement for centroids Assign each point to closest Center Reposition Center MAP REDUCE
    • clustering challenges • Curse of Dimensionality • Choice of distance / number of parameters • Performance • Choice # of clusters
    • Mahout Clustering Challenges • No Integrated Feature Engineering Stack: Get ready to write data processing in Java • Hadoop SequenceFile required as an input • Iterations as Map/Reduce read and write to disks: Relatively slow compared to in-memory processing
    • Data Processing Image Voice Log / DB Data Processing Vectorized Data
    • Mahout K-Means on Text Workflow Text Files mahout seqdirectory Mahout Sequence Files mahout seq2parse Tfidf Vectors mahout kmeans Clusters
    • Mahout K-Means on Database Extract Worflow Database Dump (CSV) org.apache.mahout.clustering.conve rsion.InputDriver Mahout Vectors mahout kmeans Clusters
    • Convert a CSV File to Mahout Vector • Real Code would have • Converting Categorical variables to dimensions • Variable Rescaling • Dropping IDs (name, forname …)
    • Mahout Algorithms Parameters Implicit Assumption Ouput K-Means K (number of clusters) Convergence Circles Point -> ClusterId Fuzzy K-Means K (number of clusters) Convergence Circles Point -> ClusterId * , Probability Expectation Maximization K (Number of clusterS) Convergence Gaussian distribution Point -> ClusterId*, Probability Mean-Shift Clustering Distance boundaries, Convergence Gradient like distribution Point -> Cluster ID Top Down Clustering Two Clustering Algorithns Hierarchy Point -> Large ClusterId, Small ClusterId Dirichlet Process Model Distribution Points are a mixture of distribution Point -> ClusterId, Probability Spectral Clustering - - Point -> ClusterId MinHash Clustering Number of hash / keys Hash Type High Dimension Point -> Hash*
    • Comparing Clustering KMeans MeanShif t Dirichlet Fuzzy KMeans
    • Canopy Optimization T2 T2 Surely in Cluster T1 Pick a random point Surely not in cluster