Dataiku - From Big Data To Machine Learning


Published on

This presentation was made in front of CIO to sensibilize to the big data in practical terms and to the new usages of machine learning and analytics.

Published in: Technology, Education
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Dataiku - From Big Data To Machine Learning

  1. 1. 1Dataiku6/4/2013
  2. 2. 6/4/2013Dataiku 2Hi !Current Life:CEO, DataikuTweet about this: @dataiku @club_dsi_gunPast Life:CriteoIsCool EntertainmentExaleadFlorianDouetteauAvailable on Slide Share Today:• Concrete Feedback on Data AnalyticsProjects• Data Team in practice and Key technologies• Motivate you to start a data science projectSlide deck allergic ? Check:
  3. 3. 6/4/2013Dataiku 3DataikuDataiku : An open source platformto help you build your data lab‟”
  4. 4. 6/4/2013Dataiku 4
  5. 5. Collocation6/4/2013Dataiku 5Big AppleBig MamaBig DataA familiar grouping of words,especially words that habitually appeartogether and thereby convey meaningby association.Colloc
  6. 6. “Big” Data in 19996/4/2013Dataiku 6struct Element {Key key;void* stat_data ;}….COptimized Data structuresPerfect HashingHP-UNIX Servers – 4GB Ram100 GB dataWeb Crawler – Socket reuseHTTP 0.91 Month
  7. 7.  Hadoop Java / Pig / Hive / Scala /Closure / … A Dozen NoSQL data store MPP Databases Real-Time6/4/2013Dataiku 7Big Data in 20131 Hour
  8. 8. Data Analytics: The Stakes6/4/2013Dataiku 81 TB? $Social Gaming2011Web Search1999Logistics2004OnlineAdvertising20121 TB100M $E-Commerce2013BankingCRM20081 TB1B $WebSearch2010100 TB? $10 TB10M $1000TB500M $50TB1B$
  9. 9. Meet Hal Alowne6/4/2013Dataiku - Data Tuesday 9Big Guys• 10B$+ Revenue• 100M+ customers• 100+ Data ScientistHal AlowneBI ManagerDim’s Private ShowroomHey Hal ! We needa big data platformlike the big guys.Let’s just do as they do!‟”European E-commerce Web site• 100M$ Revenue• 1 Million customer• 1 Data Analyst (Hal Himself)Dim SumCEO & FounderDim’s Private ShowroomBig DataCopy CatProject
  10. 10. Technology is complex6/4/2013Dataiku 10HadoopCephSphereCassandraSparkScikit-LearnMahoutWEKAMLBaseRapidMinerPandaD3CrossfilterInfiniDBLucidDBImpalaElastic SearchSOLRMongoDBRiakMembasePigHiveCascadingTalendMachine LearningMystery LandScalability CentralNoSQL-SlaviaSQL Colunnar RepublicVizualization CountyData Clean WastelandStatistician OldHouseR
  11. 11. Statistics and Machine Learning iscomplex !6/4/2013Dataiku 11 Try to understandmyself
  12. 12. (Some Book you might want to read)6/4/2013Dataiku 12
  13. 13. Plumbing is not complex(but difficult)6/4/2013Dataiku 13Implicit User Data(Views, Searches…)Content Data(Title, Categories, Price, …)Explicit User Data(Click, Buy, …)User Information(Location, Graph…)500TB50TB1TB200GBTransformationMatrixTransformationPredictorPer User StatsPer Content StatsUser SimilarityRank PredictorContent Similarity
  14. 14. MERIT = TIME + ROI6/4/2013Dataiku 14TargetedNewsletterRecommenderSystemsAdapted Product/ PromotionsTIME : 6 MONTHS ROI : APPS Build a lab in 6 months(rather than 18 months)Find the rightpeople(6 months?)Choose thetechnology(6 months?)Make it work(6 months?)Build the lab(6 months) Deploy appsthat actually deliver value2013 20142013• Train People• Reuse working patterns
  15. 15. The Problem6/4/2013Dataiku 15It’s utterly complex andunreasonable
  16. 16. Our Goal6/4/2013Dataiku 16Our Goal:Change his perspectiveon data science projects(sorry, we couldn’tfind a picture of HalSmiling)
  17. 17.  Why and For What ?◦ Business Theory◦ Concrete Projects How people and project ?◦ How to start◦ Dedicated team ? What technologies ?◦ Machine Learning◦ ArchitectureAgenda6/4/2013Dataiku 17
  18. 18. Embodiment of Knowledge6/4/2013Dataiku 18
  19. 19.  Product Successdriven by Quality ! Margin / CustomerValue / Traffic /Acquisition6/4/2013Dataiku 19Example: Launching an Appon the App Store
  20. 20.  Margin for newcustomers mightdecline … Margin for newfeatures mightdecline … Is your businessreally scalable ?6/4/2013Dataiku 20you continue growing ….
  21. 21.  Existing CustomersProfiles Existing Product Assets Existing SpecificBusiness Model And your KNOWLEDGEof it6/4/2013Dataiku 21Where is your core businessadvantage ?
  22. 22. 6/4/2013Dataiku 22Data Driven BusinessWhat your value ?Number ofCustomersCustomer KnowledgeIncrease over time with:- Time spend in your app- User relationship (network effet)- Partner / Other Apps InteractionsYour Value
  23. 23. Data ImpactNot all business equals6/4/2013Dataiku 23OnlineAdvertisingTelecommunicationInsuranceAbilityto AcquireMarginNewServicesOverallSubscriptionMarketInfrastructureDriverSelling DataRisk / PriceOptimizationSubscriptionMarketSubscriptionMarket
  24. 24. From Theory To Practice6/4/2013Dataiku 24
  25. 25.  What should be freein the application ? How to optimizeconversion ? How to plan andcreate a businessmodel ?Main Pain Point:How to plan andoptimize pricing inthe application ?6/4/2013Dataiku 25Freemium Application
  26. 26. Example (Freemium Application)Fremium Model Optimization6/4/2013Dataiku 26BusinessModelUserClusterSimulation Optimized Pricing: Margin+23% Business PlanningCapability1 month  9 months R + Python + InfiniDBOn-Premise1TB Dataset5 weeks project
  27. 27.  Business IntelligenceStack as Scalability andmaintenance issues Backoffice implementsbusiness rules that arechallenged Existing infrastructurecannot cope with per-user informationMain Pain Point:23 hours 52 minutes tocompute Business Intelligenceaggregates for one day.6/4/2013Dataiku 27Large E-Retailer
  28. 28. • Relieve their current DWH andaccelerate production of someaggregates/KPIs• Be the backbone for newpersonalized user experienceon their website: morerecommendations, moreprofiling, etc.,• Train existing people aroundmachine learning andsegmentation experience 1h12 to perform theaggregate, available everymorning New home pagepersonalization deployed in afew weeks Hadoop Cluster (24 cores)Google Compute EnginePython + R + Vertica12 TB dataset6 weeks projects6/4/2013Dataiku - Data Tuesday 28Large E-Retailer : The Datalab
  29. 29.  BI performed directly onproduction databases New reports required theCTO direct work fordesign andimplementation Each photo tag manuallyvalidated and completedLarge Photo Bank6/4/2013Dataiku - Data Tuesday 29Main pain point:No visibility on new usersbehaviours
  30. 30.  Implementing a Cloud-baseddata lab to :• centralize all available data,previously scattered betweenSQL DB and file systems,• improve web trackinggranularity to enhancecustomer knowledge viabehavior modeling andsegmentation,• create content-basedrecommendation engines withkeywords clustering andassociation.6/4/2013Dataiku - Data Tuesday 30Large Photo Bank : The Datalab R + Vertica + HadoopAmazon Web Services8 weeks projects Automated content filteringand recommendation
  31. 31.  Large set ofmanually craftedlinguistic resourcesfor interpretingusers queries New Brands, rareterms .. hard tomaintain6/4/2013Dataiku 31Large Online DirectoryMain Pain Point:Ability to maintain a verylarge ontological knowledgesets, with more than 100kconcepts
  32. 32.  Analyze clicks,rephrasing navigation todetect queries thatrequire specificprocessing Gather web and externaldata to enrich theexisting index Train team to Hadoopand Machine Learning Continuous RelevanceMonitoring Automated enrichment 2x more productivity Hadoop (48 cores)PythonOn Premise10 weeks projects6/4/2013Dataiku 32Large Online Directory: The Data Lab
  33. 33.  Launch A Marketingcampaign After a few daysPREDICT based onbehaviours◦  Total ARPU for usersafter 3 months◦  Efficiency of a campaign◦ Continue or not ?Example ( E-Application )Marketing Campaign PredictionDataiku 33
  34. 34. A very large communitySome mid-sizecommunitiesLots of small clustersmostly 2 players) Correlation◦ between community sizeand engagement / virality Meaningul patterns◦ 2 players / Family / Group What is the minimumnumber of friends tohave in the applicationto get additionalengagement ?Example (Social Gaming)Social Gaming Communities6/4/2013Dataiku 34
  35. 35.  What others do ?◦ Concrete Projects How people and project ?◦ How to start◦ Dedicated team ? What technologies ?◦ Machine Learning◦ ArchitectureAgenda6/4/2013Dataiku 35
  36. 36. 6/4/2013Dataiku 36
  37. 37.  A / B Test(or equivalent for yourbusiness) is the first step toget into a “data-driven”mind set No advanced analyticsrequires, some existingtools can help Changing a color button+21%6/4/2013Dataiku 37(1) Be Data Driven
  38. 38.  People  Microsoft Excel6/4/2013Dataiku 38(2) Use Excel
  39. 39.  Data Team  Data Tools6/4/2013Dataiku 39(3) Build a teamThe Business Expertwho knows mathsThe Analystthat reveals patternsThe Coding Guy Thatis enthusiastic
  40. 40.  data lab, (n. m): a small groupwith all the expertise, includingbusiness minded people,machine learning knowledge andthe right technology A proven organization used bysuccessful data-drivencompanies over the past fewyears (eBay, LinkedIn, Walmart…)TEAM + TOOLS = LAB6/4/2013Dataiku 40
  41. 41. Organization6/4/2013Dataiku 41Targeted campaingsPrice optimizationPersonalizedexperienceQuality AssuranceWorkload and yieldmanagementUser Feedback (A/B Test)Continuous improvementDataProductDesignerBusiness&MarketingEngineersUserVoice
  42. 42. Short Term Focus Long Term DriveBusiness People Optimize Margin, …. Create new businessrevenue streamsMarketing People Optimize click ratio Brand awareness andimpactIT People Make IT work Clean and efficientArchitectureData People Get Stats Right, makepredictionsCreate Data DrivenFeaturesIt’s just a new team6/4/2013Dataiku 42
  43. 43. Super Intern6/4/2013Dataiku 43What is your ability to integrate a newsmart guy and give him anydata he would need and any computingpower he would need to enhanceyour product ?
  44. 44.  What others do ?◦ Concrete Projects How people and project ?◦ How to start◦ Dedicated team ? What technologies ?◦ Machine Learning◦ ArchitectureAgenda6/4/2013Dataiku 44
  45. 45. An oversimplified view of big data architecture6/4/2013Dataiku 45
  46. 46. 6/4/2013Dataiku 46Database Business Layer Application
  47. 47. (What it really looks like)6/4/2013Dataiku 47
  48. 48. What kind of scale?6/4/2013Dataiku 48Database Business Layer ApplicationOrData Science AppOr ?
  49. 49. What kind of interaction ?6/4/2013Dataiku 49Database Business Layer ApplicationData Science App??? ? ??
  50. 50. Classic Columnar Architecture6/4/2013Dataiku 50Some data Some Place ToPour It InSome Tool ToTo Some Maths And Graphs
  51. 51. Classic Columnar Architecture6/4/2013Dataiku 51Lots of data Some Place ToPour It InSome Tool ToTo Some Maths And GraphsWeb Tracking LogsRaw Server LogsOrder / Product / CustomerFacebook InfoOpen Data (Weather, Currency …)
  52. 52. The Corinthian Architecture6/4/2013Dataiku 52Lots of dataSome PlaceTo PerformRapid CalculationsSome Tools ToDo Some MathsAnd ChartsSome Place ToPour It In AndClean / Prepare It
  53. 53. Data Storage And Preparation6/4/2013Dataiku 53Large Scale:Hadoop ClusterCassandraMPP SQL ColumnarMedium/Large Scale:CouchBaseMongoDB….Selection DriversVolumeScalability
  54. 54. Calculations6/4/2013Dataiku 54Classic Database• PostgresSQL• MySQL• ….MPP SQL Database• Vertica, Vectorwise, InfiniDB,GreenplumHD….Hadoop New Databases• Impala…Selection Drivers:Speed ( Interactivity )Expressivity
  55. 55. The Corinthian Architecture6/4/2013Dataiku 55Lots of dataSome PlaceTo PerformRapid CalculationsSome Tools ToDo Some MathsAnd ChartsSome Place ToPour It In AndClean / Prepare ItStatisticsCohortsRegressionsBar Charts For MarketingNice Infography for you Company Board
  56. 56. The Corinthian Architecture6/4/2013Dataiku 56Lots of dataSome DatabaseTo PerformRapid CalculationsSome Tools ToDo Some MathsSome OtherTo Do SomeChartsSome Place ToPour It In AndClean / Prepare It
  57. 57. Statistical Tools6/4/2013Dataiku 57Open Source:• IPython• RstudioCommercial• RapidMiner• SAS• RevolutionRSelection DriversExisting KnowhowScalability
  58. 58. 6/4/2013Dataiku 58What is a statistical tool ? Interact and exploredata Some statscapabilities Some GraphCapabilities
  59. 59. Visualization Tools6/4/2013Dataiku 59Open Source:• SpotFire• Tableau• QlikViewSAAS• BIME• ChartIO• RevolutionRHTML5 / AdHoc• D3• GraphVizSelection DriversHow Many Contributors /Readers ?Scalability
  60. 60. The One Database won’tmake it all problem6/4/2013Dataiku 60Lots of dataSome DatabaseTo PerformRapid CalculationsSome Tools ToDo Some MathsSome OtherTo Do SomeChartsSome Place ToPour It In AndClean / Prepare ItJOIN / AggregateRapid Goup By ComputationsDirect Access to the computed Resultsto production etc..
  61. 61. The Roman Social Forum6/4/2013Dataiku 61Lots of dataSome DatabaseTo PerformRapid CalculationsAnd Some DatabaseFor GraphsSome Tools ToDo Some MathsSome OtherTo Do SomeChartsSome Place ToPour It In AndClean / Prepare It
  62. 62. Graph6/4/2013Dataiku 62Databases• Neo4J• Titan• OrientDB• InfiniteGraphAnalytic / Visualization• GephiSelection DriversScalabilityWhat Algorithms ?Licensing Constraints
  63. 63. The Key Value Store6/4/2013Dataiku 63Lots of dataSome DatabaseTo PerformRapid CalculationsAnd Some DatabaseFor Graphs AndSome Distributed KeyValue StoreSome Tools ToDo Some MathsSome OtherTo Do SomeChartsSome Place ToPour It In AndClean / Prepare It
  64. 64. NoSQL6/4/2013Dataiku 64Search• SOLR• ElasticSearchDocument• MongoDB• CouchDBKeyValue• Redis• Hbase…Selection DriversDurability / Avaiability …PerformanceEase of use and APIIndexing
  65. 65. Action requires Prediction6/4/2013Dataiku 65Lots of dataSome DatabaseTo PerformRapid CalculationsAnd some databasefor graphs AndSome Distributed KeyValue StoreSome Tools ToDo Some MathsSome OtherTo Do SomeChartsSome Place ToPour It In AndClean / Prepare ItDraw A Line  For the futureWhat are my real users groups ?Should I launch a discount offering or not ?To everybody or to specific users only ?
  66. 66. The Medieval Fairy Land6/4/2013Dataiku 66Lots of dataSome Tools ToDo Some MathsSome OtherTo Do SomeCharts and someMACHINE LEARNINGSome Place ToPour It In AndClean / Prepare ItSome DatabaseTo PerformRapid CalculationsAnd Some DatabaseFor Graphs AndSome Distributed KeyValue Store
  67. 67. Predictions6/4/2013Dataiku 67Java• Mahout (Hadoop)• WEKAPython• Scikit-Learn• PyMLRCommercial• Kxen• SAS• SPSS…Selection DriversScalabilityBlack Box / White Box ?Data Management Integration
  68. 68. Can be fun6/4/2013Dataiku 68
  69. 69.  Exploratory Data Analysis◦ Identifying and visualizing key patterns and correlations within the dataset Unsupervised Learning◦ Create groups of similar observations sharing same patterns (aka Clustering, Segmentation) Supervised Learning◦ Modeling a variable using independent features (aka Scoring, Predictive Modeling, Classification) Time Series Prevision◦ Predict a time-dependent variable using its own history, and sometimes other covariates (variables) Graph Analysis◦ Analyzing relationships between a set of “nodes”, linked by “edges” Associations / Sequences Mining◦ Identifying frequently associated items within transactions/ events databases, sometimes ordered over time And many more…Classes of Machine Learning Problems04/06/2013Dataiku - Innovation Services 69
  70. 70. Mapping ML to Business Questions04/06/2013Dataiku - Innovation Services 70Class Sample Business QuestionsExploratory Data Analysis What does my dataset look like ? What are the key correlations in my data ?Unsupervised Learning Can I create groups of users who share the same purchasing behavior ? Thesame navigation behavior ?Supervised Learning What users are likely to click on ad X ? What users are likely to convert to payingusers ? Who is going to leave my service ? What is the profile of the users whodo X ?Time Series Prevision What is the prevision of my revenue next month ? Given the weather forecast,can I also forecast my sales ?Product Sale Forecast (for surbooking)Graph Analysis Can I identify influencers in my users community ? Can I recommend new friendsto my users ?Association & Sequences Mining Which products are frequently bought together ? What is the typical navigationpath on my website ?
  71. 71. Machine Learning Methods Detailed04/06/2013Dataiku - Innovation Services 71Analytical Task ML Task Sample Algorithms Shape of DatasetExploratory Data Analysis Univariate Analysis Distribution, frequencies, histogram, boxplots, fit tests... N obs. (1 row per obs.) * P featuresBivariate Analysis Scatterplots, correlations (Pearson, Spearman), GLM, Chi Square... N obs. (1 row per obs.) * P featuresMultivariate Analysis Principal components analysis, multi-dimensional scalingcorrespondence analysis, factor analysis…N obs. (1 row per obs.) * P features“Oriented” Data Analysis Unsupervised Learning K-means, K-medoids, hierarchical clustering, gaussian mixturemodels, mean shift, dbscan, spectral clustering...N obs. (1 row per obs.) * P featuresSupervised Learning Linear & logistic regression, decision trees, neural networks, SVM,naïve Bayes, K-NN, random forests…N obs. (1 row per obs.) * P featuresTime Series Prevision ARMA, VARMAX, ARIMA… Time Series (rows: time period,columns: measures)Graph Analysis Centrality (closeness, betweeness, Page Rank, HITS), modularity(Louvain)…Nodes and Edges lists (+attributes)Associations &SequencesFrequent Itemsets, A priori, Market Basket… (Timestamped) events ortransactions
  72. 72.  Cluster a datasetinto K Buckets bychoosing the“closest”neighbours6/4/2013Dataiku 72Unsupervised MethodK-Means
  73. 73.  Predict the color ofa point dependingon the colors of itsK closestneighbours6/4/2013Dataiku 73SupervisedK-Nearest-Neighbours
  74. 74.  Find the most“significant” inputvariable and splitvalue Split the datasetrecursively6/4/2013Dataiku 74SupervisedDecision Tree
  75. 75. Several Paths to Machine Learning04/06/2013Dataiku - Innovation Services 75AnalyticalDatasetI’m lookingfor clustersI want topredict avariableI’m lookingvariable byvariable, orpairsI know howmany groupsto look forHCA…Partitioning (K-means…)GMM…DPGMM…K-means + Gap| Silhouette | …2-stepsclusteringI just wanttoexploreYesNoYesNoSmallDataset(<<1K)YesNoMedium Dataset(<<100K)YesNoI cansampleYesNoAffinityPropagation,Mean Shift…Unsupervised LearningYesNoAll myvariablesarenumeric YesNoCA…I have adistancematrixYesNoMDS...PCA…Exploratory Data Analysis DataViz...YesNotOnlyI valueinterpretabilityGeneralizedLinearModelSimpleDecisionTreeSupervised Learning*CorrelationAnalysisGLMParametric and nonparametric stat.tests* Methods generally working for both classification & regressionSupportVectorMachinesNeuralNetworksK-NearestNeighborsEnsembles (RandomForest, GradientBoosted TreeMARSGeneralizedAdditiveModel
  76. 76. 6/4/2013Dataiku 76Questions ? Take Away◦ There are new ways to perform dataanalytics that are within your reach andcan bring business value Some Additional Resources◦ Open Source Projects Dataiku Cloud Transport Client Dataiku Web Tracker◦ Our Technical Blog