Your SlideShare is downloading. ×
  • Like
Dataiku - From Big Data To Machine Learning
Upcoming SlideShare
Loading in...5

Thanks for flagging this SlideShare!

Oops! An error has occurred.


Now you can save presentations on your phone or tablet

Available for both IPhone and Android

Text the download link to your phone

Standard text messaging rates apply

Dataiku - From Big Data To Machine Learning


This presentation was made in front of CIO to sensibilize to the big data in practical terms and to the new usages of machine learning and analytics.

This presentation was made in front of CIO to sensibilize to the big data in practical terms and to the new usages of machine learning and analytics.

Published in Technology , Education
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads


Total Views
On SlideShare
From Embeds
Number of Embeds



Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

    No notes for slide


  • 1. 1Dataiku6/4/2013
  • 2. 6/4/2013Dataiku 2Hi !Current Life:CEO, DataikuTweet about this: @dataiku @club_dsi_gunPast Life:CriteoIsCool EntertainmentExaleadFlorianDouetteauAvailable on Slide Share Today:• Concrete Feedback on Data AnalyticsProjects• Data Team in practice and Key technologies• Motivate you to start a data science projectSlide deck allergic ? Check:
  • 3. 6/4/2013Dataiku 3DataikuDataiku : An open source platformto help you build your data lab‟”
  • 4. 6/4/2013Dataiku 4
  • 5. Collocation6/4/2013Dataiku 5Big AppleBig MamaBig DataA familiar grouping of words,especially words that habitually appeartogether and thereby convey meaningby association.Colloc
  • 6. “Big” Data in 19996/4/2013Dataiku 6struct Element {Key key;void* stat_data ;}….COptimized Data structuresPerfect HashingHP-UNIX Servers – 4GB Ram100 GB dataWeb Crawler – Socket reuseHTTP 0.91 Month
  • 7.  Hadoop Java / Pig / Hive / Scala /Closure / … A Dozen NoSQL data store MPP Databases Real-Time6/4/2013Dataiku 7Big Data in 20131 Hour
  • 8. Data Analytics: The Stakes6/4/2013Dataiku 81 TB? $Social Gaming2011Web Search1999Logistics2004OnlineAdvertising20121 TB100M $E-Commerce2013BankingCRM20081 TB1B $WebSearch2010100 TB? $10 TB10M $1000TB500M $50TB1B$
  • 9. Meet Hal Alowne6/4/2013Dataiku - Data Tuesday 9Big Guys• 10B$+ Revenue• 100M+ customers• 100+ Data ScientistHal AlowneBI ManagerDim’s Private ShowroomHey Hal ! We needa big data platformlike the big guys.Let’s just do as they do!‟”European E-commerce Web site• 100M$ Revenue• 1 Million customer• 1 Data Analyst (Hal Himself)Dim SumCEO & FounderDim’s Private ShowroomBig DataCopy CatProject
  • 10. Technology is complex6/4/2013Dataiku 10HadoopCephSphereCassandraSparkScikit-LearnMahoutWEKAMLBaseRapidMinerPandaD3CrossfilterInfiniDBLucidDBImpalaElastic SearchSOLRMongoDBRiakMembasePigHiveCascadingTalendMachine LearningMystery LandScalability CentralNoSQL-SlaviaSQL Colunnar RepublicVizualization CountyData Clean WastelandStatistician OldHouseR
  • 11. Statistics and Machine Learning iscomplex !6/4/2013Dataiku 11 Try to understandmyself
  • 12. (Some Book you might want to read)6/4/2013Dataiku 12
  • 13. Plumbing is not complex(but difficult)6/4/2013Dataiku 13Implicit User Data(Views, Searches…)Content Data(Title, Categories, Price, …)Explicit User Data(Click, Buy, …)User Information(Location, Graph…)500TB50TB1TB200GBTransformationMatrixTransformationPredictorPer User StatsPer Content StatsUser SimilarityRank PredictorContent Similarity
  • 14. MERIT = TIME + ROI6/4/2013Dataiku 14TargetedNewsletterRecommenderSystemsAdapted Product/ PromotionsTIME : 6 MONTHS ROI : APPS Build a lab in 6 months(rather than 18 months)Find the rightpeople(6 months?)Choose thetechnology(6 months?)Make it work(6 months?)Build the lab(6 months) Deploy appsthat actually deliver value2013 20142013• Train People• Reuse working patterns
  • 15. The Problem6/4/2013Dataiku 15It’s utterly complex andunreasonable
  • 16. Our Goal6/4/2013Dataiku 16Our Goal:Change his perspectiveon data science projects(sorry, we couldn’tfind a picture of HalSmiling)
  • 17.  Why and For What ?◦ Business Theory◦ Concrete Projects How people and project ?◦ How to start◦ Dedicated team ? What technologies ?◦ Machine Learning◦ ArchitectureAgenda6/4/2013Dataiku 17
  • 18. Embodiment of Knowledge6/4/2013Dataiku 18
  • 19.  Product Successdriven by Quality ! Margin / CustomerValue / Traffic /Acquisition6/4/2013Dataiku 19Example: Launching an Appon the App Store
  • 20.  Margin for newcustomers mightdecline … Margin for newfeatures mightdecline … Is your businessreally scalable ?6/4/2013Dataiku 20you continue growing ….
  • 21.  Existing CustomersProfiles Existing Product Assets Existing SpecificBusiness Model And your KNOWLEDGEof it6/4/2013Dataiku 21Where is your core businessadvantage ?
  • 22. 6/4/2013Dataiku 22Data Driven BusinessWhat your value ?Number ofCustomersCustomer KnowledgeIncrease over time with:- Time spend in your app- User relationship (network effet)- Partner / Other Apps InteractionsYour Value
  • 23. Data ImpactNot all business equals6/4/2013Dataiku 23OnlineAdvertisingTelecommunicationInsuranceAbilityto AcquireMarginNewServicesOverallSubscriptionMarketInfrastructureDriverSelling DataRisk / PriceOptimizationSubscriptionMarketSubscriptionMarket
  • 24. From Theory To Practice6/4/2013Dataiku 24
  • 25.  What should be freein the application ? How to optimizeconversion ? How to plan andcreate a businessmodel ?Main Pain Point:How to plan andoptimize pricing inthe application ?6/4/2013Dataiku 25Freemium Application
  • 26. Example (Freemium Application)Fremium Model Optimization6/4/2013Dataiku 26BusinessModelUserClusterSimulation Optimized Pricing: Margin+23% Business PlanningCapability1 month  9 months R + Python + InfiniDBOn-Premise1TB Dataset5 weeks project
  • 27.  Business IntelligenceStack as Scalability andmaintenance issues Backoffice implementsbusiness rules that arechallenged Existing infrastructurecannot cope with per-user informationMain Pain Point:23 hours 52 minutes tocompute Business Intelligenceaggregates for one day.6/4/2013Dataiku 27Large E-Retailer
  • 28. • Relieve their current DWH andaccelerate production of someaggregates/KPIs• Be the backbone for newpersonalized user experienceon their website: morerecommendations, moreprofiling, etc.,• Train existing people aroundmachine learning andsegmentation experience 1h12 to perform theaggregate, available everymorning New home pagepersonalization deployed in afew weeks Hadoop Cluster (24 cores)Google Compute EnginePython + R + Vertica12 TB dataset6 weeks projects6/4/2013Dataiku - Data Tuesday 28Large E-Retailer : The Datalab
  • 29.  BI performed directly onproduction databases New reports required theCTO direct work fordesign andimplementation Each photo tag manuallyvalidated and completedLarge Photo Bank6/4/2013Dataiku - Data Tuesday 29Main pain point:No visibility on new usersbehaviours
  • 30.  Implementing a Cloud-baseddata lab to :• centralize all available data,previously scattered betweenSQL DB and file systems,• improve web trackinggranularity to enhancecustomer knowledge viabehavior modeling andsegmentation,• create content-basedrecommendation engines withkeywords clustering andassociation.6/4/2013Dataiku - Data Tuesday 30Large Photo Bank : The Datalab R + Vertica + HadoopAmazon Web Services8 weeks projects Automated content filteringand recommendation
  • 31.  Large set ofmanually craftedlinguistic resourcesfor interpretingusers queries New Brands, rareterms .. hard tomaintain6/4/2013Dataiku 31Large Online DirectoryMain Pain Point:Ability to maintain a verylarge ontological knowledgesets, with more than 100kconcepts
  • 32.  Analyze clicks,rephrasing navigation todetect queries thatrequire specificprocessing Gather web and externaldata to enrich theexisting index Train team to Hadoopand Machine Learning Continuous RelevanceMonitoring Automated enrichment 2x more productivity Hadoop (48 cores)PythonOn Premise10 weeks projects6/4/2013Dataiku 32Large Online Directory: The Data Lab
  • 33.  Launch A Marketingcampaign After a few daysPREDICT based onbehaviours◦  Total ARPU for usersafter 3 months◦  Efficiency of a campaign◦ Continue or not ?Example ( E-Application )Marketing Campaign PredictionDataiku 33
  • 34. A very large communitySome mid-sizecommunitiesLots of small clustersmostly 2 players) Correlation◦ between community sizeand engagement / virality Meaningul patterns◦ 2 players / Family / Group What is the minimumnumber of friends tohave in the applicationto get additionalengagement ?Example (Social Gaming)Social Gaming Communities6/4/2013Dataiku 34
  • 35.  What others do ?◦ Concrete Projects How people and project ?◦ How to start◦ Dedicated team ? What technologies ?◦ Machine Learning◦ ArchitectureAgenda6/4/2013Dataiku 35
  • 36. 6/4/2013Dataiku 36
  • 37.  A / B Test(or equivalent for yourbusiness) is the first step toget into a “data-driven”mind set No advanced analyticsrequires, some existingtools can help Changing a color button+21%6/4/2013Dataiku 37(1) Be Data Driven
  • 38.  People  Microsoft Excel6/4/2013Dataiku 38(2) Use Excel
  • 39.  Data Team  Data Tools6/4/2013Dataiku 39(3) Build a teamThe Business Expertwho knows mathsThe Analystthat reveals patternsThe Coding Guy Thatis enthusiastic
  • 40.  data lab, (n. m): a small groupwith all the expertise, includingbusiness minded people,machine learning knowledge andthe right technology A proven organization used bysuccessful data-drivencompanies over the past fewyears (eBay, LinkedIn, Walmart…)TEAM + TOOLS = LAB6/4/2013Dataiku 40
  • 41. Organization6/4/2013Dataiku 41Targeted campaingsPrice optimizationPersonalizedexperienceQuality AssuranceWorkload and yieldmanagementUser Feedback (A/B Test)Continuous improvementDataProductDesignerBusiness&MarketingEngineersUserVoice
  • 42. Short Term Focus Long Term DriveBusiness People Optimize Margin, …. Create new businessrevenue streamsMarketing People Optimize click ratio Brand awareness andimpactIT People Make IT work Clean and efficientArchitectureData People Get Stats Right, makepredictionsCreate Data DrivenFeaturesIt’s just a new team6/4/2013Dataiku 42
  • 43. Super Intern6/4/2013Dataiku 43What is your ability to integrate a newsmart guy and give him anydata he would need and any computingpower he would need to enhanceyour product ?
  • 44.  What others do ?◦ Concrete Projects How people and project ?◦ How to start◦ Dedicated team ? What technologies ?◦ Machine Learning◦ ArchitectureAgenda6/4/2013Dataiku 44
  • 45. An oversimplified view of big data architecture6/4/2013Dataiku 45
  • 46. 6/4/2013Dataiku 46Database Business Layer Application
  • 47. (What it really looks like)6/4/2013Dataiku 47
  • 48. What kind of scale?6/4/2013Dataiku 48Database Business Layer ApplicationOrData Science AppOr ?
  • 49. What kind of interaction ?6/4/2013Dataiku 49Database Business Layer ApplicationData Science App??? ? ??
  • 50. Classic Columnar Architecture6/4/2013Dataiku 50Some data Some Place ToPour It InSome Tool ToTo Some Maths And Graphs
  • 51. Classic Columnar Architecture6/4/2013Dataiku 51Lots of data Some Place ToPour It InSome Tool ToTo Some Maths And GraphsWeb Tracking LogsRaw Server LogsOrder / Product / CustomerFacebook InfoOpen Data (Weather, Currency …)
  • 52. The Corinthian Architecture6/4/2013Dataiku 52Lots of dataSome PlaceTo PerformRapid CalculationsSome Tools ToDo Some MathsAnd ChartsSome Place ToPour It In AndClean / Prepare It
  • 53. Data Storage And Preparation6/4/2013Dataiku 53Large Scale:Hadoop ClusterCassandraMPP SQL ColumnarMedium/Large Scale:CouchBaseMongoDB….Selection DriversVolumeScalability
  • 54. Calculations6/4/2013Dataiku 54Classic Database• PostgresSQL• MySQL• ….MPP SQL Database• Vertica, Vectorwise, InfiniDB,GreenplumHD….Hadoop New Databases• Impala…Selection Drivers:Speed ( Interactivity )Expressivity
  • 55. The Corinthian Architecture6/4/2013Dataiku 55Lots of dataSome PlaceTo PerformRapid CalculationsSome Tools ToDo Some MathsAnd ChartsSome Place ToPour It In AndClean / Prepare ItStatisticsCohortsRegressionsBar Charts For MarketingNice Infography for you Company Board
  • 56. The Corinthian Architecture6/4/2013Dataiku 56Lots of dataSome DatabaseTo PerformRapid CalculationsSome Tools ToDo Some MathsSome OtherTo Do SomeChartsSome Place ToPour It In AndClean / Prepare It
  • 57. Statistical Tools6/4/2013Dataiku 57Open Source:• IPython• RstudioCommercial• RapidMiner• SAS• RevolutionRSelection DriversExisting KnowhowScalability
  • 58. 6/4/2013Dataiku 58What is a statistical tool ? Interact and exploredata Some statscapabilities Some GraphCapabilities
  • 59. Visualization Tools6/4/2013Dataiku 59Open Source:• SpotFire• Tableau• QlikViewSAAS• BIME• ChartIO• RevolutionRHTML5 / AdHoc• D3• GraphVizSelection DriversHow Many Contributors /Readers ?Scalability
  • 60. The One Database won’tmake it all problem6/4/2013Dataiku 60Lots of dataSome DatabaseTo PerformRapid CalculationsSome Tools ToDo Some MathsSome OtherTo Do SomeChartsSome Place ToPour It In AndClean / Prepare ItJOIN / AggregateRapid Goup By ComputationsDirect Access to the computed Resultsto production etc..
  • 61. The Roman Social Forum6/4/2013Dataiku 61Lots of dataSome DatabaseTo PerformRapid CalculationsAnd Some DatabaseFor GraphsSome Tools ToDo Some MathsSome OtherTo Do SomeChartsSome Place ToPour It In AndClean / Prepare It
  • 62. Graph6/4/2013Dataiku 62Databases• Neo4J• Titan• OrientDB• InfiniteGraphAnalytic / Visualization• GephiSelection DriversScalabilityWhat Algorithms ?Licensing Constraints
  • 63. The Key Value Store6/4/2013Dataiku 63Lots of dataSome DatabaseTo PerformRapid CalculationsAnd Some DatabaseFor Graphs AndSome Distributed KeyValue StoreSome Tools ToDo Some MathsSome OtherTo Do SomeChartsSome Place ToPour It In AndClean / Prepare It
  • 64. NoSQL6/4/2013Dataiku 64Search• SOLR• ElasticSearchDocument• MongoDB• CouchDBKeyValue• Redis• Hbase…Selection DriversDurability / Avaiability …PerformanceEase of use and APIIndexing
  • 65. Action requires Prediction6/4/2013Dataiku 65Lots of dataSome DatabaseTo PerformRapid CalculationsAnd some databasefor graphs AndSome Distributed KeyValue StoreSome Tools ToDo Some MathsSome OtherTo Do SomeChartsSome Place ToPour It In AndClean / Prepare ItDraw A Line  For the futureWhat are my real users groups ?Should I launch a discount offering or not ?To everybody or to specific users only ?
  • 66. The Medieval Fairy Land6/4/2013Dataiku 66Lots of dataSome Tools ToDo Some MathsSome OtherTo Do SomeCharts and someMACHINE LEARNINGSome Place ToPour It In AndClean / Prepare ItSome DatabaseTo PerformRapid CalculationsAnd Some DatabaseFor Graphs AndSome Distributed KeyValue Store
  • 67. Predictions6/4/2013Dataiku 67Java• Mahout (Hadoop)• WEKAPython• Scikit-Learn• PyMLRCommercial• Kxen• SAS• SPSS…Selection DriversScalabilityBlack Box / White Box ?Data Management Integration
  • 68. Can be fun6/4/2013Dataiku 68
  • 69.  Exploratory Data Analysis◦ Identifying and visualizing key patterns and correlations within the dataset Unsupervised Learning◦ Create groups of similar observations sharing same patterns (aka Clustering, Segmentation) Supervised Learning◦ Modeling a variable using independent features (aka Scoring, Predictive Modeling, Classification) Time Series Prevision◦ Predict a time-dependent variable using its own history, and sometimes other covariates (variables) Graph Analysis◦ Analyzing relationships between a set of “nodes”, linked by “edges” Associations / Sequences Mining◦ Identifying frequently associated items within transactions/ events databases, sometimes ordered over time And many more…Classes of Machine Learning Problems04/06/2013Dataiku - Innovation Services 69
  • 70. Mapping ML to Business Questions04/06/2013Dataiku - Innovation Services 70Class Sample Business QuestionsExploratory Data Analysis What does my dataset look like ? What are the key correlations in my data ?Unsupervised Learning Can I create groups of users who share the same purchasing behavior ? Thesame navigation behavior ?Supervised Learning What users are likely to click on ad X ? What users are likely to convert to payingusers ? Who is going to leave my service ? What is the profile of the users whodo X ?Time Series Prevision What is the prevision of my revenue next month ? Given the weather forecast,can I also forecast my sales ?Product Sale Forecast (for surbooking)Graph Analysis Can I identify influencers in my users community ? Can I recommend new friendsto my users ?Association & Sequences Mining Which products are frequently bought together ? What is the typical navigationpath on my website ?
  • 71. Machine Learning Methods Detailed04/06/2013Dataiku - Innovation Services 71Analytical Task ML Task Sample Algorithms Shape of DatasetExploratory Data Analysis Univariate Analysis Distribution, frequencies, histogram, boxplots, fit tests... N obs. (1 row per obs.) * P featuresBivariate Analysis Scatterplots, correlations (Pearson, Spearman), GLM, Chi Square... N obs. (1 row per obs.) * P featuresMultivariate Analysis Principal components analysis, multi-dimensional scalingcorrespondence analysis, factor analysis…N obs. (1 row per obs.) * P features“Oriented” Data Analysis Unsupervised Learning K-means, K-medoids, hierarchical clustering, gaussian mixturemodels, mean shift, dbscan, spectral clustering...N obs. (1 row per obs.) * P featuresSupervised Learning Linear & logistic regression, decision trees, neural networks, SVM,naïve Bayes, K-NN, random forests…N obs. (1 row per obs.) * P featuresTime Series Prevision ARMA, VARMAX, ARIMA… Time Series (rows: time period,columns: measures)Graph Analysis Centrality (closeness, betweeness, Page Rank, HITS), modularity(Louvain)…Nodes and Edges lists (+attributes)Associations &SequencesFrequent Itemsets, A priori, Market Basket… (Timestamped) events ortransactions
  • 72.  Cluster a datasetinto K Buckets bychoosing the“closest”neighbours6/4/2013Dataiku 72Unsupervised MethodK-Means
  • 73.  Predict the color ofa point dependingon the colors of itsK closestneighbours6/4/2013Dataiku 73SupervisedK-Nearest-Neighbours
  • 74.  Find the most“significant” inputvariable and splitvalue Split the datasetrecursively6/4/2013Dataiku 74SupervisedDecision Tree
  • 75. Several Paths to Machine Learning04/06/2013Dataiku - Innovation Services 75AnalyticalDatasetI’m lookingfor clustersI want topredict avariableI’m lookingvariable byvariable, orpairsI know howmany groupsto look forHCA…Partitioning (K-means…)GMM…DPGMM…K-means + Gap| Silhouette | …2-stepsclusteringI just wanttoexploreYesNoYesNoSmallDataset(<<1K)YesNoMedium Dataset(<<100K)YesNoI cansampleYesNoAffinityPropagation,Mean Shift…Unsupervised LearningYesNoAll myvariablesarenumeric YesNoCA…I have adistancematrixYesNoMDS...PCA…Exploratory Data Analysis DataViz...YesNotOnlyI valueinterpretabilityGeneralizedLinearModelSimpleDecisionTreeSupervised Learning*CorrelationAnalysisGLMParametric and nonparametric stat.tests* Methods generally working for both classification & regressionSupportVectorMachinesNeuralNetworksK-NearestNeighborsEnsembles (RandomForest, GradientBoosted TreeMARSGeneralizedAdditiveModel
  • 76. 6/4/2013Dataiku 76Questions ? Take Away◦ There are new ways to perform dataanalytics that are within your reach andcan bring business value Some Additional Resources◦ Open Source Projects Dataiku Cloud Transport Client Dataiku Web Tracker◦ Our Technical Blog