[Webinar] Measure Twice, Build Once: Real-Time Predictive Analytics


Published on

Watch this recorded webcast and listen to Infochimps CSO and Co-Founder, Dhruv Bansal, and Think Big Analytics Principal Architect, Douglas Moore, share successful use cases and recommendations for building real-time predictive analytics in your enterprise.

Published in: Technology, Business
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • We’ll leave this slide up in the time before we start the webinar.
  • Douglas & Dhruv: Introduce ourselves and our companies.
  • Dhruv
  • Which element of the Big Data stack is most important to you?HadoopDatabases (SQL, NoSQL)Real-time (Storm, Kafka, Flume, Esper)
  • Which element of the Big Data stack is most important to you?HadoopDatabases (SQL, NoSQL)Real-time (Storm, Kafka, Flume, Esper)
  • DOUGLASThanks, Dhruv. Thanks for your time today everyone. Excited to share how Think Big Analytics can help you make big data come alive to allow you to accelerate your time to value.In case you don’t know us, Think Big Analytics provides data science and engineering services that create value from Big Data We help you IMAGINE your possibilities for big data and identify a plan for profitable projectsWe ILLUMINATE your team through hands-on training and education on the latest in big data technologiesWe then help you IMPLEMENT your analytics plan with hands-on data scientists and engineers who rapidly build data solutions to deliver value.
  • DouglasPredictive analytics is helping companies generate competitive advantages and real value. I want to walk you through one project we completed for a 2012 SXSW tech finalist, which is an online ad publishing company. My comments will demonstrate how we used certain big data technologies to enhance the company’s product offering so they could charge their advertisers premium rates. Let me first explain the challenge this company addresses: Let’s say you own a local Mexican restaurant and you want to gear up for Cinco De Mayo, by putting adds into your local online newspaper. The paper is running an article about the planned festivities in the nearby park. The restaurant owner buys banner ads with the newspaper, goes to the page about the festival, and looks for her advertisement? Clicks refresh, hits refresh again and again. She might never see her ad.The company plans on solving this problem, by letting the ad buyer glue their ad to a piece of content. Taking ownership of that content, promoting it, and being satisfied that her family, friends and customers can see her ad and coupon.
  • DouglasCreating a research agendaWhat features to use?Content or Behavior based?What features have predictive value?Which functional forms will work best?How successful is the prediction?Will the approach scale?
  • DouglasRoll your own – Complicated, brittle, scalable, loss of focus.Multi-node scaling (which requires message routing, load balancing etc)
  • For this customer, the overall architecture looked like thisNetty edge serverChange DynamoDB to NoSQL
  • And Storm supported the analytics architecture by….The result was….
  • DouglasThe theory is that one can analyze and mine a massive historical data set in batch at your leisure and create a parameterized predictive model.In real time one is collecting the parameters, and upon request we can deliver the prediction by executing the training model over the latest parametersNot unlike the Lambda Architecture described by Nathan Marz of Twitter.
  • How advanced is your organization’s approach to Big Data?Not AdvancedAdvancedVery AdvancedNot Started
  • How advanced is your organization’s approach to Big Data?Not AdvancedAdvancedVery AdvancedNot Started
  • DhruvAny Data  Any Analytics  Any Cloud
  • Dhruv
  • Dhruv
  • ]NOTE: ADD INFOCHIMPS LOGO ON HEADER]If you are engaged in a Big Data project, over what timeframe are you interested in deploying a production solution?Less than 30 daysLess than 90 daysMore than 6 months
  • ]NOTE: ADD INFOCHIMPS LOGO ON HEADER]If you are engaged in a Big Data project, over what timeframe are you interested in deploying a production solution?Less than 30 daysLess than 90 daysMore than 6 months
  • Dhruv & Douglas
  • [Webinar] Measure Twice, Build Once: Real-Time Predictive Analytics

    1. 1. Measure Twice,Build Once#RTanalyticsDouglas MoorePrincipal Consultant & ArchitectMay 2013Dhruv BansalCSO
    2. 2. #RTanalytics | 2About UsNext generation Big Data stackto power your applicationsData science and engineeringservices that accelerate time to valueDouglas MoorePrincipal Consultant & ArchitectDhruv BansalCSO & Co-Founder
    3. 3. #RTanalytics | 3Agenda Think Big Use Case Infochimps Cloud: Streams, Queries, Batch Think Big & Infochimps TogetherMeasure Twice, Build OnceUnderstandthe problemModel thesolutionTest locallyGrow theinfrastructure
    4. 4. #RTanalytics | 4POLL!Poll
    5. 5. #RTanalytics | 5PollVery Advanced18%Advanced19%Not Advanced45%Not Started18%RESULTS: How advanced is your organizationsapproach to Big Data?
    6. 6. #RTanalytics | 6Accelerating Your Time to ValueStrategyand RoadmapIMAGINETrainingand EducationILLUMINATEHands-OnData Science andData EngineeringIMPLEMENTLeading Provider ofData Science and Engineering Services
    7. 7. #RTanalytics | 7 Use Cases- Scale batch analysis pipeline- Generate lively stats- Recommendations- Better Predictions• #page views in next 30days? Environment- AWS- Version 1 already in production Project Plan- 8-9 weeks- Combined Data Engineering+ Data ScienceEngagement- Staff• 1 Arch + 1 PM• 1 Data Engineer• 2 Data Scientists• 3 Client EngineersThe Beauty of Predictive Analytics
    8. 8. #RTanalytics | 8 Predictive Model Design & Build Process- Listening & Learning- Discovery (Digging through the data)- Creating a Research Agenda- Testing & Learning Production Quality Predictive Model Development- Data Cleansing, Aggregations, Conditioning- Predictive Model Training Process- Predictive Model Execution Process Challenges:- What functional forms predict future impression counts given counts up totime T?- Robust estimators, like medians rather than means, to cope with outliers- How do we distinguish between new articles, versus old articles wereseeing for the first time?- How well do impression counts correspond to real humans?Predictive Analytics Process
    9. 9. #RTanalytics | 9 Better end-user experience- View an ad, see the counter move. Need to catch fast moving events- Content half life measured at 3 hours (H Mason: http://bit.ly/nu7IDw)- Path to additional real-time capabilities- Example: Trend analysis to recommend ‘hot’ articles.Why Real-time?
    10. 10. #RTanalytics | 10Overall ArchitectureNoSQLMemcache (Tuple fail tracking)QueueHadoopAd ServingLBEdgeEdgeImpressionS3S3S3DFSArchive LogsManagementServerLBEdgeEdgeRelationalStoreAd ManagementAd SellingStorm- Queue Management- Simple Bot Filtering- Real-time Bucketization- Performance Counters- Event LoggingView AdCleansingModel TrainingRecommendationsEventsMonitoring & Alerting (Metrics, Alarms, Notifications)Model ParametersgetPredictionPerformance CountersImpression Buckets
    11. 11. #RTanalytics | 11Analytics ArchitectureStormWebServerTime SeriesBucketBoltSimple BotAnnotatorDFSAdapterImpressionSpoutTime SeriesBuckets(Batch)Time SeriesBuckets(Realtime)ImpressionPredictionPredictiveModelParametersImpressionsImpressionsImpressionsHadoopImpressionBucketizationPredictiveModelTrainingNoSQLBoltTime
    12. 12. #RTanalytics | 12AnalyzeMassive HistoricalData SetAnalyzeRecentPastRealtimePredictionSolution ApproachHistorical Data Set = S3Analyze = Hadoop + Pig + RRecent Past = Storm + NoSQLAnalyze = R + Web Service
    13. 13. #RTanalytics | 13POLL!Poll
    14. 14. #RTanalytics | 14PollLess than 30 days8%Less than 90 days54%More than 6 months38%RESULTS: Say you are building a Big Data project, which timeframe would you want to build a production solution?
    15. 15. #RTanalytics | 15Any Data  Any Analytics  Any Cloud
    16. 16. #RTanalytics | 16Data Flow Architecture5/10/2013
    17. 17. #RTanalytics | 17Inside Cloud::Streams
    18. 18. #RTanalytics | 18TwitterGnipPowertrackFacebookGnipEDCBlogsMoreoverMetabaseTVTranscriptionRadioTranscriptionPrintTranscriptionNewMediaDataSourcesTraditionalMediaDataSourcesTraditional & Social MediaListening Platform5/10/2013Full Example
    19. 19. #RTanalytics | 19POLL!Poll
    20. 20. #RTanalytics | 20PollHadoop36%Queries35%Real-time29%Which element of the Big Data stack is most important toyou?
    21. 21. #RTanalytics | 21Don’t Build it Yourself55% of enterpriseBig Data projects fail**According to a December 2012 survey of 300 IT organizations by SSWUG5%9%9%77%Project Costs by FunctionComputeSoftwareOperations StaffEngineering Staff
    22. 22. #RTanalytics | 22How Do We Compare tothe Competition?Competition Think Big &InfochimpsSpeed 6+ months to value 30 days to valueExperience New college gradsFew successfulimplementationsAdvanced Degrees& Published AuthorsQuality Offshore Onshore, ManagedServiceProven Learn on your dime Blue ChipCustomersMethodology Waterfall Agile, test & learn
    23. 23. Questions?#RTanalyticsThank you forparticipating!
    24. 24. #RTanalytics | 24Let’s continue the conversation!infochimps.com/demothinkbiganalytics.com/about/contact