Your SlideShare is downloading. ×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

2014.07.01 - New Technologies, New Roles, New Architectures - Singapore Management University - BigData SG


Published on

  • Be the first to comment

  • Be the first to like this

No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

No notes for slide
  • Talk track: Traditionally what has been lacking is enough historical data. Now with new approaches such as Hadoop it’s possible to save long term maintenance histories in a cost –effective way…<CLICK>

    Consider what this may mean for scheduling repairs for a particular piece of equipment. Rather than just knowing overall repair rates and costs, many details can be stored for a particular part...
  • Talk track: And from the field, sensors provide real-time measurements about what is happening for that particular part…<CLICK> <PAUSE>
  • Talk track: When you combine real time sensor data with maintenance histories, you can leverage the value of your data by using machine learning models to inform your actions:

    <click> analyze the records in order to <click> predict maintenance needs for <click> better scheduling of repairs. This saves you money by <click> avoiding down time and reducing risk of costly failures.

    Time series data is useful, particularly when saved together with part or equipment specifications. You could, for example, go back <click> and see what happened in the days or months leading up to a part failure and thus better understand how to schedule repairs before problems occur.
  • Talk track: Here is a familiar view of what is being done with data. New data input can be ingested to persistence layer or used in real time processing. What the user such as an analyst would like to be able to do is to make a single query against the data.

    How does this work? Let’s think about it in terms of lambda architecture to get a conceptual view of how to combine real time with batch processing…

  • Talk track: Lambda architecture divides all components in a system into 3 basic layers:

    Batch Layer handles persistence and batch oriented computation

    Speed Layer handles real time computation and updates to short term persistence (such as HBase or M7 tables)

    Serving Layer combines the partial batch query results and the partial query results from real time processing.

    Now we can think about our system components in terms of the lambda architecture…
  • Talk track: Long term data persistence and batch processing are done by components such as Apache Hadoop –based technologies. For the speed layer, there are several choices to do real-time processing including Apache Spark Streaming or Apache Storm. The query can (soon) be carried out using Apache Drill, Apache Hive, Apache Spark’s Shark component or Impala.
    The serving layer combines long-time and real-time partial query results to provide the final results that the user wants.
  • Talk track: Recommendations have wide spread use and building a powerful recommendation engine can be easier than you think with certain innovations..
  • Talk track: the first trick is to choose the right data. Instead of looking at ratings or characteristics of the items to recommend, instead watch people’s behaviors as they interact with items. You discover patterns and that tells you what to recommend.
  • Talk track: We demonstrated this powerful two-stage approach by building a music recommender on the MapR platform. Notice that the intensive part of the computation, the
  • Catching up 10:23 (equals normal 53 min; should be at 38 min (15 min late) … catching up
  • Transcript

    • 1. © 2014 MapR Technologies 1© 2014 MapR Technologies
    • 2. © 2014 MapR Technologies 2© 2014 MapR Technologies
    • 3. © 2014 MapR Technologies 3 Contact info • Slides online later, relax, enjoy, ask questions, participate • • @allenday • • • …etc
    • 4. © 2014 MapR Technologies 4 Allen’s Scorecard – Data Science Roles • Domain Expertise – Genetics, geospatial, advertising • Data Science – Biostatistics, recommendation systems, persuasion • App Development – R (13yr), Hadoop (7yr), Before: web apps • Operations – Horizontal scaling (e.g. web apps), automation
    • 5. © 2014 MapR Technologies 5 Message How do emerging technologies… …change our roles and …change the way we design systems?
    • 6. © 2014 MapR Technologies 6© 2014 MapR Technologies Example: Sensor Data from Drilling Rigs Real-time + long-time data use case
    • 7. © 2014 MapR Technologies 7 Powerful Combination: RT Sensor Data + Histories • Internet of Things is resulting in huge quantities of sensor data • New opportunities for fine-grained view: save years instead of months of data • Also analyze in real-time for short term reporting, dashboards, anomaly detection and predictive modeling
    • 8. © 2014 MapR Technologies 8 Which part? When was maintenance performed? Why repaired? If malfunction – what details? Maintenance Data Base
    • 9. © 2014 MapR Technologies 9 What is current status of part? What are the current conditions? Where is it located? How much stress is it under? Real-Time Sensor Data
    • 10. © 2014 MapR Technologies 10 Real-Time Sensor DataMaintenance Data Base + Machine Learning => Data Models Analyze maintenance records Predict maintenance needs Schedule repairs to reduce costs Reduce damage from unexpected failures
    • 11. © 2014 MapR Technologies 11© 2014 MapR Technologies How can an application be built to do this?
    • 12. © 2014 MapR Technologies 12 Application: Data Access Real Time Processing Long Term Persistence New Data Query Hadoop Spark Streaming Storm
    • 13. © 2014 MapR Technologies 13 t now The Challenge: Hadoop is Not Very Real-time UnprocessedData Fully processed Latest full period Hadoop job takes this long for this data
    • 14. © 2014 MapR Technologies 14 t now Hadoop works great back here Spark Streaming or Storm work here Real-time and Long-time together Blended viewBlended viewBlended View
    • 15. © 2014 MapR Technologies 15 t now Hadoop works great back here Spark Streaming or Storm work here Real-time and Long-time together Blended viewBlended viewBlended View
    • 16. © 2014 MapR Technologies 16 Lambda Architecture New Data SPEED LAYER BATCH LAYER Query SERVING LAYER
    • 17. © 2014 MapR Technologies 17 Query Process Real Time Processing Long Term Persistence & Batch Processing New Data Merge Query Results SPEED LAYER SERVING LAYER BATCH LAYER Query Results Hadoop Spark Streaming Storm Drill Impala Hive Partial Query Results Partial Query Results
    • 18. © 2014 MapR Technologies 18© 2014 MapR Technologies New designs benefit from overlapping roles: Dev + Ops
    • 19. © 2014 MapR Technologies 19 Production involves real time & long time processing
    • 20. © 2014 MapR Technologies 20 Ongoing Development
    • 21. © 2014 MapR Technologies 21 DevOps View
    • 22. © 2014 MapR Technologies 22 t now Data snapshot for devops and QA Live data for production systems Real-time and Long-time together Step forward
    • 23. © 2014 MapR Technologies 23© 2014 MapR Technologies Recommendation Systems
    • 24. © 2014 MapR Technologies 24 Recommendations – Data used to train model: interactions between people taking action (users) and items – Goal is to suggest additional interactions – Example applications: movie, music or map-based restaurant choices; suggesting sale items for e-stores or via cash-register receipts
    • 25. © 2014 MapR Technologies 25 Recommendation Behavior of a crowd helps us understand what individuals will do
    • 26. © 2014 MapR Technologies 26 User History Log Files Mahout Analysis Search Technology Item Meta-Data Ingest easily via NFS MapR Cluster via NFS Python Use Python directly via NFS Pig Web TierRecommendations New User History Example: Real-time recommender using MapR data platform Offline analysis Real-time recommendations Real-time Layer Batch Layer Serving Layer
    • 27. © 2014 MapR Technologies 27 Result: System delivers real-time custom recommendations based on music listening activity
    • 28. © 2014 MapR Technologies 28 Practical Machine Learning: Free e-books • Practical Machine Learning series authored by Ted Dunning and Ellen Friedman, published by O’Reilly (2014) • Provide innovations and advice that make machine learning more accessible and more successful in real world settings • Two titles available now as free e-book download from MapR website: Innovations in Recommendation and A New Look at Anomaly Detection
    • 29. © 2014 MapR Technologies 29© 2014 MapR Technologies Building data science teams
    • 30. © 2014 MapR Technologies 30 Q: Can I simply hire one rock star data scientist to cover all this kind of work?
    • 31. © 2014 MapR Technologies 31 A: No, interdisciplinary work requires teams A: Hire leads who can speak the lingo of each required discipline A: Hire individual contributors who cover 2+ roles, when possible
    • 32. © 2014 MapR Technologies 32© 2014 MapR Technologies Good news: you don’t have to do it all at once Build in steps and repurpose existing expertise
    • 33. © 2014 MapR Technologies 33 Team Process = Needs apps discovery modeling systems help people ask the right questions allow automation to place informed bets deliver products at scale to customers build smarts into product features keep infrastructure running, cost- effective integration
    • 34. © 2014 MapR Technologies 34 Team Process = Needs apps discovery modeling systems integration These are the primary phases of leveraging BigData Analysts drive from discovery. Engineers drive from systems. Both meet at integration. Effective management of Data Science lives at integration and doesn’t delegate it
    • 35. © 2014 MapR Technologies 35 business process, stakeholder data prep, discovery, modeling, etc. software engineering, automation systems engineering, availability Team Composition = Roles Each role brings different disciplines, opportunities, and risks. It’s a powerful technique to pair people with complementary skills. Blurring roles is very effective with great people, e.g. DevOps. There is danger in blurring boundaries: Don’t try to create rockstars (pushing down / overloading stresses teams)
    • 36. © 2014 MapR Technologies 36 Team Matrix = Needs x Roles business process, stakeholder data prep, discovery, modeling, etc. software engineering, automation systems engineering, access
    • 37. © 2014 MapR Technologies 37 Team Matrix business process, stakeholder data prep, discovery, modeling, etc. software engineering, automation systems engineering, access Conceptual tool for building and managing Data Science teams Overlay your project requirements (needs) with your team’s strengths (roles) That will show very quickly where to focus Bring in individuals who cover 2-3 needs, particularly for Team Leads
    • 38. © 2014 MapR Technologies 38 Team Matrix = Needs x Roles business process, stakeholder data prep, discovery, modeling, etc. software engineering, automation systems engineering, access
    • 39. © 2014 MapR Technologies 39 Allen’s Overlay business process, stakeholder data prep, discovery, modeling, etc. software engineering, automation systems engineering, access
    • 40. © 2014 MapR Technologies 40 Aggressively Proactive Learning • Disrupts old learning and management models – one size fits all – Specialists Hire people who learn and re-learn efficiently Throw Your Life a Curve Whitney Johnson
    • 41. © 2014 MapR Technologies 41 Recap • Scalable storage allows for huge amounts of data • Huge data calls for new system designs Lambda Architecture: conceptual framework to design systems for combining real-time and long-time data • New system designs call for new definitions of roles and teams Building Data Science Teams: conceptual framework for building teams teams that can effectively work with huge amounts of data
    • 42. © 2014 MapR Technologies 42© 2014 MapR Technologies Bonus round: What’s MapR? Why care?
    • 43. © 2014 MapR Technologies 43 MapR Data Platform Supports Complete Data Science Lifecycle Filesystem POSIX NFS HBase HDFS MapReduce SAN Storage