Next Generation Analytics: A Reference Architecture


Published on

What is the reference architecture for creating an analytics shop in today?s environment? This presentation will review how businesses are mapping out their next generation analytics architecture based on various decision layers within the Hadoop ecosystem. The architecture also takes into consideration how data engineering, data science, decision science, and decision support play a role in this architecture. This framework recommends how modern analytical departments structure their technology thinking, as well as locate the gaps. This presentation will provide attendees with a better understanding of how to map out their next generation analytics architecture within the Hadoop ecosystem: -Review standard examples witnessed in the industry, compare industry standard vs. reference architecture. Highlight areas where companies are capable, as well as under-represented areas. – Explore why the decision science layer is one of the most omitted areas in production analytics departments. -Review the importance of scale. Predictive methods are required to scale analytics into the operation. – Explain impact of intelligent systems on the technology selection, physics for data in flight vs. data at rest . – Constructing model management systems that are implemented into production. Discuss possible solutions of this perplexing problems.

Published in: Technology, Education
No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • How many did you find? The correct answer is five. Whether you got the answer right or not, the process in the first case took you a while because it involved attentive processing. The list of numbers did not exhibit any pre-attentive attributes that you could use to distinguish the threesfrom the other numbers.Much easier the second time, wasn't it? In this figure the threes could easily be distinguished from the other numbers, due to their differing color intensity (one of the pre-attentive attributes discussed in the next slide): the threes are black while all the other numbers are gray, which causes them to stand out in clear contrast. Why couldn't we easily distinguish the threes in the first set of numbers based purely on their unique shape? Because the complex shapes of the numbers are not attributes that we perceive pre-attentively. Simple shapes such as circles and squares are pre-attentively perceived, but the shapes of numbers are too elaborate.
  • The key idea behind Tufte’s principles is that a major portion of the graphic should directly correlate to the data that is to be presented. The change in the ink should happen only if the data is changing since the major portion of the ink is data-ink. The fluff (non-data ink) should be minimal. Examples of non-data ink are grid lines, axis, etc.In the context of computer systems, we can think in terms of pixels instead of ink (with the background pixels not being considered).
  • Next Generation Analytics: A Reference Architecture

    1. 1. 0Mu Sigma Confidential Chicago, IL Bangalore, India Proprietary Information "This document and its attachments are confidential. Any unauthorized copying, disclosure or distribution of the material is strictly forbidden" Chicago, IL Bangalore, India Proprietary Information "This document and its attachments are confidential. Any unauthorized copying, disclosure or distribution of the material is strictly forbidden" Do The Math Next Generation Reference Architecture: Analytics Domain June 2013 Hadoop Summit June 2013 – San Jose Twitter: @crunchdata
    2. 2. 11Mu Sigma Confidential  Analytics Trends & Big Data  Reference Designs  How to Scale Analytics and Intelligent Systems Reference Design Agenda
    3. 3. 2Mu Sigma Confidential The information explosion is leading to an unprecedented ability to store, manage and analyze data; we need to run ahead.. Information Ecosystem ► Text Expanding ApplicationsData Source Explosion Technology EvolutionAdvanced Analytical Techniques Digital Data in 2011: ~ 1750 Exabytes Facebook: 100+ million users per day Mobile Phone Subscription: 4.6 billion 15 Million Bing recommendation records RFID Tags: 40 billion I N F O R M A T I O N C L O U D Click-stream data Purchase intent data Social Media, Blogs Email Archives RFID & Geospatial Information Events Opinion Mining Social Graph Targeting Influencer Marketing Offer Optimization Clinical Data Analysis Fraud Detection SKU Rationalization Massively large databases Search Technologies Complex Event Processing Cloud Computing Software as a service Interactive Visualization In Memory Analytics CART Adaptive Regression Splines Machine Learning Radial Basis Functions Support Vector Machines Neural Networks Geospatial Predictive Modeling
    4. 4. 3Mu Sigma Confidential Three Key Macro Analytics Trends Targeting Consumption Macro Trends: Agenda in All Three Analytics TrendsComputational Enablement – Scalable Analytics, Big Data & NoSQL technologies, GP-GPU, In – Memory & High Performance Computing – Master Data Management, Cloud, and Federated Data – the end of the central EDW? – Analytics as a Service and the Cloud – Open Source Maturity – Rise of Agile Self Service – Data Model Disruption: ETL vs. ELT Usability and Visualization – Dashboards will Evolve into Cockpits – Improved Interactive Data Visualization & Aesthetics – Augmented / Virtual Reality / Natural User Interfaces – Graph Analytics sparked by Social Data – Discovery with Computational Geometry – Mobile BI and the Integration of Consumer & Enterprise Devices – Gamification: Do you want to play a game? Intelligent Systems (“Anticipation Denotes Intelligence”) – Operational Smart Systems are Back propagating – Pervasive Analytics – Real Time Analytics and Event Streams – Artificial Intelligence (AI) is not what we expected – Don’t forget the 4th tier it’s Juicy – Metaheuristics and the Ants – Operational Analytics, You are the Exception – Experiments in the Enterprise and the Quasi – Just Say No to OLS – Energy Informatics if Radiating – Intelligent Search
    5. 5. 4Mu Sigma Confidential Why should you care, is this phenomenon all marketing hype or business critical?  Data is an asset and growing by leaps and bounds – Structured and unstructured (DARK MATTER) – Within and Beyond the firewall  A competitive differentiator is the use of data to improve decision making thru analytics – “Executives are Professional Gamblers” – Need an Edge.. » We are in the information age…  Consumption of analytics differentiates » Embedded analytics vs. ppt culture The idea is linked to the concept of a digital age or digital revolution, and carries the ramifications of a shift from traditional industry that the industrial revolution brought through industrialization, to an economy based on the manipulation of information, i.e., an information society (computation) It’s all about analytics done at “SCALE”, enabling scale cost effectively. Who Cares?
    6. 6. 5Mu Sigma Confidential Big Data’s recent enterprise popularity can be attributed to two key factors Big Data Technology Evolution 2003 2004 2005 2006 2007 2008 2009 2010 2011 Apache: Hadoop project Yahoo: 10K core cluster IBM Acquires Early Research Open source dev momentum Initial success stories Commercialization & Adoption Cloudera named most promising startup funded in 2009 elastic map-reduce Teradata buys Aster (map-reduce DB) Google Trigger: Releases “Map Reduce” & “GFS” paper Google: Bigtable paper Lucene subproject HP Acquires Vertica EMC Acquires Greenplum 2012 Oracle MSOFT SAS  Technology Trigger – Robust Open Source Technology for Parallel Computation on Commodity Hardware (Hadoop / NoSQL)  Frustration Trigger.. – Tyranny of the Data Model, Cost & Complexity for data integration, Agility, Impact, Bureaucracy, Difficulty & Costly to scale of existing EDW technology Faced with storing massive data sets from indexing the web, Google searches for & finds a way to store & retrieve the data in commodity hardware
    7. 7. 6Mu Sigma Confidential A Mindset for Enabling Decision Sciences At Scale Dataset -> Toolset -> Skillset -> Mindset Big Data – What is it?  Skillset – Map-Reduce, HDFS, HIVE, R, PIG, NoSQL, Unstructured Data – Parallel computation & Stream Processing – Data Scientist and Data Explorer – “…big data job postings on Dice more than tripled year-over- year – Jan 2013”  Mindset – Automation and Scale – Higher utilization of machines for decisions – Agile – Learning vs. Knowing – Consumption – Algorithmic and Heuristic (Man & Machine)
    8. 8. 77Mu Sigma Confidential  Analytics Trends & Big Data  Reference Designs  How to Scale Analytics and Intelligent Systems Reference Design Agenda
    9. 9. 8Mu Sigma Confidential the Big Data eco-system is a challenge to keep up with.. Mindset: Knowing vs. Learning.. The Big Data Ecosystem ”Formulating a right question is always hard, but with big data, it is an order of magnitude harder, because you are blazing the trail (not grazing on the green field).” source: Gartner Source: The Big Data Group llc
    10. 10. 9Mu Sigma Confidential Mindset: Qualitative  Next Generation Analytics Platform (evoked set) – Enterprise Architecture (Modern Best of Breed componentized Services, SOA) – Scale (analytics done at scale) – Value – Commodity (scale horizontally using commodity hardware) – Promote Reuse (published API, PMML, BPMN, SQL, CAMEL, CRISP-DM) – Build Discipline (model versioning, GIT, unit testing, and coding standards) – Batch and Real Time Capable – Agile (integration and rapid deployment – analyst) – Process Automation – Knowledge Management – Model Management – Open Standards (HTML5, PMML) – Quality Assurance (Unit Testing) – Distributed & Parallel Capable (Hadoop, MPI, GPGPU) – Inject intelligence into the ecosystem (consumption) – Deploy Intelligent Systems & Intelligent Machines – Champion / Challenger (test and learn) – Robust and Agile ETL (HIVE, HBASE, SQL ETL tools) – Beyond the tyranny of the data model..
    11. 11. 10Mu Sigma Confidential The “decision” stack will lead to bundling and tighter integration of technology components Privileged and confidential Copyright 2012, Wal-Mart stores, Inc. DATA ENGINEERING LAYER DATA SCIENCE LAYER DECISION SUPPORT STACK DECISION SCIENCE LAYER • Data Acquisition (Data at Rest, Data at Motion) • Data Storage (Databases, Big Data, In-memory) • Data Processing (ETL, ELT, Event Processing) • Workflow Management • Statistical Analysis • Reporting • Data mining • Visualization • Text Mining/Unstructured Data Analysis • Business Contextualization • Intelligent Analytics Workflows • Insight Generation tools • Learning frameworks • Operationalize & Scale Math + Biz + Tech Math + Tech Technology
    12. 12. 11Mu Sigma Confidential Reference Design Decision Support Stack for Operationalizing Analytics at Scale Math + Business + Technology Math + Technology Technology Dataset ++ Design Weak Point for Enterprises Today In Scaling Decision Sciences
    13. 13. 12Mu Sigma Confidential A typical Big Data Architecture design pattern leverages NoSQL as a data store “reservoir” RDBMS NoSQL Traditional EDW FTP server IMPORT SQOOP/ 3rd party connectors Edge Node Monitoring Workflow engine (Oozie) Streams Cloud NoSQL DBs HDFS/ MAP REDUCE/(MPI) Hadoop cluster hardware ETL/ ELT EDW for storing the data after analytical processing (STAR Schema, Snow flake, etc.) BI Tools REAL TIME  In Memory: Spark, Storm, Shark, Impala, Drill, Stinger  NoSQL: Cassandra, MongoDB, HBase BATCH  Hive  Pig  Native map/reduce  Hadoop streaming Business users/ Applications Traditional EDW EXPORT SQOOP/ 3rd party connectors Flume/ Avro/ Storm Data Governance  Authentication  Authorization  Kerberos  LDAP/ AD  Meta data mgmt. Data sourcing Mu Sigma’s POV: Big Data Arch. design pattern CLOUD
    14. 14. 13Mu Sigma Confidential Reference Design Decision Support Stack for Operationalizing Analytics at Scale Math + Business + Technology Math + Technology Technology Dataset ++ Design Weak Point for Enterprises Today In Scaling Decision Sciences
    15. 15. 14Mu Sigma Confidential Trend: Dashboards will Evolve to be More Actionable and Forward Looking Dashboredom? http://www.information- usiness_intelligence-10019101-1.html “As we move from the information age into the intelligence age”.. Usability and Visualization: Evolution
    16. 16. 15Mu Sigma Confidential Trend: Interactive Visualization Improvements and the Importance of Aesthetics & Design Advanced Visualization will continue to increase in depth and relevance to broader audiences. Visionary vendors like Tableau, QlikTech, SAS JMP/Visual Analytics, Panopticon, and Spotfire (now Tibco) made their mark by providing significantly differentiated visualization capabilities compared with the trite bar and pie charts of most BI players' reporting tools. The latest advances in state-of-the-art UI technologies such as Microsoft’s SilverLight, Adobe Flex, R, and Processing, Javascript/ HTML5 via frameworks like Ext JS, D3 augur the era of a revolution in state-of-the art visualization capabilities, and Multi Touch Devices … movie: Avengers & Iron Man and Visualization: Evolution
    17. 17. 16Mu Sigma Confidential Tapping the power of visual perception  Vision is by far our most powerful sense. Seeing and thinking are intimately connected.  I see what you mean. It isn't accidental that when we begin to understand something we say, "I see." Not "I hear" or "I smell," but "I see."  Approximately 70% of the sense receptors in our bodies are dedicated to vision [Few, 2006] Consumption Enablement
    18. 18. 17Mu Sigma Confidential Visually Encoding Data for Rapid Perception  Pre-attentive processing, the early stage of visual perception that rapidly occurs below the level of consciousness, is tuned to detect a specific set of visual attributes.  How many 3’s? 1281736875613897654698450698560498286762 9809858453822450985645894509845098096585 9091030209905959595772564675050678904567 8845789809821677654872664908560932949686 1281736875613897654698450698560498286762 9809858453822450985645894509845098096585 9091030209905959595772564675050678904567 8845789809821677654872664908560932949686 Frameworks Source: Tibco
    19. 19. 18Mu Sigma Confidential Key Elements for Visual Analysis Data: • integrate multiple data sources • enrich your data Add dimensions to your data: • color • ShapE • Size • “Labels” • • Graph Theory Interactive: • Ad-hoc ask & answer (self-service) • Query filters (Drill down - hierarchy) • Linked Visualizations (Brushing) • Bookmarks & Tabs • Movement (animation) Time Emphasis Frameworks Pre-Attentive Attributes of Visual Perception most applicable to Data Presentation: Form, Color, Spatial, and Motion “Pre-attentive processing is the unconscious accumulation of information from the environment. All available information is pre-attentively processed. Then, the brain filters and processes what is important. Information that has the highest salience (a stimulus that stands out the most) or relevance to what a person is thinking about is selected for further and more complete analysis by conscious (attentive) processing”
    20. 20. 19Mu Sigma Confidential Eloquence through Simplicity  Edward R. Tufte introduced a concept in his 1983 classic The Visual Display of Quantitative Information that he calls the "data-ink ratio." Tufte defines the data-ink ratio in the following way: “A large share of ink on a graphic should present data-information, the ink changing as the data change. Data-ink is the non-erasable core of a graphic, the non-redundant ink arranged in response to variation in the numbers represented” Reduce the non-data pixels  Eliminate all unnecessary non-data pixels  De-emphasize and regularize the non-data pixels that remain Enhance the data pixels  Eliminate all unnecessary data pixels  Highlight the most important data pixels that remain Frameworks
    21. 21. 20Mu Sigma Confidential General Principles of Design  What’s taught in Art Class? Frameworks
    22. 22. 21Mu Sigma Confidential Example: Good, Bad, and the Ugly
    23. 23. 22Mu Sigma Confidential Example: Good, Bad, and the Ugly
    24. 24. 23Mu Sigma Confidential Reference Design Decision Support Stack for Operationalizing Analytics at Scale Math + Business + Technology Math + Technology Technology Dataset ++ Design Weak Point for Enterprises Today In Scaling Decision Sciences
    25. 25. 24Mu Sigma Confidential Welcome to DSpace … Decision Sciences Home Page Each Community has its own Home Page. Recent submissions to this entire Community are featured on the Home Page. Other Links Example: Knowledge Management
    26. 26. 25Mu Sigma Confidential Decision Science Wiki and the Power of Reproducible Analytics Example: Knowledge Management
    27. 27. 26Mu Sigma Confidential Decision Sciences Wiki and the Power of Reproducible Analytics Example: Knowledge Management
    28. 28. 27Mu Sigma Confidential Decision Sciences Wiki and the Power of Reproducible Analytics Example: Knowledge Management
    29. 29. 28Mu Sigma Confidential Reference Design Decision Support Stack for Operationalizing Analytics at Scale Math + Business + Technology Math + Technology Technology Dataset ++ Design Weak Point for Enterprises Today In Scaling Decision Sciences
    30. 30. 29Mu Sigma Confidential Typical Analytics Workflow Teradata 15%oftheprocess70%oftheprocess15% Time Period Selection Sister Store Selection Processing of codes Basic data pull Email Request Client – Analytics Report Run the optimization process Set inputs in the code SAS Processing of codes Optmodel optimization VBA Excel Output as excels Client Report Generation Excel CPLEX optimization SAS 1 2 4 5 6 7 8 9 10 11 3 Example of Analytics Batch Process Mgmt
    31. 31. 30Mu Sigma Confidential Utilize “Analytics BPM”: Workflow and Business Process Management Platform to Manage Man & Machine Batch Analytics Automation BPM is a rapid analytics automation solution that boosts the productivity of analytical workgroups. Analytics BPM allows one to build human and machine workflows and embed them within your enterprise. •Nexus BPM v1.5 (open source LGPL; based on Jboss JBPM) •muFLOW v2.1 (based on Performance Gains Across Client Engagements Industry Engagement Business Impact Retail Technology Healthcare Automates reporting framework Optimizes search engine marketing spends Helps create marketing mix solution Improvement in analyst productivity Increased collaboration and utilization of better process management strategies Improved model granularity Flexible bidding process Increased click through rates by 8% Easier tool deployment Reduced development time by 15% Support creation of additional reports rapidly Key Features  A modeling environment that “glues” together reusable components from multiple technologies and enables automation  A robust and flexible data integration capability  Ability to create rich analytics vehicles that can be customized by client  Provision to leverage past knowledge  Facilitates efficient process management such as modeling, automation, monitoring, and analysis
    32. 32. 31Mu Sigma Confidential Professional Shops Require Model Management: “Algo Wars” Source: SAS Model Manager Key Features Model Lifecycle Management • “Software” Parallels • Dev, Test, Stage, Production • Versioning • PMML; vendor agnostic Continuous model performance management • Champion/Challenger • Live Update and Optimization Proactive notification and live switch overs Model Management
    33. 33. 3232Mu Sigma Confidential  Analytics Trends & Big Data  Reference Designs  How to Scale Analytics & Intelligent Systems Reference Design Agenda Analytics Trends
    34. 34. 33Mu Sigma Confidential Yippee ! 5 % discount .. Yeah.. I think I need these headphones too The current business scenario demands automating and injecting intelligence into various enterprise touch points Business Situation Webpage Add to Cart Checkout Yearly Bonus  Let me buy a new laptop Taxonomy, Associations, Nearest Neighbor Popup Optimization, Need State Personalization Collaborative Filtering Recommender Systems Offer Optimization Behind the Scenes Email Marketing Mix CRM Revenue Management Hey ! I have some good deals being provided here Wow ! Great offers .. Since I have some more money, let me have a look Customer Segmentation Propensity Score Attribution Methods to Optimize Media Spend Pricing Elasticity Add Intelligent Analytics Everywhere You Can to Unlock The Value In Information Search Buy laptop
    35. 35. 34Mu Sigma Confidential We are in the „Information Revolution‟ and its high time that data and analytics capabilities are integrated to scale with machine intelligence Intelligent Systems Intelligent Systems: Next Wave of Innovation and Change The Next Big Opportunity ! Optimal Utilization of Available Information and Migration to Intelligent Systems  Convergence of the global industrial system with the power of advanced computing, analytics, low-cost sensing and new levels of connectivity permitted by the Internet * Source: GE (Industrial Internet – Pushing the Boundaries of Minds and Machines) Intelligent Machines Advanced Analytics People At Work
    36. 36. 35Mu Sigma Confidential Current News on Intelligent Systems CEP Vendors Acquired June 2013: Streambase (Tibco) and Progress Apama (Software AG) USA Today, September 16, 2012 – probes-math-in-global-trading/57782600/1  Two Second Advantage: Published 2011 – Excellent portrayal of Intelligent Systems, patterns from the human brains – mental models – Wayne Gretzky: Edge: He predicted a little better than anyone else, where the hockey puck would be. – Enterprise 2.0: Every transaction became a bit of digital data – Enterprise 3.0: Every EVENT can become a bit of digital data – There will always be value in making projections, days, months, years, predictive analytics, data mining, and analytics systems on historical data – batch systems – But in today’s world, enterprises need something more, instantaneous, proactive, predictive capability – real time systems  GE Mind an Machines: November 2012 » Industrial Internet, Intelligent Systems, and Intelligent Machines » »
    37. 37. 36Mu Sigma Confidential High-level design for a Robust Intelligent System Reference Architecture for Intelligent Systems
    38. 38. 37Mu Sigma Confidential Real Time Computation for Analytics: The Agent Metaphor Real Time Automation Java Agent Development Environment (JADE) is a java based middleware to aid in agent oriented programming (AOP) across distributed machines. The Jade platform and framework provides the functionality to easily develop software agents and control the lifecycle of these agents. Agent-based computing is an approach to design and implementation that facilitates the design and development of sophisticated systems by viewing them as a society of independent communicating agents working together to meet the goals of the system. JADE Features:  Agent Communication  Distributed Platform  Agent Mobility  Monitoring tools like Remote Monitoring Agent (RMA) Agent Attributes:  Autonomous  Goal-directed  Sociable RJADE Ragentsembeddedinside Javawithrsession associatedwiththem
    39. 39. 38Mu Sigma Confidential Thank You Chicago, IL Bangalore, India 2013 Proprietary Information "This document and its attachments are confidential. Any unauthorized copying, disclosure or distribution of the material is strictly prohibited" Twitter: @crunchdata