Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Automated Analytics at Scale

7,172 views

Published on

Automated Analytics at Scale

Published in: Technology
  • Thank You for the conglomerate of information. Your presentation is a great place to learn a large amount of current info in a short read.
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

Automated Analytics at Scale

  1. 1. Automated Analytics at Scale Model Management in Streaming Big Data Architectures Chris Kang
  2. 2. • Machine learning allows organizations to proactively discover patterns and predict outcomes for their operations, and improving those insights requires deploying better analytical models on their data. • Finding the best analytical model requires running thousands of hypotheses on various datasets and comparing models in a brute force approach. • Currently a model management framework does not exist - that is, an agnostic tool or framework that manages and orchestrates the entire lifecycle of a model. Real-time Analytics at Scale 2Copyright © 2016 Accenture All rights reserved. Challenges of Model Management
  3. 3. Model Management Framework operationalizes analytics to ease development and deployment of analytical models The framework provides key benefits to operationalize and democratize access to analytical modeling at scale 3Copyright © 2016 Accenture All rights reserved. Captures and templates analytical models created by expert data scientists for easy reuse Faster development of analytical models with rapid iteration of training and comparing models using brute force approach Presents champion- challenger view to visually compare and promote trained models Reduces complexity for data scientists to train and deploy models Enables business analysts and others to participate in modeling process
  4. 4. Model Management Framework is essential for the Internet of Things platform The Internet of Things platform exposes thousands of sensors that require models to be automatically managed and maintained as well as provide easy access to the predicted results Identify desired insights Identify sights for operationalizing devices/machinery for various purposes: detecting anomaly, prediction maintenance, budget and resource optimization Collect data Collect various types of data (time series or static) and store them into databases that best fits the data type Analyze Train the analytical models using the model management framework or using other analytical tools such as R then onboard it to the framework Actuate and optimize Set up rules to act on predicted results from thousands of sensors, e.g. schedule a maintenance or lower temperature on a device Copyright © 2016 Accenture All rights reserved. 4
  5. 5. Background
  6. 6. Organizations today have an unprecedented amount of data available because of the Internet of Things, the web, and social media In order to take advantage of this massive set of data, organizations must build analytics platforms 6Copyright © 2016 Accenture All rights reserved. Source: IBM, Big Data Hub, 2013
  7. 7. Traditional analytics platforms use big data technologies to process and analyze large amounts of data “Excited by big data technology capabilities to store more data, more diverse data and more real-time data, (companies) focus on data collection. Rapidly growing data stores put increasing pressure on figuring out what to do with this data. Determining the value of the collected data becomes the top challenge in all industries.” Source: Svetlana Sicular, Gartner, October 30 2015 Example Technologies The steps to derive value out of the data include collecting, processing, and analyzing the data using a variety of big data tools. Analytics and Visualization Data Processing Data Collection Store huge volumes of data in multiple data stores in a variety of data types for processing. Process the data by filtering, transforming, and applying machine learning algorithms using computing engines. Create ad hoc reports on processed data using business intelligence and visualization tools. Copyright © 2016 Accenture All rights reserved. 7
  8. 8. Enterprises need access to both historical and real-time data to gain the most value out of big data analytics • Real-time is data that is processed in sub-seconds to seconds from the time data arrives to when the results are derived. • Batch processing technologies alone are insufficient because in the time it takes to process a batch (hours, days), real-time data has accumulated and is missed, which generates a loss of opportunity for proactive decision making. Storing data in a fault-tolerant, replicated historic store, processing a large batch of data, and storing the processed data using batch writes incurs delays that make real-time not feasible Queries are only directed at stale data of up to hours or days. The lack of real-time data limits the analytics to ad- hoc summarizations and aggregations. Because of the batch processing delay, by the time the captured data is available for queries, it is stale Real-time data is missed by the time analytics begins Historic Data Store Batch Batch Write Data Query Storage Processing Serving Real-Time Data Copyright © 2016 Accenture All rights reserved. 8
  9. 9. The Lambda Architecture empowers real-time analytics by handling data at scale and in real-time using a hybrid architecture • Designed by Nathan Marz, the creator of the Apache Storm project and previously a lead engineer at Twitter, the goal was to build a general architecture to process big data at scale. • The architecture separates batch processing on historical data from stream processing on real-time flow of data, allowing for analytics on data that combines the most up-to-date data with historical data views. Real-time analytics can now be performed on data combined from most up-to-date data with historical views BATCH LAYER focuses on processing historical data views for queries SPEED LAYER handles the complexity of real-time data collection and analysis Historic Data Store Batch Batch Write Data Query Storage Processing Serving Queue Speed Random Write Copyright © 2016 Accenture All rights reserved. 9
  10. 10. In the Internet of Things, predictive modeling on sensor data allows organizations to discover patterns and predict outcomes for their operations Remediation Notification and Alerts Oil & Gas Producer Water Utility Client NoSQL for Unstructured Data Computing Engines and Stream Processors Machine Learning Algorithms Model Runtime Environments Sensors at Field Sites Predictive Results Data Collection Data Processing Predictive Modeling Proactive Decision Making Collects data from over 190,000 sensors Collects data from sensors placed along pipes in a water distribution network Injects 6,000 rows/second and 11 billion rows of data per month – larger analytics platform than Twitter Processes data for water flow rate and pressure Has over 3,500 models analyzing data using various algorithms Apply predictive model to project forward in time to see spikes or falls that exhibit warning signs of failure Enables company to examine huge sets of data, discover trends to predict outcomes in operation and exploration efforts Use results from predictive model to proactively reduce pressure spikes, avoiding leaks, prolonging the longevity of assets, and reducing disruption to customers • The real value of big data is the insight via the analytics, not just the collection of the data. • Predictive modeling is the primary means by which companies can discover trends and make proactive, as opposed to reactive, decisions on data. Copyright © 2016 Accenture All rights reserved. 10
  11. 11. The modeling process is iterative and its lifetime spans both the batch mode model training and real-time prediction In general, a model creates an output for an unknown target value given a defined set of inputs. In a time-series model, the target value also depends on time as an input 11Copyright © 2016 Accenture All rights reserved. Build Model • Identify required data and how to get it • Design and validate specific analytic models • Verify approach through initial set of insights on particular environments Analyzes a variety of machine learning algorithms and identifies the logistic regression model as the most suitable for the problem. Codes model .JAR file Train Model • Prepare historical data for training • Select model input parameters and runtime environment • Train the model on data from historical batch and/or real-time stream in runtime environment Selects input parameters such as the regularization parameter for the logistic regression model. Submits the model to Spark to train the model on historical data in HDFS Monitor Execution • Monitor the status of training the model in the runtime environment (e.g. running, succeeded, failed) • Troubleshoot issues in the runtime environment if necessary Opens the terminal, ssh into the Hadoop cluster, and enters the commands to verify the status of the model as it is trained Compare Models • Compare trained models in champion-challenger fashion • Brute force approach to finding best-of-breed model for deploying to live stream After iteratively training many models, select the best-of-breed based on the model with the lowest mean square error Operationalize Model • Deploy best model on live stream of data • Generate predicted results for automated or manual proactive decision making • Observe results to feed back and fine-tune the model Submits the model to Spark Streaming to be applied to streaming data ingested from Kafka, and model predicts in real-time whether sensor will fail I want to deploy a model that can detect if a sensor is faulty in real-time Data Scientist Data Science System Administration
  12. 12. 12Copyright © 2016 Accenture All rights reserved. Challenges with Analytical Modeling in the Current State
  13. 13. Building, training, and deploying analytical models require a rare combination of data science and engineering skills The ability to complete the modeling process is limited to specialized individuals who are experts in both data science and engineering “The United States alone faces a shortage of 140,000 to 190,000 people with analytical expertise and 1.5 million managers and analysts with the skills to understand and make decisions based on the analysis of big data.” Source: McKinsey Global Institute analysis Traditional Strengths Potential Hurdles with Model Building and Deployment Full Set of Skills Needed for Model Building and Deployment Mathematics, statistics, machine learning, data mining, pattern recognition, predictive algorithms, domain expertise Troubleshooting and running a runtime environment such as Spark requires advanced system engineering skills, which a data scientist may not be trained in. This can potentially lead to slower development and deployment of predictive models. • Understanding of a variety of machine learning algorithms, pattern recognition, as well as expertise in a domain. • Ability to build and code accurate models based on problem space. • System administrator skills as well as deep understanding of big data systems to deploy models in runtime environment. Domain expertise, business processes, requirements gathering Traditional business analysts may lack core skills in data science or data engineering because of a lack of experience to build, train, or deploy models Combination of data science skills as well as software engineering and system administrator skills for big data systems May lack domain expertise, in which case it may take longer to build and train relevant models for the use case Data Scientist Business Analyst Dual Data Scientist and Engineer Copyright © 2016 Accenture All rights reserved. 13
  14. 14. Analytical models are not easily reusable or shareable, resulting in siloed analytics work There is no standard method for sharing models to let users leverage models created by other data scientists, so the analytics work is siloed. This is true for both freshly built models and models that were already trained on a dataset Predictive models duplicate and sprawl as data scientists build and train their own individual library of models that are not shared. No standard for sharing or viewing other data scientist’s models Individual Libraries of Models Data scientists primarily leverage their own libraries of models and previous datasets they worked with to select an algorithm and build a model for the current problem Model Duplication As models are built and trained, the same types of models may be built by more than one data scientist, particularly if the types of models are common in the industry’s use cases Model Sprawl Over time, as more data scientists build and train more models, the models begin to sprawl and duplicate unnecessarily, making the central management of models more difficult Train and deploy individual models Runtime execution environments for model training and deployment Copyright © 2016 Accenture All rights reserved. 14
  15. 15. Without a framework, current approach is too inflexible to support multiple runtime execution environments It is impractical to scale the number of runtime environments to train and deploy models using a manual approach Spark model with R dependencies Model with R dependencies I have a model, but I don’t know which runtime environment can support it I’m only familiar with R, so I need to learn all the environments to test my model I have a new type of model so I need to learn another runtime environment Runtime environments often times cannot support all types of models. As a result, data scientists must spend time learning environments instead of using that time for analytical modeling. Dependencies match and runtime can support model Missing Spark functionality to execute model Missing specific R dependency so cannot support model All R libraries supported and can execute model Data Scientist Update Test Learn • Data scientist needs to acquire the system administration skills to operate the runtime environments • Each runtime environment is unique and requires time and energy • In the worst case, the data scientist must try every runtime environment before successfully finding a match for the predictive model • As more model types are needed, additional runtime environments must be learned • Learning additional environments becomes a time-consuming endeavor Copyright © 2016 Accenture All rights reserved. 15
  16. 16. Lack of engineering abstraction makes it difficult to quickly train predictive models on data Data scientists lose productivity as the process to train models is manual, requiring a manual check for the status of a model in the environment as well as system administration for troubleshooting the model in the environment Need for abstraction grows as the number of types of models and runtime environments increases Wasted productivity – Spending time on data engineering instead of comparing models to find the best-of-breed for deployment No abstractions for training or monitoring models on runtime environments Train model Repeated for hundreds of models on various runtime environments Check status of model Troubleshoot model Train model Check status of model Troubleshoot model Train model Check status of model Troubleshoot model Build many models on various algorithms More time spent on system administration Less time spent on building predictive models Try different input parameters and algorithms to find best-of- breed model ….. Manual Process Data Scientist Build Model Train Model and Monitor Status Copyright © 2016 Accenture All rights reserved. 16
  17. 17. 17Copyright © 2015 Accenture All rights reserved. Model Management Framework for Automated Analytics at Scale
  18. 18. Model Management Framework simplifies the training, deployment, and management of a large number of models for a Lambda architecture Model management is a framework for data scientists and users to more easily train and deploy analytical models in various runtime environments on the lambda architecture by abstracting the system administration, reducing the complexity of train and deploy, and sharing the models in a way that is consumable by users in your organization, enabling other users such as business analysts to partake in the modeling process. The framework in this reference architecture proposes • Model Store and Trained Model Store: A library of models of commonly used machine learning algorithms that can be trained on user’s historical datasets, as well as trained models that are available to be deployed. • Model Interface Templates: Interfaces that abstract away the complexity of the machine learning algorithm, allowing users to specify the inputs and outputs of the model. • Deployment and Scheduler: Automatic training, deployment, and scheduling of models on runtime environments so that users do not need to operate the runtime environments themselves. • Runtime Verifier: Ability to determine which runtime environments can support a model prior to execution, enabling faster development of trained models. • Monitoring Service and Metadata Store: Service monitors the status of the model during its execution on the runtime environment for the user, as well as any metadata about its execution which it can then store. • API: Exposes functionalities with API endpoints for users to verify, train, deploy, and monitor models on runtime environments. Real Time Analytics Runtime Environments Distributed Computing Scientific Computing Model Management Deployment and Scheduler Runtime Verifier Model Store Metadata StoreTrained Model Store Monitoring Service API Model Interface Templates Users Data Scientists Business Analysts Copyright © 2016 Accenture All rights reserved. 18
  19. 19. • Design for seamless interfaces is the method of connecting various stages throughout modeling pipeline to support the domain experts/data scientists to create and update models and for the business analysts to extract data insights. • Model management at scale is specific for large scale data analytics which requires distributed resources allocation and communicates with various data stores. Model Management Framework provides seamless interfaces along data analytics pipeline for model creation, deployment and scheduling The framework in this technical architecture proposes • Runtime Environments: Backend runtime environments such as Spark, MapReduce, R, and more interact with distributed resources (e.g. Hadoop) to train and deploy models • Historical Data Store: Data virtualization interacts with various databases (e.g. Cassandra, Redshift, S3) • Training, Prediction, Model Runtime Services: Framework services interact with runtime service to deploy and allocate resources for models as well as verify models for execution • APIs: APIs interact with framework services for various functionalities • Online Message Queue: Message queue is injected with real-time data Copyright © 2016 Accenture All rights reserved. 19 Prediction Service Training Service API User Interface Resource Allocation Service Model Store Results Store Model Metadat a Store Historical Data Storage Runtime Environments Model Runtime Service Online Message Queue Data Scientist Business Analyst
  20. 20. Demo 20Copyright © 2016 Accenture All rights reserved.
  21. 21. Model Management Framework covers a number of features to support various perspectives The framework provides the following features from the services to better serve domain experts/data scientists and business analysts 21Copyright © 2016 Accenture All rights reserved. Feature Explanation Automatic model deployment on multiple runtime environments Automatic preparing trained model to serve real-time data with the saved.jar file to multiple runtime environments with pre-verification prior to execution. Modeling algorithm library A library with algorithms for machine learning and statistical learning Model metadata A model profile to describe the configuration parameters, path to input/output data, model version as well as resource consumption Heterogeneous data stores Data can be stored in various databases Champion-challenger model Multiple models with the best performed model as the champion and the rest as the challengers Batch mode and real-time mode A combination of model training and serving model to real-time data Model update Retraining of the current model or re-selecting of the champion model Job completion time estimation Estimate of how soon a job can be completed given the current resources Prediction results query and UI Access to prediction results from applying trained model for real-time data for dashboard display Algorithm parameter tuning Automatic fine tuning of algorithm parameters to achieve the best model quality
  22. 22. Deploy Accenture’s Model Management Framework on-premise to operationalize analytics in a big data analytics platform At Accenture Labs, we have a patent-protected invention on the model management framework that showcases the unique capabilities of our framework. If you have analytical models running in a big data analytics platform, we can help deploy our model manager in your environment before problems arise as the number of types of models and runtime environments you need to support increases 22Copyright © 2016 Accenture All rights reserved. Simplified modeling process for data scientists Abstracts data engineering and presents champion-challenger view for your data scientists to more quickly train, compare, and promote their models for deployment. Provide analytics for Internet of Things use cases Process data from heterogeneous data stores allows for sending data from thousands of sensors through modeling pipeline to leverage existing platform’s analytical capabilities. Enabled for real-time analytics The model manager can deploy prediction jobs that ingest streaming data and applies a trained model for real-time predictions. Greater coverage of runtime environments and models Extends the capability to support additional runtime environments, increasing the number of types of models you can use in your data pipeline. Democratized access to analytics Share library of models created by experts allows other data scientists and business analysts to leverage the models for their use cases.
  23. 23. Contact Information Accenture Labs Teresa Tung Technology Labs Fellow teresa.tung@accenture.com Carl Dukatz R&D Senior Manager carl.m.dukatz@accenture.com Copyright © 2016 Accenture All rights reserved. 23 Chris Kang R&D Associate Principal chris.kang@accenture.com
  24. 24. Appendix 24Copyright © 2016 Accenture All rights reserved.
  25. 25. The solution: A new Model Management Framework Simplifying model deployment at scale 25Copyright © 2016 Accenture All rights reserved. A simplified interface RESULTS • Enables a catalog approach to finding analytics • Simplified onboarding of new analytics • Brute-force approach to retraining and comparing models Comprises of a model building service, a prediction service, and a resource allocation service Supports end-to-end analytical modeling at scale using the Lambda Architecture Hides the complexity of Lambda and unlocks its power for data scientists, domain experts, and business analysts
  26. 26. Benefits of the new framework Unlocking the power of Lambda for data scientists, domain experts, and business analysts 26Copyright © 2016 Accenture All rights reserved. Data scientists and domain experts who generate the models can: • Select from already captured modeling approaches or onboard their own • Easily compare models in a champion-challenger fashion Business analysts who rely on model’s results can select from a catalog of models created by experts
  27. 27. Model Management Framework differs from other approaches in its enablement of big data capability with heterogeneity and scalability Other analytics focuses on designing and fine tuning machine learning algorithms to improve accuracy with modeling tools that are hard to scale or speed. For example, WEKA libraries provides comprehensive machine learning algorithms but lack the capability to integrate with big data or manage thousands of models. For example, Apache Mahout works with Hadoop MapReduce with slowdown from frequent writes to disk. Comparison Examples Model Management Framework • I want to run my analytics on the distributed data set with the size of TB or PB which is geographically distributed and stored in various databases • I want to deploy multiple models on distributed resources and let the framework automatically select the best model based on the metrics I have defined • I want to specify the prediction interval and query the results by calling API endpoints • I want to always use the up-to-date model by having the framework retrain the current model or selecting a new champion model Other Model Management • I want to the improve my SVM classification algorithm by 3% in terms of accuracy with my 300MB dataset residing on my local disk • I want to try various algorithms and fine tune parameters to see how the accuracy can be improved • I want to apply the trained model for new data for prediction by calling the modeling method and specifying where to store the results. I need to try multiple prediction intervals to see which works. • I want to see the prediction results by plotting the data from the file where results are saved into Copyright © 2016 Accenture All rights reserved. 27

×