• Save
Building Predictive Analytics on Big Data Platforms
Upcoming SlideShare
Loading in...5

Building Predictive Analytics on Big Data Platforms



SoftServe Innovation Conference in Austin, Texas 2013

SoftServe Innovation Conference in Austin, Texas 2013
Building Predictive Analytics on Big Data Platforms presented by Olha Hrytsay (BI Consultant) and Serhiy Shelpuk (Lead Data Scientist)



Total Views
Views on SlideShare
Embed Views



0 Embeds 0

No embeds



Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
Post Comment
Edit your comment

Building Predictive Analytics on Big Data Platforms Building Predictive Analytics on Big Data Platforms Presentation Transcript

  • A Day of Empowerment Building Predictive Analytics on Big Data Platforms
  • 1. Opportunity: Big Data 2. Demystifying Predictive Analytics 3. Taking advantage of combined power
  • Striving for an “unfair” competitive advantage
  • Old Days
  • New Days
  • Big Data could be looking like rubbish
  • Until you find out the use of it
  • “Data are becoming the new raw material of business” - Craig Mundie, head of research and strategy, Microsoft
  • Modeling true risk Network data analysis to predict failure Customer churn analysis Threat analysis Recommendations Feature Usage analysis Ad targeting …
  • Collect and Store • Complex data (text files, audio, video, images, …) • Multiple sources • Lots of data Process • Batch processing • Parallel execution • Cluster solution Analyze • • • • • Simple visualization (reports, dashboard) Text mining Sentiment analysis Prediction models Collaborative filtering
  • Event sources (Log files, Windows Event Log, WMI, SNMP, database, etc.) Event Storage Event Aggregation and Transformation Event Transport Event Serialization and Archiving Event Processing and Analytics Presentation Query Engine Interactive Search User Full-text Search engine Event DB Rules Engine Reports and Dashboards Full-text Index Predictive Analytics Alerts Visualization E-mail, SMS, SNMP, etc. Operational Management Tools Event Ingestion
  • Event sources (Log files, Windows Event Log, WMI, SNMP, database, etc.) Event Storage Event Transport Event Aggregation and Apache Flume Transformation Event Serialization and Archiving Protobuf, Avro, Thrif t, MessagePack Event Processing and Analytics Presentation Query Engine Impala Interactive Search Custom User Full-text Solr, ElasticSe Search engine arch Full-text Event DB HDFS, Hbase, Cas Index sandra Rules Engine Drools Reports and JasperSoft, Dashboards Tableau Predictive Analytics R Alerts Visualization Custom E-mail, SMS, SNMP, etc. Operational Management Tools Event Ingestion Cloudera Manager, Apache Ambari
  • “The idea that the future is unpredictable is undermined every day by the ease with which the past is explained” ― Daniel Kahneman, Thinking, Fast and Slow
  • More data is available for companies Storage technologies allow to store and operate it Advanced analytics could be applied to this new data to achieve competitive advantage
  • Descriptive Diagnostic Predictive Prescriptive What happened? Why did it happen? What is going to happen? What should we do about that? Hindsight Insight Foresight
  • Senior (Executive) Management Ambiguity The goals to be achieved or the problem to be solved is unclear Alternatives are difficult to define Information about outcomes is unavailable. Uncertainty Middle Management Managers know which goals they wish to achieve. Information about alternatives and future events is incomplete. Risk Junior (Line) Management A decision has clear goals and good information is available, but the future outcomes associated with each alternative are subject to chance. Certainty All of the information the decision maker needs is fully available
  • Define objective • Increase customer satisfaction level • Identify prospective customers • Identify crossselling opportunities • Decrease time to market • Decrease costs of marketing campaigns Identify data sets Design the model • Historical data on • Classification model for Internet customers from users defining CRM system what one is • Geographical interested in location data • Smartphone data • Adaptive control models for • Social network managing IT and data network • Text data from the infrastructure Internet pages • Probabilistic • Image data from model for defining the medical credit worthiness sources Design the solution • Data storage type • Logical database design • Availability and scalability of the solution • Integration into corporate information environment • Solution deployment model Implement the solution • Add new functionality to the existing corporate BI platform • Implement new BI solution • Enrich existing business system (CRM, ERP) with the predictive analytics functionality
  • Business Tasks Model Family Algorithms • Define prospective customers • Define traffic jams in the city • Recommend restaurants and menus • Adjust UI to the particular user • Classify body part on X-Ray image • Define market niche • Define influencers in the social networks • Define similar customers or projects in portfolio • Define informal groups in the organization • Define fraud bank transaction • Define network intrusion attempts • Provide automatic aircraft engine testing • Provide automatic IT infrastructure monitoring • Provide clinical test analysis • Define the best price for the goods or services to maximize profits • Define best working schedule for the store • Define best amount of production • Define best business rules Classification Clustering Anomaly Detection Optimization • Naïve Bayes • Logistic regression • Support Vector Machines • Neural Networks • K-Means • K nearest neighbor • Self-organized maps • Mixture of Gaussians • Mixture of Gaussians • Self-learning anomaly detection • • • • • Gradient descent Simplex method Newton’s method Normal equations Genetic algorithms
  • Google to Buy Waze for $1.3 Billion Xerox plans to clear traffic on I-10 The promise of better data has MetLife investing $300M in new tech Gracenote did a whole business on recommending music Obama’s data scientists built a volunteer army on Facebook
  • Description: Cloud-based service for providing more accurate estimates of the credit worthiness (loan scoring) using publicly available data from social networks. Service is oriented to be used by banks. Technologies:      Amazon EC2 MySQL SAP HANA R JAVA Credit Score
  • Facebook Twitter LinkedIn API Processing Preprocessing MySQL (data filtering, data cleansing) SAP HANA Credit scoring API (scoring model)
  • Description: Computer aid diagnostic system that can recognize human body part on X-Ray image and detect broken or fractured bones X-Ray Image Technologies:      Matlab/Octave Python PyBrain NumPy SciPy Analytical Engine This is a hand. Broken bone detected
  • Technology Expertise Services
  • Big Data and NoSQL Data Warehouse Data Integration BI Platforms
  • Big Data Analytics Predictive Analytics Data Science Service Data Integration Data Warehousing Data Visualization and Analysis