Big data analytics presented at meetup big data for decision makers


Published on

Presentation on data science techniques as presented at the Washington DC meet-up on Big Data for Decision Makers.

Published in: Technology, Business
1 Comment
No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • Think about the access to top talent and how crowd sourcing is allowing organizations to put a bounty on solutions to hard problems.
  • Think about graph analysis and the work being done with SNA today.
  • Think about common patterns and pattern discovery. For example in Cargo, if a ship stops at certain ports is the probability higher or lower that it may have picked up some illegal substances on the way.
  • Really great example of how different techniques can be combined and reused. This is really driving the need for an enterprise analytic data set as you can start to chain analytics together to do many types of operations.
  • Think about automation of analysis tasks. If I’ve figured how to to bucket things, I may be able to triage the data better according to priorities in my organization.
  • Clustering is really BIG in the big data world right now due to the wide applicability.
  • Big data analytics presented at meetup big data for decision makers

    1. 1. The Science Behind Data Science Presented at Big Data for Decision Makers Ruhollah Farchtchi – Director of Big Data December 5, 2013
    2. 2. Agenda • Introductions • Big Data Analytics Overview • Use Cases – Examples of Data Products • Building Blocks • Data Mining • Technologies • Operational Models © 2013 Unisys Corporation. All rights reserved. 2
    3. 3. So we’ve got a lot of data… • What can we get out of it? • How does it help with our business decision making? • How is this complex landscape changing? Column 1 Column 2 Column 3 Column 4 Multiple Types Multiple Sources Pictures Column 5 1-A 2-A 3-A 4-A 5-A 1-B 2-B 3-B 4-B 5-B 1-C 2-C 3-C 4-C 5-C 1-D 2-D 3-D 4-D 5-D 1-E 2-E 3-E 4-E 5-E 1-F Tabular / Structured My Documents 2-F 3-F 4-F 5-F Documents Unstructured Emails Video Sensors, Networks, C yber Infrastructure Web, Email, Social Media Enterprise Applications Mobile Devices, GPS, and many more! Multiple Domains Defense Health Finance Other • Logistics / Workforce analytics • Cyber and EW • Intelligence Analysis • Drug Discovery • EHR • Epidemic/pandemic prediction • Fraud Detection • Identity Resolution • Customer Support • Supply/Demand Forecasting • MTTB Prediction • Context-based IR © 2013 Unisys Corporation. All rights reserved. 3
    4. 4. Source: And we’ve got a lot of tools… © 2013 Unisys Corporation. All rights reserved. 4
    5. 5. Big Data and Data Analytics – A Unisys Point of View • Unisys Point of View: Today’s big data is tomorrow’s normal data – What remains is the need to extract insights and value out of the data • Data Analytics is often the goal or end-product of what organizations what to get out of their data (Big or otherwise) – Focused around the capabilities of: • Efficient Data Processing – get data in and processed in time to make use of it and in a tenable manner • Effective Information Management – ability to make the data accessible and to manage the downstream data products as assets • and Expressive Analytics – make sense of the data in a format that is easily digested and incorporated into decision making i.e., if you need a PhD to interpret the results, you still have work to do here – With the aim to increase business value • It’s about understanding the data and what you can get out of it – ―…40% of business leaders had no response when asked what types of information would transform their industries over the next 10 years.‖1 1. Anne Lapkin, 2012. Hype Cycle for Big Data, 2012, Gartner. © 2013 Unisys Corporation. All rights reserved. 5
    6. 6. Backward-looking (Forensic) Modeling and Forecasting Pattern Recognition Scale-out Linear Programming Data Analytics Global Optimization Classification Machine Learning Simulation Business Intelligence & Data Warehousing STAR Schema OLAP RDBMS SQL ETL Leverage for large-scale analytics and data mining Extend Complexity Forward-looking (Predictive) Data Analytics is the culmination of Analytics and IT Big Data & NoSQL Hadoop Google BigTable Map/Reduce Splunk Dynamo Hive MongoDB Cassandra EMC Greenplum HBase Leverage for largescale application development & information management Multi-TB Turning Point Low Volume, Variety, Velocity Data Volume High Volume, Variety, Velocity Data Analytics is at the intersection of high volume data processing and advanced analysis. The tools and methodologies here represent a mix of both worlds and there is currently no ‘killer app’. © 2013 Unisys Corporation. All rights reserved. 6
    7. 7. Challenges Misaligned IT, Analytics, and Business Strategies Ineffective Data Management Strategy Ineffective/inefficient storage and security platforms In-accessible or siloed analytics (―Cylinders of Excellence‖) Untrusted analytic products or analytics that are not timely, accurate, or repeatable (untested) Inability to scale analytic generation (lack of training) © 2013 Unisys Corporation. All rights reserved. 7
    8. 8. Analytic Environment That Supports Data Processing, Enhances Information Management and Improve Decision Making Data Products Building Analytic Environment 1. 2. 3. 4. 5. 6. 7. 8. Work with business leaders and decision makers to understand and quantify data value chain View data as an enterprise asset Innovate through creation of new data products and services Retrain staff and/or acquire Data Scientist skills Integrate teams across big data, data warehousing, and business analysis Revise information management strategies to incorporate big data Develop new ways of capturing information e.g., mobile and streaming data Identify and leverage previously unused internal and external data Analyst Focused IT Focused Raw Data © 2013 Unisys Corporation. All rights reserved. 8
    9. 9. Creation of data products is key to analytic reuse • What are Data Products? – Essentially this the output of a data science or data mining activity – Non-trivial; more than a simple query – Requires a platform for processing • They can manifest themselves as many things – Analytical "engines" running in a larger application (Amazon's recommender engine is a great Data Product) – Lists (e.g., Top 10 things I need to know today) – Entire applications (e.g., customer baseball cards) • However once they are defined, one thing is true for all – It takes a combination of domain agnostic analytic techniques together with domain specific knowledge to produce something relevant and consumable that can be monetized or operationalized. © 2013 Unisys Corporation. All rights reserved. 9
    10. 10. Examples of Data Products
    11. 11. Use Case #1- Netflix Recommendation • Netflix is about connecting people to the movies they love by leveraging their movie recommendation system: CinematchSM • CinematchSM initially was a linear model that helped to predict the users choices • The predictions are used to make personal movie recommendations based on a customers unique tastes – Challenge: Can the recommendation engine be improved upon? – Resolution: Set the improvement accuracy level(10%) and create a contest with a $1 million prize • Crowdsourcing: Teams merged together for an internet enabled approach and improve results • Netflix provided a training dataset of 100+ million ratings that 480,000 users gave to 17K movies and contained the quadruplet of the form (user, movie, date of grade , grade) – – – – – Goal is to predict grade Example of Supervised Machine Learning Submitted predictions are scored against the true grades in terms of Root Mean Squared Error (RMSE) RSME is a frequently used measure of the difference between values predicted by a model and the values observed(i.e. residuals) Similarity is determined by a distance measure such as Jaccard or Cosine distance Source; and Mining Massive Datasets by Anand Rajaraman and Jeffry Ullman © 2013 Unisys Corporation. All rights reserved. 11
    12. 12. Use Case #2- Google PageRank • Google wanted to be able to measure and rank the importance of Web Pages. – Challenge: Identify and rank the pages that a users would want to view in terms of their relevance? – Resolution: Develop an algorithm that leverages link analysis and implement it as part of Google’s infrastructure • The PageRank algorithm considers a webpage to be important if many other webpages point to it. The linking webpages that point to a given page aren’t treated equally • The algorithm takes into account both the importance (PageRank) of the linking pages and the number of outgoing links it has – Similar to Social Network Analysis • Linking pages with higher PageRank are given more weight while pages with more outgoing links are given less weight. • Example of Un-Supervised Machine Learning 0 0 1 0 1 0 0 0 Link Matrix= 1 1 0 1 0 0 0 0 Page 2 Page 1 Page 3 Page 4 Source; The Elements of Statistical Learning, Trevor Hastie, Robert Tibshirani and Jerome Friedman © 2013 Unisys Corporation. All rights reserved. 12
    13. 13. Use Case #3- Walmart Data Driven Value Chain • Walmart is the leading and largest retailer in the world. • Walmart has been a catalyst for technology adoption amongst its suppliers including requiring partners to leverage RFID technology to track and coordinate inventories. • They have a great cross section of data from individual Social Security Information, Geographic detail and product purchases • They utilize econometric and marketing mix modeling (multiplicative, log-log, power additive, adstocks, lags and powers) for a number of their key analyses • Walmart mines their data to get their product mix correct under different and changing environment conditions. – – • Challenge: Identify the correct product mix in order to protect the firm from too much or not enough inventory Resolution: Mine their multiple data sources for data products that will help tighten and improve operational forecasts For impending hurricane warnings, Walmart found that: Sales – Pop Tarts increase in sales(7 times their normal rate) – Identified that the top selling premium item was beer – Allows the firm to get the supply to the store ahead of time GAs = a + b(TV) GAs = a + b(TV)G Item(Beer, Pop Tarts) Source; What Walmart Knows about Customer Habits: New York Times © 2013 Unisys Corporation. All rights reserved. 13
    14. 14. Use Case #4- Amazon Targeted Marketing • Amazon is the worlds largest online retailer and known for their e-commerce Web Site where they use input about a customer’s interest to generate a list of recommendation. • Similar to Netflix they use recommendation algorithms but they do targeted marketing for items that a customer would want to buy based on their previous purchase patterns • The recommendation algorithms personalize the online store for each customer and radically changes based on the customers interest – Challenge(s): Analyze massive amounts of data, submit results realtime, new customers have very little data and customer data is very volatile – Resolution: Cluster modeling, search based methods and Item to Item Collaborative filtering • Cluster Modeling: Identify customers similar to the user by dividing the customer base into segments and treat the task as a classification problem. Typically uses a unsupervised learning algorithm such as K-Means or Hierarchical • Search Based Methods: Treats the recommendations problem as a search for related items. Given a users purchases and rated items, the algorithm constructs a search query to find other popular items by the same author, artist or director with similar keywords • Item to Item Collaborative Filtering: Customized algorithm that is able to scale to massive data sets and produces high quality recommendations in real time. This algorithm matches each of the users purchased and rated items to similar items and then combines those similar items into a recommendation list. Offline and Online components to increase performance Source; Recommendations: Item to Item Collaborative Filtering. Greg Linden, Brenth Smith and Jeremy York © 2013 Unisys Corporation. All rights reserved. 14
    15. 15. Unisys Big Data Analytics Building Blocks
    16. 16. Big Data Analytics Methodology Modeling Components Decision Making & Forecasting • Provide actionable intelligence into the future state Models • Statistical model applied to input data that separates the portion of volume due to each of the variables or factors. We use the term model, because it is a simplification of reality. Data Internal Data Demographic Data Demographic Data 3rd Party Data © 2013 Unisys Corporation. All rights reserved. 16
    17. 17. Data Mining
    18. 18. Data Mining - Motivations • We’ve covered big data – There’s a lot of it! • New Modus Operandi – Gather whatever data you can, whenever and where ever possible • New Expectation – Data gathered will have value; either for the purpose it was collected or for a purpose not yet envisioned • Challenge: There will never be enough analysts to sift through it all © 2013 Unisys Corporation. All rights reserved. 18
    19. 19. Data Mining Definitions • Non-trivial extraction of implicit, previously unknown and potentially useful information from data (normally large databases) • Exploration & analysis, by automatic or semiautomatic means, of large quantities of data in order to discover meaningful patterns. • Part of the Knowledge Discovery in Databases Process. Source: © 2013 Unisys Corporation. All rights reserved. 19
    20. 20. Data Mining Tasks Prediction Methods: Use some variables to predict unknown or future values of other variables Description Methods: Find human interpretable patterns that describe the data. • Classification • Clustering – For a given set of attributes apply a model for the class (what you want to predict) as a function of the attributes – • • Regression – Predict a value of a given continuous valued variable based on the values of other variables, assuming a linear or nonlinear model of dependency • Data points in one cluster are more similar to one another Data points in separate clusters are less similar to one another • Association Rule Discovery – • Deviation Detection – Given a set of data points, each having a set of attributes, and a similarity measure among them, find clusters such that: Given a set of records each of which contain some number of items from a given collection: • Detect significant deviations from normal behavior Produce dependency rules which will predict occurrence of an item based on occurrences of other items. • Sequential Pattern Discovery – Given a set of sequences and support threshold, find the complete set of frequent subsequences © 2013 Unisys Corporation. All rights reserved. 20
    21. 21. Classification - Example Tax Fraud Refund Marital Status Taxable Income Cheat Yes Single 125k ? Tid Refund Marital Status Taxable Income Cheat No Married 100k ? 1 Yes Single 125k No No Single 70k ? 2 No Married 100k No Yes Married 120k ? 3 No Single 70k No 4 Yes Married 120k No 5 No Divorced 95k Yes 6 No Married 60k No 7 Yes Divorced 220k No 8 No Single 85k Yes 9 No Married 75k No 10 No Single 90k Yes Training Data Set Test Data Set Learn Classifier Model Model Model © 2013 Unisys Corporation. All rights reserved. 21
    22. 22. Classification – Your Turn • Fraud Detection • Goal: Predict fraudulent cases in credit card transactions. • Approach: – – – – What kind of data will you try to get ? Can you say something about the characteristics of the data? Estimate the size of the data. What kind of pitfalls you might run into ? © 2013 Unisys Corporation. All rights reserved. 22
    23. 23. Fraud Detection • Fraud Detection • Goal: Predict fraudulent cases in credit card transactions. • Approach: – Use credit card transactions and the information on its accountholder as attributes. – When does a customer buy, what does he buy, how often he pays on time, etc – Label past transactions as fraud or fair transactions. This forms the class attribute. – Learn a model for the class of the transactions. – Use this model to detect fraud by observing credit card transactions on an account. © 2013 Unisys Corporation. All rights reserved. 23
    24. 24. Clustering - Example • Document Clustering: – Goal: To find groups of documents that are similar to each other based on the important terms appearing in them. – Approach: To identify frequently occurring terms in each document. Form a similarity measure based on the frequencies of different terms. Use it to cluster. – Gain: Search tools can utilize the clusters to relate a new document or search term to clustered documents. • Clustering Points: 3204 Articles of Los Angeles Times. • Similarity Measure: How many words are common in these documents (after some word filtering). © 2013 Unisys Corporation. All rights reserved. 24
    25. 25. Clustering - Illustration Seems strait-forward for a small number of dimensions… what if there were more? © 2013 Unisys Corporation. All rights reserved. 25
    26. 26. Clustering - Illustration Source: We [human beings] have a limited ability to visualize and reason over a large number of dimensions – clustering helps © 2013 Unisys Corporation. All rights reserved. 26
    27. 27. Association Rules • Classic Association Rule Example: – If a customer buys diaper and milk, then he is very likely to buy beer. • Applications: Supermarket shelf management. – Goal: To identify items that are bought together by sufficiently many customers. – Approach: Process the point-of-sale data collected with barcode scanners to find dependencies among items. © 2013 Unisys Corporation. All rights reserved. 27
    28. 28. Technologies
    29. 29. Hadoop -- So what is Hadoop, Really? - Dilbert It’s just a framework © 2013 Unisys Corporation. All rights reserved. 29
    30. 30. Hadoop and MapReduce  Hadoop is an open-source framework (written in Java) to store and process gobs of data across many commodity computers  Hadoop is designed to solve a different problem: the fast, reliable analysis of both structured, unstructured and complex data.  Hadoop and related software are designed for 3V’s: (1) Volume – Commodity hardware and open source software lowers cost and increases capacity; (2) Velocity – Data ingest speed aided by append-only and schema-on-read design; and (3) Variety – Multiple tools to structure, process, and access  Hadoop consists of two elements: reliable very large, low-cost data storage using the Hadoop Distributed File System (HDFS) and high-performance parallel/distributed data processing framework called MapReduce.  HDFS is self-healing high-bandwidth clustered storage. Map-Reduce is essentially fault tolerant distributed computing. © 2013 Unisys Corporation. All rights reserved. 30
    31. 31. The Hadoop Stack • Hadoop runs on a collection/cluster of commodity, sharednothing x86 servers. • You can add or remove servers in a Hadoop cluster (sizes from 50, 100 to even 2000+ nodes) at will; the The four primary areas where to use Hadoop: system detects and 1) To aggregate ―data exhaust‖ — compensates for hardware or system problems on any server. messages, posts, blog entries, photos, video clips, maps, web graph…. • Hadoop is self-healing. It can 2) To give data context — friends networks, social graphs, recommendations, collaborative filtering…. deliver data — and can run 3) To keep apps running — web logs, system large-scale, high-performance logs, system metrics, database query logs…. processing batch jobs — in 4) To deliver novel mashup services – mobile spite of system changes or location data, clickstream data, SKUs, pricing….. failures. © 2013 Unisys Corporation. All rights reserved. 31
    32. 32. Operational Models
    33. 33. Data Products Become the Drivers to Identify new Insights, Cost Savings and Increase Efficiencies Your Customers Feedback • Decreased time to analytics • Reuse of analytics tools • Focus on analytic vs. IT integration Internal Data Sets Data Analytics Environment Knowledge Repository Populate Analytics Engine • More self-service • Incorporation of external data • Ability to scale to analytic needs • Supports analytics lifecycle External Data Sets © 2013 Unisys Corporation. All rights reserved. 33
    34. 34. Thank you © 2013 Unisys Corporation. All rights reserved. 34