Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Big Data: It’s all about the Use Cases


Published on

Big Data, IoT, data lake, unstructured data, Hadoop, cloud, and massively parallel processing (MPP) are all just fancy words unless you can find uses cases for all this technology. Join me as I talk about the many use cases I have seen, from streaming data to advanced analytics, broken down by industry. I’ll show you how all this technology fits together by discussing various architectures and the most common approaches to solving data problems and hopefully set off light bulbs in your head on how big data can help your organization make better business decisions.

Published in: Technology
  • Crush food cravings with "ODD" water hack. watch video... ☺☺☺
    Are you sure you want to  Yes  No
    Your message goes here
  • 7 Signs Your Car Battery Is About To Die And Needs To Be Replaced (or reconditioned) 
    Are you sure you want to  Yes  No
    Your message goes here
  • Get Paid For Your Opinions! Earn $5-$10 cash on your first survey. ■■■
    Are you sure you want to  Yes  No
    Your message goes here
  • $25 per hour jobs on Facebook, now hiring! ♥♥♥
    Are you sure you want to  Yes  No
    Your message goes here
  • Penis Enlargement and Enhancement Techniques: What REALLY Works?!? ◆◆◆
    Are you sure you want to  Yes  No
    Your message goes here

Big Data: It’s all about the Use Cases

  1. 1. Big Data: It’s all about the use cases James Serra Big Data Evangelist Microsoft
  2. 2. About Me  Business Intelligence Consultant, in IT for 30 years  Microsoft, Big Data Evangelist  Worked as desktop/web/database developer, DBA, BI and DW architect and developer, MDM architect, PDW/APS developer  Been perm, contractor, consultant, business owner  Presenter at PASS Business Analytics Conference and PASS Summit  MCSE: Data Platform and Business Intelligence  MS: Architecting Microsoft Azure Solutions  Blog at  Former SQL Server MVP  Author of book “Reporting with Microsoft SQL Server 2012”
  3. 3. Use Cases (theory) Use Cases (practice) Popular Technologies
  4. 4. Popular Technologies
  5. 5. Harness the growing and changing nature of data What is Big Data? StreamingStructured Challenge is combining transactional data stored in relational databases with less structured data Big Data = All Data Get the right information to the right people at the right time in the right format Unstructured “ ”
  6. 6. Connectivity Data AnalyticsThings IoT = sensor-acquired data
  7. 7. Using a Data Lake Modern Architecture All data sources are considered Leverages the power of on-prem technologies and the cloud for storage and capture Native formats, streaming data, big data Extract and load, no/minimal transform Storage of data in near-native format Orchestration becomes possible Streaming data accommodation becomes possible Refineries transform data on read Produce curated data sets to integrate with traditional warehouses Users discover published data sets/services using familiar tools CRMERPOLTP LOB DATA SOURCES FUTURE DATA SOURCESNON-RELATIONAL DATA EXTRACT AND LOAD DATA LAKE DATA REFINERY PROCESS (TRANSFORM ON READ) Transform relevant data into data sets BI AND ANALYTCIS Discover and consume predictive analytics, data sets and other reports DATA WAREHOUSE Star schemas, views other read- optimized structures
  8. 8. What is Hadoop? Microsoft Confidential  Distributed, scalable system on commodity HW  Composed of a few parts:  HDFS – Distributed file system  MapReduce – Programming model  Other tools: Hive, Pig, SQOOP, HCatalog, HBase, Flume, Mahout, YARN, Tez, Spark, Stinger, Oozie, ZooKeeper, Flume, Storm  Main players are Hortonworks, Cloudera, MapR  WARNING: Hadoop, while ideal for processing huge volumes of data, is inadequate for analyzing that data in real time (companies do batch analytics instead) Core Services OPERATIONAL SERVICES DATA SERVICES HDFS SQOOP FLUME NFS LOAD & EXTRACT WebHDFS OOZIE AMBARI YARN MAP REDUCE HIVE & HCATALOG PIG HBASEFALCON Hadoop Cluster compute & storage . . . . . . . . compute & storage . . Hadoop clusters provide scale-out storage and distributed data processing on commodity hardware
  9. 9. Can I use the cloud with my DW? • Public and private cloud • Cloud-born data vs on-prem born data • Transfer cost from/to cloud and on-prem • Sensitive data on-prem, non-sensitive in cloud • Look at hybrid solutions
  10. 10. MPP Logical Architecture “Compute” node Balanced storage SQL“Control” node SQL “Compute” node Balanced storage SQL “Compute” node Balanced storage SQL “Compute” node Balanced storage SQL DMS DMS DMS DMS DMS 1) User connects to the appliance (control node) and submits query 2) Control node query processor determines best *parallel* query plan 3) DMS distributes sub-queries to each compute node 4) Each compute node executes query on its subset of data 5) Each compute node returns a subset of the response to the control node 6) If necessary, control node does any final aggregation/computation 7) Control node returns results to user Queries running in parallel on a subset of the data, using separate pipes effectively making the pipe larger
  11. 11. NoSQL databases • Non-relational databases (semi-structured data) • Types: Document, Key-value, Column, Graph • MongoDB, Cassandra, HBase, DocumentDB, Riak • Large-scale OLTP (i.e. popular web application) • Scale-out solution • High-availability • JSON data • Cons: data consistency, join data, use SQL, quick mass updates, skillset • Bad solution for a data warehouse, but can have a place in a big data solution • Polyglot Persistence: use the right tool for the job
  12. 12. Use Cases (theory)
  13. 13. Speed/Real-time Batch/Traditional Hybrid
  14. 14. Modern Data WarehouseThe Dream
  15. 15. The Reality
  16. 16. Let’s set off light bulbs in your head
  17. 17. Recommenda- tion engines Smart meter monitoring Equipment monitoring Advertising analysis Life sciences research Fraud detection Healthcare outcomes Weather forecasting for business planning Oil & Gas exploration Social network analysis Churn analysis Traffic flow optimization IT infrastructure & Web App optimization Legal discovery and document archiving Data Analytics is needed everywhere Intelligence Gathering Location-based tracking & services Pricing Analysis Personalized Insurance
  18. 18. The Internet of Things – Manufacturing GLOBAL OPERATIONS I can see my production line status and recommend adjustments to better manage operational cost. I know when to deploy the right resources for predictive maintenance to minimize equipment failures and reduce service cost. I gain insight into usage patterns from multiple customers and track equipment deterioration, enabling me to reengineer products for better performance. MANUFACTURING PLANT Aggregate product data, customer sentiment, and other third-party syndicated data to identify and correct quality issues. Manage equipment remotely, using temperature limits and other settings to conserve energy and reduce costs. Monitor production flow in near-real time to eliminate waste and unnecessary work in process inventory. GLOBAL FACILITY INSIGHT Implement condition- based maintenance alerts to eliminate machine down-time and increase throughput. THIRD-PARTY LOGISTICS Provide cross-channel visibility into inventories to optimize supply and reduce shared costs in the value chain. CUSTOMER SITE Transmits operational information to the partner (e.g. OEM) and to field service engineers for remote process automation and optimization. Management R&D Field Service
  19. 19. The Internet of Things – Oil & Gas Utilize advanced 3D and 4D visualizations based on analytic algorithms to model subsurface geology Production Manager Onsite personnel Establish near real-time communication and automatically publish events and alarms to the field to guide and protect onsite personnel and assets Integrate all upstream data onto a unified platform to facilitate analytics, information sharing, and organizational transition 1. Exploration 2. Development 3. Drilling4. Production Geologist Consolidate data from surveys, drill logs, and external sources to generate advanced reservoir models and production forecasts Maximize recovery by monitoring near real-time production data and generating alerts for conditional maintenance needs Combine near real-time drilling and seismic data to optimize drilling trajectories and recovery potential, while minimizing environmental risk Operations Control Center Find new hydrocarbon reservoirs quicker with seismic data uploaded to the cloud and prepared for analysis NORTH SHORE PRODUCTION
  20. 20. PHARMACY The Internet of Things – Pharma Customer Service Monitor device data to make more timely health decisions, such as adjusting dosages Enable advanced product tracking and authentication to prevent counterfeits Develop better products, faster, informed by a much larger data set based on patient outcomes R&D Anticipate medical device maintenance needs, and alert patients to schedule a doctor visit for replacement or repair Healthcare Provider Monitor medical device functionality for better customer service, reduced risk, and insight to improve product designs Manage equipment remotely, using appropriate KPIs Reduce machine downtime with condition-based maintenance alerts Patient Home DistributionManufacturing Aggregate and correlate data from disparate medical devices with medications and health outcomes for advanced insight
  21. 21. Producers Event Ingestion Storage Transformation Presentation & action Event Hubs (Service Bus) SQL Database Machine Learning Azure Websites Heterogeneous client agents Table/Blob Storage HD Insight Mobile Services External Data Sources DocumentDB Stream Analytics Notification Hubs External Data Sources Cloud Services Power BI External Services Microsoft Azure services for IoT Event Hubs (Service Bus) Stream Analytics SQL Database Azure Websites Mobile Services Notification Hubs Power BI External Services Table/Blob Storage DocumentDB{ } HD Insight Machine Learning
  22. 22. Use Cases (practice)
  23. 23. Manufacturing
  24. 24. Manufacturer of Automobiles Manufacturer One of the leading multinational automobile corporations that is one of the largest companies in the world by revenue. They manufacture over 10 million vehicles a year. Part 1: What They Did | Produces Internet of Things insights for their automobiles Challenge Needed to analyze the telemetry being emitted from their luxury car line in real-time. Wanted to build a scalable, reliable, and highly available solution that has the ability to receive and process a large volume of vehicle information and maintenance events Solution Use Azure Blob, HDInsight, Storm in HDInsight, HBase in HDInsight, Event Hubs, DocumentDB, Machine Learning, and Power BI Collect IoT data from automobiles: • Telemetry data comes in real-time • Able to process and generate insights around vehicle information and maintenance events Internet of Things BK1
  25. 25. BK1 Manufacturer of Automobiles Part 2: How They Did It | Produces Internet of Things insights for automobiles How They Did It Collect data from automobiles • Send events in real-time to Event Hubs • Stored into Azure Blobs Retrieve reference data and do predictive analytics • Get reference data stored in HBase • Run ML algorithms on the telemetry to predict outcomes Store into queryable store DocumentDB • Stored in DocumentDB for Power BI to display as a dashboard • Trigger Apache Storm in HDInsight to process and return results back to the vehicles Internet of Things HDFS Store ML No SQL Store Live Dashboard Event Hubs Azure Blob HBase Azure ML DocumentDB PowerBI Event Hubs Apache Storm on HDInsight
  26. 26. Power and Utilities & Oil and Gas
  27. 27. Industrial automation company partnering with multinational oil company Oil and Gas Leading industrial automation company who employs over 20,000 people. partnering with Leading multinational oil and gas company (one of the six oil and gas super majors) who employs over 90,000 people. Part 1: What They Did | IoT internet-connected sensors to generate analytics for proactive maintenance Challenge Manage sites used for dispensing liquefied natural gas (clean fuel for commercial customers who do heavy-duty road transportation) Built LNG refueling stations across US interstate highway Stations are unmanned so they built 24x7 remote management and monitoring to track diagnostics of each station for maintenance or tuning Built internet-connected sensors embedded in 350 dispenser sites worldwide generating tens of thousands data points per second • Temperature, pressure, vibration, etc. Data needs outgrew company’s internal datacenter and data warehouse Solution Chose Azure HDInsight, Data Factory, SQL Database, Machine Learning Dashboards used to detect anomalies for proactive maintenance • Changes in performance of the components • Energy consumption of components • Component downtime and reliability Future: Goal is to expand program to hundreds of thousands of dispensers IoT, Analytics
  28. 28. BK1 Industrial automation company partnering with multinational oil company Part 2: How They Did It | IoT internet-connected sensors to generate analytics for proactive maintenance How They Did It Collect data from internet-collected sensors • Tens of thousands data points per second • Interpolate time-series prior to analysis • Stored raw sensor data in Blobs every 5 minutes Use Hadoop to execute scripts and Data Factory to orchestrate • Hive and Pig scripts orchestrated by Data Factory • Data resulting from scripts loaded in SQL Database • Queries detect site anomalies to indicate maintenance/tuning Produced dashboards with role-based reporting • Azure Machine Learning , SSRS, Power BI for O365 • Provide users with customizable interface • View current and historical data (day-to-day operations, asset performance over time, etc.) • Leveraged Azure Mobile Notification Hub for real-time notifications, alarms, or important events Use Azure ML to predict • Understand which pumps, run at what speeds, maximized water supply while minimizing energy use IoT, Analytics
  29. 29. Government
  30. 30. Secretary of Finance and Public Credit - Government Government Government organization that handles finances, taxes, budget, income, and national debt for their country. Part 1: What They Did | Fraud and Money Laundering Detection Challenge The government passed a law to have all invoice submission to be in electronic format The tax department allows clients to uploads their digital documents (pay stubs, expenditure slips) and now have 4 billion documents uploaded Want to get insights into the data to do analysis and identify trends and fraud and ensure compliance with tax obligations Solution Built electronic digital invoicing solution to upload invoices • Paystubs, expenditure slips Use HDInsight to run queries and to process the electronic invoices to gain insights Needed to scale to a peak of 150+ million invoices uploaded / day Do Fraud detection by understanding what people are doing to detect anomalies (ie. tax fraud, money laundering, etc.) Output of the system saved to SQL Server on-premises databases to run ad hoc queries Fraud Detection
  31. 31. BK1 Secretary of Finance and Public Credit - Government Part 2: How They Did It | Fraud and Money Laundering Detection How They Did It Store electronic digital invoices as XML document in Azure Blobs • Store approximately 4 billion invoices total • Store 40 million – 180 million files every day • Data is stored as XML files with metadata information • Average size of each XML document is 5-10KB Use Azure HDInsight (>140 node clusters) • Do batch querying • Use Hive, Pig, and MapReduce • Hive external tables to make files queryable • Run once per day • Detect anomalies / fraud Send to SQL Server in IaaS VM and then to SQL Server On-premises • SQOOP data from Azure Blobs to SQL Server VMs • ETL to SQL Server on-premises • Do BI on top of SQL Server as a data mart Fraud Detection Website to submit electronic documents
  32. 32. Entertainment and Gaming
  33. 33. Game Development Company Gaming A predominantly mobile-based game development company. While they are a mid-sized organization, they have partnered with media giants on various gaming projects Part 1: What They Did | In-game Analytics Challenge As a game development studio, they wanted to do in-game analytics to understand their players more and what they do in the games Solution Chose Azure HDInsight (MapReduce and Storm), Service Bus and also use SQL Server for reporting Switched from Amazon AWS EMR Collects telemetry and logging data to gain in-game analytics: • How many players using the game • How many players invited their friends • How far along did players get into the tutorial • How many attempts did they make on one level/stage In-game Analytics
  34. 34. BK1 Game Development Company Part 2: How They Did It | In-game Analytics How They Did It Collect data from games in Azure Blobs • Game sends telemetry/logging data as JSON files • Contains every action of user in the game • Data is pushed to Azure Service Bus as real-time • Tens of Gigabytes of data captured daily HDInsight picks up real-time data and processes • From Service Bus, HDInsight processes using Apache Storm and MapReduce • Constantly running experiments to determine insight • A/B testing • In-game metrics and analytics • Spin up 32-node cluster nightly for four hours Output sent to SQL Server for BI • Transfer data to SQL Server for BI In-game Analytics Service Bus SQL Server On-premises
  35. 35. Non-Profit
  36. 36. JustGiving, Non-Profit Non-profit JustGiving, a global online social platform for giving. It's a financial service (not a charity) that lets you "raise money for a cause you care about" through your network of friends. Their goal is to become "Facebook of Giving" Part 1: What They Did | Recommendation Engine Challenge They wanted to identify what was personal and relevant to people and what they cared about, so that they could suggest further causes that may inspire continual involvement. With 22 million customers this meant storing and processing huge amounts of data that their existing infrastructure simply couldn’t support. Solution Chose SQL Server on-premises, Azure HDInsight, Blobs, Tables, Cache, and Service Bus Deployed a network of “social giving” for people to make it a group activity to support a cause • Built a way to inform givers a charity goal based on a person’s position in their social graph • Help identify causes that a user might be interested in (based on demographics, and their social graph) • Recommend people to add to their social graph as well as other charitable causes Recommendation
  37. 37. JustGiving, Non-Profit Part 2: How They Did It | Recommendation engine How They Did It Collect data in Azure Blobs • Move data from SQL Server through an Agent to Azure Blobs HDInsight processes data for insights • Input data is 20-30GB / job • Use MapReduce jobs to create a graph • Further job to denormalize activity feeds for all users • Generates an activity recommendation Generates a real-time recommendation • Real-time activity feeds/events coming in from Service Bus (~50 events/second) • Activity recommendation coming out of daily HDInsight job • Sent to web-site Recommendation SQL Server On-premises Agent Azure Blobs Azure HDInsight Activity Feeds Give Graph Azure Tables Web API Website + Event store Service Bus Serves results Azure Cache
  38. 38. Resources  The Modern Data Warehouse:  Should you move your data to the cloud?  Presentation slides for Modern Data Warehousing:  Presentation slides for Building an Effective Data Warehouse Architecture:  Hadoop and Data Warehouses:
  39. 39. Q & A ? James Serra, Big Data Evangelist Email me at: Follow me at: @JamesSerra Link to me at: Visit my blog at: (where this slide deck will be posted)