Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Internet of things & predictive analytics


Published on

How the data captured using Internet of things is pulled using Hive kind of tools for importing and creating database for analytics of data

Internet of things & predictive analytics

  2. 2. INTERNET OF THINGS • Each “thing” or connected device is part of the digital shadow of a person • For there to be a market in the internet of things, two things must be true: 1) The “thing” in question must provide utility to the human, and 2) The digital shadow must provide value to an enterprise.
  3. 3. MARKET • The “market” is made up of many parts :  From wearable to drivable to home and  Industrial sensors and controllers, and • Each part is made up of segments :  Innovators,  Early adopters,  Pragmatists,  Conservatives, and  Laggards across many industries.
  4. 4. PREDICTIVE ANALYTICS • From the data streams that implement the “digital shadows” of people, we can use predictive analytics to understand their needs and behavior better than ever before. • Every new dimension of data increases the predictive power, enabling enterprises to answer the question “what does the human want?”
  5. 5. INTERNET OF THINGS & PREDICTIVE ANALYTICS • Transforming the internet of things and its sibling, predictive analytics, to be programmable by the same labor pool that has developed the apps which drove the mobile revolution makes basic economic sense. • Types of data generated by the internet of things is coupled with :  data analysis  data discovery tools and  techniques to help business leaders identify emerging developments such as machines that might need maintenance :  to prevent costly breakdowns or  sudden shifts in customer or  market conditions that might signal some action a company should take.
  6. 6. • The internet of things, the physical world will become a networked information system—through sensors and actuators embedded in real physical objects and linked through wired and wireless networks via the internet protocol. • This holds special value for manufacturing:  The potential for connected physical systems to improve productivity in the production process and  The supply chain is huge. • Consider processes that govern themselves, where smart products can take corrective action to avoid damages and where individual parts are automatically replenished. • Such technologies already exist and could drive the fourth industrial revolution— following the steam engine, the conveyor belt (assembly line - think ford model t), and the first phase of it and automation technology.
  7. 7. EG 1 : AUTO INSURANCE • The first-order vector was a connected accelerometer offered to drivers :  to improve their insurance rates based on proven “safe driving” habits. • Through this digital shadow, the insurance provider can make much better actuarial predictions than through the coarse-grained data they had before  age,  gender, and  traffic violations. • This is interesting in the same way the blackberry was interesting - a basic capability adopted for basic business improvement.
  8. 8. • The second-order vector is much stronger :  the ability to transform the insurance market to better meet the needs of customers while changing the rules of competition.  based on real-time driving information insurance companies can :  move to a real-time spot-pricing model driven by an exchange (not unlike the stock exchange),  bidding on drivers and  providing insurance on demand. Not driving today? Don’t pay for insurance. Need to drive fast tomorrow? Pay a little more but don’t worry about your “permanent record”. • These outcomes are all based on tying the internet of things to predictive analytics.
  9. 9. EG 2 : HEALTH CARE • The first-order vector is similar, a wearable accelerometer offered to patients :  To improve traceability of their compliance with their exercise prescription,  Enabling better outcomes for cardiac patients.  Unlike prescription refills, exercise compliance has been untraceable before, so this digital shadow is a breakthrough for medicine. • Similar developments exist in digestible sensors within medications :  which activate only on contact with stomach acid,  providing higher truth and  better granularity than a monthly refill.
  10. 10. • In second-order vector in healthcare ,the ability to combine multiple streams of information that were previously invisible has the potential to drive better health outcomes through provably higher patient compliance. • Sorting these data streams at scale will allow health providers and health insurance companies to rapidly iterate health protocols across a population of humans, augmenting human expertise with predictive analytics. • Outcome-based analysis based on predictive models built from data can reduce :  waste,  error rates, and  lawsuits while driving better margins. • Larger exchanges of this type of data will tend to :  perform better,  creating a more effective market and  a better pool of empirical research for science.
  11. 11. EG 3 : AUTO COMPANIES • They have installed thousands of "black boxes" inside their prototype and field testing vehicles to capture second by second data from the dozens of control units which manage today's automobiles. • These boxes simply plug into the vehicle's on-board diagnostic (obd) port which is typically located under the front dashboard of all cars. • They collect 500-750 different vehicle performance parameters that add up to terabytes of data in hours!
  12. 12. • The intent of the automakers for installing these boxes is to collect data which their engineers can later analyze to fix bugs and improve on existing designs. • For example, one car manufacturer found out from this data that their minivan batteries would end up in a recall.  The problem was an underpowered alternator - it was not able to fully recharge the batteries because the most common drive cycle for this particular minivan (think soccer mom taking kid to practice) was less than 3 miles.  As a result, there appeared to be a lot of complaints about dead batteries and the company was potentially facing the recall of millions of minivans which had this alternator.  The boxes collect information about driving cycles and this data was really useful in understanding the real reason behind the dead batteries.  The test vehicles which had short drive cycles were the ones which reported dead batteries! simply changing the alternator to higher capacity could fix the problem.  Now it was an easy fix to extend this solution to the entire fleet.
  13. 13. ENDLESS OPPORTUNITY • The opportunities are literally endless,  Ranging from early fault detection (predicting when a particular component is likely to fail)  To automatically adjusting driving route based on traffic pattern predictions. • The ultimate test of predictive analytics in the internet of things is of course fully autonomous systems, such as :  the nissan car of 2020 or  the google self driving car of today. • In the end all autonomous systems will need the ability to build predictive capabilities - in other words, machines must learn machine learning!
  14. 14. EG 4 : GOOGLE’S SELF DRIVING CAR • Google claims that their self-driving car of today has logged more than 300,000 miles with almost zero incidence of accidents. • The one time a minor crash did occur was when the car was rear-ended by a human-driven car! • So, when the technology is fully mature, it is not just parking valets who become obsolete, other higher paying professions such as automotive safety systems experts may also need to look for other options! • Predictive analytics is the enabler that will make this happen.
  15. 15. EG 5 : JET AIRLINER • A jet airliner generates 20 terabytes of diagnostic data per hour of flight. • The average oil platform has 40,000 sensors, generating data 24/7. • M2M is now generating enormous volumes of data and is testing the capabilities of traditional database technologies. • To extract rich, real-time insight from the vast amounts of machine-generated data, companies will have to build a technology foundation with speed and scale because raw data, whatever the source, is only useful after it has been transformed into knowledge through analysis. • Investigative analytics tools enable interactive, ad-hoc querying on complex big data sets to identify patterns and insights and can perform analysis at massive scale with precision even as machine-generated data grows beyond the petabyte scale
  16. 16. • With investigative analytics, companies can take action  In response to events in real-time and  Identify patterns to either capitalize on or  Prevent an event in the future. • This is especially important because most failures result from a confluence of multiple factors, not just a single red flag. • To fully address the influx of M2M data generated by the increasingly connected internet of things landscape, companies can deploy a range of technologies to leverage distributed processing frameworks like hadoop and nosql  and improve performance of their analytics,  including enterprise data warehouses,  analytic databases,  data visualization, and  business intelligence tools. • These can be deployed in any combination of :  on-premise software,  appliance, or  in the cloud.
  17. 17. FINDING RIGHT ANALYTICS DATABASE TECHNOLOGY • To find the right analytics database technology to capture, connect, and drive meaning from data, companies should consider the following requirements:  Real-time Analysis : Businesses can’t afford for data to get stale. Data solutions need to :  load quickly and easily,  and must dynamically query,  analyze, and  communicate m2m information in real-time, without huge investments in it administration, support, and tuning.  Flexible Querying And Ad-hoc Reporting : When intelligence needs to change quickly, analytic tools can’t  be constrained by data schemas that limit the number and  type of queries that can be performed. This type of deeper analysis also cannot be constrained by tinkering or time-consuming manual configuration (such as indexing and managing data partitions) to create and change analytic queries.
  18. 18.  Efficient Compression : Efficient data compression is key to enabling M2M data management within :  A network node,  Smart device, or  Massive data center cluster. Better compression allows :  For less storage capacity overall,  As well as tighter data sampling and  Longer historical data sets,  Increasing the accuracy of query results.  Ease Of Use And Cost : Data analysis must be :  Affordable,  Easy-to-use, and  Simple to implement in order to justify the investment. This demands low-touch solutions that are optimized to deliver :  Fast analysis of large volumes of data,  With minimal hardware,  Administrative effort, and  Customization needed to set up or  Change query and reporting parameters.
  19. 19. EG 6 : UNION PACIFIC RAILROAD • The railroad is using sensor and analytics technologies to predict and prevent train derailments, • For example, the company has placed infrared sensors on every 20 miles of its tracks to gather 20 million temperature readings of train wheels each day to look for signs of overheating, which is a sign of impending failure. • Meanwhile, trackside microphones are used to pick up “growling” bearings in the wheels. • Data from such physical measurements are sent via fiber optic lines to union pacific’s data centers. • Complex pattern-matching algorithms and analytics are used to identify irregularities, allowing union pacific experts to determine within minutes of capturing the data whether a driver should pull a train over for inspection or reduce its speed until it reaches the next station to be repaired.
  20. 20. HOW TO ANALYZE MACHINE AND SENSOR DATA • This tutorial describes how to refine data from heating, ventilation, and air conditioning (hvac) systems in 20 large buildings around the world using the hortonworks data platform, and how to analyze the refined sensor data to maintain optimal building temperatures. • Sensor data A sensor is a device that measures a physical quantity and transforms it into a digital signal. sensors are always on, capturing data at a low cost, and powering the “internet of things.” • Potential uses of sensor data  Sensors can be used to collect data from many sources, such as:  To monitor machines or infrastructure such as ventilation equipment, bridges, energy meters, or airplane engines. This data can be used for predictive analytics, to repair or replace these items before they break.  To monitor natural phenomena such as meteorological patterns, underground pressure during oil extraction, or patient vital statistics during recovery from a medical procedure.
  21. 21. • Prerequisites:  Hortonworks sandbox (installed and running)  Hortonworks odbc driver installed and configured  Microsoft excel 2013 professional plus • Notes:  In this tutorial, the hortonworks sandbox is installed on an oracle virtualbox virtual machine (vm) – your screens may be different.  Install the odbc driver that matches the version of excel you are using (32-bit or 64- bit).  In this tutorial, use the power view feature in microsoft excel 2013 to visualize the sensor data. Power view is currently only available in microsoft office professional plus and microsoft office 365 professional plus.  Note, other versions of excel will work, but the visualizations will be limited to charts. One can connect to any other visualization tool that one like.
  22. 22. • Overview To refine and analyze hvac sensor data, :  Download and extract the sensor data files.  Load the sensor data into the hortonworks sandbox.  Run two hive scripts to refine the sensor data.  Access the refined sensor data with microsoft excel.  Visualize the sensor data using excel power view.
  23. 23. STEP 1: DOWNLOAD AND EXTRACT THE SENSOR DATA FILES • Download the sample sensor data contained in a compressed (.zip) folder from • Save the file to the computer, then extract the files. One should see a sensorfiles folder that contains the following files:  hvac.csv – contains the targeted building temperatures, along with the actual (measured) building temperatures.  The building temperature data was obtained using apache flume.  Flume can be used as a log aggregator, collecting log data from many diverse sources and moving it to a centralized data store.  In this case, flume was used to capture the sensor log data, which we can now load into the hadoop distributed file system (hfds).  building.csv – contains the “building” database table.  Apache sqoop can be used to transfer this type of data from a structured database into hfds.
  24. 24. STEP 2: LOAD THE SENSOR DATA INTO THE HORTONWORKS SANDBOX • Open the sandbox hue and click the hcatalog icon in the toolbar at the top of the page, then click create a new table from a file. • On the “create a new table from a file” page, type “hvac” in the table name box, then click choose a file under the input file box.
  25. 25. • On the “choose a file” pop-up, click upload a file. • Use the file upload dialog to browse to the sensorfiles folder that was extracted previously. • Select the hvac.csv file, then click open.
  26. 26. • on the “choose a file” pop-up, click the hvac.csv file. • The default settings on the “create a new table from a file” page are correct for this file, scroll down to the bottom of the page and click create table
  27. 27. • A progress indicator appears while the table is being created • When the table has been created, it appears in the hcatalog table list.
  28. 28. • Repeat the previous steps to create a “building” table by uploading the building.csv file.  Now let’s take a look at the two data tables.  On the hcatalog table list page, select the check box next to the “hvac” table, then click browse data.  One can see that the “hvac” table includes :  columns for date,  time,  the target temperature,  the actual temperature,  the system identifier,  the system age, and  the building id.
  29. 29. • Navigate back to the hcatalog table list page. • Select the check box next to the “building” table, then click browse data. • One can see that the “building” table includes columns for the building identifier, the building manager, the building age, the hvac product in the building, and the country in which the building is located.
  30. 30. STEP 3: RUN TWO HIVE SCRIPTS TO REFINE THE SENSOR DATA • Now use two hive scripts to refine the sensor data. • We hope to accomplish three goals with this data :  Reduce heating and cooling expenses.  Keep indoor temperatures in a comfortable range between 65-70 degrees.  Identify which hvac products are reliable, and replace unreliable equipment with those models.  First, identify whether the actual temperature was more than five degrees different from the target temperature. In the sandbox hue, click the beeswax (hive ui) icon in the toolbar at the top of the page to display the query editor.
  31. 31. Paste the following script in the Query Editor box, then click Execute: To view the data generated by the script, click Tables in the menu at the top of the page, select the checkbox next to hvac_temperatures, and then click Browse Data
  32. 32. •On the Query Results page, slide to scroll to the right. One can notice that two new attributes appear in the hvac_temperatures table.The data in the “temprange” column indicates whether the actual temperature was:  NORMAL – within 5 degrees of the target temperature.  COLD – more than five degrees colder than the target temperature.  HOT – more than 5 degrees warmer than the target temperature. • If the temperature is outside of the normal range, “extremetemp” is assigned a value of 1; otherwise its value is 0.
  33. 33. • Next combine the “hvac” and “hvac_temperatures” data the sandbox hue, click the beeswax (hive ui) icon in the toolbar at the top of the page to display the query editor. • Paste the following script in the query editor box, then click execute:create table if not exists hvac_building as select h.*,, b.hvacproduct, b.buildingage, b.buildingmgr from building b join hvac_temperatures h on b.buildingid = h.buildingid;
  34. 34. • To view the data generated by the script, click tables in the menu at the top of the page, select the checkbox next to hvac_building, and then click browse data. The hvac_temperatures table is displayed on the Query Results page.
  35. 35. STEP 4: ACCESS THE REFINED SENTIMENT DATA WITH MICROSOFT EXCEL • In this section, use microsoft excel professional plus 2013 to access the refined sentiment data. • In windows, open a new excel workbook, then select data > from other sources > from microsoft query. • On the choose data source pop-up, select the hortonworks odbc data source that is installed previously, then click ok.The hortonworks odbc driver enables to access hortonworks data with excel and other business intelligence (bi) applications that support odbc.
  36. 36. • After the connection to the sandbox is established, the query wizard appears. select the “hvac_building” table in the available tables and columns box, then click the right arrow button to add the entire “hvac_building” table to the query. click next to continue. • On the filter data screen, click next to continue without filtering the data.
  37. 37. • On the sort order screen, click next to continue without setting a sort order. • Click finish on the query wizard finish screen to retrieve the query data from the sandbox and import it into excel.
  38. 38. • On the import data dialog box, click ok to accept the default settings and import the data as a table. • The imported query data appears in the excel workbook.
  39. 39. STEP 5: VISUALIZE THE SENSOR DATA USING EXCEL POWER VIEW • Now the refined sensor data is successfully imported into microsoft excel, one can use the excel power view feature to analyze and visualize the data. • Begin the data visualization by mapping the buildings that are most frequently outside of the optimal temperature range. • In the excel worksheet with the imported “hvac_building” table, select insert > power view to open a new power view report.
  40. 40. • The power view fields area appears on the right side of the window, with the data table displayed on the left. Drag the handles or click the pop out icon to maximize the size of the data table. • In the power view fields area, select the checkboxes next to the country and extremetemp fields, and clear all of the other checkboxes. One may need to scroll down to see all of the check boxes.
  41. 41. • In the fields box, click the down- arrow at the right of the extremetemp field, then select count (not blank). • Click map on the design tab in the top menu.
  42. 42. • The map view displays a global view of the data. • One can see that the office in finland had 814 sensor readings where the temperature was more than five degrees higher or lower than the target temperature. • In contrast, the german office is doing a better job maintaining ideal office temperatures, with only 363 readings outside of the ideal range.
  43. 43. • Hot offices can lead to employee complaints and reduced productivity. • Let’s see which offices run hot. • In the power view fields area, clear the extremetemp checkbox and select the temprange checkbox. • Click the down-arrow at the right of the temprange field, then select add as size.
  44. 44. • Drag temprange from the power view fields area to the filters box, then select the hot checkbox. • One can see that the buildings in finland and france run hot most often.
  45. 45. • Cold offices cause elevated energy expenditures and employee discomfort. • In the filters box, clear the hot checkbox and select the cold checkbox. • One can see that the buildings in finland and indonesia run cold most often.
  46. 46. • Data set includes information about the performance of five brands of hvac equipment, distributed across many types of buildings in a wide variety of climates. • Use this data to assess the relative reliability of the different hvac models. • Open a new excel worksheet, then select data > from other sources > from microsoft query to access the hvac_building table. • Follow the same procedure as before to import the data, but this time only select the “hvacproduct” and “extremetemp” columns.
  47. 47. • In the excel worksheet with the imported “hvacproduct” and “extremetemp” columns, select insert > power view to open a new power view report. • Click the pop out icon to maximize the size of the data table. in the fields box, click the down-arrow at the right of the extremetemp field, then select count (not blank).
  48. 48. • Select column chart > stacked columnin THE TOP MENU. • Click the down-arrow next to sort by hvacproduct in the upper left corner of the chart area, then select count of extremetemp.
  49. 49. • One can see that the gg1919 model seems to regulate temperature most reliably, whereas the fn39tg failed to maintain the appropriate temperature range 9% more frequently than the gg1919. • Shown how the hortonworks data platform (hdp) can store and analyze sensor data. • With real-time access to massive amounts of temperature and other types of data on hdp, facilities department can initiate data-driven strategies to reduce energy expenditures and improve employee comfort.