Transcript of "Internet of things & predictive analytics"
INTERNET OF THINGS
PRASAD NARASIMHAN – TECHNICAL ARCHITECT
INTERNET OF THINGS
• Each “thing” or connected device is part of the digital shadow of a person
• For there to be a market in the internet of things, two things must be true:
1) The “thing” in question must provide utility to the human, and
2) The digital shadow must provide value to an enterprise.
• The “market” is made up of many parts :
From wearable to drivable to home and
Industrial sensors and controllers, and
• Each part is made up of segments :
Laggards across many industries.
• From the data streams that implement the “digital shadows” of people, we can
use predictive analytics to understand their needs and behavior better than ever
• Every new dimension of data increases the predictive power, enabling
enterprises to answer the question “what does the human want?”
INTERNET OF THINGS
• Transforming the internet of things and its sibling, predictive analytics, to be
programmable by the same labor pool that has developed the apps which drove
the mobile revolution makes basic economic sense.
• Types of data generated by the internet of things is coupled with :
data discovery tools and
techniques to help business leaders identify emerging developments such as
machines that might need maintenance :
to prevent costly breakdowns or
sudden shifts in customer or
market conditions that might signal some action a company should take.
• The internet of things, the physical world will become a networked information
system—through sensors and actuators embedded in real physical objects and
linked through wired and wireless networks via the internet protocol.
• This holds special value for manufacturing:
The potential for connected physical systems to improve productivity in the
production process and
The supply chain is huge.
• Consider processes that govern themselves, where smart products can take
corrective action to avoid damages and where individual parts are automatically
• Such technologies already exist and could drive the fourth industrial
revolution— following the steam engine, the conveyor belt (assembly line -
think ford model t), and the first phase of it and automation technology.
EG 1 : AUTO INSURANCE
• The first-order vector was a connected accelerometer offered to drivers :
to improve their insurance rates based on proven “safe driving” habits.
• Through this digital shadow, the insurance provider can make much better
actuarial predictions than through the coarse-grained data they had before
• This is interesting in the same way the blackberry was interesting - a basic
capability adopted for basic business improvement.
• The second-order vector is much stronger :
the ability to transform the insurance market to better meet the needs of customers
while changing the rules of competition.
based on real-time driving information insurance companies can :
move to a real-time spot-pricing model driven by an exchange (not unlike the stock
bidding on drivers and
providing insurance on demand. Not driving today? Don’t pay for insurance. Need to drive
fast tomorrow? Pay a little more but don’t worry about your “permanent record”.
• These outcomes are all based on tying the internet of things to predictive
EG 2 : HEALTH CARE
• The first-order vector is similar, a wearable accelerometer offered to patients :
To improve traceability of their compliance with their exercise prescription,
Enabling better outcomes for cardiac patients.
Unlike prescription refills, exercise compliance has been untraceable before, so this
digital shadow is a breakthrough for medicine.
• Similar developments exist in digestible sensors within medications :
which activate only on contact with stomach acid,
providing higher truth and
better granularity than a monthly refill.
• In second-order vector in healthcare ,the ability to combine multiple streams of
information that were previously invisible has the potential to drive better health
outcomes through provably higher patient compliance.
• Sorting these data streams at scale will allow health providers and health insurance
companies to rapidly iterate health protocols across a population of humans,
augmenting human expertise with predictive analytics.
• Outcome-based analysis based on predictive models built from data can reduce :
error rates, and
lawsuits while driving better margins.
• Larger exchanges of this type of data will tend to :
creating a more effective market and
a better pool of empirical research for science.
EG 3 : AUTO COMPANIES
• They have installed thousands of "black boxes" inside their prototype and field
testing vehicles to capture second by second data from the dozens of control
units which manage today's automobiles.
• These boxes simply plug into the vehicle's on-board diagnostic (obd) port
which is typically located under the front dashboard of all cars.
• They collect 500-750 different vehicle performance parameters that add up to
terabytes of data in hours!
• The intent of the automakers for installing these boxes is to collect data which their
engineers can later analyze to fix bugs and improve on existing designs.
• For example, one car manufacturer found out from this data that their minivan
batteries would end up in a recall.
The problem was an underpowered alternator - it was not able to fully recharge the
batteries because the most common drive cycle for this particular minivan (think soccer
mom taking kid to practice) was less than 3 miles.
As a result, there appeared to be a lot of complaints about dead batteries and the
company was potentially facing the recall of millions of minivans which had this
The boxes collect information about driving cycles and this data was really useful in
understanding the real reason behind the dead batteries.
The test vehicles which had short drive cycles were the ones which reported dead
batteries! simply changing the alternator to higher capacity could fix the problem.
Now it was an easy fix to extend this solution to the entire fleet.
• The opportunities are literally endless,
Ranging from early fault detection (predicting when a particular component is likely to
To automatically adjusting driving route based on traffic pattern predictions.
• The ultimate test of predictive analytics in the internet of things is of course fully
autonomous systems, such as :
the nissan car of 2020 or
the google self driving car of today.
• In the end all autonomous systems will need the ability to build predictive
capabilities - in other words, machines must learn machine learning!
EG 4 : GOOGLE’S SELF DRIVING CAR
• Google claims that their self-driving car of today has logged more than 300,000
miles with almost zero incidence of accidents.
• The one time a minor crash did occur was when the car was rear-ended by a
• So, when the technology is fully mature, it is not just parking valets who
become obsolete, other higher paying professions such as automotive safety
systems experts may also need to look for other options!
• Predictive analytics is the enabler that will make this happen.
EG 5 : JET AIRLINER
• A jet airliner generates 20 terabytes of diagnostic data per hour of flight.
• The average oil platform has 40,000 sensors, generating data 24/7.
• M2M is now generating enormous volumes of data and is testing the capabilities
of traditional database technologies.
• To extract rich, real-time insight from the vast amounts of machine-generated
data, companies will have to build a technology foundation with speed and scale
because raw data, whatever the source, is only useful after it has been
transformed into knowledge through analysis.
• Investigative analytics tools enable interactive, ad-hoc querying on complex big
data sets to identify patterns and insights and can perform analysis at massive
scale with precision even as machine-generated data grows beyond the
• With investigative analytics, companies can take action
In response to events in real-time and
Identify patterns to either capitalize on or
Prevent an event in the future.
• This is especially important because most failures result from a confluence of
multiple factors, not just a single red flag.
• To fully address the influx of M2M data generated by the increasingly connected
internet of things landscape, companies can deploy a range of technologies to
leverage distributed processing frameworks like hadoop and nosql
and improve performance of their analytics,
including enterprise data warehouses,
data visualization, and
business intelligence tools.
• These can be deployed in any combination of :
in the cloud.
FINDING RIGHT ANALYTICS DATABASE
• To find the right analytics database technology to capture, connect, and drive meaning from
data, companies should consider the following requirements:
Real-time Analysis : Businesses can’t afford for data to get stale. Data solutions need to :
load quickly and easily,
and must dynamically query,
communicate m2m information in real-time, without huge investments in it administration, support,
Flexible Querying And Ad-hoc Reporting : When intelligence needs to change quickly, analytic tools
be constrained by data schemas that limit the number and
type of queries that can be performed.
This type of deeper analysis also cannot be constrained by tinkering or time-consuming manual
configuration (such as indexing and managing data partitions) to create and change analytic
Efficient Compression : Efficient data compression is key to enabling M2M data management within :
A network node,
Smart device, or
Massive data center cluster.
Better compression allows :
For less storage capacity overall,
As well as tighter data sampling and
Longer historical data sets,
Increasing the accuracy of query results.
Ease Of Use And Cost : Data analysis must be :
Simple to implement in order to justify the investment.
This demands low-touch solutions that are optimized to deliver :
Fast analysis of large volumes of data,
With minimal hardware,
Administrative effort, and
Customization needed to set up or
Change query and reporting parameters.
EG 6 : UNION PACIFIC RAILROAD
• The railroad is using sensor and analytics technologies to predict and prevent train
• For example, the company has placed infrared sensors on every 20 miles of its tracks to
gather 20 million temperature readings of train wheels each day to look for signs of
overheating, which is a sign of impending failure.
• Meanwhile, trackside microphones are used to pick up “growling” bearings in the wheels.
• Data from such physical measurements are sent via fiber optic lines to union pacific’s data
• Complex pattern-matching algorithms and analytics are used to identify irregularities,
allowing union pacific experts to determine within minutes of capturing the data whether a
driver should pull a train over for inspection or reduce its speed until it reaches the next
station to be repaired.
HOW TO ANALYZE MACHINE AND SENSOR
• This tutorial describes how to refine data from heating, ventilation, and air conditioning (hvac)
systems in 20 large buildings around the world using the hortonworks data platform, and how
to analyze the refined sensor data to maintain optimal building temperatures.
• Sensor data
A sensor is a device that measures a physical quantity and transforms it into a digital signal.
sensors are always on, capturing data at a low cost, and powering the “internet of things.”
• Potential uses of sensor data
Sensors can be used to collect data from many sources, such as:
To monitor machines or infrastructure such as ventilation equipment, bridges, energy meters, or
airplane engines. This data can be used for predictive analytics, to repair or replace these items
before they break.
To monitor natural phenomena such as meteorological patterns, underground pressure during oil
extraction, or patient vital statistics during recovery from a medical procedure.
Hortonworks sandbox (installed and running)
Hortonworks odbc driver installed and configured
Microsoft excel 2013 professional plus
In this tutorial, the hortonworks sandbox is installed on an oracle virtualbox virtual
machine (vm) – your screens may be different.
Install the odbc driver that matches the version of excel you are using (32-bit or 64-
In this tutorial, use the power view feature in microsoft excel 2013 to visualize the
sensor data. Power view is currently only available in microsoft office professional
plus and microsoft office 365 professional plus.
Note, other versions of excel will work, but the visualizations will be limited to
charts. One can connect to any other visualization tool that one like.
To refine and analyze hvac sensor data, :
Download and extract the sensor data files.
Load the sensor data into the hortonworks sandbox.
Run two hive scripts to refine the sensor data.
Access the refined sensor data with microsoft excel.
Visualize the sensor data using excel power view.
STEP 1: DOWNLOAD AND EXTRACT THE SENSOR DATA FILES
• Download the sample sensor data contained in a compressed (.zip) folder from
• Save the sensorfiles.zip file to the computer, then extract the files. One should see a
sensorfiles folder that contains the following files:
hvac.csv – contains the targeted building temperatures, along with the actual (measured)
The building temperature data was obtained using apache flume.
Flume can be used as a log aggregator, collecting log data from many diverse sources and
moving it to a centralized data store.
In this case, flume was used to capture the sensor log data, which we can now load into the
hadoop distributed file system (hfds).
building.csv – contains the “building” database table.
Apache sqoop can be used to transfer this type of data from a structured database into hfds.
STEP 2: LOAD THE SENSOR DATA INTO THE HORTONWORKS SANDBOX
• Open the sandbox hue and click the
hcatalog icon in the toolbar at the top
of the page, then click create a new
table from a file.
• On the “create a new table from a file”
page, type “hvac” in the table name
box, then click choose a file under the
input file box.
• On the “choose a file” pop-up, click
upload a file.
• Use the file upload dialog to browse
to the sensorfiles folder that was
• Select the hvac.csv file, then click
• on the “choose a file” pop-up, click
the hvac.csv file.
• The default settings on the “create a
new table from a file” page are
correct for this file, scroll down to the
bottom of the page and click create
• A progress indicator appears while
the table is being created
• When the table has been created, it
appears in the hcatalog table list.
• Repeat the previous steps to create a
“building” table by uploading the
Now let’s take a look at the two data tables.
On the hcatalog table list page, select the check box
next to the “hvac” table, then click browse data.
One can see that the “hvac” table includes :
columns for date,
the target temperature,
the actual temperature,
the system identifier,
the system age, and
the building id.
• Navigate back to the hcatalog table list page.
• Select the check box next to the “building” table, then click browse data.
• One can see that the “building” table includes columns for the building
identifier, the building manager, the building age, the hvac product in the
building, and the country in which the building is located.
STEP 3: RUN TWO HIVE SCRIPTS TO REFINE THE SENSOR DATA
• Now use two hive scripts to refine the sensor data.
• We hope to accomplish three goals with this data :
Reduce heating and cooling expenses.
Keep indoor temperatures in a comfortable range between 65-70 degrees.
Identify which hvac products are reliable, and replace unreliable equipment with
First, identify whether the actual temperature was more than five degrees different
from the target temperature. In the sandbox hue, click the beeswax (hive ui) icon in
the toolbar at the top of the page to display the query editor.
Paste the following script in the Query Editor box,
then click Execute:
To view the data generated by the script, click
Tables in the menu at the top of the page, select the
checkbox next to hvac_temperatures, and then click
•On the Query Results page, slide to scroll to the
right. One can notice that two new attributes appear
in the hvac_temperatures table.The data in the
“temprange” column indicates whether the actual
NORMAL – within 5 degrees of the target
COLD – more than five degrees colder than
the target temperature.
HOT – more than 5 degrees warmer than the
• If the temperature is outside of the normal range,
“extremetemp” is assigned a value of 1; otherwise
its value is 0.
• Next combine the “hvac” and
“hvac_temperatures” data sets.in the sandbox
hue, click the beeswax (hive ui) icon in the
toolbar at the top of the page to display the
• Paste the following script in the query editor
box, then click execute:create table if not
exists hvac_building as select h.*, b.country,
b.hvacproduct, b.buildingage, b.buildingmgr
from building b join hvac_temperatures h on
b.buildingid = h.buildingid;
• To view the data generated by the
script, click tables in the menu at the
top of the page, select the checkbox
next to hvac_building, and then click
The hvac_temperatures table is displayed on the
Query Results page.
STEP 4: ACCESS THE REFINED SENTIMENT DATA WITH MICROSOFT EXCEL
• In this section, use microsoft excel
professional plus 2013 to access the
refined sentiment data.
• In windows, open a new excel
workbook, then select data > from
other sources > from microsoft query.
• On the choose data source pop-up, select
the hortonworks odbc data source that is
installed previously, then click ok.The
hortonworks odbc driver enables to access
hortonworks data with excel and other
business intelligence (bi) applications that
• After the connection to the sandbox is
established, the query wizard appears.
select the “hvac_building” table in the
available tables and columns box, then
click the right arrow button to add the
entire “hvac_building” table to the query.
click next to continue.
• On the filter data screen, click next to
continue without filtering the data.
• On the sort order screen, click next
to continue without setting a sort
• Click finish on the query wizard finish
screen to retrieve the query data from
the sandbox and import it into excel.
• On the import data dialog box, click
ok to accept the default settings and
import the data as a table.
• The imported query data appears in
the excel workbook.
STEP 5: VISUALIZE THE SENSOR DATA USING EXCEL POWER VIEW
• Now the refined sensor data is
successfully imported into
microsoft excel, one can use the
excel power view feature to
analyze and visualize the data.
• Begin the data visualization by
mapping the buildings that are
most frequently outside of the
optimal temperature range.
• In the excel worksheet with the
imported “hvac_building” table,
select insert > power view to
open a new power view report.
• The power view fields area appears on the
right side of the window, with the data
table displayed on the left. Drag the
handles or click the pop out icon to
maximize the size of the data table.
• In the power view fields area, select the
checkboxes next to the country and
extremetemp fields, and clear all of the
other checkboxes. One may need to scroll
down to see all of the check boxes.
• In the fields box, click the down-
arrow at the right of the extremetemp
field, then select count (not blank).
• Click map on the design tab in the
• The map view displays a global view of the data.
• One can see that the office in finland had 814 sensor readings where the
temperature was more than five degrees higher or lower than the target
• In contrast, the german office is doing a better job maintaining ideal office
temperatures, with only 363 readings outside of the ideal range.
• Hot offices can lead to employee complaints and reduced productivity.
• Let’s see which offices run hot.
• In the power view fields area, clear the extremetemp checkbox and select the
• Click the down-arrow at the right of the temprange field, then select add as
• Drag temprange from the power view fields area to the filters box, then select
the hot checkbox.
• One can see that the buildings in finland and france run hot most often.
• Cold offices cause elevated energy expenditures and employee discomfort.
• In the filters box, clear the hot checkbox and select the cold checkbox.
• One can see that the buildings in finland and indonesia run cold most often.
• Data set includes information about the performance of five brands of hvac
equipment, distributed across many types of buildings in a wide variety of climates.
• Use this data to assess the relative reliability of the different hvac models.
• Open a new excel worksheet, then select data > from other sources > from microsoft
query to access the hvac_building table.
• Follow the same procedure as before to import the data, but this time only select the
“hvacproduct” and “extremetemp” columns.
• In the excel worksheet with the
imported “hvacproduct” and
“extremetemp” columns, select insert
> power view to open a new power
• Click the pop out icon to maximize
the size of the data table. in the
fields box, click the down-arrow at
the right of the extremetemp field,
then select count (not blank).
• Select column chart > stacked
columnin THE TOP MENU.
• Click the down-arrow next to sort by
hvacproduct in the upper left corner
of the chart area, then select count of
• One can see that the gg1919 model
seems to regulate temperature most
reliably, whereas the fn39tg failed to
maintain the appropriate temperature
range 9% more frequently than the
• Shown how the hortonworks data
platform (hdp) can store and analyze
• With real-time access to massive
amounts of temperature and other
types of data on hdp, facilities
department can initiate data-driven
strategies to reduce energy
expenditures and improve employee
A particular slide catching your eye?
Clipping is a handy way to collect important slides you want to go back to later.