SlideShare a Scribd company logo
1 of 28
© 2015 IBM Corporation
June 2016
Modeling Water Problems using Zeppelin,
Spark, R and System ML
P.S. “Arvind” Aravind
(psaravind@us.ibm.com)
IBM Analytics
(http://www-01.ibm.com/software/data/services/stampede.html)
© 2015 IBM Corporation2
 Water Problems in Big Cities– Implications and Causes (10 Mins)
 Water Problems in New York City (10 Mins)
 Data Sources For Modeling Water Problems (10 Mins)
 Fast Data Analytics Technologies For Modeling Water Problems (10 Mins)
 Demonstration of Modeling Water Problems using New York City Data Sources (60 Mins)
 Q & A (15 Mins)
Agenda
© 2015 IBM Corporation3
 Water Problems in Big Cities– Implications and Causes (10 Mins)
 Water Problems in New York City (10 Mins)
 Data Sources For Modeling Water Problems (10 Mins)
 Fast Data Analytics Technologies For Modeling Water Problems (10 Mins)
 Demonstration of Modeling Water Problems using New York City Data Sources (60 Mins)
 Q & A (15 Mins)
Agenda
© 2015 IBM Corporation4
Water efficiency enhances People’s Life
Water efficiency analytics enhances performance of city water systems, improving the longevity of
infrastructure while saving energy and reducing water loss.
© 2015 IBM Corporation5
 No Water
 Water Quality
 Hardness (measure of dissolved calcium and
magnesium)
 Dirty, Cloudy or Milky, Rusty Brown Color
 Contains particles, insects or worms
 Contains grease, oil, or gasoline
 Bad smell or taste
 Chlorine odors
 Fluoride levels
 Water Pressure High/Low
 Illness caused by drinking water
 Building/Neighborhood
What are the Typical Reasons for Water related problems ?
© 2015 IBM Corporation6
How Analytics can help in addressing Water Problems ?
 Correlating disparate Data Sources easily
 Help in Potential Cause Analysis based on area, timing, building types etc.
 Prediction of Areas/Timing with potential of more Water problems
 Prediction of sudden Spikes ahead of time using near real time data
 Influencing efficient use of Water Management
 Better Service and Improved People’s Lifestyle
© 2015 IBM Corporation7
 Water Problems in Big Cities– Implications and Causes (10 Mins)
 Water Problems in New York City (10 Mins)
 Data Sources For Modeling Water Problems (10 Mins)
 Fast Data Analytics Technologies For Modeling Water Problems (10 Mins)
 Demonstration of Modeling Water Problems using New York City Data Sources (60 Mins)
 Q & A (15 Mins)
Agenda
© 2015 IBM Corporation8
Water Complaints in New York City
© 2015 IBM Corporation9
Weekly Water Complaints in Manhattan for a Year
© 2015 IBM Corporation10
 Water Problems in Big Cities– Implications and Causes (10 Mins)
 Water Problems in New York City (10 Mins)
 Data Sources For Modeling Water Problems (10 Mins)
 Fast Data Analytics Technologies For Modeling Water Problems (10 Mins)
 Demonstration of Modeling Water Problems using New York City Data Sources (60 Mins)
 Q & A (15 Mins)
Agenda
© 2015 IBM Corporation11
Data Source 1 : Complaint Data
 Water Complaints
 Lat/Long, Address and Zip Code of
complaint
 Every Incident is Recorded with
date and time
© 2015 IBM Corporation12
Data Source 2 : Building Characteristics Data (Pluto Data)
© 2015 IBM Corporation13
Data Source 3 : American Community Survey Data
B25040_002 B25040_003 B25040_004 B25040_005 B25040_006 B25040_007 B25040_008 B25040_009 B25040_010
Total:% Utility gas Total:% Bottled, tank, or LP gasTotal:% ElectricityTotal:% Fuel oil, kerosene, etc.Total:% Coal or cokeTotal:% Wood Total:% Solar energyTotal:% Other fuelTotal:% No fuel used
56946717 5797150 40920801 7444637 133994 2398110 42747 501131 1041515
52246234 1472137 32148632 5132176 41908 556577 28394 277385 948678
4700483 4325013 8772169 2312461 92086 1841533 14353 223746 92837
90943 63166 94925 10909 1048 60246 259 5904 1343
20 39 85 196 0 20 0 2 0
493038 96026 330745 1233 69 25676 29 9012 1135
6779 2256 6898 731 74 2382 12 129 46
33368 1699 6403 33895 185 7972 7 417 104
77687 32142 257776 6680 53 3737 58 449 1401
69 187 1871 0 0 11 129 6 4985
54799584 4423476 38158739 6895313 109643 1749625 39312 439702 1021324
50567541 3153050 33523735 6047144 69061 1156049 32770 358224 881791
21853598 387343 13730688 1684305 8108 112594 12447 128261 456510
28713943 2765707 19793047 4362839 60953 1043455 20323 229963 425281
4232043 1270426 4635004 848169 40582 593576 6542 81478 139533
2203453 83860 1456649 125797 1674 42583 1130 14460 33945
2028590 1186566 3178355 722372 38908 550993 5412 67018 105588
2147133 1373674 2762062 549324 24351 648485 3435 61429 20191
6379176 2644100 7397066 1397493 64933 1242061 9977 142907 159724
10882812 731313 2693478 5880150 94396 500891 5363 135950 76454
17886422 2056053 4779587 433462 11483 600704 4597 175051 83946
13401826 2095157 25494496 852565 19222 582580 6491 78569 202137
14775657 914627 7953240 278460 8893 713935 26296 111561 678978
© 2015 IBM Corporation14
Approach For Modeling
Step 4a:
Model complaint counts of NYC 311 to find
possible relationships with water
characteristics such building location, zip
code etc
Predict
Water complaint risk
for zip codes
Step 1:
Merge
building
addresses of
water
complaints for
NY City
© 2015 IBM Corporation15
 Water Problems in Big Cities– Implications and Causes (10 Mins)
 Water Problems in New York City (10 Mins)
 Data Sources For Modeling Water Problems (10 Mins)
 Fast Data Analytics Technologies For Modeling Water Problems (10 Mins)
 Demonstration of Modeling Water Problems using New York City Data Sources (60 Mins)
 Q & A (15 Mins)
Agenda
© 2015 IBM Corporation16
Fast Data Analytics Stack for developing Analytics in an Agile and
iterative way
Application Component Frameworks
Distributed Processing Frameworks
Persistence Storage
Spark Core
SQL Components Streaming Components Modeling Components
System ML
Spark R
Spark StreamingSpark SQL
APIs
JavaSQL
Spark MLLib
Graph Components
Spark GraphX
HDFS SAN
Scala
Resource Management
Mesos Yarn
Fast Data Analytics Stack
S3 Others
Spark ML
Pipeline
Standalone
Swift
RPython
Notebook/Workbench
Batch Processes/
Workflows
Reports
Online
Applications
© 2015 IBM Corporation17
Technology Stack Used for This Demonstration
Application Component Frameworks
Distributed Processing Frameworks
Persistence Storage
Spark Core
SQL Components Streaming Components Modeling Components
System ML
Spark R
Spark StreamingSpark SQL
APIs
JavaSQL
Spark MLLib
Graph Components
Spark GraphX
HDFS SAN
Scala
Resource Management
Mesos Yarn
Fast data Analytics Stack
S3 Others
Spark ML
Pipeline
Standalone
Swift
RPython
Zeppelin Notebook
Batch Processes/
Workflows
Reports
Online
Applications
© 2015 IBM Corporation18
Apache Spark
Log processing TBD
Graph Analytics
Fast and integrated
graph computation
Stream Processing
Near real-time data
processing &
analytics
Machine Learning
Fast and easy to deploy
algorithms
Unified Data Access
Fast, familiar query
language for all data
• Micro-batch event processing for near real-time
analytics
• Process live streams of data (IoT, Twitter, Kafka)
• No multi-threading or parallel processing required
• Predictive and prescriptive analytics, and smart
application design, from statistical and algorithmic
models
• Algorithms are pre-built
• Query your structured data sets with SQL or
other dataframe APIs
• Data mining, BI, and insight discovery
• Get results faster due to performance
• Represent data in a graph
• Represent/analyze systems represented by
nodes and interconnections between them
SparkCore
Spark SQL
Spark
Streaming
MLlib
(machine
learning)
GraphX
SparkR
Support for R
Data Processing &
Machine Learning using
R syntax
• Explore and Analyze Data using R syntax
• SQL like syntax using R
• Machine Learning (using MlLib)
© 2015 IBM Corporation19
R
 An interpreted language
 Open-source implementation of the S language (1976)
 Best suited for statistical analysis and modeling
 Data exploration and manipulation
 Descriptive statistics
 Predictive analytics and machine learning
 Visualization
 +++
 Can produce “publication quality graphics”
 State of the art algorithms
 Statistical researchers often provide their methods as R packages
 New techniques available without delay
 Commercial packages usually behind the curve
 4700+ packages as of today
 Active and vibrant user community
© 2015 IBM Corporation20
Apache Zeppelin
 A web based Notebook
 Supports multiple programming paradigms within a single Notebook instance
 Dynamic Form generation
 Supports other technologies apart from Spark – Flink, Hive, etc.
 We’ll use PR 208 of Zeppelin for R Interpreter (https://github.com/apache/incubator-
zeppelin/pull/208)
© 2015 IBM Corporation21
Apache SystemML
 A Distributed machine Learning Platform
 Supports multiple run time – Spark, Hadoop
MR, Single Node
 Users can create Custom Algorithms using
R like syntax
 Optimize Performance of Algorithms
automatically depending on the choice of
Platform
© 2015 IBM Corporation22
IOP – IBM Open Platform with Apache Hadoop
 Fully Open Source Distribution of Big Data Platform
 Packages different types of Big data Technologies within
single Distribution – Hadoop, Spark, Kafka, Solr, etc.
 Compliant to industry standard for Big Data Platform -
ODPi : The Open Ecosystem of Big Data
(https://www.odpi.org/)
© 2015 IBM Corporation23
BigInsights for Apache Hadoop in IBM Bluemix
 Managed cloud service with 100% open
source Apache Hadoop through the IBM
Open Platform.
 Includes Ambari, YARN, Spark, Knox,
HBase, Hive, Solr, and an encrypted HDFS
 High value Hadoop analytics features such
as Big SQL, BigSheets, Text Analytics, Big
R, and Machine Learning to gain insight
faster.
 Key components of the platform, including
the infrastructure, are proactively monitored
by a 24x7 cloud operations team.
 Free Trial available for 30 days
© 2015 IBM Corporation24
Technology Specific Details
 Spark –
 Spark 1.5.2
 Spark Core, Spark MlLib, SparkSQL, SparkR
 Apache
 SystemML 0.9
 Running on Spark
 Zeppelin
 Zeppelin 0.5.5
 We’ll use PR 208 of Zeppelin for R Interpreter
(https://github.com/apache/incubator-zeppelin/pull/208)
 IBM Open Data Platform for Apache Hadoop (IOP)
 IOP 4.1
 HDFS, Yarn, Hive Metastore and Spark
 IBM Bluemix
 BigInsights for Apache Hadoop in IBM Bluemix
 A cluster with 3 Data Nodes and 4 Management Nodes
© 2015 IBM Corporation25
The Overall Deployment Architecture with Key Components
Slave Nodes
Management Nodes
IBM Bluemix
Spark Slave Process
HDFS Data Node
Yarn
Spark Slave Process
HDFS Data Node
Yarn
Spark Slave Process
HDFS Data Node
Yarn
Yarn Master HDFS Name Node
Hive Server 2
Hive Meta Store
Edge Management Nodes
Zeppelin Process
Spark Driver
R
System ML
IOP
User Browser
© 2015 IBM Corporation26
 Water Problems in Big Cities– Implications and Causes (10 Mins)
 Water Problems in New York City (10 Mins)
 Data Sources For Modeling Water Problems (10 Mins)
 Fast Data Analytics Technologies For Modeling Water Problems (10 Mins)
 Demonstration of Modeling Water Problems using New York City Data Sources (60 Mins)
 Q & A (15 Mins)
Agenda
© 2015 IBM Corporation27
 Water Problems in Big Cities– Implications and Causes (10 Mins)
 Water Problems in New York City (10 Mins)
 Data Sources For Modeling Water Problems (10 Mins)
 Fast Data Analytics Technologies For Modeling Water Problems (10 Mins)
 Demonstration of Modeling Water Problems using New York City Data Sources (60 Mins)
 Q & A (15 Mins)
Agenda
© 2015 IBM Corporation
Thank You

More Related Content

Similar to CognitiveAnalyticsWithSparkAndZeppelinMeetup-v0.2

Why and How to Monitor Application Performance in Azure
Why and How to Monitor Application Performance in AzureWhy and How to Monitor Application Performance in Azure
Why and How to Monitor Application Performance in AzureRiverbed Technology
 
Why and How to Monitor App Performance in Azure
Why and How to Monitor App Performance in AzureWhy and How to Monitor App Performance in Azure
Why and How to Monitor App Performance in AzureIan Downard
 
Data Democracy: Journey to User-Facing Analytics - Pulsar Summit SF 2022
Data Democracy: Journey to User-Facing Analytics - Pulsar Summit SF 2022Data Democracy: Journey to User-Facing Analytics - Pulsar Summit SF 2022
Data Democracy: Journey to User-Facing Analytics - Pulsar Summit SF 2022StreamNative
 
NetApp Industry Keynote - Flash Memory Summit - Aug2015
NetApp Industry Keynote - Flash Memory Summit - Aug2015NetApp Industry Keynote - Flash Memory Summit - Aug2015
NetApp Industry Keynote - Flash Memory Summit - Aug2015Val Bercovici
 
Key Imperatives for the CIO in Digital Age By Lalatendu Das Digital VP, Assoc...
Key Imperatives for the CIO in Digital Age By Lalatendu Das Digital VP, Assoc...Key Imperatives for the CIO in Digital Age By Lalatendu Das Digital VP, Assoc...
Key Imperatives for the CIO in Digital Age By Lalatendu Das Digital VP, Assoc...Rahul Neel Mani
 
B2b Project Kick Off 012208
B2b Project Kick Off 012208B2b Project Kick Off 012208
B2b Project Kick Off 012208rapplebee
 
EDA Meets Data Engineering – What's the Big Deal?
EDA Meets Data Engineering – What's the Big Deal?EDA Meets Data Engineering – What's the Big Deal?
EDA Meets Data Engineering – What's the Big Deal?confluent
 
Real-time Visibility at Scale with Sumo Logic
Real-time Visibility at Scale with Sumo LogicReal-time Visibility at Scale with Sumo Logic
Real-time Visibility at Scale with Sumo LogicAmazon Web Services
 
Spring and Pivotal Application Service - SpringOne Tour Dallas
Spring and Pivotal Application Service - SpringOne Tour DallasSpring and Pivotal Application Service - SpringOne Tour Dallas
Spring and Pivotal Application Service - SpringOne Tour DallasVMware Tanzu
 
Delivering New Visibility and Analytics for IT Operations
Delivering New Visibility and Analytics for IT OperationsDelivering New Visibility and Analytics for IT Operations
Delivering New Visibility and Analytics for IT OperationsGabrielle Knowles
 
SplunkLive Auckland - Operational Intelligence
SplunkLive Auckland - Operational IntelligenceSplunkLive Auckland - Operational Intelligence
SplunkLive Auckland - Operational IntelligenceSplunk
 
SplunkLive Wellington 2015 - Operational Intelligence
SplunkLive Wellington 2015 - Operational IntelligenceSplunkLive Wellington 2015 - Operational Intelligence
SplunkLive Wellington 2015 - Operational IntelligenceSplunk
 
Big Data Architecture
Big Data ArchitectureBig Data Architecture
Big Data ArchitectureGuido Schmutz
 
Do Clouds Compute? A Framework for Estimating the Value of Cloud Computing.
Do Clouds Compute? A Framework for Estimating the Value of Cloud Computing.Do Clouds Compute? A Framework for Estimating the Value of Cloud Computing.
Do Clouds Compute? A Framework for Estimating the Value of Cloud Computing.Markus Klems
 
Sap Leonardo - what is it, and why would I want one?
Sap Leonardo - what is it, and why would I want one?Sap Leonardo - what is it, and why would I want one?
Sap Leonardo - what is it, and why would I want one?Tom Raftery
 
Glassbeam: Ad-hoc Analytics on Internet of Complex Things with Apache Cassand...
Glassbeam: Ad-hoc Analytics on Internet of Complex Things with Apache Cassand...Glassbeam: Ad-hoc Analytics on Internet of Complex Things with Apache Cassand...
Glassbeam: Ad-hoc Analytics on Internet of Complex Things with Apache Cassand...DataStax Academy
 
Robert Harrison, WMG - IIoT and Industry 4.0 in Automation Systems Engineering
Robert Harrison, WMG - IIoT and Industry 4.0 in Automation Systems EngineeringRobert Harrison, WMG - IIoT and Industry 4.0 in Automation Systems Engineering
Robert Harrison, WMG - IIoT and Industry 4.0 in Automation Systems EngineeringWMG, University of Warwick
 
A Journey to Building an Autonomous Streaming Data Platform—Scaling to Trilli...
A Journey to Building an Autonomous Streaming Data Platform—Scaling to Trilli...A Journey to Building an Autonomous Streaming Data Platform—Scaling to Trilli...
A Journey to Building an Autonomous Streaming Data Platform—Scaling to Trilli...Databricks
 
Empowering SmartCloud APM - Predictive Insights and Analysis: A Use Case Scen...
Empowering SmartCloud APM - Predictive Insights and Analysis: A Use Case Scen...Empowering SmartCloud APM - Predictive Insights and Analysis: A Use Case Scen...
Empowering SmartCloud APM - Predictive Insights and Analysis: A Use Case Scen...Prolifics
 

Similar to CognitiveAnalyticsWithSparkAndZeppelinMeetup-v0.2 (20)

Why and How to Monitor Application Performance in Azure
Why and How to Monitor Application Performance in AzureWhy and How to Monitor Application Performance in Azure
Why and How to Monitor Application Performance in Azure
 
Why and How to Monitor App Performance in Azure
Why and How to Monitor App Performance in AzureWhy and How to Monitor App Performance in Azure
Why and How to Monitor App Performance in Azure
 
Data Democracy: Journey to User-Facing Analytics - Pulsar Summit SF 2022
Data Democracy: Journey to User-Facing Analytics - Pulsar Summit SF 2022Data Democracy: Journey to User-Facing Analytics - Pulsar Summit SF 2022
Data Democracy: Journey to User-Facing Analytics - Pulsar Summit SF 2022
 
NetApp Industry Keynote - Flash Memory Summit - Aug2015
NetApp Industry Keynote - Flash Memory Summit - Aug2015NetApp Industry Keynote - Flash Memory Summit - Aug2015
NetApp Industry Keynote - Flash Memory Summit - Aug2015
 
Key Imperatives for the CIO in Digital Age By Lalatendu Das Digital VP, Assoc...
Key Imperatives for the CIO in Digital Age By Lalatendu Das Digital VP, Assoc...Key Imperatives for the CIO in Digital Age By Lalatendu Das Digital VP, Assoc...
Key Imperatives for the CIO in Digital Age By Lalatendu Das Digital VP, Assoc...
 
B2b Project Kick Off 012208
B2b Project Kick Off 012208B2b Project Kick Off 012208
B2b Project Kick Off 012208
 
EDA Meets Data Engineering – What's the Big Deal?
EDA Meets Data Engineering – What's the Big Deal?EDA Meets Data Engineering – What's the Big Deal?
EDA Meets Data Engineering – What's the Big Deal?
 
Real-time Visibility at Scale with Sumo Logic
Real-time Visibility at Scale with Sumo LogicReal-time Visibility at Scale with Sumo Logic
Real-time Visibility at Scale with Sumo Logic
 
Spring and Pivotal Application Service - SpringOne Tour Dallas
Spring and Pivotal Application Service - SpringOne Tour DallasSpring and Pivotal Application Service - SpringOne Tour Dallas
Spring and Pivotal Application Service - SpringOne Tour Dallas
 
Delivering New Visibility and Analytics for IT Operations
Delivering New Visibility and Analytics for IT OperationsDelivering New Visibility and Analytics for IT Operations
Delivering New Visibility and Analytics for IT Operations
 
SplunkLive Auckland - Operational Intelligence
SplunkLive Auckland - Operational IntelligenceSplunkLive Auckland - Operational Intelligence
SplunkLive Auckland - Operational Intelligence
 
SplunkLive Wellington 2015 - Operational Intelligence
SplunkLive Wellington 2015 - Operational IntelligenceSplunkLive Wellington 2015 - Operational Intelligence
SplunkLive Wellington 2015 - Operational Intelligence
 
Big Data Architecture
Big Data ArchitectureBig Data Architecture
Big Data Architecture
 
Do Clouds Compute? A Framework for Estimating the Value of Cloud Computing.
Do Clouds Compute? A Framework for Estimating the Value of Cloud Computing.Do Clouds Compute? A Framework for Estimating the Value of Cloud Computing.
Do Clouds Compute? A Framework for Estimating the Value of Cloud Computing.
 
SAP
SAPSAP
SAP
 
Sap Leonardo - what is it, and why would I want one?
Sap Leonardo - what is it, and why would I want one?Sap Leonardo - what is it, and why would I want one?
Sap Leonardo - what is it, and why would I want one?
 
Glassbeam: Ad-hoc Analytics on Internet of Complex Things with Apache Cassand...
Glassbeam: Ad-hoc Analytics on Internet of Complex Things with Apache Cassand...Glassbeam: Ad-hoc Analytics on Internet of Complex Things with Apache Cassand...
Glassbeam: Ad-hoc Analytics on Internet of Complex Things with Apache Cassand...
 
Robert Harrison, WMG - IIoT and Industry 4.0 in Automation Systems Engineering
Robert Harrison, WMG - IIoT and Industry 4.0 in Automation Systems EngineeringRobert Harrison, WMG - IIoT and Industry 4.0 in Automation Systems Engineering
Robert Harrison, WMG - IIoT and Industry 4.0 in Automation Systems Engineering
 
A Journey to Building an Autonomous Streaming Data Platform—Scaling to Trilli...
A Journey to Building an Autonomous Streaming Data Platform—Scaling to Trilli...A Journey to Building an Autonomous Streaming Data Platform—Scaling to Trilli...
A Journey to Building an Autonomous Streaming Data Platform—Scaling to Trilli...
 
Empowering SmartCloud APM - Predictive Insights and Analysis: A Use Case Scen...
Empowering SmartCloud APM - Predictive Insights and Analysis: A Use Case Scen...Empowering SmartCloud APM - Predictive Insights and Analysis: A Use Case Scen...
Empowering SmartCloud APM - Predictive Insights and Analysis: A Use Case Scen...
 

CognitiveAnalyticsWithSparkAndZeppelinMeetup-v0.2

  • 1. © 2015 IBM Corporation June 2016 Modeling Water Problems using Zeppelin, Spark, R and System ML P.S. “Arvind” Aravind (psaravind@us.ibm.com) IBM Analytics (http://www-01.ibm.com/software/data/services/stampede.html)
  • 2. © 2015 IBM Corporation2  Water Problems in Big Cities– Implications and Causes (10 Mins)  Water Problems in New York City (10 Mins)  Data Sources For Modeling Water Problems (10 Mins)  Fast Data Analytics Technologies For Modeling Water Problems (10 Mins)  Demonstration of Modeling Water Problems using New York City Data Sources (60 Mins)  Q & A (15 Mins) Agenda
  • 3. © 2015 IBM Corporation3  Water Problems in Big Cities– Implications and Causes (10 Mins)  Water Problems in New York City (10 Mins)  Data Sources For Modeling Water Problems (10 Mins)  Fast Data Analytics Technologies For Modeling Water Problems (10 Mins)  Demonstration of Modeling Water Problems using New York City Data Sources (60 Mins)  Q & A (15 Mins) Agenda
  • 4. © 2015 IBM Corporation4 Water efficiency enhances People’s Life Water efficiency analytics enhances performance of city water systems, improving the longevity of infrastructure while saving energy and reducing water loss.
  • 5. © 2015 IBM Corporation5  No Water  Water Quality  Hardness (measure of dissolved calcium and magnesium)  Dirty, Cloudy or Milky, Rusty Brown Color  Contains particles, insects or worms  Contains grease, oil, or gasoline  Bad smell or taste  Chlorine odors  Fluoride levels  Water Pressure High/Low  Illness caused by drinking water  Building/Neighborhood What are the Typical Reasons for Water related problems ?
  • 6. © 2015 IBM Corporation6 How Analytics can help in addressing Water Problems ?  Correlating disparate Data Sources easily  Help in Potential Cause Analysis based on area, timing, building types etc.  Prediction of Areas/Timing with potential of more Water problems  Prediction of sudden Spikes ahead of time using near real time data  Influencing efficient use of Water Management  Better Service and Improved People’s Lifestyle
  • 7. © 2015 IBM Corporation7  Water Problems in Big Cities– Implications and Causes (10 Mins)  Water Problems in New York City (10 Mins)  Data Sources For Modeling Water Problems (10 Mins)  Fast Data Analytics Technologies For Modeling Water Problems (10 Mins)  Demonstration of Modeling Water Problems using New York City Data Sources (60 Mins)  Q & A (15 Mins) Agenda
  • 8. © 2015 IBM Corporation8 Water Complaints in New York City
  • 9. © 2015 IBM Corporation9 Weekly Water Complaints in Manhattan for a Year
  • 10. © 2015 IBM Corporation10  Water Problems in Big Cities– Implications and Causes (10 Mins)  Water Problems in New York City (10 Mins)  Data Sources For Modeling Water Problems (10 Mins)  Fast Data Analytics Technologies For Modeling Water Problems (10 Mins)  Demonstration of Modeling Water Problems using New York City Data Sources (60 Mins)  Q & A (15 Mins) Agenda
  • 11. © 2015 IBM Corporation11 Data Source 1 : Complaint Data  Water Complaints  Lat/Long, Address and Zip Code of complaint  Every Incident is Recorded with date and time
  • 12. © 2015 IBM Corporation12 Data Source 2 : Building Characteristics Data (Pluto Data)
  • 13. © 2015 IBM Corporation13 Data Source 3 : American Community Survey Data B25040_002 B25040_003 B25040_004 B25040_005 B25040_006 B25040_007 B25040_008 B25040_009 B25040_010 Total:% Utility gas Total:% Bottled, tank, or LP gasTotal:% ElectricityTotal:% Fuel oil, kerosene, etc.Total:% Coal or cokeTotal:% Wood Total:% Solar energyTotal:% Other fuelTotal:% No fuel used 56946717 5797150 40920801 7444637 133994 2398110 42747 501131 1041515 52246234 1472137 32148632 5132176 41908 556577 28394 277385 948678 4700483 4325013 8772169 2312461 92086 1841533 14353 223746 92837 90943 63166 94925 10909 1048 60246 259 5904 1343 20 39 85 196 0 20 0 2 0 493038 96026 330745 1233 69 25676 29 9012 1135 6779 2256 6898 731 74 2382 12 129 46 33368 1699 6403 33895 185 7972 7 417 104 77687 32142 257776 6680 53 3737 58 449 1401 69 187 1871 0 0 11 129 6 4985 54799584 4423476 38158739 6895313 109643 1749625 39312 439702 1021324 50567541 3153050 33523735 6047144 69061 1156049 32770 358224 881791 21853598 387343 13730688 1684305 8108 112594 12447 128261 456510 28713943 2765707 19793047 4362839 60953 1043455 20323 229963 425281 4232043 1270426 4635004 848169 40582 593576 6542 81478 139533 2203453 83860 1456649 125797 1674 42583 1130 14460 33945 2028590 1186566 3178355 722372 38908 550993 5412 67018 105588 2147133 1373674 2762062 549324 24351 648485 3435 61429 20191 6379176 2644100 7397066 1397493 64933 1242061 9977 142907 159724 10882812 731313 2693478 5880150 94396 500891 5363 135950 76454 17886422 2056053 4779587 433462 11483 600704 4597 175051 83946 13401826 2095157 25494496 852565 19222 582580 6491 78569 202137 14775657 914627 7953240 278460 8893 713935 26296 111561 678978
  • 14. © 2015 IBM Corporation14 Approach For Modeling Step 4a: Model complaint counts of NYC 311 to find possible relationships with water characteristics such building location, zip code etc Predict Water complaint risk for zip codes Step 1: Merge building addresses of water complaints for NY City
  • 15. © 2015 IBM Corporation15  Water Problems in Big Cities– Implications and Causes (10 Mins)  Water Problems in New York City (10 Mins)  Data Sources For Modeling Water Problems (10 Mins)  Fast Data Analytics Technologies For Modeling Water Problems (10 Mins)  Demonstration of Modeling Water Problems using New York City Data Sources (60 Mins)  Q & A (15 Mins) Agenda
  • 16. © 2015 IBM Corporation16 Fast Data Analytics Stack for developing Analytics in an Agile and iterative way Application Component Frameworks Distributed Processing Frameworks Persistence Storage Spark Core SQL Components Streaming Components Modeling Components System ML Spark R Spark StreamingSpark SQL APIs JavaSQL Spark MLLib Graph Components Spark GraphX HDFS SAN Scala Resource Management Mesos Yarn Fast Data Analytics Stack S3 Others Spark ML Pipeline Standalone Swift RPython Notebook/Workbench Batch Processes/ Workflows Reports Online Applications
  • 17. © 2015 IBM Corporation17 Technology Stack Used for This Demonstration Application Component Frameworks Distributed Processing Frameworks Persistence Storage Spark Core SQL Components Streaming Components Modeling Components System ML Spark R Spark StreamingSpark SQL APIs JavaSQL Spark MLLib Graph Components Spark GraphX HDFS SAN Scala Resource Management Mesos Yarn Fast data Analytics Stack S3 Others Spark ML Pipeline Standalone Swift RPython Zeppelin Notebook Batch Processes/ Workflows Reports Online Applications
  • 18. © 2015 IBM Corporation18 Apache Spark Log processing TBD Graph Analytics Fast and integrated graph computation Stream Processing Near real-time data processing & analytics Machine Learning Fast and easy to deploy algorithms Unified Data Access Fast, familiar query language for all data • Micro-batch event processing for near real-time analytics • Process live streams of data (IoT, Twitter, Kafka) • No multi-threading or parallel processing required • Predictive and prescriptive analytics, and smart application design, from statistical and algorithmic models • Algorithms are pre-built • Query your structured data sets with SQL or other dataframe APIs • Data mining, BI, and insight discovery • Get results faster due to performance • Represent data in a graph • Represent/analyze systems represented by nodes and interconnections between them SparkCore Spark SQL Spark Streaming MLlib (machine learning) GraphX SparkR Support for R Data Processing & Machine Learning using R syntax • Explore and Analyze Data using R syntax • SQL like syntax using R • Machine Learning (using MlLib)
  • 19. © 2015 IBM Corporation19 R  An interpreted language  Open-source implementation of the S language (1976)  Best suited for statistical analysis and modeling  Data exploration and manipulation  Descriptive statistics  Predictive analytics and machine learning  Visualization  +++  Can produce “publication quality graphics”  State of the art algorithms  Statistical researchers often provide their methods as R packages  New techniques available without delay  Commercial packages usually behind the curve  4700+ packages as of today  Active and vibrant user community
  • 20. © 2015 IBM Corporation20 Apache Zeppelin  A web based Notebook  Supports multiple programming paradigms within a single Notebook instance  Dynamic Form generation  Supports other technologies apart from Spark – Flink, Hive, etc.  We’ll use PR 208 of Zeppelin for R Interpreter (https://github.com/apache/incubator- zeppelin/pull/208)
  • 21. © 2015 IBM Corporation21 Apache SystemML  A Distributed machine Learning Platform  Supports multiple run time – Spark, Hadoop MR, Single Node  Users can create Custom Algorithms using R like syntax  Optimize Performance of Algorithms automatically depending on the choice of Platform
  • 22. © 2015 IBM Corporation22 IOP – IBM Open Platform with Apache Hadoop  Fully Open Source Distribution of Big Data Platform  Packages different types of Big data Technologies within single Distribution – Hadoop, Spark, Kafka, Solr, etc.  Compliant to industry standard for Big Data Platform - ODPi : The Open Ecosystem of Big Data (https://www.odpi.org/)
  • 23. © 2015 IBM Corporation23 BigInsights for Apache Hadoop in IBM Bluemix  Managed cloud service with 100% open source Apache Hadoop through the IBM Open Platform.  Includes Ambari, YARN, Spark, Knox, HBase, Hive, Solr, and an encrypted HDFS  High value Hadoop analytics features such as Big SQL, BigSheets, Text Analytics, Big R, and Machine Learning to gain insight faster.  Key components of the platform, including the infrastructure, are proactively monitored by a 24x7 cloud operations team.  Free Trial available for 30 days
  • 24. © 2015 IBM Corporation24 Technology Specific Details  Spark –  Spark 1.5.2  Spark Core, Spark MlLib, SparkSQL, SparkR  Apache  SystemML 0.9  Running on Spark  Zeppelin  Zeppelin 0.5.5  We’ll use PR 208 of Zeppelin for R Interpreter (https://github.com/apache/incubator-zeppelin/pull/208)  IBM Open Data Platform for Apache Hadoop (IOP)  IOP 4.1  HDFS, Yarn, Hive Metastore and Spark  IBM Bluemix  BigInsights for Apache Hadoop in IBM Bluemix  A cluster with 3 Data Nodes and 4 Management Nodes
  • 25. © 2015 IBM Corporation25 The Overall Deployment Architecture with Key Components Slave Nodes Management Nodes IBM Bluemix Spark Slave Process HDFS Data Node Yarn Spark Slave Process HDFS Data Node Yarn Spark Slave Process HDFS Data Node Yarn Yarn Master HDFS Name Node Hive Server 2 Hive Meta Store Edge Management Nodes Zeppelin Process Spark Driver R System ML IOP User Browser
  • 26. © 2015 IBM Corporation26  Water Problems in Big Cities– Implications and Causes (10 Mins)  Water Problems in New York City (10 Mins)  Data Sources For Modeling Water Problems (10 Mins)  Fast Data Analytics Technologies For Modeling Water Problems (10 Mins)  Demonstration of Modeling Water Problems using New York City Data Sources (60 Mins)  Q & A (15 Mins) Agenda
  • 27. © 2015 IBM Corporation27  Water Problems in Big Cities– Implications and Causes (10 Mins)  Water Problems in New York City (10 Mins)  Data Sources For Modeling Water Problems (10 Mins)  Fast Data Analytics Technologies For Modeling Water Problems (10 Mins)  Demonstration of Modeling Water Problems using New York City Data Sources (60 Mins)  Q & A (15 Mins) Agenda
  • 28. © 2015 IBM Corporation Thank You