CognitiveAnalyticsWithSparkAndZeppelinMeetup-v0.2

© 2015 IBM Corporation
June 2016
Modeling Water Problems using Zeppelin,
Spark, R and System ML
P.S. “Arvind” Aravind
(psaravind@us.ibm.com)
IBM Analytics
(http://www-01.ibm.com/software/data/services/stampede.html)

© 2015 IBM Corporation2
 Water Problems in Big Cities– Implications and Causes (10 Mins)
 Water Problems in New York City (10 Mins)
 Data Sources For Modeling Water Problems (10 Mins)
 Fast Data Analytics Technologies For Modeling Water Problems (10 Mins)
 Demonstration of Modeling Water Problems using New York City Data Sources (60 Mins)
 Q & A (15 Mins)
Agenda

 Q & A (15 Mins)
Agenda

Water efficiency enhances People’s Life
Water efficiency analytics enhances performance of city water systems, improving the longevity of
infrastructure while saving energy and reducing water loss.

 No Water
 Water Quality
 Hardness (measure of dissolved calcium and
magnesium)
 Dirty, Cloudy or Milky, Rusty Brown Color
 Contains particles, insects or worms
 Contains grease, oil, or gasoline
 Bad smell or taste
 Chlorine odors
 Fluoride levels
 Water Pressure High/Low
 Illness caused by drinking water
 Building/Neighborhood
What are the Typical Reasons for Water related problems ?

How Analytics can help in addressing Water Problems ?
 Correlating disparate Data Sources easily
 Help in Potential Cause Analysis based on area, timing, building types etc.
 Prediction of Areas/Timing with potential of more Water problems
 Prediction of sudden Spikes ahead of time using near real time data
 Influencing efficient use of Water Management
 Better Service and Improved People’s Lifestyle

 Q & A (15 Mins)
Agenda

Water Complaints in New York City

Weekly Water Complaints in Manhattan for a Year

 Q & A (15 Mins)
Agenda

Data Source 1 : Complaint Data
 Water Complaints
 Lat/Long, Address and Zip Code of
complaint
 Every Incident is Recorded with
date and time

Data Source 2 : Building Characteristics Data (Pluto Data)

Data Source 3 : American Community Survey Data
B25040_002 B25040_003 B25040_004 B25040_005 B25040_006 B25040_007 B25040_008 B25040_009 B25040_010
Total:% Utility gas Total:% Bottled, tank, or LP gasTotal:% ElectricityTotal:% Fuel oil, kerosene, etc.Total:% Coal or cokeTotal:% Wood Total:% Solar energyTotal:% Other fuelTotal:% No fuel used
56946717 5797150 40920801 7444637 133994 2398110 42747 501131 1041515
52246234 1472137 32148632 5132176 41908 556577 28394 277385 948678
4700483 4325013 8772169 2312461 92086 1841533 14353 223746 92837
90943 63166 94925 10909 1048 60246 259 5904 1343
20 39 85 196 0 20 0 2 0
493038 96026 330745 1233 69 25676 29 9012 1135
6779 2256 6898 731 74 2382 12 129 46
33368 1699 6403 33895 185 7972 7 417 104
77687 32142 257776 6680 53 3737 58 449 1401
69 187 1871 0 0 11 129 6 4985
54799584 4423476 38158739 6895313 109643 1749625 39312 439702 1021324
50567541 3153050 33523735 6047144 69061 1156049 32770 358224 881791
21853598 387343 13730688 1684305 8108 112594 12447 128261 456510
28713943 2765707 19793047 4362839 60953 1043455 20323 229963 425281
4232043 1270426 4635004 848169 40582 593576 6542 81478 139533
2203453 83860 1456649 125797 1674 42583 1130 14460 33945
2028590 1186566 3178355 722372 38908 550993 5412 67018 105588
2147133 1373674 2762062 549324 24351 648485 3435 61429 20191
6379176 2644100 7397066 1397493 64933 1242061 9977 142907 159724
10882812 731313 2693478 5880150 94396 500891 5363 135950 76454
17886422 2056053 4779587 433462 11483 600704 4597 175051 83946
13401826 2095157 25494496 852565 19222 582580 6491 78569 202137
14775657 914627 7953240 278460 8893 713935 26296 111561 678978

Approach For Modeling
Step 4a:
Model complaint counts of NYC 311 to find
possible relationships with water
characteristics such building location, zip
code etc
Predict
Water complaint risk
for zip codes
Step 1:
Merge
building
addresses of
water
complaints for
NY City

 Q & A (15 Mins)
Agenda

Fast Data Analytics Stack for developing Analytics in an Agile and
iterative way
Application Component Frameworks
Distributed Processing Frameworks
Persistence Storage
Spark Core
SQL Components Streaming Components Modeling Components
System ML
Spark R
Spark StreamingSpark SQL
APIs
JavaSQL
Spark MLLib
Graph Components
Spark GraphX
HDFS SAN
Scala
Resource Management
Mesos Yarn
Fast Data Analytics Stack
S3 Others
Spark ML
Pipeline
Standalone
Swift
RPython
Notebook/Workbench
Batch Processes/
Workflows
Reports
Online
Applications

Technology Stack Used for This Demonstration
Application Component Frameworks
Distributed Processing Frameworks
Persistence Storage
Spark Core
SQL Components Streaming Components Modeling Components
System ML
Spark R
Spark StreamingSpark SQL
APIs
JavaSQL
Spark MLLib
Graph Components
Spark GraphX
HDFS SAN
Scala
Resource Management
Mesos Yarn
Fast data Analytics Stack
S3 Others
Spark ML
Pipeline
Standalone
Swift
RPython
Zeppelin Notebook
Batch Processes/
Workflows
Reports
Online
Applications

Apache Spark
Log processing TBD
Graph Analytics
Fast and integrated
graph computation
Stream Processing
Near real-time data
processing &
analytics
Machine Learning
Fast and easy to deploy
algorithms
Unified Data Access
Fast, familiar query
language for all data
• Micro-batch event processing for near real-time
analytics
• Process live streams of data (IoT, Twitter, Kafka)
• No multi-threading or parallel processing required
• Predictive and prescriptive analytics, and smart
application design, from statistical and algorithmic
models
• Algorithms are pre-built
• Query your structured data sets with SQL or
other dataframe APIs
• Data mining, BI, and insight discovery
• Get results faster due to performance
• Represent data in a graph
• Represent/analyze systems represented by
nodes and interconnections between them
SparkCore
Spark SQL
Spark
Streaming
MLlib
(machine
learning)
GraphX
SparkR
Support for R
Data Processing &
Machine Learning using
R syntax
• Explore and Analyze Data using R syntax
• SQL like syntax using R
• Machine Learning (using MlLib)

R
 An interpreted language
 Open-source implementation of the S language (1976)
 Best suited for statistical analysis and modeling
 Data exploration and manipulation
 Descriptive statistics
 Predictive analytics and machine learning
 Visualization
 +++
 Can produce “publication quality graphics”
 State of the art algorithms
 Statistical researchers often provide their methods as R packages
 New techniques available without delay
 Commercial packages usually behind the curve
 4700+ packages as of today
 Active and vibrant user community

Apache Zeppelin
 A web based Notebook
 Supports multiple programming paradigms within a single Notebook instance
 Dynamic Form generation
 Supports other technologies apart from Spark – Flink, Hive, etc.
 We’ll use PR 208 of Zeppelin for R Interpreter (https://github.com/apache/incubator-
zeppelin/pull/208)

Apache SystemML
 A Distributed machine Learning Platform
 Supports multiple run time – Spark, Hadoop
MR, Single Node
 Users can create Custom Algorithms using
R like syntax
 Optimize Performance of Algorithms
automatically depending on the choice of
Platform

IOP – IBM Open Platform with Apache Hadoop
 Fully Open Source Distribution of Big Data Platform
 Packages different types of Big data Technologies within
single Distribution – Hadoop, Spark, Kafka, Solr, etc.
 Compliant to industry standard for Big Data Platform -
ODPi : The Open Ecosystem of Big Data
(https://www.odpi.org/)

BigInsights for Apache Hadoop in IBM Bluemix
 Managed cloud service with 100% open
source Apache Hadoop through the IBM
Open Platform.
 Includes Ambari, YARN, Spark, Knox,
HBase, Hive, Solr, and an encrypted HDFS
 High value Hadoop analytics features such
as Big SQL, BigSheets, Text Analytics, Big
R, and Machine Learning to gain insight
faster.
 Key components of the platform, including
the infrastructure, are proactively monitored
by a 24x7 cloud operations team.
 Free Trial available for 30 days

Technology Specific Details
 Spark –
 Spark 1.5.2
 Spark Core, Spark MlLib, SparkSQL, SparkR
 Apache
 SystemML 0.9
 Running on Spark
 Zeppelin
 Zeppelin 0.5.5
 We’ll use PR 208 of Zeppelin for R Interpreter
(https://github.com/apache/incubator-zeppelin/pull/208)
 IBM Open Data Platform for Apache Hadoop (IOP)
 IOP 4.1
 HDFS, Yarn, Hive Metastore and Spark
 IBM Bluemix
 BigInsights for Apache Hadoop in IBM Bluemix
 A cluster with 3 Data Nodes and 4 Management Nodes

The Overall Deployment Architecture with Key Components
Slave Nodes
Management Nodes
IBM Bluemix
Spark Slave Process
HDFS Data Node
Yarn
Spark Slave Process
HDFS Data Node
Yarn
Spark Slave Process
HDFS Data Node
Yarn
Yarn Master HDFS Name Node
Hive Server 2
Hive Meta Store
Edge Management Nodes
Zeppelin Process
Spark Driver
R
System ML
IOP
User Browser

 Q & A (15 Mins)
Agenda

CognitiveAnalyticsWithSparkAndZeppelinMeetup-v0.2

Recommended

Recommended

More Related Content

Similar to CognitiveAnalyticsWithSparkAndZeppelinMeetup-v0.2

Similar to CognitiveAnalyticsWithSparkAndZeppelinMeetup-v0.2 (20)

CognitiveAnalyticsWithSparkAndZeppelinMeetup-v0.2