More Related Content
Similar to CognitiveAnalyticsWithSparkAndZeppelinMeetup-v0.2
Similar to CognitiveAnalyticsWithSparkAndZeppelinMeetup-v0.2 (20)
CognitiveAnalyticsWithSparkAndZeppelinMeetup-v0.2
- 1. © 2015 IBM Corporation
June 2016
Modeling Water Problems using Zeppelin,
Spark, R and System ML
P.S. “Arvind” Aravind
(psaravind@us.ibm.com)
IBM Analytics
(http://www-01.ibm.com/software/data/services/stampede.html)
- 2. © 2015 IBM Corporation2
Water Problems in Big Cities– Implications and Causes (10 Mins)
Water Problems in New York City (10 Mins)
Data Sources For Modeling Water Problems (10 Mins)
Fast Data Analytics Technologies For Modeling Water Problems (10 Mins)
Demonstration of Modeling Water Problems using New York City Data Sources (60 Mins)
Q & A (15 Mins)
Agenda
- 3. © 2015 IBM Corporation3
Water Problems in Big Cities– Implications and Causes (10 Mins)
Water Problems in New York City (10 Mins)
Data Sources For Modeling Water Problems (10 Mins)
Fast Data Analytics Technologies For Modeling Water Problems (10 Mins)
Demonstration of Modeling Water Problems using New York City Data Sources (60 Mins)
Q & A (15 Mins)
Agenda
- 4. © 2015 IBM Corporation4
Water efficiency enhances People’s Life
Water efficiency analytics enhances performance of city water systems, improving the longevity of
infrastructure while saving energy and reducing water loss.
- 5. © 2015 IBM Corporation5
No Water
Water Quality
Hardness (measure of dissolved calcium and
magnesium)
Dirty, Cloudy or Milky, Rusty Brown Color
Contains particles, insects or worms
Contains grease, oil, or gasoline
Bad smell or taste
Chlorine odors
Fluoride levels
Water Pressure High/Low
Illness caused by drinking water
Building/Neighborhood
What are the Typical Reasons for Water related problems ?
- 6. © 2015 IBM Corporation6
How Analytics can help in addressing Water Problems ?
Correlating disparate Data Sources easily
Help in Potential Cause Analysis based on area, timing, building types etc.
Prediction of Areas/Timing with potential of more Water problems
Prediction of sudden Spikes ahead of time using near real time data
Influencing efficient use of Water Management
Better Service and Improved People’s Lifestyle
- 7. © 2015 IBM Corporation7
Water Problems in Big Cities– Implications and Causes (10 Mins)
Water Problems in New York City (10 Mins)
Data Sources For Modeling Water Problems (10 Mins)
Fast Data Analytics Technologies For Modeling Water Problems (10 Mins)
Demonstration of Modeling Water Problems using New York City Data Sources (60 Mins)
Q & A (15 Mins)
Agenda
- 8. © 2015 IBM Corporation8
Water Complaints in New York City
- 9. © 2015 IBM Corporation9
Weekly Water Complaints in Manhattan for a Year
- 10. © 2015 IBM Corporation10
Water Problems in Big Cities– Implications and Causes (10 Mins)
Water Problems in New York City (10 Mins)
Data Sources For Modeling Water Problems (10 Mins)
Fast Data Analytics Technologies For Modeling Water Problems (10 Mins)
Demonstration of Modeling Water Problems using New York City Data Sources (60 Mins)
Q & A (15 Mins)
Agenda
- 11. © 2015 IBM Corporation11
Data Source 1 : Complaint Data
Water Complaints
Lat/Long, Address and Zip Code of
complaint
Every Incident is Recorded with
date and time
- 12. © 2015 IBM Corporation12
Data Source 2 : Building Characteristics Data (Pluto Data)
- 13. © 2015 IBM Corporation13
Data Source 3 : American Community Survey Data
B25040_002 B25040_003 B25040_004 B25040_005 B25040_006 B25040_007 B25040_008 B25040_009 B25040_010
Total:% Utility gas Total:% Bottled, tank, or LP gasTotal:% ElectricityTotal:% Fuel oil, kerosene, etc.Total:% Coal or cokeTotal:% Wood Total:% Solar energyTotal:% Other fuelTotal:% No fuel used
56946717 5797150 40920801 7444637 133994 2398110 42747 501131 1041515
52246234 1472137 32148632 5132176 41908 556577 28394 277385 948678
4700483 4325013 8772169 2312461 92086 1841533 14353 223746 92837
90943 63166 94925 10909 1048 60246 259 5904 1343
20 39 85 196 0 20 0 2 0
493038 96026 330745 1233 69 25676 29 9012 1135
6779 2256 6898 731 74 2382 12 129 46
33368 1699 6403 33895 185 7972 7 417 104
77687 32142 257776 6680 53 3737 58 449 1401
69 187 1871 0 0 11 129 6 4985
54799584 4423476 38158739 6895313 109643 1749625 39312 439702 1021324
50567541 3153050 33523735 6047144 69061 1156049 32770 358224 881791
21853598 387343 13730688 1684305 8108 112594 12447 128261 456510
28713943 2765707 19793047 4362839 60953 1043455 20323 229963 425281
4232043 1270426 4635004 848169 40582 593576 6542 81478 139533
2203453 83860 1456649 125797 1674 42583 1130 14460 33945
2028590 1186566 3178355 722372 38908 550993 5412 67018 105588
2147133 1373674 2762062 549324 24351 648485 3435 61429 20191
6379176 2644100 7397066 1397493 64933 1242061 9977 142907 159724
10882812 731313 2693478 5880150 94396 500891 5363 135950 76454
17886422 2056053 4779587 433462 11483 600704 4597 175051 83946
13401826 2095157 25494496 852565 19222 582580 6491 78569 202137
14775657 914627 7953240 278460 8893 713935 26296 111561 678978
- 14. © 2015 IBM Corporation14
Approach For Modeling
Step 4a:
Model complaint counts of NYC 311 to find
possible relationships with water
characteristics such building location, zip
code etc
Predict
Water complaint risk
for zip codes
Step 1:
Merge
building
addresses of
water
complaints for
NY City
- 15. © 2015 IBM Corporation15
Water Problems in Big Cities– Implications and Causes (10 Mins)
Water Problems in New York City (10 Mins)
Data Sources For Modeling Water Problems (10 Mins)
Fast Data Analytics Technologies For Modeling Water Problems (10 Mins)
Demonstration of Modeling Water Problems using New York City Data Sources (60 Mins)
Q & A (15 Mins)
Agenda
- 16. © 2015 IBM Corporation16
Fast Data Analytics Stack for developing Analytics in an Agile and
iterative way
Application Component Frameworks
Distributed Processing Frameworks
Persistence Storage
Spark Core
SQL Components Streaming Components Modeling Components
System ML
Spark R
Spark StreamingSpark SQL
APIs
JavaSQL
Spark MLLib
Graph Components
Spark GraphX
HDFS SAN
Scala
Resource Management
Mesos Yarn
Fast Data Analytics Stack
S3 Others
Spark ML
Pipeline
Standalone
Swift
RPython
Notebook/Workbench
Batch Processes/
Workflows
Reports
Online
Applications
- 17. © 2015 IBM Corporation17
Technology Stack Used for This Demonstration
Application Component Frameworks
Distributed Processing Frameworks
Persistence Storage
Spark Core
SQL Components Streaming Components Modeling Components
System ML
Spark R
Spark StreamingSpark SQL
APIs
JavaSQL
Spark MLLib
Graph Components
Spark GraphX
HDFS SAN
Scala
Resource Management
Mesos Yarn
Fast data Analytics Stack
S3 Others
Spark ML
Pipeline
Standalone
Swift
RPython
Zeppelin Notebook
Batch Processes/
Workflows
Reports
Online
Applications
- 18. © 2015 IBM Corporation18
Apache Spark
Log processing TBD
Graph Analytics
Fast and integrated
graph computation
Stream Processing
Near real-time data
processing &
analytics
Machine Learning
Fast and easy to deploy
algorithms
Unified Data Access
Fast, familiar query
language for all data
• Micro-batch event processing for near real-time
analytics
• Process live streams of data (IoT, Twitter, Kafka)
• No multi-threading or parallel processing required
• Predictive and prescriptive analytics, and smart
application design, from statistical and algorithmic
models
• Algorithms are pre-built
• Query your structured data sets with SQL or
other dataframe APIs
• Data mining, BI, and insight discovery
• Get results faster due to performance
• Represent data in a graph
• Represent/analyze systems represented by
nodes and interconnections between them
SparkCore
Spark SQL
Spark
Streaming
MLlib
(machine
learning)
GraphX
SparkR
Support for R
Data Processing &
Machine Learning using
R syntax
• Explore and Analyze Data using R syntax
• SQL like syntax using R
• Machine Learning (using MlLib)
- 19. © 2015 IBM Corporation19
R
An interpreted language
Open-source implementation of the S language (1976)
Best suited for statistical analysis and modeling
Data exploration and manipulation
Descriptive statistics
Predictive analytics and machine learning
Visualization
+++
Can produce “publication quality graphics”
State of the art algorithms
Statistical researchers often provide their methods as R packages
New techniques available without delay
Commercial packages usually behind the curve
4700+ packages as of today
Active and vibrant user community
- 20. © 2015 IBM Corporation20
Apache Zeppelin
A web based Notebook
Supports multiple programming paradigms within a single Notebook instance
Dynamic Form generation
Supports other technologies apart from Spark – Flink, Hive, etc.
We’ll use PR 208 of Zeppelin for R Interpreter (https://github.com/apache/incubator-
zeppelin/pull/208)
- 21. © 2015 IBM Corporation21
Apache SystemML
A Distributed machine Learning Platform
Supports multiple run time – Spark, Hadoop
MR, Single Node
Users can create Custom Algorithms using
R like syntax
Optimize Performance of Algorithms
automatically depending on the choice of
Platform
- 22. © 2015 IBM Corporation22
IOP – IBM Open Platform with Apache Hadoop
Fully Open Source Distribution of Big Data Platform
Packages different types of Big data Technologies within
single Distribution – Hadoop, Spark, Kafka, Solr, etc.
Compliant to industry standard for Big Data Platform -
ODPi : The Open Ecosystem of Big Data
(https://www.odpi.org/)
- 23. © 2015 IBM Corporation23
BigInsights for Apache Hadoop in IBM Bluemix
Managed cloud service with 100% open
source Apache Hadoop through the IBM
Open Platform.
Includes Ambari, YARN, Spark, Knox,
HBase, Hive, Solr, and an encrypted HDFS
High value Hadoop analytics features such
as Big SQL, BigSheets, Text Analytics, Big
R, and Machine Learning to gain insight
faster.
Key components of the platform, including
the infrastructure, are proactively monitored
by a 24x7 cloud operations team.
Free Trial available for 30 days
- 24. © 2015 IBM Corporation24
Technology Specific Details
Spark –
Spark 1.5.2
Spark Core, Spark MlLib, SparkSQL, SparkR
Apache
SystemML 0.9
Running on Spark
Zeppelin
Zeppelin 0.5.5
We’ll use PR 208 of Zeppelin for R Interpreter
(https://github.com/apache/incubator-zeppelin/pull/208)
IBM Open Data Platform for Apache Hadoop (IOP)
IOP 4.1
HDFS, Yarn, Hive Metastore and Spark
IBM Bluemix
BigInsights for Apache Hadoop in IBM Bluemix
A cluster with 3 Data Nodes and 4 Management Nodes
- 25. © 2015 IBM Corporation25
The Overall Deployment Architecture with Key Components
Slave Nodes
Management Nodes
IBM Bluemix
Spark Slave Process
HDFS Data Node
Yarn
Spark Slave Process
HDFS Data Node
Yarn
Spark Slave Process
HDFS Data Node
Yarn
Yarn Master HDFS Name Node
Hive Server 2
Hive Meta Store
Edge Management Nodes
Zeppelin Process
Spark Driver
R
System ML
IOP
User Browser
- 26. © 2015 IBM Corporation26
Water Problems in Big Cities– Implications and Causes (10 Mins)
Water Problems in New York City (10 Mins)
Data Sources For Modeling Water Problems (10 Mins)
Fast Data Analytics Technologies For Modeling Water Problems (10 Mins)
Demonstration of Modeling Water Problems using New York City Data Sources (60 Mins)
Q & A (15 Mins)
Agenda
- 27. © 2015 IBM Corporation27
Water Problems in Big Cities– Implications and Causes (10 Mins)
Water Problems in New York City (10 Mins)
Data Sources For Modeling Water Problems (10 Mins)
Fast Data Analytics Technologies For Modeling Water Problems (10 Mins)
Demonstration of Modeling Water Problems using New York City Data Sources (60 Mins)
Q & A (15 Mins)
Agenda