Spark-Zeppelin-ML on HWX

Data Science at Scale
Spark – Zeppelin - ML
Kirk Haslbeck, Sr. Solution Engineer HWX

2 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Kirk Haslbeck - Hortonworks
Sr. Solution Engineer @ Hortonworks
Lead Architect for Trade Surveillance @ Morgan Stanley
Masters in Data Mining @UMBC
Computer Science Degree @ Mount Saint Mary’s University
github.com/kirkhas/zeppelin-notebooks

Spark – Apache Open Source Project

Why do we need Spark?
 Distributed
– Multi-threading is hard to do in Java but even if you get it right it isn’t distributed. It is limited to a
single JVM
 Horizontal
– Spark can take advantage of a modern data architecture. Scales out as a function of hardware.
 Data Science
– Language R, Python both growing in popularity and great for statistical workloads but suffer from
single threaded nature.
 Need for a top level computing language
– SQL is great and provides a lot of what we need but not everything. Tradeoffs occur when SQL is
better for some operations and a full programming language for others. Spark satisfies both!

Spark API Languages

Spark - Functional + Distributed = Concise and Powerful
Spark Map Function Java Thread Pool
Objective: we have a list of tasks and we want to
pad each project timeline with 20% time buffer

Why Spark chose Scala?
 Functional
– Map, Filter, Fold, GroupBy
– 5-10X code reduction
 Immutable
– No state management, less headache, each operation is fully encapsulated.
 Thread Safety is the Biggest Challenge

RDDs, DataFrames and DataSets
 Resilient Distributed Dataset
– Good for schema
– case class Trade (sym: String, price: Double)
 DataFrame
– SQL like operations, higher level object
– aggregations, ordering
 Interoperability
– Finally interop between Tables, Classes, and Vectors for Data Science. Borrowing the best from R,
Scala and SQL. Impedance mismatch solved, no need for Domain Layer, Data Access Layer

RDD (low level) vs. DataFrames (new API)

Spark 101 – Execution Model
Spark Driver
–Client side application that creates Spark Context
Spark Context
–Talks to Spark Driver, Cluster Manager to Launch Spark Executors
Cluster Manager – E.g YARN, Spark Standalone, MESOS
Executors – Spark worker bees

Spark Engine in the HDP Stack
Spark is first-class citizen of Hadoop

Demo
Show me the Code!

Model Inputs
Data Gathering
Custom Logic
Process Flow
Evaluate Results

What About Machine Learning?

Machine Learning and Big Data
Machine learning has advanced to the point where it more or less goes hand-
in-hand with Big Data. Indeed, so popular is the technology that over a third of
developers – some 36 percent – who are working on Big Data or advanced
analytics projects use elements of machine learning, says a new study by
Evans Data Corp.
Machine Learning involves creating and improving complex algorithms that are
able to analyze data automatically and identify patterns or predict outcomes
based on the knowledge they have “learned”. As such, it has great potential for
helping companies to better understand what their data is telling them.

Where Can We Use Data Science?
Healthcare
• Predict diagnosis
• Prioritize screenings
• Reduce re-admittance rates
Financial services
• Fraud Detection/prevention
• Predict underwriting risk
• New account risk screens
Public Sector
• Analyze public sentiment
• Optimize resource allocation
• Law enforcement & security
Retail
• Product recommendation
• Inventory management
• Price optimization
Telco/mobile
• Predict customer churn
• Predict equipment failure
• Customer behavior analysis
Oil & Gas
• Predictive maintenance
• Seismic data management
• Predict well production levels

Customer Use Cases with Spark
Web Analytics - WebTrends
Web Analytics for Marketing
• Ingesting 13 Billion events/Day
• Use Spark Streaming & Samza for Data Ingest
• Extremely low latency: 40 milliseconds
• Need more metrics for Spark Streaming
• Wants 2 way SSL for Kafka Spark receiver
Bank/Credit Card
Real time monitoring and Fraud
Detection
• Monitor ATM with NiFi
• Start with Log Aggregation
• Tackle fraud detection next
Railroad Company
Real time view of state of track
• Optimize the train maintenance
• Large volume of track data, down to feel
granularity
• GeoSpatial analytics is critical
Cable Company
Optimize Advertising
• Monitor channel changes with Spark Streaming
• Correlate changes with Ads/Programming
• Allocate Ads real time: Show ads to user who are
watching a show and will stay for > over 20
seconds
• How to optimize Spark App development

Example: Credit Card Fraud Detection

Building a Model
 Show of hands, how many have built a “Model”?
 What are some limitations?
– Conditional based logic: if/else binary decisions
 If you need a lot of data to build a good model, what tools can you use?
– Data volumes can eliminate the possibility of desktop tools
 Sampling?
– Well… we better get an even distribution of true and false positives in each sample, but wait that
requires data munging, back to what tools can we use.
 Security Concerns?
– Extracting data from it’s secure resting place and pushing it into other environments, often times
unsecure files or desktops where Matlab or R can be installed.
 Collaboration
– Push processing to the data using modern distributed tooling.

“All models are wrong, some are useful”
George E. P. Box
Most limiting factor is the data, with modern systems we are now able to
capture more data and hopefully produce better insights

Credit Card Fraud
 Requirement: Detect fraudulent transactions.
 Goal: Save the card company money and build trust amongst card users. Cut down on
fraudulent crime
 Functional Requirement: Detect fraud in under 2 seconds at point of sale. Learn, adapt
and make smarter decisions over time.
 Design
– Distance: How far can one travel over a period of time before it is fraudulent?
– Category: How can we detect a purchase that a customer wouldn’t likely make?
– Frequency: How can we detect purchasing patterns that do not resemble the card holder?
 Ideas?
– White board some conditional logic, egregiousness vs binary
– Back test the data
– Build a model per card holder?

Rules, Statistics, Machine Learning
 Rule Based Logic
– Great for checking conditions that can prove to be 100% accurate. Easy to build and no reason to
over engineer.
– Example: Spending Limit. Card holder limit = $2,000
• If (currentPurchaseAmount + balance > 2,000) then deny transaction
 Statistics
– Mean, median, mode, variance, deviation
– Anomaly detection. Outliers. (i.e. womens retail example)
 Machine Learning
– Supervised
– Unsupervised
– Trainable
– Adapt over time

Discovery
 Gathered all Credit Card Transactions
– Problem is they didn’t make sense
– No identifiable patterns, no log normal curves
– Gas $45, Chipotle $8.50, Steak dinner $88, Amazon shoes $55
 Classification

Outlier Detection: identify abnormal patterns
Example: identify anomalies
Features:
- Time frequency
- Category
- Amount
- Distance

Page 26
Hortonworks Data Flow

Page 27
Hortonworks Data Flow

Machine Learning Continued

Classification: predicting a category
Some techniques:
- Naïve Bayes
- Decision Tree
- Logistic Regression
- SGD
- Support Vector Machines
- Neural Network
- Ensembles

Regression: predict a continuous value
Some techniques:
- Linear Regression / GLM
- Decision Trees
- Support vector regression
- SGD
- Ensembles

Unsupervised Learning: detect natural patterns
Age State Annual Income Marital
status
25 CA $80,000 M
45 NY $150,000 D
55 WA $100,500 M
18 TX $85,000 S
… … … …
No labels
Model Naturally occurri
(hidden) structur

Clustering: detect similar instance groupings
Some techniques:
- k-means
- Spectral clustering
- DB-scan
- Hierarchical clustering

Getting the Proper Fit
Over-fitting:
Model over-fits training set, but does not generalize well to new inputs
Under-fitting:
Model can’t predict accurately

Business Intelligence
vs
Data Science
R and Matplotlib now available

R and Matlab Visuals in Zeppelin

Matplotlib with Python

Appendix – Links to content
Github
https://github.com/kirkhas/zeppelin-notebooks
Credit Card Fraud (real-time ML)
https://community.hortonworks.com/articles/38457/credit-fraud-prevention-demo-a-guided-tour.html
Monte Carlo / VaR
https://community.hortonworks.com/articles/39096/predicting-stock-portfolio-gains-using-monte-carlo.html
Stock Variance
https://community.hortonworks.com/repos/32713/stock-variance-using-zeppelin.html

Spark-Zeppelin-ML on HWX

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (18)

Similar to Spark-Zeppelin-ML on HWX

Similar to Spark-Zeppelin-ML on HWX (20)

Recently uploaded

Recently uploaded (20)

Spark-Zeppelin-ML on HWX

Editor's Notes