New usage model for real-time analytics by Dr. WILLIAM L. BAIN at Big Data Spain 2014

NEW USAGE MODEL FOR REAL-TIME ANALYTICS
WILLIAM L. BAIN
CEO AT SCALEOUT SOFTWARE, INC. SCALEOUT SOFTWARE, INC.

Using In-Memory Models of
Real-World Systems for
Operational Intelligence
Big Data Hispano
November 17, 2014
Bill Bain, CEO (wbain@scaleoutsoftware.com)
Copyright © 2014 by ScaleOut Software, Inc.

Agenda
• What Is Operational Intelligence?
• Example: Tracking Cable Viewers
• Implementing OI Using an In-Memory Data Grid:
• Distributing the Data Across a Cluster
• Integrating Data-Parallel Analysis
• Building an In-Memory Model
• More Examples of In-Memory Models
• Comparison to Spark and Storm
• Implementing an Example in Financial Services
• Using In-Memory Hadoop MapReduce for OI
2 ScaleOut Software, Inc.

About the Speaker
• Dr. William Bain, Founder & CEO
• Career focused on parallel computing – Bell Labs, Intel, Microsoft
• 3 prior start-ups, last acquired by Microsoft and product now ships as
Network Load Balancing in Windows Server
• ScaleOut Software develops and markets In-Memory Data Grids,
software middleware for:
• Scaling application performance and
• Providing operational intelligence using
• In-memory data storage and computing
• Nine years in the market, 400 customers,
10,000 servers; sample customers:

Online Systems Need Operational
Intelligence
Goal: Provide immediate feedback to a system handling live data.
A few examples:
• Ecommerce: for personalized, real-time recommendations
• Equity trading: to minimize risk during a trading day
• Reservations systems: to identify issues, reroute, etc.
• Credit cards & wire transfers: to detect fraud in real time
• Smart grids: to optimize power distribution & detect issues

Example: Track Cable TV Viewers
• Goals:
• Make real-time, personalized upsell offers.
• Immediately respond to service issues.
• Track aggregate behavior to identify patterns, e.g.:
• Total instantaneous incoming event rate
• Most popular programs and # viewers by zip code
• Requirements:
• Track events from 10M cable boxes with 25K events/sec (2.2B/day).
• Correlate, cleanse, and enrich events per rules (e.g. ignore fast channel
switches, match channels to programs).
• Be able to feed enriched events to recommendation engine within 5 sec.
• Immediately examine any cable box (e.g., box status) & track statistics.
©2011 Tammy Bruce presents LiveWire

The Result: An OI Platform
Based on a simulated
workload for San Diego
metropolitan area:
• Continuously correlates and
enriches telemetry from 10M
simulated set-top boxes (from
synthetic load generator).
• Processes more than 30K
events/second.
• Enriches events with program
information every second.
• Tracks aggregate statistics
(e.g., top 10 programs by zip
code) every 10 secs.
Real-Time Dashboard

Real-Time vs. Batch Analytics
Big Data Analytics
Real-Time Batch
Static data sets
Petabytes
Disk storage
Minutes to hours
Best uses:
• Analyzing
warehoused data
• Mining for long-term
trends
Live data sets
Gigabytes to terabytes
In-memory storage
Seconds to minutes
Best uses:
• Tracking live data
• Immediately
identifying trends
and capturing
opportunities
• Providing immediate
feedback
Analytics
Server
hServer
Hadoop
IBM
Teradata
SAS
SAP
Real-time
“Operational Intelligence”
Batch
“Business Intelligence”

Integrated View of Analytics
• Operational intelligence can co-exist with business intelligence:
• Processes streaming data close to its sources.
• Provides real-time, “tactical” feedback (e.g., recommendations, alerts).
• Transforms data for storage in the data warehouse (ETL).
• Data warehouse provides “strategic” guidance.
• Using the same tool set (e.g., Hadoop MapReduce) lowers TCO:
• Leverages common skill set.
• Simplifies design (e.g., loading data into HDFS).

Challenges for Operational Intelligence
• To keep up with fast
growing “live” workloads &
maintain fast response times:
• Track state of entities within a
live system.
• Reliably process updates to
data set in real-time.
• To identify and respond to
trends in fast-changing data:
• Enrich & evaluate “live” data set
in real time.
• Respond to identified
patterns within seconds.
300
250
200
150
100
50
4000
3500
3000
2500
2000
1500
1000
500
0
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
2010
Millions
Growth in Web Servers
Source:
Netcraft
0
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
2010
2011
2012
2013
Exebytes
Growth in “Big Data”
“More data has been
created in the past three
years than in the past
40,000.”

In-Memory Architecture for
Operational Intelligence
• In-memory data grid
(IMDG) holds active
entities undergoing
state changes in
memory.
• Backing store
optionally holds large
population of entities.
• IMDG processes
incoming stream of
state changes.
• Analytics engine
examines entities in real
time and generates
alerts within seconds
as needed.

In-Memory Data Grid
In-Memory Data Grid (IMDG) stores “live” data in a cluster:
• Fits in the business logic layer:
• Follows object-oriented view of data
(vs. relational view).
• Stores collections of Java/.NET/C++
objects shared by multiple clients.
• Uses create/read/update/delete
and query APIs to access data.
• Implemented across a cluster of
servers or VMs:
• Scales storage and throughput
by adding servers.
• Provides high availability
in case a server fails.

IMDGs Use Object-Oriented Model
• IMDG’s collections of objects act like
process collections:
• Unstructured, typically instances of a class
(stored as serialized blobs)
• Individually accessible / update-able
• IMDG adds attributes:
• Accessible by global key
• Query-able by properties
• Highly available
• Optional timeouts
• Distributed locking
• Integration with a backing store
• Optional dependency relationships
• Asynchronous event handling
Object
key
Basic “CRUD” APIs:
• Create(key, obj, tout)
• Read(key)
• Update(key, obj)
• Delete(key)
and…
• Lock(key)
• Unlock(key)

In-Memory, Data-Parallel Computing
• Integrates with IMDG data storage to minimize data motion.
• Ex.: Parallel Method Invocation (PMI), an object-oriented version
of data-parallel computing from the HPC community:
• Selects objects using a parallel query on data hosted in the IMDG.
• Runs user-defined methods in parallel across the cluster and merges
results.
Analyze Data (Eval)
Combine Results
(Merge)
In-Memory Data Grid Runs
Data-Parallel Computation.

Achieving Linear Speedup
Avoid data motion (network or disk I/O) which limits throughput:

In-Memory Model of “Live” Entities
Object-oriented model tracks and analyzes real-world entities:
Real-Time
Data Parallel
Analysis
In-Memory
State in
“IMDG”
NoSQL
Storage

Example: Cable Set-Top Boxes
• Each cable box is represented as an object in the IMDG:
• Object holds raw & enriched event streams, viewer parameters, and
statistics.
• IMDG captures
incoming events by
updating objects.
• IMDG uses data-parallel
computation to:
• immediately
enrich box objects
to generate alerts
to recc. engine, and
• continuously
collect and report
global statistics.

Example in Ecommerce: Inventory
Management
Fast map/reduce reconciles inventory and order systems
for an online retailer:
• Challenge: Inventory and online
order management are handled
by different applications.
• Reconciled once per day.
• Inaccurate orders reduces margins.
• Solution:
• Host SKUs in IMDG updated in real
time by order & inventory systems.
• Use MapReduce to reconcile in two minutes.
• Enables real-time reconciliation to ensure accurate orders.

Example: Web Shopping
• IMDG holds customer
information for active
Web users.
• IMDG saves/retrieves
customer information
from backing store.
• Web browsers send
activity information to
analytics engine.
• IMDG updates customer history and
preferences.
• Analytics engine identifies browsing and
buying patterns.
• Analytics engine makes suggestions in
real-time. Also sends email follow-ups.

Example: Retail Shopping
Brick and mortar stores use OI to compete with online experience:
• IMDG tracks opt-in customers to make recommendations.
• RFID tags identify product selection and availability in showroom.
• Analytics engine sends real-time advisories to sales staff via tablet.

Comparison: IMDGs to Spark
Focus: accelerating business intelligence
using in-memory computing:
• In-memory computing to accelerate and extend
Hadoop MapReduce using data-parallel operators
in Scala.
• Stores data as “resilient
distributed datasets” (RDDs):
• Distributed across cluster
• Immutable
• Hold data from/output to HDFS.
• Manages data stream as a sequence of RDDs.
• Comparison to IMDG:
• Not designed for operational systems:
• Lacks high availability (uses lineage).
• Intended for data-parallel operations:
• Lacks CRUD APIs on individual objects.

Comparison to Storm
• Focus: continuous processing of input streams
• Storm implements pipelined execution of tasks by “bolts” on
incoming data streams.
• Streams can be distributed to bolts with configurable mappings.
• Developer controls the number of tasks per bolt.
• Storm uses a centralized master node
and Zookeeper for fault-tolerance.
• Issues:
• Managing global state
• Minimizing data motion
• Complexity / tuning

Implementing an Example in FinServ
• Hedge fund tracks a set of hedging strategies:
• Strategies can cover various market
sectors, such as high-tech, automotive,
energy, consumer, real estate, etc.
• Each strategy contains list of holdings
and rules for managing the holdings
(such as target allocations).
• Updates to market data
continuously arrive during
the trading day.
• The challenge: update and analyze a large population of
hedging strategies to immediately alert traders.

In-Memory Model
• The IMDG holds hedging strategies as an object-oriented collection.
• Updates to market data
are managed as a series of
snapshot objects.
• The IMDG performs
repeated data-parallel
analysis on hedging
strategies to generate
alerts.
• Merges alerts and feeds to
traders in real time.
• IMDG automatically and dynamically
scales its throughput to handle new
hedging strategies by adding servers.

Implementing the Analysis
Step 1: Select all objects using parallel query of strategy
objects:
• Query spec matches data’s object-oriented properties.
• Selected objects are fed to the analysis engine on each local server.

Java Example: Parallel Query
public class Portfolio {
private long id;
private Set<Stock> longPositions;
private Set<Stock> shortPositions;
private double totalValue;
private Region region;
private boolean alerted; // alert for trading
@SossIndexAttribute // query-able property
public double getTotalValue() {…}
@SossIndexAttribute // query-able property
public Region getRegion() {…}
public Set<Long> evalPositions(MarketSnapshot ms) {…};
}
NamedCache pset = CacheFactory.getCache(“portfolios");
Set<Portfolio> res = pset.queryObjects(Portfolio.class,
and(greaterThan(“totalValue”, 1000000),
equals(“region”, Region.US)));

Implementing the Analysis
Step 2: Create parallel methods to update and analyze the
queried collection of hedging strategies:
• “Eval” method applies market snapshot to an instance of a strategy
object:
• Compare to a MapReduce mapper; adds an input parameter.
• Updates the strategy object’s positions.
• Analyzes the positions for a deviation from allowed rules.
• Optionally generates an alert.
• “Merge” method combines alerts across the collection of strategies:
• Compare to a MapReduce combiner.
• Uses binary combining.
• Is applied globally to the object collection by the IMDG (unlike a Mapreduce
reducer).
• Note: both methods access hydrated objects; avoid need for CRUD access.

Java Example: Parallel Method Invocation
• Create method to analyze a queried portfolio and another method to
pair-wise merge the result sets of alerted portfolios:
public class PortfolioAnalysis implements
Invokable<Portfolio, MarketSnapshot, Set<Long>>
{
public Set<Long> eval(Portfolio p, MarketSnapshot ms)
throws InvokeException {
// update portfolio and return id if alerted:
return p.evalPositions(ms);
}
public Set<Long> merge(Set<Long> set1, Set<Long> set2)
throws InvokeException {
set1.addAll(set2);
return set1; // merged set of alerted portfolio ids
}}

Java Example: Parallel Method Invocation
• Run a parallel method invocation on a queried set of portfolios and
return set of ids for alerted portfolios:
NamedCache pset = CacheFactory.getCache(“portfolios");
InvokeResult alertedPortolios = pset.invoke(
PortfolioAnalysis.class,
Portfolio.class,
and(greaterThan(“totalValue”, 1000000), // query spec
equals(“region”, Region.US)),
marketSnapshot, // parameters
...
);
System.out.println("The alerted portfolios are" +
alertedPortfolios.getResult());

Running the Analysis
• IMDG ships user’s code and libraries to its servers.
• IMDG automatically schedules analysis operations across all grid
servers and cores:
• The analysis runs on all objects selected
by the parallel query.
• Each grid server analyzes its locally stored
objects to minimize data motion.
• Parallel execution ensures fast
completion time:
• IMDG automatically distributes
workload across servers/cores.
• Scaling the IMDG automatically
handles larger data sets.

Merging the Results
• The IMDG automatically merges all analysis results:
• The IMDG first merges all results within each grid server in parallel.
• It then merges results across all grid servers to create one combined
result.
• Efficient parallel merge
minimizes the delay in
combining all results.
• The IMDG delivers the
combined result to the
invoking application as
one object.

Output: Real-Time Alerts
• In-memory analysis delivers a set of
alerts to traders every 300 msec.
• Enables the trader to examine strategy details in real time:

Sample Performance Results for PMI
• Measured a similar financial services application (back testing stock
trading strategies on stock histories)
• Hosted IMDG in Amazon EC2 using 75 servers holding 1 TB of stock
history data in memory
• IMDG handled a continuous stream of updates (1.1 GB/s)
• Results: analyzed 1 TB in 4.1 seconds (250 GB/s) with linear scaling

In-Memory MapReduce
Benefits:
• Enables use of standard Hadoop MapReduce for operational
intelligence.
• Accelerates data access by holding data in memory.
• Analyzes and updates “ live” data.
• Reduces overheads of standard
Hadoop distributions:
• Batch scheduling
• Disk access
• Data shuffling
• Mandatory key sorting
• Enables new features, e.g.:
• Global combining, optional sorting

Running MapReduce on an IMDG
• A Hadoop distribution does not have to be installed unless HDFS is used.
• The developer starts MapReduce applications from a remote workstation.
• The IMDG automatically builds a reusable “invocation grid” of JVMs on the
grid’s servers for PMI and ships the application’s jars.
• Results are stored in the IMDG, HDFS, or optionally globally merged and
returned to the remote workstation.

Run In-Memory MR with YARN
• YARN transparently integrates batch and in-memory MapReduce into a
single execution framework with shared access to HDFS.
• For example, IMDG can transparently run Apache Hive in-memory.
Example of ScaleOut hServer with Hortonworks
Example of Hive
Running on IMDG

Implementing MapReduce
Run MapReduce as two PMI
phases:
• Data can be input from either the
IMDG or an external data source.
• Works with any input/output format
compatible with the Apache
distribution.
• IMDG uses its data-parallel
execution engine (PMI) to invoke
the mappers and the reducers.
• Eliminates batch scheduling
overhead.
• Intermediate results are stored
within the IMDG.
• Minimizes data motion between the
mappers and reducers.
• Allows optional sorting.
• Output of a single reducer/combiner
optionally can be globally merged.

Accessing IMDG Data for M/R
• IMDG adds grid input format for
accessing key/value pairs held in
the IMDG.
• MapReduce programs optionally
can output results to IMDG with
grid output format.
• Grid Record Reader optimizes
access to key/value pairs to
eliminate network overhead.
• Applications can access and
update key/value pairs as
operational data during analysis.

Optional Caching of HDFS Data
• IMDG adds Dataset Record Reader (wrapper) to cache HDFS
data during program execution.
• Hadoop automatically retrieves data from IMDG on subsequent
runs.
• Dataset Record Reader
stores and retrieves data
with minimum network
and memory overheads.
• Tests with Terasort
benchmark have
demonstrated 11X
faster access latency
over HDFS without IMDG.

In-Memory Storage Models
IMDG needs multiple in-memory
storage models:
• Named cache, optimized for
rich semantics on large
objects:
• Property-based query
• Distributed locking
• Access from remote grids
• Named map, optimized for
efficient storage and bulk
analysis (e.g., MapReduce):
• Highly efficient object storage
• Pipelined, bulk-access
mechanisms

In-Memory Storage Optimizations
In-Memory Concurrent Map:
• Stores key/value pairs in chunks.
• Allows CRUD operations on kvps.
• Automatically organizes chunks into
splits.
• Uses per-split hash table to access
keys and manage multi-valued
keys.
• Stores shuffled data set between
mappers and reducers.
• Pipelines chunks to mappers and
from reducers.
• Optionally uses memory mapped
files to reduce access latency.
• Provides support for sorting keys.

In-Memory M/R Optimizations
• MapReduce optimizations:
• Optional sorting
• Optional multicast of parameters to mappers
• Optional O(logN) global combining (avoids
single, sequential reducer)
• Optional HDFS caching
• Optional reuse of JVMs across jobs
• Measured performance:
• Startup times reduced to a few milliseconds
• Word count benchmark shows 20X speedup.
• Real-world example shows >40X speedup.
• Current limitations:
• No specific security for multi-tenancy
• Intermediate data must fit in the IMDG

Accelerating Start-Up Times
• Re-use in-memory context across MapReduce jobs:
public static void main(String argv[]) throws Exception {
//Configure and load the invocation grid
InvocationGrid grid = HServerJob.getInvocationGridBuilder("myGrid").
// Add JAR files as IG dependencies
addJar("main-job.jar"). addJar("first-library.jar").
// Add classes as IG dependencies
addClass(MyMapper.class). addClass(MyReducer.class).
// Define custom JVM parameters
setJVMParameters("-Xms512M -Xmx1024M").
load();
//Run 10 jobs on the same invocation grid
for(int i=0; i<10; i++) {
Configuration conf = new Configuration();
//The preloaded invocation grid is passed as the parameter to the job
Job job = new HServerJob(conf, "Job number "+i, false, grid);
//......Configure the job here.........
//Run the job
job.waitForCompletion(true);
}
//Unload the invocation grid when we are done
grid.unload();
}

Recap
• Online systems need operational
intelligence on “live” data for
immediate feedback.
• Operational intelligence can be
implemented using an IMDG
integrated with data-parallel
analysis.
• IMDGs track “live” state:
• Model real-world entities as a
highly available object collection.
• Enable updates to track changes.
• Use data-parallel computation for
immediate feedback with low
latency.
• Can run standard MapReduce.

Parallel Query Example (C#)
• Mark class properties as indexes for query:
class Stock {
[SossIndex]
public string Ticker { get; set; }
public decimal TotalShares { get; set; }
public decimal Price { get; set; }}
• Define a query using these properties:
NamedCache cache = CacheFactory.GetCache("Stocks");
var q = from s in cache.QueryObjects<Stock>()
where s.Ticker == "GOOG" || s.Ticker == "ORCL"
select s;
Console.WriteLine("{0} Stocks found", q.Count());

Example of Analysis Code (C#)
• Create method to analyze each queried stock object:
static decimal eval(Stock stock, StockCalcParams params)
{
return stock.Price * stock.TotalShares;
}
• Create method to pair-wise merge the results:
static decimal merge(decimal r1, decimal r2)
{
return r1 + r2;
}

Invoking the Parallel Analysis (C#)
• Run a parallel method invocation:
NamedCache cache = CacheFactory.GetCache("Stocks");
decimal valueOfSelectedStocks =
(from s in cache.QueryObjects<Stock>()
where s.Ticker == "GOOG" || s.Ticker == "ORCL"
select s)
.Invoke(new StockCalcParams(…),
new Func<Stock, StockCalcParams, decimal>(eval))
.Merge(new Func<decimal, decimal, decimal>(merge));
Console.WriteLine(“The value of selected stocks is {0}",
valueOfSelectedStocks);

17TH ~ 18th NOV 2014
MADRID (SPAIN)

New usage model for real-time analytics by Dr. WILLIAM L. BAIN at Big Data Spain 2014

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to New usage model for real-time analytics by Dr. WILLIAM L. BAIN at Big Data Spain 2014

Similar to New usage model for real-time analytics by Dr. WILLIAM L. BAIN at Big Data Spain 2014 (20)

More from Big Data Spain

More from Big Data Spain (20)

Recently uploaded

Recently uploaded (20)

New usage model for real-time analytics by Dr. WILLIAM L. BAIN at Big Data Spain 2014