November 2013 HUG: Real-time analytics with in-memory grid

Enabling Real-Time Analytics
Using Hadoop Map/Reduce
Hadoop Users Group
November 20, 2013

Bill Bain, CEO (wbain@scaleoutsoftware.com)

Copyright © 2013 by ScaleOut Software, Inc.

Agenda
• Quick Review of In-Memory Data Grids
• The Need for Real-Time Analytics: Two Use Cases
• Data-Parallel Computation on an IMDG Using Parallel Method
Invocation (PMI)
• Implementing MapReduce Using PMI: ScaleOut hServer™
• Sample Use Cases
• Video Demo

• Comparison to Spark

2

ScaleOut Software, Inc.

About ScaleOut Software
• Develops and markets In-Memory Data Grids:
software middleware for:
• Scaling application performance and
• Performing real-time analytics using
• In-memory data storage and computing

• Dr. William Bain, Founder & CEO
• Career focused on parallel computing – Bell Labs, Intel, Microsoft
• 3 prior start-ups, last acquired by Microsoft and product now ships as
Network Load Balancing in Windows Server

• Eight years in the market; 400 customers, 9,000 servers
• Sample customers:

3


What is an In-Memory Data Grid?
In-memory storage for fast updates and retrieval of live data
• Fits in the business logic layer:
• Follows object-oriented view of data
(vs. relational view).
• Stores collections of Java/.NET
objects shared by multiple clients.
• Uses create/read/update/delete
and query APIs to access data.

• Implemented across a cluster of
servers or VMs:
• Scales storage and throughput
by adding servers.

• Provides high availability
in case a server fails.
4


Our Focus: Real-Time Analytics
Real-time

Batch

Live data sets
Gigabytes to terabytes
In-memory storage
Minutes to seconds
Best uses:

Static data sets
Petabytes
Disk storage
Hours to minutes
Best uses:

“Business Intelligence”

“Operational Intelligence”

• Tracking live data
• Immediately
identifying trends
and capturing
opportunities

5

Big Data Analytics
Real-Time

Batch

Analytics
Server

Hadoop
IBM
Teradata
SAS
SAP

hServer


• Analyzing
warehoused data
• Mining for longterm trends

Online Systems Need Real-Time Analysis
A
•
•
•
•
•

6

few examples:
Equity trading: to minimize risk during a trading day
Ecommerce: to optimize real-time shopping activity
Reservations systems: to identify issues, reroute, etc.
Credit cards: to detect fraud in real time
Smart grids: to optimize power distribution & detect issues


Integrate MapReduce
into IMDG for Real-Time Analytics
Benefits:
• Enables use of widely used Hadoop MapReduce APIs:
• Accelerates data access by staging data in memory.
• Eliminates batch scheduling
and data shuffling overheads
of standard Hadoop distributions.
• Analyzes and updates live data.

• Enables Hadoop
deployment in live
systems.
• Hadoop MapReduce
programs run without change.
• ScaleOut’s implementation is called
ScaleOut hServer™.
7


Data-Parallel Analysis Is Not New
• 1980’s: Special Purpose Hardware: “SIMD”

Thinking Machines
Connection Machine 5

• 1990’s: General Purpose Parallel Supercomputers:
“Domain Decomposition”, “SPMD”
Intel
IPSC-2

8


IBM
SP1

Data-Parallel Analysis Is Not New
• 1990’s – early 2000’s: HPC on Clusters: “MPI”

HP
Blade
Servers

• Since 2003: Clusters, the Cloud, and IMDGs: “MapReduce”

Amazon EC2,
Windows Azure

9


Parallel Method Invocation
• Basic, well understood model of data-parallel computation
• Implemented for use on objects hosted in IMDGs:
• Executes user’s code in parallel across the grid.
• Uses parallel query to select objects for analysis.

Analyze Data (Eval)
In-Memory Data Grid Runs
Data-Parallel Analysis.

Combine Results
(Merge)

10


Running the Analysis
The parallel analysis executes in three steps:
• Step 1: The application first selects all relevant objects in the
collection with a parallel query run on all grid servers.
• Note: Query spec matches data’s object-oriented properties.

11


Running the Analysis: Step 2
• Step 2: The IMDG automatically schedules analysis operations
across all grid servers and cores.
• The analysis runs on all objects selected
by the parallel query.
• Each grid server analyzes its locally stored
objects to minimize data motion.

• Parallel execution ensures fast
completion time:
• IMDG automatically distributes
workload across servers/cores.
• Scaling the IMDG automatically
handles larger data sets.

12


Running the Analysis: Step 3
• Step 3: The IMDG automatically merges all analysis results.
• The IMDG first merges all results within each grid server in parallel.
• It then merges results across all grid servers to create one combined
result.

• Efficient parallel merge
minimizes the delay in
combining all results.
• The IMDG delivers the
combined result to the
trader’s display as one
object.

13


Sample Performance Results for PMI
Optimizing a stock trading platform with real-time analysis:
• IMDG hosted in Amazon
cloud using 75 servers.
• IMDG holds 1 TB of stock
history data in memory.
• IMDG handles continuous
stream of updates (1.1 GB/s).
• IMDG performs real-time
analysis on live data.
• Entire data set analyzed in
4.1 seconds (250 GB/s).
• IMDG scales linearly as
workload grows.
14


Implementing Real-Time MapReduce
• Goal: Run MapReduce applications from a remote workstation.
• The IMDG automatically builds an “invocation grid” of JVMs on the
grid’s servers for PMI and ships the application’s jars.
• The invocation grid can be reused to shorten startup time.

• Use PMI to implement MapReduce.

15


Accelerating MapReduce Execution
PMI is the foundation of fast
execution time:
• Data can be input from either the
IMDG or an external data source.
•

Works with any input/output format
compatible with the Apache
distribution.

• ScaleOut IMDG uses its dataparallel execution engine (PMI) to
invoke the mappers and the
reducers.
•

Eliminates batch scheduling
overhead.

• Intermediate results are stored
within the IMDG.
•
•
16

Minimizes data motion between the
mappers and reducers.
Allows optional sorting.

Only One-Line Code Change
ScaleOut hServer subclasses the Hadoop Job class:
// This job will run using the Hadoop
// job tracker:
public static void main(String[] args)
throws
Exception {

// This job will run using ScaleOut hServer:

Configuration conf = new Configuration();
Job job = new Job(conf, "wordcount");

Job job = new HServerJob(conf, "wordcount");

job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
job.setMapperClass(Map.class);
job.setReducerClass(Reduce.class);
job.setInputFormatClass(
TextInputFormat.class);
job.setOutputFormatClass(
TextOutputFormat.class);
FileInputFormat.addInputPath(
job, new Path(args[0]));
FileOutputFormat.setOutputPath(

job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
job.setMapperClass(Map.class);
job.setReducerClass(Reduce.class);
job.setInputFormatClass(
TextInputFormat.class);
job.setOutputFormatClass(
TextOutputFormat.class);
FileInputFormat.addInputPath(
FileOutputFormat.setOutputPath(

job.waitForCompletion(true);
}

}

17

public static void main(String[] args)
throws Exception {


Accessing IMDG Data for M/R
• IMDG adds grid input format for
accessing key/value pairs held in
the IMDG.
• MapReduce programs optionally
can output results to IMDG with
grid output format.
• Grid Record Reader optimizes
access to key/value pairs to
eliminate network overhead.
• Applications can access and
update key/value pairs as
operational data during analysis.

18


Optimized In-Memory Storage
Multiple in-memory storage
models:
• Named cache, optimized
for rich semantics:
• Property-based query

• Distributed locking
• Access from remote grids

• Named map, optimized for
efficient storage and bulk
analysis:
• Highly efficient object
storage
• Pipelined, bulk-access
mechanisms

19


Example: Ecommerce: Inventory Management
Fast map/reduce reconciles inventory and order systems
for an online retailer:
• Challenge: Inventory and online
order management are handled
by different applications.
• Reconciled once per day.
• Inaccurate orders reduces margins.

• Solution:
• Host SKUs in IMDG updated in real
time by order & inventory systems.
• Use PMI to reconcile in two minutes.

• Results: Real-time reconciliation ensures accurate orders.
20


Example in Financial Services
Integrate analysis into a stock trading platform:
• The IMDG holds market data and hedging strategies.
• Updates to market data
continuously flow through
the IMDG.
• The IMDG performs
repeated map/reduce
analysis on hedging
strategies and alerts
traders in real time.
• IMDG automatically and dynamically
scales its throughput to handle new
hedging strategies by adding servers.
21


Demo
• Video Link

22


Comparison to Spark
• Spark is intended to accelerate data analysis using in-memory
computing.
• ScaleOut’s IMDG provides standard MapReduce for “live” systems.
Spark

ScaleOut IMDG

New MapReduce engine

Yes

Yes

In-memory data storage

Resilient Distr. Datasets

Distributed Objects

Load/store from HDFS

Yes

Yes

Avoid disk access

Yes

Yes

CRUD on live data

No

Yes

Query on properties

No

Yes

High availability

Rebuild on failure

Replication and failover

Extensibility

Additional operators

PMI methods

Open source

Yes

Hybrid

23


Summary
• Online systems need to analyze “live” data in real-time.

• MapReduce has traditionally focused on analyzing
large, static (offline) datasets held in file systems.
• An in-memory data grid (IMDG) can accelerate
MapReduce applications, enabling real-time analytics:
• Enables the application to analyze and update live data.

• Leverages the IMDG’s load-balanced placement of data.
• Avoids batch-scheduled startup delays.
• Avoids data motion from secondary storage.

• MapReduce can be implemented using standard dataparallel computing techniques (“parallel method
invocation”):
• Tightly integrates Map/Reduce engine with the IMDG.
• Accelerates Map/Reduce execution by >20X in benchmark
tests.
24


Accelerating Start-Up Times
• The invocation grid can be re-used across MapReduce jobs:
public static void main(String argv[]) throws Exception {
//Configure and load the invocation grid
InvocationGrid grid = HServerJob.getInvocationGridBuilder("myGrid").
// Add JAR files as IG dependencies
addJar("main-job.jar"). addJar("first-library.jar").
// Add classes as IG dependencies
addClass(MyMapper.class). addClass(MyReducer.class).
// Define custom JVM parameters
setJVMParameters("-Xms512M -Xmx1024M").
load();
//Run 10 jobs on the same invocation grid
for(int i=0; i<10; i++) {
//The preloaded invocation grid is passed as the parameter to the job
Job job = new HServerJob(conf, "Job number "+i, false, grid);
//......Configure the job here.........
//Run the job
}
//Unload the invocation grid when we are done
grid.unload();
}

25


Targeted Use Cases
Run continuous Hadoop
on live data, while it’s
being updated.

Accelerate Hadoop on
static data with a one
line code change.

Quickly prototype
Hadoop code.
26

“Capture perishable business
opportunities and identify issues.”
Real-time risk
analysis

Credit card fraud
detection

...

“Speed-up Hadoop execution by >10X for
faster business insights.”
Financial
modeling

Process
simulations

...

“Validate your Hadoop code before it
goes into batch processing.”
No need to install
Hadoop stack

Fast-turn debug
and tuning

...

The Need for Real-Time Analytics
Many Use Cases:
•

Across Key Industries:

Authorizations / Payment
Processing / Mobile Payments

•
•
•

•
•
•
•

•
•
•

27


Health Care

•

Operational Risk Compliance

Government

•

Financial: Risk, P&L, Pricing

Life Sciences

•

Execution Rules

IC / DoD

•

Market Feed / Event Handlers

Logistics

•

Churn Management

Manufacturing

•

Situational Awareness

Utilities

•

Fraud Detection

Retail

•

Real Time Tracking

Telco

•

Sensor Data / SCADA

Financial

•

Inventory Management

CPG

•

Service Activation

•

•

Law enforcement

Problem: Hadoop Cannot Efficiently
Perform Real-Time Analytics
• Typically used for very large, static, offline datasets
• Data must be copied from disk-based storage (e.g., HDFS) into
memory for analysis.
• Hadoop Map/Reduce adds lengthy batch scheduling and data
shuffling overhead.

28


Hadoop Users Need
Real-Time Analytics
• ScaleOut Software conducted informal survey at Strata 2013
Conference (Santa Clara).
• Based on 150 responses:
• 78% of organizations generate fast-changing data.
• 60% use Hadoop and 78% plan to expand usage of Hadoop within
12 months.
• Only 42% consider Hadoop to be an effective platform for realtime analysis, but…
• 93% would benefit from real-time data analytics.
• 71% consider a 10X improvement in performance meaningful.

• Take-away: Hadoop users need real-time analytics.
29


Optional Caching of HDFS Data
• ScaleOut hServer adds Dataset Record Reader (wrapper) to
cache HDFS data during program execution.
• Hadoop automatically retrieves data from ScaleOut IMDG on
subsequent runs.
• Dataset Record Reader
stores and retrieves data
with minimum network
and memory overheads.
• Tests with Terasort
benchmark have
demonstrated 11X
faster access latency
over HDFS without IMDG.
30


Java Example: Parallel Method Invocation
• Create method to analyze each queried stock object and another
method to pair-wise merge the results:
public class StockAnalysis implements
Invokable<Stock, StockCalcParams, Double>
{
public Double eval(Stock stock, StockCalcParams param)
throws InvokeException {
return stock.getPrice() * stock.getTotalShares();
}
public Double merge(Double first, Double second)
throws InvokeException {
return first + second;
}

}

31


Java Example: Parallel Method Invocation
•

Run a parallel method invocation on the query results:

NamedCache cache = CacheFactory.getCache("Stocks");
InvokeResult valueOfSelectedStocks =
cache.invoke(
StockAnalysis.class,
Stock.class,
or(equal("ticker", "GOOG"), equal("ticker", "ORCL")),
new StockCalcParams());
System.out.println("The value of selected stocks is" +
valueOfSelectedStocks.getResult());

32


November 2013 HUG: Real-time analytics with in-memory grid

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to November 2013 HUG: Real-time analytics with in-memory grid

Similar to November 2013 HUG: Real-time analytics with in-memory grid (20)

More from Yahoo Developer Network

More from Yahoo Developer Network (20)

Recently uploaded

Recently uploaded (20)

November 2013 HUG: Real-time analytics with in-memory grid