A big-data architecture for real-time analytics

Tao Zhong
Kshitij A. Doshi
Xi Tang
Ting Lou
Zhongyan Lu
Hong Li
Presented by: Raminder Kaur
Wayne State University

 Introduction
 Motivation and Background
 Architecture
 Framework
 Result
 Future work
 Conclusion
 Index term
 References

This paper describes:
 a few key additional requirements that result from having to
support in-memory processing of data while updates proceed
concurrently.
 RAF
 Two RAF based solutions (discussed further)

A few examples of information in motion that may just be seconds old, and
not yet well categorized or linked to other data:
- GPS-based navigation : to reduce wasted energy, accidents, delays and
emergencies.
- A credit card company : to detect and intercept suspicious transactions
- A metropolitan or regional power grid : to modulate power generation,
perform load-balancing, direct repair actions, and take policy enforcement
steps
 An essential feature in the above examples is the need to integrate new
transactions into analysis results within a very short time—sometimes as
short as a few tens of milliseconds.

RDD makes in-memory solutions less failure prone. So RAF enhances RDD
approach so that resiliency is blended with a few additional characteristics as
listed below:
• Efficient allocation and control of memory resources
• Resilient update of information at much finer resolution
• Flexible and highly efficient concurrency control
• Replication and partitioning of data transparent to clients
Architecturally RAF elevates memory across an entire cluster to a first class
storage entity and defines high level mechanisms by which applications on RAF
can orchestrate distributed actions upon objects stored in cluster memory.
To promote responsible and transparent use of memory, RAF opts to use a
programming language such as C, C++, over mixed language environments in
which garbage allocation is opaque.

Data has a lot of value when mined. As data continues to compound at brisk
rates, institutions need to grapple with two broad demands –
 accumulating, processing, synopsizing and utilizing information in a
timely manner
 storing the refined data resiliently
 keeping the data accessible at high speed.
The term Big Data itself is elastic and serves well as a description of the scale
or volume of these solutions, but does not define a constraining principle for
organizing storage .

Requirements for low-latency and high throughput analytics on
datasets:
 In-memory structures and storage
 Resiliency
 Sharing data through memory
 Uniform interaction with storage
 Minimizing memory recycling
 Efficient integration of CRUD
 Synchronizing efficiently
 Searching Efficiently

Translation of eight requirements into five design elements:
 C and C++ based programming for efficient sharing of data
through memory
 Resilient storing of new content
 Efficient concurrency
 Processing information in motion
 Fast, general, ad-hoc searches

 This framework targets the execution of complex queries at
very low latency.
 Information upon which queries operate may be available on
some storage medium, or generated dynamically as a result of
ongoing transactional activities.
 RAF provides distributed computing environment which is
integrated with memory-centric, distributed storage system
where one application can pass the data to another in order to
share data in memory

 RDD: used to store information in memory of one or more machines to
assure that in case of failure of one or more machines, the RDD can be
reconstructed.
 Transformations: operation on RDD to generate new data sets. RAF
transformations are join, map, union, etc.
 Filter: a particular type of transformation. Produces a dataset whose
contents satisfy a specified condition.
 Delegate: It is a bridged module. Purpose of delegate is to create a version
of datastore at a particular time and present it as memory resident RDD.
Wayne State
University

 Efficient storage sharing using DELEGATE
 Memory-centric storage operation
-Reliability
 Data and storage types
-Structured data
-Storage types (Replicated store and Partitioned store)
 Distributed Execution of Analytics tasks
-Analytics tasks interface

 Unit Testing:
-Scalability testing results (how well update operations scale)
-Latency relative to Hive/HDFS (how long does it take to
complete a query)
NOTE: These unit test results show advantage of in-memory
distributed processing oriented design of RAF.
 Solution-level implementation and testing
-Telecommunications subscriber Management
-Safe City Solution

 Motivated by the high degree of familiarity that many developers have
with database interfaces, we are incrementally introducing SQL-
92/JDBC/ODBC like interfaces on top of RAF. A number of optimizations
are also being added.
These optimizations include:
 application requested indexing, to accelerate searches
 blending in column-store capabilities where appropriate (for example, for
rarely-written data)
 compression, in order to reduce data transported between nodes.
Wayne State
University

 Discussed RAF, an architectural approach that meshes memory-centric
non-relational query processing for low latency analytics with memory-
centric update processing to accommodate high volumes of updates.
 Delegate, which participates as a special type of content transformer in a
hierarchy of RDD transformations.
 In RAF, protocol buffers are used to obtain data abstraction and efficient
conveyance among applications, providing applications with a high degree
of independence in location, representation, and transmission of data.
 A light-weight but expressive interface for RAF
 Using unit tests we show high cluster scaling capability for transactions, an
order of magnitude latency improvement for query processing.
 Discussed two real-world usage scenarios in which RAF is being used.

 RDD: Resilient distributed dataset
 RAF: Real-time Analytics Foundation
 CRUD : Create/Retrieve/Update/Delete
 HDFS: Hadoop Distributed File System

 Apache Hadoop: http://hadoop.apache.org/
 Apache HBase: http://hbase.apache.org/
 Memcached: http://www.memcached.org/
 Oracle Coherence: http://www.oracle.com/technetwork/ middle ware/ coherence/
 H. Plattner, A. Zeier, In-Memory Data Management.
 Protobuf: http://code.google.com/p/protobuf/
 Redis: http://www.redis.io/
 SQLStream: http://www.sqlstream.com/
 Vertica: http://www.vertica.com/
 VoltDB: http://www.voltdb.com

A big-data architecture for real-time analytics

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to A big-data architecture for real-time analytics

Similar to A big-data architecture for real-time analytics (20)

Recently uploaded

Recently uploaded (20)

A big-data architecture for real-time analytics