Apache Phoenix with Actor Model (Akka.io) for real-time Big Data Programming Stack
Apache Phoenix with
Actor Model (Akka.io)
for Real-time Big Data
Why we still need SQL for Big Data ?
How to make Big Data more responsive and faster ?
Tech Lead at eClick team - FPT Online
1. What is Big data and Why ?
2. When standard relational database (Oracle,MySQL, ...) is
not good enough
3. Common problems in big data system
4. Introducing open-source tools in Big Data System
a. Apache Phoenix for ad-hoc query
b. Actor Model and Akka.io for reactive data processing
What Does Big Data Actually Mean?
“Big data means data
that cannot fit easily into
a standard relational database.”
Hal Varian- Chief Economist, Google
When standard relational database
(Oracle,MySQL, ...) is not good enough
the “analytic system” MySQL database from a
startup, tracking all actions in mobile games:
iOS, Android, ...
Definition from the crowd
“Big data is a term describing the storage
and analysis of large and or complex
data sets using a series of techniques
including, but not limited to: NoSQL,
MapReduce and machine learning.”
Jonathan Stuart Ward and Adam Barker
“Chaotic” fact and the demand
80% of that data is unstructured or “chaotic”
Photos, videos and social media posts - data that says so much
about us - but cannot be analyzed via traditional methods
“Finding order among chaos”
3 common problems in Big Data System
1. Size: the volume of the datasets is a critical
2. Complexity: the structure, behaviour and
permutations of the datasets is a critical
3. Technologies: the tools and techniques
which are used to process a sizable or
complex dataset is a critical factor.
Introducing open-source tools in Big Data System
as SQL ad-hoc query
Actor Model as nano-service
for reactive data
in the dawn of “Fast data”
Some innovative tools were born
in the dawn of Big Data Age
Interesting features of Apache Phoenix
● Embedded JDBC driver implements the majority of java.sql interfaces,
including the metadata APIs.
● Allows columns to be modeled as a multi-part row key or key/value cells.
● Full query support with predicate push down and optimal scan key
● DDL support: CREATE TABLE, DROP TABLE, and ALTER TABLE for
● Versioned schema repository. Snapshot queries use the schema that was
in place when data was written.
● DML support: UPSERT VALUES for row-by-row insertion, UPSERT
SELECT for mass data transfer between the same or different tables, and
DELETE for deleting rows.
● Limited transaction support through client-side batching.
● Single table only - no joins yet and secondary indexes are a work in
● Follows ANSI SQL standards whenever possible
● Requires HBase v 0.94.2 or above
● 100% Java
http://youtu.be/TnLiEWglqHk - Google I/O 2014 - The dawn of "Fast Data"
The paper: MillWheel: Fault-Tolerant Stream
Processing at Internet Scale
What is actor model ?
● Carl Hewitt defined the Actor
Model in 1973 as a mathematical
theory that treats “Actors” as the
universal primitives of concurrent
● A fitting model for heavily-parallel
processing in a cloud environment
Using Akka (Rfx) and Apache Phoenix
for Social Media Real-time Analytics
Links for self-study and research
Actor Model and Programming:
Big Data and Data Science
● http://www.mc2ads.com and http://www.mc2ads.org