Lambda Architecture
in Real-time Big Data
● Concepts & Techniques “Thinking with Lambda”
● Case studies in Practice
Trieu Nguyen - http://nguyentantrieu.info/blog or @tantrieuf31
Lead Engineer at eClick Data Analytics team at FPT Online
All contents and thoughts in this slide are my subjective ideas and compiled from Open Source
Communities
Just a little introduction
● 2008 Java Developer, developed Social
Trading Network for a small startup (Yopco)
● 2011 worked at FPT Online, software engineer
in Banbe Project, Restful API for VnExpress
Mobile App
● 2012 joined Greengar Studio in 6 months,
scaling backend API mobile games (iOS, Android)
● 2013 back to FPT Online, R&D about Big Data
& Analytics, developing the new core
Analytics Platform (on JVM Platform)
Stupid questions
● Big Data means big logs storage ?
● I just installed Hadoop, and it works! Do we
really get a big data solution ?
● We have lots data, so let’s play with cool big
data technologies x,y, z! Do we get profits
from that ?
● We can hire or outsource a professional
team to build big data solution, but do they
answer what problem we get ?
Contents for this talk
● A little introduction about Lambda in history
● Trends of Now and the Future
● Why lambda architecture is correct solution
for big data?
● Lambda in Practice, case studies from
Greengar Studios and eClick
● Lessons
● Questions & Answers
History
The best way to predict the future is
looking at the past and now ?
Lambda is the symbol to denote:
● Half-life game ?
● Anonymous function, aka: Closure ?
● functional computation/programming?
● scalable system ?
http://en.wikipedia.org/wiki/Lambda
When I study “lambda” ?
I studied Haskell in 2007 with Dr.Peter Gammie http://peteg.org/ when
internship at DRD (a non-profit organization).
● Imperative programs will always be vulnerable to data races because
they contain mutable variables.
● There are no data races in purely functional languages because they
don't have mutable variables.
http://stackoverflow.com/questions/6087834/how-
scalable-is-mapreduce-in-the-original-functional-
languages
How did Google scale their search engine ?
How does Hadoop really work ?
The Closure in JavaScript,
running by billion websites
!
(Lambda) is everywhere !
Trends of Now and the Future
● Big Data
● Data Analytics
● Reactive Programming
● Functional Programming
● Streaming Computation
=> All just the special cases of Lambda
Question:
Is mobile app
generating more
data than
traditional web ?
Question:
Is the Open Source Big Data Solution like Hadoop, that makes big
data more popular to enterprises and startups ?
2009, a big-data startup, Cloudera was founded !
What is the λ
(Lambda)
Architecture ?
the Lambda Architecture:
● apply the (λ) Lambda philosophy in designing big data
system
● equation “query = function(all data)” which is the basis
of all data systems
● proposed by Nathan Marz (http://nathanmarz.com/), a
software engineer from Twitter in his “Big Data” book.
● is based on three main design principles:
○ human fault-tolerance – the system is unsusceptible to data loss or
data corruption because at scale it could be irreparable. (BUGS ?)
○ data immutability – store data in it’s rawest form immutable and for
perpetuity. (INSERT/ SELECT/DELETE but no UPDATE !)
○ recomputation – with the two principles above it is always possible to
(re)-compute results by running a function on the raw data.
“lambda architecture”
proposed by @nathanmarz
We, at FPT Online, have applied
the lambda architecture since
April, 2013
Lambda In Practice
2 case studies from my experiences
Case Study 1:
Greengar Studios
API Backend Monitor + Statistics
http://www.greengar.com/
Backend System at Greengar Studio
I applied
“Lambda”
here
The data and the size, not too big for a small
startup!
Where is the lambda ?
I used Groovy + GPars (Groovy Parallel Systems) + MongoDB for fast
parallel computation (actor model) on statistical data
http://gpars.codehaus.org/
The GPars framework offers Java developers intuitive and safe ways to handle
Java or Groovy tasks concurrently.
Support:
● Dataflow concurrency
● Actor programming model
● CSP
● Agent - an thread-safe reference to mutable state
● Concurrent collection processing
● Composable asynchronous functions
● Fork/Join
● STM (Software Transactional Memory)
Mobile Apps => Backend APIs =>
Statistics => Find the Trends & Insights?
Case Study 2:
eClick Ad-Network
● Real-time Data Analytics
● Monitoring Stream Data (Reactive)
http://eclick.vn
at eClick we have
30~40 GB Logs in Stream
10~20 GB Bandwidth
just for tracking user
actions (click,
impression,...)
in ONE day !
at eClick we must
check campaigns in
near-real-time
(seconds) !
at eClick we have many types of log (video, web,
mobile, system logs, ad-campaign, articles, … )
Our big-data system
Leverage Open Source Projects
● Netty (http://netty.io/) a framework using reactive programming
pattern for scaling HTTP system easier
● Kafka (http://kafka.apache.org/) a publish-subscribe messaging
rethought as a distributed commit log.
● Storm (http://storm-project.net/) a framework for distributed
realtime computation system.
● Redis (http://redis.io/) a advanced key-value in-memory NoSQL
database, all fast statistical computations in here.
● Groovy for scripting layer, dynamic query on Redis + RDBMSs
● Hadoop ecosystem: HDFS, Hive, HBase for batch processing
● RxJava https://github.com/Netflix/RxJava a library for
composing asynchronous and event-based programs
Some new ideas for the future:
Connecting the active functor pattern + reactive programming
+ stream computation + in-memory computing to make:
● real-time data analytics easier
● better recommendation system
● build more profitable big data solutions
More Information:
● http://activefunctor.blogspot.com/ (a special case of Lambda
that actively search best connections to form optimal
topology) - from ideas when internship at DRD with my
advisor.
● Can a function be persistent (stored as data), distributed in
a cluster (cloud), reactive to right data (best value in
network)?
We can't solve problems
by using the same kind of
thinking we used when we
created them.
Albert Einstein
Think more Lambda and Reactive
How could we see "user interest graph" in our user's database ?
● Social Graph
=> Keep the connection
● Interest Graph
=> Make new connection
=> recommendation
platform
Source: http://en.wikipedia.org/wiki/Interest_graph
Lessons
What I have learned from Lambda and Big
Data World
What I have learned
● Keep it as simple as possible, but no simpler !
● Ask right questions=> deep analytics=>Profit
● Reactive and Lambda for your data products
● Implement it! Just right tools for right jobs.
● Turn your data into the things everyone can
"look & feel"
How to build profitable big data solutions?
=> read these Behavioral Economics Books
http://www.goodreads.com/shelf/show/behavioral-economics
Stay focused, keep innovating
Big Data is not profitable if you do not know
what you want and ask right questions
“Logic will get you from A to Z;
imagination will get you
everywhere.” - Albert Einstein
Use your imaginationwith data analytics, not
just logic
Lambda architecture for real time big data

Lambda architecture for real time big data

  • 1.
    Lambda Architecture in Real-timeBig Data ● Concepts & Techniques “Thinking with Lambda” ● Case studies in Practice Trieu Nguyen - http://nguyentantrieu.info/blog or @tantrieuf31 Lead Engineer at eClick Data Analytics team at FPT Online All contents and thoughts in this slide are my subjective ideas and compiled from Open Source Communities
  • 2.
    Just a littleintroduction ● 2008 Java Developer, developed Social Trading Network for a small startup (Yopco) ● 2011 worked at FPT Online, software engineer in Banbe Project, Restful API for VnExpress Mobile App ● 2012 joined Greengar Studio in 6 months, scaling backend API mobile games (iOS, Android) ● 2013 back to FPT Online, R&D about Big Data & Analytics, developing the new core Analytics Platform (on JVM Platform)
  • 3.
    Stupid questions ● BigData means big logs storage ? ● I just installed Hadoop, and it works! Do we really get a big data solution ? ● We have lots data, so let’s play with cool big data technologies x,y, z! Do we get profits from that ? ● We can hire or outsource a professional team to build big data solution, but do they answer what problem we get ?
  • 4.
    Contents for thistalk ● A little introduction about Lambda in history ● Trends of Now and the Future ● Why lambda architecture is correct solution for big data? ● Lambda in Practice, case studies from Greengar Studios and eClick ● Lessons ● Questions & Answers
  • 5.
    History The best wayto predict the future is looking at the past and now ?
  • 6.
    Lambda is thesymbol to denote: ● Half-life game ? ● Anonymous function, aka: Closure ? ● functional computation/programming? ● scalable system ?
  • 8.
  • 9.
    When I study“lambda” ? I studied Haskell in 2007 with Dr.Peter Gammie http://peteg.org/ when internship at DRD (a non-profit organization). ● Imperative programs will always be vulnerable to data races because they contain mutable variables. ● There are no data races in purely functional languages because they don't have mutable variables.
  • 10.
  • 11.
    How did Googlescale their search engine ? How does Hadoop really work ?
  • 14.
    The Closure inJavaScript, running by billion websites !
  • 15.
  • 16.
    Trends of Nowand the Future ● Big Data ● Data Analytics ● Reactive Programming ● Functional Programming ● Streaming Computation => All just the special cases of Lambda
  • 17.
    Question: Is mobile app generatingmore data than traditional web ?
  • 19.
    Question: Is the OpenSource Big Data Solution like Hadoop, that makes big data more popular to enterprises and startups ? 2009, a big-data startup, Cloudera was founded !
  • 20.
    What is theλ (Lambda) Architecture ?
  • 21.
    the Lambda Architecture: ●apply the (λ) Lambda philosophy in designing big data system ● equation “query = function(all data)” which is the basis of all data systems ● proposed by Nathan Marz (http://nathanmarz.com/), a software engineer from Twitter in his “Big Data” book. ● is based on three main design principles: ○ human fault-tolerance – the system is unsusceptible to data loss or data corruption because at scale it could be irreparable. (BUGS ?) ○ data immutability – store data in it’s rawest form immutable and for perpetuity. (INSERT/ SELECT/DELETE but no UPDATE !) ○ recomputation – with the two principles above it is always possible to (re)-compute results by running a function on the raw data.
  • 23.
    “lambda architecture” proposed by@nathanmarz We, at FPT Online, have applied the lambda architecture since April, 2013
  • 24.
    Lambda In Practice 2case studies from my experiences
  • 25.
    Case Study 1: GreengarStudios API Backend Monitor + Statistics http://www.greengar.com/
  • 26.
    Backend System atGreengar Studio I applied “Lambda” here
  • 27.
    The data andthe size, not too big for a small startup! Where is the lambda ? I used Groovy + GPars (Groovy Parallel Systems) + MongoDB for fast parallel computation (actor model) on statistical data http://gpars.codehaus.org/ The GPars framework offers Java developers intuitive and safe ways to handle Java or Groovy tasks concurrently. Support: ● Dataflow concurrency ● Actor programming model ● CSP ● Agent - an thread-safe reference to mutable state ● Concurrent collection processing ● Composable asynchronous functions ● Fork/Join ● STM (Software Transactional Memory)
  • 28.
    Mobile Apps =>Backend APIs => Statistics => Find the Trends & Insights?
  • 29.
    Case Study 2: eClickAd-Network ● Real-time Data Analytics ● Monitoring Stream Data (Reactive) http://eclick.vn
  • 30.
    at eClick wehave 30~40 GB Logs in Stream 10~20 GB Bandwidth just for tracking user actions (click, impression,...) in ONE day ! at eClick we must check campaigns in near-real-time (seconds) ! at eClick we have many types of log (video, web, mobile, system logs, ad-campaign, articles, … )
  • 31.
    Our big-data system LeverageOpen Source Projects ● Netty (http://netty.io/) a framework using reactive programming pattern for scaling HTTP system easier ● Kafka (http://kafka.apache.org/) a publish-subscribe messaging rethought as a distributed commit log. ● Storm (http://storm-project.net/) a framework for distributed realtime computation system. ● Redis (http://redis.io/) a advanced key-value in-memory NoSQL database, all fast statistical computations in here. ● Groovy for scripting layer, dynamic query on Redis + RDBMSs ● Hadoop ecosystem: HDFS, Hive, HBase for batch processing ● RxJava https://github.com/Netflix/RxJava a library for composing asynchronous and event-based programs
  • 32.
    Some new ideasfor the future: Connecting the active functor pattern + reactive programming + stream computation + in-memory computing to make: ● real-time data analytics easier ● better recommendation system ● build more profitable big data solutions More Information: ● http://activefunctor.blogspot.com/ (a special case of Lambda that actively search best connections to form optimal topology) - from ideas when internship at DRD with my advisor. ● Can a function be persistent (stored as data), distributed in a cluster (cloud), reactive to right data (best value in network)?
  • 33.
    We can't solveproblems by using the same kind of thinking we used when we created them. Albert Einstein Think more Lambda and Reactive
  • 34.
    How could wesee "user interest graph" in our user's database ?
  • 35.
    ● Social Graph =>Keep the connection ● Interest Graph => Make new connection => recommendation platform Source: http://en.wikipedia.org/wiki/Interest_graph
  • 36.
    Lessons What I havelearned from Lambda and Big Data World
  • 37.
    What I havelearned ● Keep it as simple as possible, but no simpler ! ● Ask right questions=> deep analytics=>Profit ● Reactive and Lambda for your data products ● Implement it! Just right tools for right jobs. ● Turn your data into the things everyone can "look & feel"
  • 38.
    How to buildprofitable big data solutions? => read these Behavioral Economics Books http://www.goodreads.com/shelf/show/behavioral-economics
  • 39.
    Stay focused, keepinnovating Big Data is not profitable if you do not know what you want and ask right questions
  • 41.
    “Logic will getyou from A to Z; imagination will get you everywhere.” - Albert Einstein Use your imaginationwith data analytics, not just logic