Lambda Architecture and open source technology stack for real time big data

  • 2,359 views
Uploaded on

Concepts & Techniques “Thinking with Lambda” …

Concepts & Techniques “Thinking with Lambda”
Case studies in Practice using Lambda architecture

More in: Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
2,359
On Slideshare
0
From Embeds
0
Number of Embeds
0

Actions

Shares
Downloads
106
Comments
0
Likes
10

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Lambda Architecture and Open Source Tools for Real-time Big Data ● Concepts & Techniques “Thinking with Lambda” ● Case studies in Practice Trieu Nguyen - http://nguyentantrieu.info or @tantrieuf31 Principal Engineer at eClick Data Analytics team, FPT Online All contents and thoughts in this slide are my subjective ideas and compiled from Communities
  • 2. Just a little introduction ● 2008 Java Developer, developed Social Trading Network for a small startup (Yopco) ● 2011 worked at FPT Online, software engineer in Banbe Project, Restful API for VnExpress Mobile App ● 2012 joined Greengar Studios in 6 months, scaling backend API mobile games (iOS, Android) ● 2013 back to FPT Online, R&D about Big Data & Analytics, developing the new core Analytics Platform (on JVM Platform)
  • 3. Contents for this talk ● ● ● ● ● ● ● ● The lessons from history Problems In Practice What is the Lambda Architecture? Why lambda architecture for real-time big data ? Open Source Technology Stack Lambda in Practice (Mobile Data and Web Data) Lessons I have learned Questions & Answers
  • 4. History ? The best way to predict the future is looking at the past and now ?
  • 5. Big data is a buzzword for old problems
  • 6. Explaining Big Data http://www.youtube.com/watch?v=7D1CQ_LOizA
  • 7. Learning ?
  • 8. Working ?
  • 9. Big Data + Old History http://www.youtube.com/watch?v=tp4y-_VoXdA
  • 10. This is Big DATA This is most valuable things!
  • 11. We can't solve problems by using the same kind of thinking we used when we created them. Albert Einstein Think more with Lambda and Reactive
  • 12. Where Big Data can be used
  • 13. BBC Horizon 2013 The Age of Big Data http://www.youtube.com/watch?v=RE0ITQ7XQjM
  • 14. Google’s mission is to organize the world’s information and make it universally accessible and useful.
  • 15. Organize the world’s information?
  • 16. How did Google scale their search engine ? How does Hadoop really work ?
  • 17. http://stackoverflow.com/questions/6087834/howscalable-is-mapreduce-in-the-original-functionallanguages
  • 18. Trends of Now and the Future MapReduce Programming Reactive Programming Functional Programming Streaming Computation => All just the special cases of Lambda ● ● ● ●
  • 19. So what is the λ (Lambda) Architecture ?
  • 20. the Lambda Architecture: ● apply the (λ) Lambda philosophy in designing big data system ● equation “query = function(all data)” which is the basis of all data systems ● proposed by Nathan Marz (http://nathanmarz.com/), a software engineer from Twitter in his “Big Data” book. ● is based on three main design principles: ○ human fault-tolerance – the system is unsusceptible to data loss or data corruption because at scale it could be irreparable. (BUGS ?) ○ data immutability – store data in it’s rawest form immutable and for perpetuity. (INSERT/ SELECT/DELETE but no UPDATE !) ○ recomputation – with the two principles above it is always possible to (re)-compute results by running a function on the raw data.
  • 21. Lambda In Practice 2 case studies from my experiences
  • 22. Case Study 1: Mobile Data Monitor API Backend + System KPI
  • 23. Problem: Inside “mobile data”, What's the most valuable piece of information
  • 24. I applied “Lambda” here Backend System for mobile app
  • 25. Web vs Mobile App Web Visitors Visits Pageviews Events Mobile App Users Sessions Events
  • 26. Metrics: Cause and Effect ● ● ● ● ● ● ● Screen Size => App Design, UI/UX, Usability App version => Deployment, Marketing Connectivity => Code, User Experience Location => Marketing, User Behaviour OS => Marketing, Cost, Development Memory => User Experience Feature Session => How to engage app users
  • 27. The data and the size, not too big for a small startup! Where is the lambda ? I used Groovy + GPars (Groovy Parallel Systems) + MongoDB for fast parallel computation (actor model) on statistical data http://gpars.codehaus.org/ The GPars framework offers Java developers intuitive and safe ways to handle Java or Groovy tasks concurrently. Support: ● ● ● ● ● ● ● ● Dataflow concurrency Actor programming model CSP Agent - an thread-safe reference to mutable state Concurrent collection processing Composable asynchronous functions Fork/Join STM (Software Transactional Memory)
  • 28. Mobile Apps => Backend APIs => Statistics => Find the Trends & Insights?
  • 29. Reactive Data Analytics for Mobile Apps It means real-time recommendation by: ➔ context (location, time) ➔ user profile (preferences, level, ...)
  • 30. Big Data on Small Devices: Data Science goes Mobile http://strataconf.com/strata2013/public/schedule/detail/27605
  • 31. Case Study 2: Web Data ● Real-time Data Analytics ● Monitoring Stream Data (Reactive) http://eclick.vn
  • 32. at eClick we must check campaigns in near-real-time (seconds) ! at eClick we have 30~40 GB Logs in Stream 10~20 GB Bandwidth just for tracking user actions (click, impression,...) in ONE day ! at eClick we have many types of log (video, web, mobile, system logs, ad-campaign, articles, … )
  • 33. “lambda architecture” proposed by @nathanmarz
  • 34. Internet Netty Http Server TCP Connection Kafka Akka Workers Hadoop Tools Storm Redis Redis KPI Report the open-source lambda architecture at eClick
  • 35. The big-data technology stack ● Netty (http://netty.io/) a framework using reactive programming pattern for scaling HTTP system easier, by JBoss http://www.jboss.org ● Kafka (http://kafka.apache.org/) a publish-subscribe messaging rethought as a distributed commit log, open sourced by Linkedin ● Storm (http://storm-project.net/) the framework for distributed realtime computation system, by Twitter ● Redis (http://redis.io/) a advanced key-value in-memory NoSQL database, all fast statistical computations in here. ● Groovy for scripting layer on JVM, ad-hoc query on Redis ● Hadoop ecosystem: HDFS, Hive, HBase for batch processing ● RxJava https://github.com/Netflix/RxJava a library for composing asynchronous and event-based programs ● Hystrix https://github.com/Netflix/Hystrix : for Latency and Fault Tolerance for Distributed Systems
  • 36. My new ideas for the future Connecting the active functor pattern + reactive programming + stream computation + in-memory computing to make: ● real-time data analytics easier ● better recommendation system ● build more profitable in big data More Information: ● http://activefunctor.blogspot.com/ (a special case of Lambda that actively search best connections to form optimal topology) - from ideas when internship at DRD with my advisor. ● Can a function be persistent (stored as data), distributed in a cluster (cloud), reactive to right data (best value in network) ? ● http://www.reactivemanifesto.org/ (reactive pattern)
  • 37. Lessons What I have learned from Lambda and Big Data World
  • 38. What I have learned ● ● ● ● ● Study about lambda and read some books Ask questions=> analytics=> Profit & Value Collect any data you can, learn inside ! Implement it! Just right tools for right jobs. Turn your data into the things everyone can "look & feel"
  • 39. read papers
  • 40. Study the “lambda” I studied Haskell in 2007 with Dr.Peter Gammie http://peteg.org/ when internship at DRD (a non-profit organization). ● Imperative programs will always be vulnerable to data races because they contain mutable variables. ● There are no data races in purely functional languages because they don't have mutable variables.
  • 41. Reading some books
  • 42. Improve your business knowledge ! => read the Behavioral Economics Books http://www.goodreads.com/shelf/show/behavioral-economics
  • 43. Collect the data ?
  • 44. Use your imagination is more than just knowledge you have
  • 45. Think more about Butterfly Effect!
  • 46. Z; om A to fr l get you you il “Logic w n will get in ginatio - Albert Einste ima .” ywhere ever Use you r with da imagination ta just log analytics, not ic Learn Data Visualization
  • 47. Questions & Answers The link of this slide is here: ● http://nguyentantrieu.info/blog/lambda-architecture-andopen-source-tools-for-real-time-big-data/ More useful resources: ● http://nguyentantrieu.info/blog ● http://www.mc2ads.com