Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Building a modern end-to-end open source Big Data reference application

382 views

Published on

In this talk, Edgar Orendain walks through a modern real-time streaming application serving as a reference framework for developing a big data pipeline, complete with a broad range of use cases and powerful reusable core components.

Modern applications can ingest data and leverage analytics in real-time. These analytics are based on machine learning models typically built using historical big data. This reference application provides examples of connecting data-in-motion analytics to your application based on Big Data.

We review code, best practices and considerations involved when integrating different components into a complete data platform. From IoT sensor data collection, to flow management, real-time stream processing and analytics, through to machine learning and prediction, this reference project aims to help developers seed their own open source solutions – fast.

Published in: Technology
  • Be the first to comment

Building a modern end-to-end open source Big Data reference application

  1. 1. 1 © Hortonworks Inc. 2011 – 2017 All Rights Reserved Building a Modern End-to-End Open Source Big Data Reference Application @edgarorendain eorendain@hortonworks.com Edgar Orendain Hortonworks UC Berkeley, Computer Science
  2. 2. 2 © Hortonworks Inc. 2011 – 2017 All Rights Reserved Data Everywhere • Data is now more real, rich and relevant • Everyday gadgets can now sense, communicate and compute • 25 Billion IoT Devices by 2020 not including phones/tablets/computers (source: Gartner, HIS, Ericsson) bit.ly/truckingiot
  3. 3. 3 © Hortonworks Inc. 2011 – 2017 All Rights Reserved 4ZB DATA 44 ZB DATA BY 2020 INTERNET OF ANYTHING Seriously Big Data bit.ly/truckingiot
  4. 4. 4 © Hortonworks Inc. 2011 – 2017 All Rights Reserved Real, Complete, Modern Use-Case: Vehicle Dispatch Company Problem: • Cargo needs to be delivered safely yet quickly • Recent accidents have meant higher insurance premiums • Service is taking a hit; competition is heating up • Changes should be driven by real data Solution: • Leverage real-time data: sensors on trucks • Analytical processing / visualization for actionable insights • Machine learning using real and historical information • Many endpoints, must route streams intelligently and reliably • Not cost-prohibitive, open source bit.ly/truckingiot
  5. 5. 5 © Hortonworks Inc. 2011 – 2017 All Rights Reserved Existing Resources • Code samples are isolated • No big picture view • No swappable components • Tech still relevant or supported? bit.ly/truckingiot
  6. 6. 6 © Hortonworks Inc. 2011 – 2017 All Rights Reserved What If A Reference Application ... • Built on top of true open source platforms • Fully sourced and supported • Followed a single use-case, end-to-end • Documented with tutorials and demos • Followed best practices • Evolved over it’s lifetime bit.ly/truckingiot
  7. 7. 7 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Demo bit.ly/truckingiot
  8. 8. 8 © Hortonworks Inc. 2011 – 2017 All Rights Reserved Architecture Overview bit.ly/truckingiot
  9. 9. 9 © Hortonworks Inc. 2011 – 2017 All Rights Reserved Apache NiFi • Why? … Because routing is not trivial • Dataflow management system • Visual flow building and templating support • Guaranteed content delivery with buffering • Reusable custom processors and services • My custom processors: < 30 lines of code bit.ly/truckingiot
  10. 10. 10 © Hortonworks Inc. 2011 – 2017 All Rights Reserved Schema Registry • Why? Services need to know what’s flowing through them • A centralized repository for storing evolving schemas and it’s readers/writers • Accessible by all services • Allows services to match schema impedances on their own bit.ly/truckingiot
  11. 11. 11 © Hortonworks Inc. 2011 – 2017 All Rights Reserved Checkpoint - Recap • NiFi handles intelligent routing and filtering of our dataflow – no matter the source (devices, sensors, filesystems, logs) • Kafka reliably streams data between our services • Now we choose our distributed computing system … ¯_(⊙︿⊙)_/¯ • Typically, lots of low-level coding • Boilerplate, dependency issues • Development cycle not always fun bit.ly/truckingiot
  12. 12. 12 © Hortonworks Inc. 2011 – 2017 All Rights Reserved Streaming Analytics Manager (SAM) • Visual stream builder • Open source (the only visual stream app builder on the market) • Unifies underlying streaming engines • As developers we can focus on building apps bit.ly/truckingiot
  13. 13. 13 © Hortonworks Inc. 2011 – 2017 All Rights Reserved Web Application Tech Stack: • Play Framework 2.5.x (Scala) • Angular 2 (ScalaJS 0.6.14) • ReactiveKafka (Scala/Java) • WebSockets • 447KB (50MB uber jar) bit.ly/truckingiot
  14. 14. 14 © Hortonworks Inc. 2011 – 2017 All Rights Reserved Apache Superset • Data exploration platform • Perform analytics and create visualizations • Act on both real-time and historical data • Integration with Druid, a high-performance distributed data store • Integration built-in for different data sources bit.ly/truckingiot
  15. 15. 15 © Hortonworks Inc. 2011 – 2017 All Rights Reserved Architecture Recap bit.ly/truckingiot
  16. 16. 16 © Hortonworks Inc. 2011 – 2017 All Rights Reserved Some Typical Challenges • Services with conflicting requirements • Necessary packages, tools, or OS environment changes • Continuous integration is not trivial • Better approaches: • CI service or automation scripts (can be tedious) • Containerized environments (Hortonworks Sandbox) bit.ly/truckingiot
  17. 17. 17 © Hortonworks Inc. 2011 – 2017 All Rights Reserved A Challenger Appears! Docker on YARN • Bundle applications, including services and tools into a Docker image • Currently in a branch of the Hadoop 3 repo • Containers can be self-contained or wire up to form an assembly • YARN runs the job as it does any other Hadoop job • =-Faster development cycles bit.ly/truckingiot
  18. 18. 18 © Hortonworks Inc. 2011 – 2016. All Rights Reserved DOCKER ON YARN DEMO bit.ly/truckingiot
  19. 19. 19 © Hortonworks Inc. 2011 – 2017 All Rights Reserved What’s Left and What’s Next? • Plug and play, different streaming engines • Predictive analytics using ML with Spark and real-time scoring • Securing an application from end-to-end • Tons of code/documentation to check out bit.ly/truckingiot
  20. 20. 20 © Hortonworks Inc. 2011 – 2017 All Rights Reserved THANKS! Code/docs/readmes: bit.ly/truckingiot --> github.com/orendain/trucking-iot Big/Fast/Real-Time Data Tutorials: hortonworks.com/tutorials Ready-to-deploy sandboxes: hortonworks.com/products/sandbox bit.ly/truckingiot Other Talks: Crash Course: Streaming Analytics Thu 12:20pm, 210C: Running a Container Cloud on YARN Thu 3:00pm, 211: Yahoo Moving beyond running 100% of Apache Pig jobs on Apache Tez Edgar Orendain Hortonworks / UC Berkeley @edgarorendain

×