Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Advanced data science algorithms applied to scalable stream processing by David Piris and Ignacio García

475 views

Published on

https://www.bigdataspain.org/2016/program/fri-advanced-data-science-algorithms-applied-scalable-stream-processing.html

https://www.youtube.com/watch?v=raul7zMBmfM&index=28&list=PL6O3g23-p8Tr5eqnIIPdBD_8eE5JBDBik&t=38s

Published in: Technology
  • Be the first to comment

Advanced data science algorithms applied to scalable stream processing by David Piris and Ignacio García

  1. 1. Advanced data science algorithms applied to scalable stream processing David Piris Valenzuela Nacho García Fernández Ignacio.g.Fernandez@treelogic.com @0xNacho david.piris@treelogic.com @davidpiris
  2. 2. 3 About Treelogic  R&D intensive company with the mission of adapting technological knowledge to improve quality standards in our daily life  8 ongoing H2020 projects (coordinating 3 of them)  8 ongoing FP7 projects (coordinating 5 of them)  Focused on providing Big Data Analytics in all the world  Internal organization Research lines  Big Data  Computer vision  Data science  Social Media Analysis  Security ICT solutions  Security & Safety  Justice  Health  Transport  Financial Services  ICT tailored solutions
  3. 3. CONTENTS 1. WHY WE NEED BIG DATA 2. BIG DATA: SOLUTIONS 3. BIG DATA: REAL-TIME PROCESSING 4. INCREMENTAL ALGORITHMS 5. WHAT WE WANT 6. WHAT WE NEED 1. A stream processing engine 2. Online incremental algorithms 3. A distributed data storage system 4. A use case 5. A visualization layer
  4. 4. CONTENTS 1. WHY WE NEED BIG DATA 2. BIG DATA: SOLUTIONS 3. BIG DATA: REAL-TIME PROCESSING 4. INCREMENTAL ALGORITHMS 5. WHAT WE WANT 6. WHAT WE NEED 1. A stream processing engine 2. Online incremental algorithms 3. A distributed data storage system 4. A use case 5. A visualization layer
  5. 5. 6 Why we need Big Data
  6. 6. 7 Why we need Big Data  Public and private sector companies store a huge mount of data  Countries with huge databases store data from  Population  Medical records  Taxes  Online transactions  Mobile transactions  Social Networks In a single day, tweets generates 12 TB!!
  7. 7. 8 Why we need Big Data 2.5 Exabytes are produced every day!!!  530.000.000 million songs  150.000.000 iPhones  5 million laptops  90 years of HD Video
  8. 8. 9 Why we need Big Data How can we manage all data?
  9. 9. CONTENTS 1. WHY WE NEED BIG DATA 2. BIG DATA: SOLUTIONS 3. BIG DATA: REAL-TIME PROCESSING 4. INCREMENTAL ALGORITHMS 5. WHAT WE WANT 6. WHAT WE NEED 1. A stream processing engine 2. Online incremental algorithms 3. A distributed data storage system 4. A use case 5. A visualization layer
  10. 10. 11 Big Data: Solutions First we can manage all historical repository, and retrieve some value from data stored  Batch architecture  MapReduce  Hadoop Ecosystem
  11. 11. 12 Big Data: Solutions
  12. 12. 13 Big Data: Solutions Batch processing with Hadoop takes a lot of time and the need to process ingested data and display results in a shortest way possible brings new architecture and tools  Lambda architecture  Spark (memory vs disk)
  13. 13. 14 Big Data: Solutions
  14. 14. CONTENTS 1. WHY WE NEED BIG DATA 2. BIG DATA: SOLUTIONS 3. BIG DATA: REAL-TIME PROCESSING 4. INCREMENTAL ALGORITHMS 5. WHAT WE WANT 6. WHAT WE NEED 1. A stream processing engine 2. Online incremental algorithms 3. A distributed data storage system 4. A use case 5. A visualization layer
  15. 15. 16 Big data: real-time processing  Faster results  Accurate results  Less expense  Please consumers
  16. 16. 17 Big data: real-time processing As previously said, we need to extract and visualize information in near real time…
  17. 17. 18 Big data: real-time processing  Flink as engine process  Stream processing  Windowing with events time semantics  Streaming and batch processing
  18. 18. 19 Big data: real-time processing Kappa architecture  Batch layer removed  Only one set of code needs to be maintained
  19. 19. 20 Big data: real-time processing  No need to use batch layer  Avoid use disk in engine process (latency)
  20. 20. CONTENTS 1. WHY WE NEED BIG DATA 2. BIG DATA: SOLUTIONS 3. BIG DATA: REAL-TIME PROCESSING 4. INCREMENTAL ALGORITHMS 5. WHAT WE WANT 6. WHAT WE NEED 1. A stream processing engine 2. Online incremental algorithms 3. A distributed data storage system 4. A use case 5. A visualization layer
  21. 21. 22 Big data: available tools
  22. 22. 23 Incremental algorithms  BI & BA people always want to made some common operations to retrieve value and visualize data  We have operational tools in a relational or batch environment  How we can obtain average for a data stream that is changing every second, minutes or even milliseconds…?  Common average operation is indicated for historical repository, data input without any changes in the moment we start the process to obtain it.  Do we have tools to make it possible in a real time deployment?
  23. 23. 24 Incremental algorithms Answer is NO!
  24. 24. 25 Incremental algorithms Flink gives us the chance to operate with a new window processing concept. We can decide and configure "small time pieces", and make some operations or manipulate data in that time space.
  25. 25. 26 Incremental algorithms With Flink and windowing…
  26. 26. 27 Incremental algorithms  These algorithms consume streams of data and are able to update their results in a parallel manner without the need of saving the processed data  Using checkpoints in windowing, allows us to store result from previous window process
  27. 27. 28 Incremental algorithms Our analytics & visualization solution implemented in a real time architecture
  28. 28. 29 Incremental algorithms If you are a BI or BA professional...we care about you!
  29. 29. 30 Incremental algorithms  Currently, we have implemented:  Average  Mode  Variance  Correlation  Covariance  Min  Max
  30. 30. 31 Incremental algorithms  Currently we are working on:  Median
  31. 31. 32 Incremental algorithms  In roadmap…  Standard deviation  Order by  Discretization  Contains  Split  Validate range values  Set default value to specific output
  32. 32. CONTENTS 1. WHY WE NEED BIG DATA 2. BIG DATA: SOLUTIONS 3. BIG DATA: REAL-TIME PROCESSING 4. INCREMENTAL ALGORITHMS 5. WHAT WE NEED 1. A stream processing engine 2. Online incremental algorithms 3. A distributed data storage system 4. A use case 5. A visualization layer
  33. 33. 34 Apache Flink vs Apache Spark  Pure streams for all workloads  Optimizer  Low latency, high throughput  Global, session, time and count based window criteria  Provides automatic memory management  Micro-batches for all workloads  No job optimizer  High latency as compared to Flink  Time-based window criteria  Configurable memory management. Spark 1.6+ has move towards automating memory management
  34. 34. 35
  35. 35. CONTENTS 1. WHY WE NEED BIG DATA 2. BIG DATA: SOLUTIONS 3. BIG DATA: REAL-TIME PROCESSING 4. INCREMENTAL ALGORITHMS 5. WHAT WE NEED 1. A stream processing engine 2. Online incremental algorithms 3. A distributed data storage system 4. A use case 5. A visualization layer
  36. 36. 37 Incremental algorithms in Flink
  37. 37. 38 Incremental algorithms in Flink  Default behavior in Apache Flink:  With incremental algorithms:
  38. 38. 39 Incremental algorithms in Flink
  39. 39. CONTENTS 1. WHY WE NEED BIG DATA 2. BIG DATA: SOLUTIONS 3. BIG DATA: REAL-TIME PROCESSING 4. INCREMENTAL ALGORITHMS 5. WHAT WE NEED 1. A stream processing engine 2. Online incremental algorithms 3. A distributed data storage system 4. A use case 5. A visualization layer
  40. 40. 41 Apache Kudu  Provides a combination of fast inserts / updates and efficient columnar scans to enable real-time analytic workloads  It is a new complements to HDFS and HBase  Designed for use cases that require fast analytics on fast data  Low query latency  V1.0.1 was released on October 11, 2016
  41. 41. CONTENTS 1. WHY WE NEED BIG DATA 2. BIG DATA: SOLUTIONS 3. BIG DATA: REAL-TIME PROCESSING 4. INCREMENTAL ALGORITHMS 5. WHAT WE NEED 1. A stream processing engine 2. Online incremental algorithms 3. A distributed data storage system 4. A use case 5. A visualization layer
  42. 42. 43 PROTEUS: a steel making scenario  Steel industry is a key sector for the European community.  PROTEUS was introduced last year at Big Data Spain by Treelogic *  Hot Strip mills (sometimes) produces steel with defects  Predict coil parameters (thickness, width, flatness) using real-time and historical data  Detecting defective coils in an early stage saves money. The production process can be modified / stopped.  Proposed architecture is being validated in this project  7870 variables with a frequency of 500ms: data-in-motion  700.000 registers for each variables. 500GB time series and flatness map: data-at-rest * https://www.youtube.com/watch?v=EIH7HLyqhfE
  43. 43. 44 PROTEUS: a steel-making scenario  Steel industry is a key sector for the European community.  PROTEUS was introduced last year at Big Data Spain by Treelogic *  Hot Strip mills (sometimes) produces steel with defects  Predict coil parameters (thickness, width, flatness) using real-time and historical data  Detecting defective coils in an early stage saves money. The production process can be modified / stopped.  Proposed architecture is being validated in this project  7870 variables with a frequency of 500ms: data-in-motion  700.000 registers for each variables. 500GB time series and flatness map: data-at-rest * https://www.youtube.com/watch?v=EIH7HLyqhfE
  44. 44. CONTENTS 1. WHY WE NEED BIG DATA 2. BIG DATA: SOLUTIONS 3. BIG DATA: REAL-TIME PROCESSING 4. INCREMENTAL ALGORITHMS 5. WHAT WE NEED 1. A stream processing engine 2. Online incremental algorithms 3. A distributed data storage system 4. A use case 5. A visualization layer
  45. 45. 46 Websockets  Websocket is a computer communication protocol providing full-duplex communication channels over a single TCP connection.  Extremely faster than HTTP  Its API is standardized by the W3C
  46. 46. 47 Apache Flink & Websockets  Data sinks consume DataSets and are used to store or return them.  Flink comes with a variety of built-in output formats that are encapsulated behind operations on the DataSet:  writeAsText()  writeAsFormattedText()  writeAsCsv()  print()  write()  We’ve developed a WebsocketSink enabling Flink to send outputs to a given websocket endpoint.  Based on the javax-websocket-client-api 1.1 spec.
  47. 47. 48 Incremental architecture: our approach
  48. 48. 49
  49. 49. 50 ProteicJS https://github.com/proteus-h2020/proteic/
  50. 50. 51 ProteicJS: Visualizations
  51. 51. 52 ProteicJS: Researching on visualization  Currently researching on new ways of visualizing data and ML models
  52. 52. 53 ProteicJS & Apache Flink
  53. 53. 54 How to get it all https://github.com/proteus-h2020/proteus-docker
  54. 54. Advanced data science algorithms applied to scalable stream processing David Piris Valenzuela Nacho García Fernández Ignacio.g.Fernandez@treelogic.com @0xNacho david.piris@treelogic.com @davidpiris

×