Apache spark y cómo lo usamos en nuestros proyectos

244 views

Published on

Evolución de Hadoop y aparición de Spark para resolver retos en Big Data.

Published in: Data & Analytics
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
244
On SlideShare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
2
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Apache spark y cómo lo usamos en nuestros proyectos

  1. 1. OpenSistemas Open Source Innovation for your Business Apache Spark y como lo usamos en nuestros proyectos
  2. 2. Who am I Data Solutions Manager at OpenSistemas Former President of Hispalinux (Spanish LUG) Author “La Pastilla Roja” first spanish book about Free Software
  3. 3. Menú Un poco de historia: Hadoop Qué es Apache Spark Arquitectura Kappa Algunos casos reales con Spark
  4. 4. Un poco de historia En principio de los tiempos Los Papers de Google Nacimiento del proyecto Hadoop
  5. 5. La era pre Spark Google publica algunos papers claves: ● HDFS ● MapReduce ● HBASE
  6. 6. La era pre Spark Nace Hadoop ● Primera implementación libre ● Yahoo es el sponsor ● Se desarrolla uno de los entornos de trabajo más interesantes y productivos de los años 2000
  7. 7. Batch-Hadoop (MR1)
  8. 8. Batch-MapReduce
  9. 9. Batch-Cascading
  10. 10. Cosas que no terminan de gustarnos de Hadoop Algunas cosas de Hadoop no terminan de convencernos No todo los problemas se solucionan con Map Reduce Hadoop ha generado un montón de proyectos y tecnologías para resolverlo
  11. 11. Mientras tanto... Hadoop ha cumplido 10 años Ha habido muchos cambios tanto en hardware como en software
  12. 12. In memory Computing
  13. 13. Y de pronto nace: Apache Spark Es un motor de analítica paralelizado Escrito desde cero Heredando lo mejor de la experiencia “hadoop” Probablemente la obra de ingeniería más bella de los últimos años
  14. 14. Qué es Apache Spark La primera reseña importante esta pensado desde el principio para aprovechar al máximo la memoria Resiliente y escalable por definición
  15. 15. Qué es Apache Spark Escrito, pensado e influido por el lenguaje Scala
  16. 16. Spark, the elevator pitch
  17. 17. Spark streaming
  18. 18. Spark SQL Cuando cualquier información con un esquema se puede consultar con SQL
  19. 19. A little context July 2, 2014 Jay Kreps coined the term Kappa Architecture in an article for O’reilly Radar
  20. 20. Who is Jay Kreps Jay has been involved in lots of projects ● Author of the essay ● The Log: What every software engineer should know about real-time data's unifying abstraction (12/16/2013) ● https://engineering.linkedin.com/distributed-systems/log-what-every-software- engineer-should-know-about-real-time-datas-unifying
  21. 21. Jay Kreps Author of the book: I ♥ Logs
  22. 22. Jay Kreps Involved with projects as: ● Apache Kafka ● Apache Samza ● Voldemort ● Azkaban ● Ex-Linkedin ● Now co-founder and CEO of Confluent
  23. 23. Lambda Architecture Look something like this: https://www.mapr.com/developercentral/lambda-architecture
  24. 24. Lambda Architecture Batch layer that provides the following functionality ● managing the master dataset, an immutable, append-only set of raw data ● pre-computing arbitrary query functions, called batch views https://www.mapr.com/developercentral/lambda-architecture
  25. 25. Lambda Architecture Serving layer ● This layer indexes the batch views so that they can be queried in ad hoc with low latency Speed layer ● This layer accommodates all requests that are subject to low latency requirements. Using fast and incremental algorithms, the speed layer deals with recent data only
  26. 26. Lambda Architecture Batch layer datasets can be in a distributed filesystem, while MapReduce can be used to create batch views that can be fed to the serving layer The serving layer can be implemented using NoSQL technologies such as HBase,Apache Druid, etc Querying can be implemented by technologies such as Apache Drill or Impala Speed layer can be realized with data streaming technologies such as Apache Storm or Spark Streaming https://www.mapr.com/developercentral/lambda-architecture
  27. 27. Pros of Lambda Architecture Retain the input data unchanged ● Think about modeling data transformations, series of data states from the original input Lambda architecture take in account the problem of reprocessing data ● This happens all the time, the code will change, and you will need to reprocess all the information. Lots of reasons and you will need to live with this.
  28. 28. Cons of Lambda Architecture Maintain the code that need to produce the same result from two complex distributed system is painful ● Very different code for MapReduce and Storm/Apache Spark Not only is about different code, is also about debugging and interaction with other products like (hive, Oozie, Cascading, etc) At the end is a problem about different and diverging programming paradigms
  29. 29. So what is Kappa Architecture The proposal of Jay Kreps is so simple ● Use kafka (or other system) that will let you retain the full log of the data you need to reprocess ● When you want to do the reprocessing, start a second instance of your stream processing job that starts processing from the beginning of the retained data, but direct this output data to a new output table
  30. 30. part II ● When the second job has caught up, switch the application to read from the new table ● Stop the old version of the job, and delete the old output table So what is Kappa Architecture
  31. 31. This architecture looks something like this So what is Kappa Architecture
  32. 32. The first benefit is that only you need to reprocessing only when you change the code You can check if the new version is working ok and if not reverse to the old output table You can mirror a Kafka topic to HDFS so you are not limited to the Kafka retention configuration You have only a code to maintain with an unique framework So what is Kappa Architecture
  33. 33. The real advantage is not about efficiency at all (You will need extra temporarily storage when reprocessing for example) is allowing your team to develop, test, debug and operate their systems on top of a single processing framework So what is Kappa Architecture
  34. 34. Is not a silver bullet to solve every problem at Big Data Is not a list of prescriptions of technologies. You can implement with your favorite frameworks Is not a rigid set of rules. But helps to maintain the complex projects simple What is not Kappa Architecture
  35. 35. We start working with projects with a complex structure like Linkedin looks at early stage That’s very usual How we use Kappa Architecture
  36. 36. How we use Kappa Architecture
  37. 37. We try to refactoring the data flows to fix in a Kappa Architecture How we use Kappa Architecture
  38. 38. How we use Kappa Architecture
  39. 39. We use Kafka as Stream Data Platform Instead of Samza we feel more comfortable with Spark Streaming At ASPGems we choose Apache Spark as our Analytics Engine and not only for Spark Streaming How we use Kappa Architecture
  40. 40. At the end, Kappa Architecture is design pattern for us We use/clone this pattern in almost our projects We have projects of every size, volume of data or speed needing and fix with the Kappa Architecture How we use Kappa Architecture
  41. 41. Telefónica - MSS We use KA to calculate near real time KPIs, SLAs related with the managed security system We simplify the data flow of the input data Kafka in the streaming data platform As MPP we use CassandraDB Use cases
  42. 42. IOT – OBD II One of our clients install On Board Devices in the cars of its customers. We implement an API to got all the information in real time and inject the information in Kafka The business rules are implemented in a CEP running into Apache Spark Streaming As MPP we use Elastic Search Use cases
  43. 43. Insurance company We implement Kappa Architecture to process click stream in real time and clustering users We show content and offers that better fix users Use cases
  44. 44. Energy Facility We implement Kappa Architecture to process and predict energy consume. Our customer include energy storage systems and we got all the information about energy storage (ultra- capacitors and batteries). We process this information to calculate the effective lifetime of the components and its degradation. Use cases
  45. 45. Questions
  46. 46. Thank you Juantomás García juantomas@opensistemas.com @juantomas
  47. 47. www.opensistemas.com comercial@opensistemas.com

×