Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

[OracleCode SF] In memory analytics with apache spark and hazelcast

354 views

Published on

Apache Spark is a distributed computation framework optimized to work in-memory, and heavily influenced by concepts from functional programming languages.

Hazelcast - open source in-memory data grid capable of amazing feats of scale - provides wide range of distributed computing primitives computation, including ExecutorService, M/R and Aggregations frameworks.

The nature of data exploration and analysis requires data scientists be able to ask questions that weren't planned to be asked—and get an answer fast!

In this talk, Viktor will explore Spark and see how it works together with Hazelcast to provide a robust in-memory open-source big data analytics solution!

Published in: Technology
  • Be the first to comment

[OracleCode SF] In memory analytics with apache spark and hazelcast

  1. 1. @gamussa @hazelcast #oraclecode IN-MEMORY ANALYTICS with APACHE SPARK and HAZELCAST
  2. 2. @gamussa @hazelcast #oraclecode Solutions Architect Developer Advocate @gamussa in internetz Please, follow me on Twitter I’m very interesting © Who am I?
  3. 3. @gamussa @hazelcast #oraclecode What’s Apache Spark? Lightning-Fast Cluster Computing
  4. 4. @gamussa @hazelcast #oraclecode Run programs up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk.
  5. 5. @gamussa @hazelcast #oraclecode When to use Spark? Data Science Tasks when questions are unknown Data Processing Tasks when you have to much data You’re tired of Hadoop
  6. 6. @gamussa @hazelcast #oraclecode Spark Architecture
  7. 7. @gamussa @hazelcast #oraclecode
  8. 8. @gamussa @hazelcast #oraclecode RDD
  9. 9. @gamussa @hazelcast #oraclecode Resilient Distributed Datasets (RDD) are the primary abstraction in Spark – a fault-tolerant collection of elements that can be operated on in parallel
  10. 10. @gamussa @hazelcast #oraclecode
  11. 11. @gamussa @hazelcast #oraclecode RDD Operations
  12. 12. @gamussa @hazelcast #oraclecode operations on RDDs: transformations and actions
  13. 13. @gamussa @hazelcast #oraclecode transformations are lazy (not computed immediately) the transformed RDD gets recomputed when an action is run on it (default)
  14. 14. @gamussa @hazelcast #oraclecode RDD Transformations
  15. 15. @gamussa @hazelcast #oraclecode
  16. 16. @gamussa @hazelcast #oraclecode
  17. 17. @gamussa @hazelcast #oraclecode RDD Actions
  18. 18. @gamussa @hazelcast #oraclecode
  19. 19. @gamussa @hazelcast #oraclecode
  20. 20. @gamussa @hazelcast #oraclecode RDD Fault Tolerance
  21. 21. @gamussa @hazelcast #oraclecode
  22. 22. @gamussa @hazelcast #oraclecode RDD Construction
  23. 23. @gamussa @hazelcast #oraclecode parallelized collections take an existing Scala collection and run functions on it in parallel
  24. 24. @gamussa @hazelcast #oraclecode Hadoop datasets run functions on each record of a file in Hadoop distributed file system or any other storage system supported by Hadoop
  25. 25. @gamussa @hazelcast #oraclecode What’s Hazelcast IMDG? The Fastest In-memory Data Grid
  26. 26. @gamussa @hazelcast #oraclecode Hazelcast IMDG is an operational, in-memory, distributed computing platform that manages data using in-memory storage, and performs parallel execution for breakthrough application speed and scale
  27. 27. @gamussa @hazelcast #oraclecode High-Density Caching In-Memory Data Grid Web Session Clustering Microservices Infrastructure
  28. 28. @gamussa @hazelcast #oraclecode What’s Hazelcast IMDG? In-memory Data Grid Apache v2 Licensed Distributed Caches (IMap, JCache) Java Collections (IList, ISet, IQueue) Messaging (Topic, RingBuffer) Computation (ExecutorService, M-R)
  29. 29. @gamussa @hazelcast #oraclecode Green Primary Green Backup Green Shard
  30. 30. @gamussa @hazelcast #oraclecode
  31. 31. @gamussa @hazelcast #oraclecode final SparkConf sparkConf = new SparkConf() .set("hazelcast.server.addresses", "localhost") .set("hazelcast.server.groupName", "dev") .set("hazelcast.server.groupPass", "dev-pass") .set("hazelcast.spark.readBatchSize", "5000") .set("hazelcast.spark.writeBatchSize", "5000") .set("hazelcast.spark.valueBatchingEnabled", "true"); final JavaSparkContext jsc = new JavaSparkContext("spark://localhost:7077", "app", sparkConf); final HazelcastSparkContext hsc = new HazelcastSparkContext(jsc); final HazelcastJavaRDD<Object, Object> mapRdd = hsc.fromHazelcastMap("movie"); final HazelcastJavaRDD<Object, Object> cacheRdd = hsc.fromHazelcastCache("my- cache");
  32. 32. @gamussa @hazelcast #oraclecode final SparkConf sparkConf = new SparkConf() .set("hazelcast.server.addresses", "localhost") .set("hazelcast.server.groupName", "dev") .set("hazelcast.server.groupPass", "dev-pass") .set("hazelcast.spark.readBatchSize", "5000") .set("hazelcast.spark.writeBatchSize", "5000") .set("hazelcast.spark.valueBatchingEnabled", "true"); final JavaSparkContext jsc = new JavaSparkContext("spark://localhost:7077", "app", sparkConf); final HazelcastSparkContext hsc = new HazelcastSparkContext(jsc); final HazelcastJavaRDD<Object, Object> mapRdd = hsc.fromHazelcastMap("movie"); final HazelcastJavaRDD<Object, Object> cacheRdd = hsc.fromHazelcastCache("my- cache");
  33. 33. @gamussa @hazelcast #oraclecode final SparkConf sparkConf = new SparkConf() .set("hazelcast.server.addresses", "localhost") .set("hazelcast.server.groupName", "dev") .set("hazelcast.server.groupPass", "dev-pass") .set("hazelcast.spark.readBatchSize", "5000") .set("hazelcast.spark.writeBatchSize", "5000") .set("hazelcast.spark.valueBatchingEnabled", "true"); final JavaSparkContext jsc = new JavaSparkContext("spark://localhost:7077", "app", sparkConf); final HazelcastSparkContext hsc = new HazelcastSparkContext(jsc); final HazelcastJavaRDD<Object, Object> mapRdd = hsc.fromHazelcastMap("movie"); final HazelcastJavaRDD<Object, Object> cacheRdd = hsc.fromHazelcastCache("my- cache");
  34. 34. @gamussa @hazelcast #oraclecode final SparkConf sparkConf = new SparkConf() .set("hazelcast.server.addresses", "localhost") .set("hazelcast.server.groupName", "dev") .set("hazelcast.server.groupPass", "dev-pass") .set("hazelcast.spark.readBatchSize", "5000") .set("hazelcast.spark.writeBatchSize", "5000") .set("hazelcast.spark.valueBatchingEnabled", "true"); final JavaSparkContext jsc = new JavaSparkContext("spark://localhost:7077", "app", sparkConf); final HazelcastSparkContext hsc = new HazelcastSparkContext(jsc); final HazelcastJavaRDD<Object, Object> mapRdd = hsc.fromHazelcastMap("movie"); final HazelcastJavaRDD<Object, Object> cacheRdd = hsc.fromHazelcastCache("my- cache");
  35. 35. @gamussa @hazelcast #oraclecode Demo
  36. 36. @gamussa @hazelcast #oraclecode LIMITATIONS
  37. 37. @gamussa @hazelcast #oraclecode DATA SHOULD NOT BE UPDATED WHILE READING FROM SPARK
  38. 38. @gamussa @hazelcast #oraclecode WHY ?
  39. 39. @gamussa @hazelcast #oraclecode MAP EXPANSION SHUFFLES THE DATA INSIDE THE BUCKET
  40. 40. @gamussa @hazelcast #oraclecode CURSOR DOESN’T POINT TO CORRECT ENTRY ANYMORE, DUPLICATE OR MISSING ENTRIES COULD OCCUR
  41. 41. @gamussa @hazelcast #oraclecode github.com/hazelcast/hazelcast-spark
  42. 42. @gamussa @hazelcast #oraclecode THANKS! Any questions? You can find me at @gamussa viktor@hazelcast.com

×