Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Advanced Data Retrieval and Analytics with Apache Spark and Openstack Swift


Published on

Lightning talk from the OpenStack NYC meetup on October 8, 2014.

By Gil Vernik

The integration between Apache Spark and Swift, and the use of Storlets for smart retrieval via filtering and privacy-support.

The content of this talk is a statement from the IBM Research division, not IBM product divisions, and is not a statement from IBM regarding its plans, directions or product intents. Any activities described by this talk are subject to change.

Published in: Technology
  • Be the first to comment

Advanced Data Retrieval and Analytics with Apache Spark and Openstack Swift

  1. 1. © 2014 IBM Corporation Advanced Data Retrieval and Analytics with Apache Spark and Openstack Swift Gil Vernik IBM Research - Haifa
  2. 2. © 2014 IBM Corporation Topics Covered in This Talk § Openstack Swift § Apache Spark § Basic integration between Spark and Swift § Advanced integration between Spark and Swift by utilizing the Storlets technology.
  3. 3. © 2014 IBM Corporation Digital Universe More than 1.8 zettabytes (1.8 trillion gigabytes) Grows rapidly 80% owned by enterprises 75% generated by individuals According IDC iView "Extracting Value from Chaos,"
  4. 4. © 2014 IBM Corporation Map-Reduce, Databases, etc.. Data needs to be replicated, Time, Cost, etc..
  5. 5. © 2014 IBM Corporation Can we do it better?
  6. 6. © 2014 IBM Corporation Openstack Swift § A massively scalable object store § Known to work with thousands of servers, stores petabytes of data. § Exposes REST API § Features: – Storage polices – Erasure codes – Data replication – …. PUTProxy Nodes Storage Nodes
  7. 7. © 2014 IBM Corporation Apache Spark § Apache Spark™ is a fast and general engine for large-scale data processing – Up to 100x faster than Hadoop Map Reduce in-memory, 10x faster on disk § Combines SQL, streaming, and complex analytics § Can read existing Hadoop data § Most active project in Apache today
  8. 8. © 2014 IBM Corporation Swift enablement for data retrieval in Spark § Apache Spark implements Hadoop interfaces and can use HDFS or Amazon S3 as a data source. Swift Network § IBM research enabled Spark to access data stored in Openstack Swift.
  9. 9. © 2014 IBM Corporation What do we analyze? Swift Network Stored Data Input to Analytics Images EXIF metadata PDF Hidden metadata LOGs Only ‘ERROR’ records …. ….
  10. 10. © 2014 IBM Corporation Yes! We can do it better.
  11. 11. © 2014 IBM Corporation Storlets: Flexibly extend for Swift Advanced Data processing inside Swift § Storlets is a way to ‘extend’ cloud computational capabilities § Storlet is compiled code, deployed to Swift and when triggered is executed by Storlet Engine directly on storage nodes. § Storlet engine - responsible to execute every storlet in a secure environment § Storlet is a standard Java code
  12. 12. © 2014 IBM Corporation Storlets extend an object store by moving computation to the data – filtering, transforming, analyzing – instead of bringing the data to the computation
  13. 13. © 2014 IBM Corporation Swift Storlets: How do they benefit Spark? Swift Storlet Network Objects Filter Data processing+
  14. 14. © 2014 IBM Corporation Storlets Enable Extending the Functionality of Spark Example: analyzing EXIF metadata from photos § Object store is a natural repository for photos § Photos contain rich capture metadata § Analyzing this metadata for a set of photos can show how the camera is used
  15. 15. © 2014 IBM Corporation Example: Analyzing EXIF metadata Storlets can extract metadata, returning as JSON (rather than of processing the binary data directly by Spark) 10MB 1KB
  16. 16. © 2014 IBM Corporation Example: Analyzing EXIF metadata. •  Spark accesses images via storlet •  No change to Spark, only changes the URI •  JSON file returned by storlet defines schema •  SQL from Spark processes metadata
  17. 17. © 2014 IBM Corporation Example: Analyzing EXIF metadata.
  18. 18. © 2014 IBM Corporation Summary § Openstack Swift is the most popular open source object store § Apache Spark is the next big thing in data analytics § Spark and Swift can be integrated § Storlets in Swift provide clear benefits for analytics use cases. Thank you! More information Gil Vernik, IBM Research -Haifa