Advanced Data Retrieval and Analytics with Apache Spark and Openstack Swift

2,413 views

Published on

Lightning talk from the OpenStack NYC meetup on October 8, 2014.

http://bit.ly/ibm-os-meetup

By Gil Vernik

The integration between Apache Spark and Swift, and the use of Storlets for smart retrieval via filtering and privacy-support.

The content of this talk is a statement from the IBM Research division, not IBM product divisions, and is not a statement from IBM regarding its plans, directions or product intents. Any activities described by this talk are subject to change.

Published in: Technology
  • Be the first to comment

Advanced Data Retrieval and Analytics with Apache Spark and Openstack Swift

  1. 1. © 2014 IBM Corporation Advanced Data Retrieval and Analytics with Apache Spark and Openstack Swift Gil Vernik IBM Research - Haifa
  2. 2. © 2014 IBM Corporation Topics Covered in This Talk § Openstack Swift § Apache Spark § Basic integration between Spark and Swift § Advanced integration between Spark and Swift by utilizing the Storlets technology.
  3. 3. © 2014 IBM Corporation Digital Universe More than 1.8 zettabytes (1.8 trillion gigabytes) Grows rapidly 80% owned by enterprises 75% generated by individuals According IDC iView "Extracting Value from Chaos,"
  4. 4. © 2014 IBM Corporation Map-Reduce, Databases, etc.. Data needs to be replicated, Time, Cost, etc..
  5. 5. © 2014 IBM Corporation Can we do it better?
  6. 6. © 2014 IBM Corporation Openstack Swift § A massively scalable object store § Known to work with thousands of servers, stores petabytes of data. § Exposes REST API § Features: – Storage polices – Erasure codes – Data replication – …. PUTProxy Nodes Storage Nodes
  7. 7. © 2014 IBM Corporation Apache Spark § Apache Spark™ is a fast and general engine for large-scale data processing – Up to 100x faster than Hadoop Map Reduce in-memory, 10x faster on disk § Combines SQL, streaming, and complex analytics § Can read existing Hadoop data § Most active project in Apache today
  8. 8. © 2014 IBM Corporation Swift enablement for data retrieval in Spark § Apache Spark implements Hadoop interfaces and can use HDFS or Amazon S3 as a data source. Swift Network § IBM research enabled Spark to access data stored in Openstack Swift.
  9. 9. © 2014 IBM Corporation What do we analyze? Swift Network Stored Data Input to Analytics Images EXIF metadata PDF Hidden metadata LOGs Only ‘ERROR’ records …. ….
  10. 10. © 2014 IBM Corporation Yes! We can do it better.
  11. 11. © 2014 IBM Corporation Storlets: Flexibly extend for Swift Advanced Data processing inside Swift § Storlets is a way to ‘extend’ cloud computational capabilities § Storlet is compiled code, deployed to Swift and when triggered is executed by Storlet Engine directly on storage nodes. § Storlet engine - responsible to execute every storlet in a secure environment § Storlet is a standard Java code
  12. 12. © 2014 IBM Corporation Storlets extend an object store by moving computation to the data – filtering, transforming, analyzing – instead of bringing the data to the computation
  13. 13. © 2014 IBM Corporation Swift Storlets: How do they benefit Spark? Swift Storlet Network Objects Filter Data processing+
  14. 14. © 2014 IBM Corporation Storlets Enable Extending the Functionality of Spark Example: analyzing EXIF metadata from photos § Object store is a natural repository for photos § Photos contain rich capture metadata § Analyzing this metadata for a set of photos can show how the camera is used
  15. 15. © 2014 IBM Corporation Example: Analyzing EXIF metadata Storlets can extract metadata, returning as JSON (rather than of processing the binary data directly by Spark) 10MB 1KB
  16. 16. © 2014 IBM Corporation Example: Analyzing EXIF metadata. •  Spark accesses images via storlet •  No change to Spark, only changes the URI •  JSON file returned by storlet defines schema •  SQL from Spark processes metadata
  17. 17. © 2014 IBM Corporation Example: Analyzing EXIF metadata.
  18. 18. © 2014 IBM Corporation Summary § Openstack Swift is the most popular open source object store § Apache Spark is the next big thing in data analytics § Spark and Swift can be integrated § Storlets in Swift provide clear benefits for analytics use cases. Thank you! More information Gil Vernik, IBM Research -Haifa gilv@il.ibm.com

×