Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
BUILDING A CEPH-POWERED 
DATA LAKE (OR) DATA GRID 
Paul Evans 
principal architect 
daystrom technology group 
paul@daystr...
Why build a data grid (or data lake) ? 
…because we have a data FLOOD in process
indeed, we love data… 
( we never seem to throw 
any of it out ) 
we’re good at generating 
more and more, but… 
too 
FAST...
IS THE ANSWER TO ALL OF THIS…. 
“ WE NEED LESS DATA! ” 
are you crazy? 
we live to store things! 
we just need better tool...
DATA 
AUTOMATION 
Workflow Automation 
Data Lake 
Data Grid STACK 
Wildly-Scalable Storage
DATA LAKE 
“a storage repository that holds a vast amount of raw data in its native 
format until it is needed”
DATA LAKE - ORIGINS 
First use credited to James Dixon, CTO at Pentaho, circa 2010 
“If you think of a datamart 
as a stor...
DATA LAKE - EXPLAINED 
While a hierarchical data warehouse stores 
data in files or folders, a data lake uses a flat 
arch...
DATA LAKE - WHY ??? 
?
DATA LAKE CHARACTER 
Unwashed Data: schema-on-read from RAW source 
Flexible Processing: batch, interactive, online, searc...
A REFERENCE ‘LAKE’ ARCHITECTURE 
GOVERNENCE DATA ACCESS SECURITY OPERATIONS 
INTEGRATION 
DATA MANAGEMENT
A CEPHALOPOD IN THE LAKE? 
If this is import… Use this… 
Hadoop-native 
HDFS 
Locality-aware 
HDFS 
Distributed Name Svc 
...
(LAKE) DREDGERS 
technology group
DATA GRID 
“the unifying layer to how content and data are stored, protected, located 
and accessed”
DATA GRID - ORIGINS 
The need for data grids was first recognized by the scientific 
community concerning climate modeling...
DATA GRID - EXPLAINED 
Data Grids present consistent access 
controls, governance, and metadata 
extensions to diverse sto...
DATA GRID - WHY ???
DATA GRID - ATTRIBUTES 
Data Virtualization: common presentation of all content 
Universe-size Namespace: for files, objec...
CEPH MEETS GRID 
implemented: 
Direct 
CephFS & RBD Ceph libRADOS Remote 
Cloud 
Cold Storage 
Archive 
DATA GRID unified ...
GRID IRON ALL-STARS 
(Dan Bedard: danb@renci.org) 
technology group
TIME 2 SUMMARIZE… 
We are in the midst of a Data Explosion 
We need robust, expandable, yet simple solutions to store data...
the SMART approach 
DATA 
AUTOMATION 
STACK 
Workflow Automation 
Ceph 
Wildly-Scalable Storage 
+ 
Data Lake 
Data Grid
thank you! 
san jose ceph days 
Paul Evans 
principal architect 
paul@daystrom.com 
technology group
Upcoming SlideShare
Loading in …5
×

Ceph Days 2014 Paul Evans Slide Deck

773 views

Published on

Ceph Days held in October 2014 at Brocade headquarters in Silicon Valley.

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Ceph Days 2014 Paul Evans Slide Deck

  1. 1. BUILDING A CEPH-POWERED DATA LAKE (OR) DATA GRID Paul Evans principal architect daystrom technology group paul@daystrom.com ceph days san jose 2014
  2. 2. Why build a data grid (or data lake) ? …because we have a data FLOOD in process
  3. 3. indeed, we love data… ( we never seem to throw any of it out ) we’re good at generating more and more, but… too FAST too many VARIANTS too MUCH
  4. 4. IS THE ANSWER TO ALL OF THIS…. “ WE NEED LESS DATA! ” are you crazy? we live to store things! we just need better tools… (and more storage)
  5. 5. DATA AUTOMATION Workflow Automation Data Lake Data Grid STACK Wildly-Scalable Storage
  6. 6. DATA LAKE “a storage repository that holds a vast amount of raw data in its native format until it is needed”
  7. 7. DATA LAKE - ORIGINS First use credited to James Dixon, CTO at Pentaho, circa 2010 “If you think of a datamart as a store of bottled water – cleansed and packaged and structured for easy consumption – the data lake is a large body of water in a more natural state…” “The contents of the data lake stream in from a source to fill the lake, and various users of the lake can come to examine, dive in, or take samples.”
  8. 8. DATA LAKE - EXPLAINED While a hierarchical data warehouse stores data in files or folders, a data lake uses a flat architecture to store data. Each data element in a lake is assigned a unique identifier and tagged with a set of extended metadata tags. When a business question arises, the data lake can be queried for relevant data, and that smaller set of data can then be analyzed to help answer the question.
  9. 9. DATA LAKE - WHY ??? ?
  10. 10. DATA LAKE CHARACTER Unwashed Data: schema-on-read from RAW source Flexible Processing: batch, interactive, online, search MetaData Dependent: tag it or lose it Common Access: hdfs-centric toolset …in other words: this is not a glass-house Data Mart
  11. 11. A REFERENCE ‘LAKE’ ARCHITECTURE GOVERNENCE DATA ACCESS SECURITY OPERATIONS INTEGRATION DATA MANAGEMENT
  12. 12. A CEPHALOPOD IN THE LAKE? If this is import… Use this… Hadoop-native HDFS Locality-aware HDFS Distributed Name Svc Ceph Native Erasure Coding Ceph 20% Faster * Ceph * on Terasort benchmark over IB, Mar 2014
  13. 13. (LAKE) DREDGERS technology group
  14. 14. DATA GRID “the unifying layer to how content and data are stored, protected, located and accessed”
  15. 15. DATA GRID - ORIGINS The need for data grids was first recognized by the scientific community concerning climate modeling, where exchanging PB-size data sets became commonplace. Recently, large-scale instruments such as the Large Hadron Collider (LHC) at CERN are driving grid innovation.
  16. 16. DATA GRID - EXPLAINED Data Grids present consistent access controls, governance, and metadata extensions to diverse storage media using a common, global interface for access and transport. Additionally, they offer a ‘micro-service’ architecture for the creation of standard tasks & policies, which are enforced by a distributed “grid control-plane.”
  17. 17. DATA GRID - WHY ???
  18. 18. DATA GRID - ATTRIBUTES Data Virtualization: common presentation of all content Universe-size Namespace: for files, objects & metadata Automation of Data Operations: distributed, scalable Policy Mgmt/Reporting: data valuation & action triggers
  19. 19. CEPH MEETS GRID implemented: Direct CephFS & RBD Ceph libRADOS Remote Cloud Cold Storage Archive DATA GRID unified namespace HiSpeed Tier Link LIBRADOS Ceph + LIBRADOS Ceph + RBD
  20. 20. GRID IRON ALL-STARS (Dan Bedard: danb@renci.org) technology group
  21. 21. TIME 2 SUMMARIZE… We are in the midst of a Data Explosion We need robust, expandable, yet simple solutions to store data We also need effective, de-centralized ways to care for the data
  22. 22. the SMART approach DATA AUTOMATION STACK Workflow Automation Ceph Wildly-Scalable Storage + Data Lake Data Grid
  23. 23. thank you! san jose ceph days Paul Evans principal architect paul@daystrom.com technology group

×