Practical Guide to
Architecting Data Lakes
Presented By Avinash Ramineni
Agenda
• About Clairvoyant
• What is Data Lake ?
• Features of Data Lake
• Tools
• Implementation Challenges
• Questions
3Page
Clairvoyant
4Page
Clairvoyant Services
5Page
What is a Data Lake
“ A data lake is an enterprise-wide system for storing and analyzing disparate sources of data in
their native formats”
“A data lake is a central location in which to store all your data, regardless of its source or format.”
“Is Data lake a replacement or complimentary to EDW ? ”
“Is Data lake just a storage layer ? ”
“ Just having a Hadoop environment is a data lake ? ”
6Page
Data Lake Attributes
• Data Democratization
• Data Discovery
• Data Lineage
• Self-Service capabilities
• Metadata Management
7Page
Data Lake
8Page
Self Service Analytics
9Page
Data Governance
• Data Acquisition - what, when, where of data
• Data Organization – Structure, format
• Data Catalog – what data exists in the lake
• Capturing Metadata
• Data Lineage
• Data Quality
• Data Profile
• Provenance of data at file and record levels
• Business names, descriptions
• Data Provisioning
10Page
11Page
Data Lineage
12Page
Data Lake Challenges
13Page
Guidelines
• Expect structured , semi-structure, unstructured data
• store a metadata or tag for location of schema, unstructured
• Store a copy of raw input
• Raw first mile copy of the data so that we can recover our business or almost
• Replay the business if we need to
• Data Standardization – data clensing as a workflow after ingest
• Use a format that supports your data
• Automate metadata management
14Page
Data Lake Security
15Page
Data Security
16Page
Implementation Challenges
• Change Data Capture
• Mysql – binlog readers
• Oracle - tungsten
• Updating the deltas on to the data lake
• Reusable Data movement workflows
• One workflow for table ? (Generate Dynamic workflows based on metadata)
• Needs to be driven of metadata
• Schema changes on the Source end
• Streaming Data
• Partitioning Strategies on the Data Lake
• Configure them into metadata
17Page
Tools /
Products
• Smart Catalogs
• Waterline Data Inventory
• Collibra Catalog
• Data Lake Management
• Zaloni Bedrock
• Informatica Intelligent Data Lake
• Data Governance and Metadata Management
• Cloudera Navigator
• Apache Atlas
• Collibra Data Governance
• Oracle BigData Catalog
18Page
Data Lake Trends
• Data Lakes on Cloud
• IOT Data Lakes
• Logical Data Lakes
• Unified View of data that exists across data stores
• Data Discovery Portals
19Page
Questions
• Principal @ Clairvoyant
• Email: avinash@clairvoyantsoft.com
• LinkedIn: https://www.linkedin.com/in/avinashramineni

Practical guide to architecting data lakes - Avinash Ramineni - Phoenix Data Conference 2016

  • 1.
    Practical Guide to ArchitectingData Lakes Presented By Avinash Ramineni
  • 2.
    Agenda • About Clairvoyant •What is Data Lake ? • Features of Data Lake • Tools • Implementation Challenges • Questions
  • 3.
  • 4.
  • 5.
    5Page What is aData Lake “ A data lake is an enterprise-wide system for storing and analyzing disparate sources of data in their native formats” “A data lake is a central location in which to store all your data, regardless of its source or format.” “Is Data lake a replacement or complimentary to EDW ? ” “Is Data lake just a storage layer ? ” “ Just having a Hadoop environment is a data lake ? ”
  • 6.
    6Page Data Lake Attributes •Data Democratization • Data Discovery • Data Lineage • Self-Service capabilities • Metadata Management
  • 7.
  • 8.
  • 9.
    9Page Data Governance • DataAcquisition - what, when, where of data • Data Organization – Structure, format • Data Catalog – what data exists in the lake • Capturing Metadata • Data Lineage • Data Quality • Data Profile • Provenance of data at file and record levels • Business names, descriptions • Data Provisioning
  • 10.
  • 11.
  • 12.
  • 13.
    13Page Guidelines • Expect structured, semi-structure, unstructured data • store a metadata or tag for location of schema, unstructured • Store a copy of raw input • Raw first mile copy of the data so that we can recover our business or almost • Replay the business if we need to • Data Standardization – data clensing as a workflow after ingest • Use a format that supports your data • Automate metadata management
  • 14.
  • 15.
  • 16.
    16Page Implementation Challenges • ChangeData Capture • Mysql – binlog readers • Oracle - tungsten • Updating the deltas on to the data lake • Reusable Data movement workflows • One workflow for table ? (Generate Dynamic workflows based on metadata) • Needs to be driven of metadata • Schema changes on the Source end • Streaming Data • Partitioning Strategies on the Data Lake • Configure them into metadata
  • 17.
    17Page Tools / Products • SmartCatalogs • Waterline Data Inventory • Collibra Catalog • Data Lake Management • Zaloni Bedrock • Informatica Intelligent Data Lake • Data Governance and Metadata Management • Cloudera Navigator • Apache Atlas • Collibra Data Governance • Oracle BigData Catalog
  • 18.
    18Page Data Lake Trends •Data Lakes on Cloud • IOT Data Lakes • Logical Data Lakes • Unified View of data that exists across data stores • Data Discovery Portals
  • 19.
    19Page Questions • Principal @Clairvoyant • Email: avinash@clairvoyantsoft.com • LinkedIn: https://www.linkedin.com/in/avinashramineni