SlideShare a Scribd company logo
Achieve Data Democratization with effective
Data Integration
Saurabh K. Gupta
Manager, Data & Analytics, GE
www.amazon.com/author/saurabhgupta
@saurabhkg
Disclaimer:
“This report has been prepared by the Authors who are part of GE. The opinions expresses herein by the Authors and the
information contained herein are in good faith and Authors and GE disclaim any liability for the content in the report. The
report is the property of GE and GE is the holder of the copyright or any intellectual property over the report. No part of this
document may be reproduced in any manner without the written permission of GE. This Report also contains certain
information available in public domain, created and maintained by private and public organizations. GE does not control or
guarantee the accuracy, relevance, timelines or completeness of such information. This Report constitutes a view as on the
date of publication and is subject to change. GE does not warrant or solicit any kind of act or omission based on this Report.”
AIOUG Sangam’17
✓ Data lake is relatively a new term when compared to all
fancy ones, since the industry realized the potential of
data.
✓ Enterprises are bending their backwards to build a stable
data strategy and take a leap towards data
democratization.
✓ Traditional approaches pertaining to data pipelines, data
processing, data security still hold true but architects need
to scoot an extra mile while designing a big data lake.
✓ This session will focus on data integration design
considerations. We will discuss the relevance of data
democratization in the organizational data strategy.
Abstract
✓ Data Lake architectural styles
✓ Implement data lake for democratization
✓ Data Ingestion framework principles
Learning objectives
✓ Manager, Data & Analytics at General Electric
✓ 11+ years of experience in data architecture, data
engineering, analytics
✓ Books authored:
✓ Practical Enterprise Data Lake Insights | Apress | 2018
✓ Advanced Oracle PL/SQL Developer’s Guide | Packt |
2016
✓ Oracle Advanced PL/SQL Developer Professional
Guide | Packt | 2012
✓ Speaker at AIOUG, IOUG, NASSCOM
✓ Twitter @saurabhkg
✓ Blog@ sbhoracle.wordpress.com
About Me
Practical Enterprise Data Lake Insights - Published
Footer
Handle Data-Driven Challenges in an Enterprise Big Data Lake
Use this practical guide to successfully handle the challenges encountered
when designing an enterprise data lake and learn industry best practices to
resolve issues.
What You'll Learn:
• Get to know data lake architecture and design principles
• Implement data capture and streaming strategies
• Implement data processing strategies in Hadoop
• Understand the data lake security framework and availability model
Apress – https://www.apress.com/in/book/9781484235218/
Amazon - https://www.amazon.com/Practical-Enterprise-Data-Lake-
Insights/dp/1484235215/
Data Explosion
Future data trends
AIOUG Sangam’17
Data as a Service
Cybersecurity
Augmented Analytics Machine Intelligence
“Fast Data” and
“Actionable data”
Evolution of Data Lake
• James Dixon’s “time machine” vision of data
• Leads Data-as-an-Asset strategy
“If you think of a datamart as a store of bottled water –
cleansed and packaged and structured for easy
consumption – the data lake is a large body of water in a
more natural state. The contents of the data lake stream
in from a source to fill the lake, and various users of the
lake can come to examine, dive in, or take samples.”
-James Dixon
Footer
July 21, 2018 Presentation Title 7
Conceptualizing a Data Lake
© 2018 Gartner, Inc.
Data Lake Conceptual Design
Insight
Discovery and
Development
Data
Acquisition
Analytics
Consumption
Optimization
and
Governance
© 2018 Gartner, Inc.
Data Lake Input, Processing and Output
Data In
Acquisition
Data Out
ConsumptionDiscovery and Development
Metadata, Modeling and Governance
Data Lake
Tooling and Infrastructure
01011
Extend analytical capabilities to unrefined data in its
native or near-native format
Architecture of an Enterprise Data Lake depends upon
Data-In and Data-Out strategy
July 21, 2018 Presentation Title 8
Data Lake Architecture Styles
© 2018 Gartner, Inc.
Outflow
Data Lake 10101
01010
10101
Data
Science
Lab
Inflow
Data Lake
“Data Hub” to bridge information
silos to find new insights
Under agile governance, caters data to
the consumers at a rapid scale
With minimal governance, enables
innovation and drives analytics
Analytics as a Service
Data Democracy
Provide the data in a matter in which it can be consumed - regardless of the people,
process or end user technology
Footer
Data Empowerment
Complement business
with data
Data Commercialization
Data backed decisions
Data As A Service
Data Vision
Enterprise Data Lake operational pillars
Footer
Data Lake
Data
Management
Data Ingestion Data Engineering
Data
Consumption
Data
Integration
Data Ingestion challenges
Footer
Understand the data
• Structured data is an organized piece of
information
• Aligns strongly with the relational standards
• Defined metadata
• Easy ingestion, retrieval, and processing
• Unstructured data lacks structure and
metadata
• Not so easy to ingest
• Complex retrieval
• Complex processing
Footer
Understand the data sources
• OLTP and Data warehouses – structured data from typical relational data stores.
• Data management systems – documents and text files
• Legacy systems – essential for historical and regulatory analytics
• Sensors and IoT devices – Devices installed on healthcare, home, and mobile appliances and large
machines can upload logs to data lake at periodic intervals or in a secure network region
• Web content – data from the web world (retail sites, blogs, social media)
• Geographical data – data flowing from location data, maps, and geo-positioning systems.
Footer
Data Ingestion Framework
Design Considerations
• Data format – What format is the data to be ingested?
• Data change rate – Critical for CDC design and streaming data. Performance is a derivative of
throughput and latency.
• Data location and security –
• Whether data is located on-premise or public cloud infrastructure. While fetching data from
cloud instances, network bandwidth plays an important role.
• If the data source is enclosed within a security layer, ingestion framework should be enabled
establishment of a secure tunnel to collect data for ingestion
• Transfer data size (file compression and file splitting) – what would be the average and maximum
size of block or object in a single ingestion operation?
• Target file format – Data from a source system needs to be ingested in a hadoop compatible file
format.
Footer
ETL vs ELT for Data Lake
• Heavy transformation may restrict data
surface area for data exploration
• Brings down the data agility
• Transformation on huge volumes of data may
foster a latency between data source and
data lake
• Curated layer to empower analytical models
Footer
ETL
ELT
Batched data ingestion principles
Structured data
• Data collector fires a SELECT query (also known as filter query) on the source to pull incremental
records or full extract
• Query performance and source workload determine how efficient data collector is
• Robust and flexible
Footer
• Change Track flag – flag rows with the operation code
• Incremental extraction – pull all the changes after certain timestamp
• Full Extraction – refresh target on every ingestion run
Standard Ingestion techniques
Change Data Capture
• Log mining process
• Capture changed data from the source system’s transaction logs and integrate with the target
• Eliminating the need to run SQL queries on source system. Incurs no load overhead on a
transactional source system.
• Achieves near real-time replication between source and target
Footer
Change Data Capture
Design Considerations
• Source database be enabled for logging
• Commercial tools - Oracle GoldenGate, HVR, Talend CDC, custom replicators
• Keys are extremely important for replication
• Helps capture job in establishing uniqueness of a record in the changed data set
• Source PK ensures the changes are applied to the correct record on target
• PK not available; establish uniqueness based on composite columns
• Establish uniqueness based on a unique constraint - terrible design!!
• Trigger based CDC
• Event on a table triggers the change to be captured in an change-log table
• Change-log table merged with the target
• Works when source transaction logs are not available
Footer
LinkedIn Databus
CDC capture pipeline
• Relay is responsible for pulling the most recent
committed transactions from the source
• Relays are implemented through tungsten replicator
• Relay stores the changes in logs or cache in
compressed format
• Consumer pulls the changes from relay
• Bootstrap component – a snapshot of data source on
a temporary instance. It is consistent with the changes
captured by Relay
• If any consumer falls behind and can’t find the changes
in relay, bootstrap component transforms and packages
the changes to the consumer
• A new consumer, with the help of client library, can
apply all the changes from bootstrap component until a
time. Client library will point the consumer to Relay to
continue pulling most recent changes
Footer
Relay LogWriter Log Storage
LogApplier Snapshot
Storage
Consolidated changes
Consistent Snapshot
Change merge techniques
Design Considerations
Footer
Exchange
Partition
Prepare final dataset
with merged changes
P1
P2
P3
P4
#Changes
P4
1
Change capture
2 Pull most recent partition for Compare andMerge
3 4
Data source
Hive Table with
partitions
Flow
• Table partitioned on time
dimension
• Changes are captured
incrementally
• Changes tagged by table name and
the most recent partition
• Exchange partition process - recent
partition compared against the
“change” data set for merging
Apache Sqoop
• Native member of Hadoop tech stack for data ingestion
• Batched ingestion, no CDC
• Java based utility (web interface in Sqoop2) that spawns Map jobs from MapReduce engine to
store data in HDFS
• Provides full extract as well as incremental import mode support
• Runs on HDFS cluster and can populate tables in Hive, HBase
• Can establish a data integration layer between NoSQL and HDFS
• Can be integrated with Oozie to schedule import/export tasks
• Supports connectors to multiple relational databases like Oracle, SQL Server, MySQL
Footer
Sqoop architecture
• Mapper jobs of MapReduce processing
layer in Hadoop
• By default, a sqoop job has four
mappers
• Rule of Split
• Values of --split-by column must be
equally distributed to each mapper
• --split-by column must be a primary key
• --split-by column should be a primary
key
Footer
Sqoop - FYI
Design considerations - I
• Mappers
• --num-mappers [n] argument
• run in parallel within Hadoop. No formula but needs to be judiciously set
• Cannot split
• --autoreset-to-one-mapper to perform unsplit extraction
• Source has no PK
• Split based on natural or surrogate key
• Source has character keys
• Divide and conquer! Manual partitions and run one mapper per partition
• If key value is an integer, no worries
Footer
Sqoop - FYI
Design considerations - II
• If only subset of columns is required from the source table, specify column list in --columns
argument.
• For example, --columns “orderId, product, sales”
• If limited rows are required to be “sqooped”, specify --where clause with the predicate clause.
• For example, --where “sales > 1000”
• If result of a structured query needs to be imported, use --query clause.
• For example, --query ‘select orderId, product, sales from orders where sales>1000’
• Use --hive-partition-key and --hive-partition-value attributes to create partitions on a column key
from the import
• Delimiters can be handled through either of the below ways –
• Specify --hive-drop-import-delims to remove delimiters during import process
• Specify --hive-delims-replacement to replace delimiters with an alternate character
Footer
Oracle copyToBDA
• Licensed under Oracle BigData SQL
• Stack of Oracle BDA, Exadata, Infiniband
• Helps in loading Oracle database tables to Hadoop by –
• Dumping the table data in Data Pump format
• Copying them into HDFS
• Full extract and load
• Source data changes
• Rerun the utility to refresh Hive tables
Footer
Greenplum’s GPHDFS
• Setup on all segment nodes of a
Greenplum cluster
• All segments concurrently push
the local copies of data splits to
Hadoop cluster
• Cluster segments yield the power
of parallelism
Footer
Writeable
Ext Table
Writeable
Ext Table
Writeable
Ext Table
Writeable
Ext Table
Writeable
Ext Table
Hadoop clusterGreenplum cluster
Stream unstructured data using Flume
• Distributed system to capture and load large volumes of log data from different source systems to
data lake
• Collection and aggregation of streaming data as events
Footer
Incoming
Events
Outgoing
Data
Source Sink
Flume Agent
Channel
Source Transaction Sink Transaction
Client
PUT TAKE
Apache Flume - FYI
Design considerations - I
• Channel type
• MEMORY - events are read from source to memory
• Good performance, but volatile. Not cost effective
• FILE – events are ready from source into file system
• Controllable performance. Persistent and Transactional guarantee
• JDBC - events are read and stored in Derby database
• Slow performance
• Kafka – store events in Kafka topic
• Event batch size - maximum number of events that can be batched by source or sink in a single
transaction
• Fatter the better. But not for FILE channel
• Stable number ensures data consistency
Footer
Apache Flume - FYI
Design considerations - II
• Channel capacity and transaction capacity
• For MEMORY channel, channel capacity is limited
by RAM size.
• For FILE, channel capacity is limited by disk size
• Should not exceed batch size configured for the
sinks
• Channel selector
• An event can either be replicated or multiplexed
• Preferable vs conditional
• Handle high throughput systems
• Tiered architecture to handle event flow
• Aggregate and push approach
Footer
Questions?
AIOUG Sangam’17
Achieve data democracy in data lake with data integration

More Related Content

What's hot

Building a Logical Data Fabric using Data Virtualization (ASEAN)
Building a Logical Data Fabric using Data Virtualization (ASEAN)Building a Logical Data Fabric using Data Virtualization (ASEAN)
Building a Logical Data Fabric using Data Virtualization (ASEAN)
Denodo
 
Webinar -Data Warehouse Augmentation: Cut Costs, Increase Power
Webinar -Data Warehouse Augmentation: Cut Costs, Increase PowerWebinar -Data Warehouse Augmentation: Cut Costs, Increase Power
Webinar -Data Warehouse Augmentation: Cut Costs, Increase Power
Zaloni
 
Webinar: Data Modeling and Shortcuts to Success in Scaling Time Series Applic...
Webinar: Data Modeling and Shortcuts to Success in Scaling Time Series Applic...Webinar: Data Modeling and Shortcuts to Success in Scaling Time Series Applic...
Webinar: Data Modeling and Shortcuts to Success in Scaling Time Series Applic...
DATAVERSITY
 
5 Steps for Architecting a Data Lake
5 Steps for Architecting a Data Lake5 Steps for Architecting a Data Lake
5 Steps for Architecting a Data Lake
MetroStar
 
Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011
Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011
Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011
Cloudera, Inc.
 
Building a Modern Data Architecture by Ben Sharma at Strata + Hadoop World Sa...
Building a Modern Data Architecture by Ben Sharma at Strata + Hadoop World Sa...Building a Modern Data Architecture by Ben Sharma at Strata + Hadoop World Sa...
Building a Modern Data Architecture by Ben Sharma at Strata + Hadoop World Sa...
Zaloni
 
Big Data: Setting Up the Big Data Lake
Big Data: Setting Up the Big Data LakeBig Data: Setting Up the Big Data Lake
Big Data: Setting Up the Big Data Lake
Caserta
 
Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)
James Serra
 
Understanding Metadata: Why it's essential to your big data solution and how ...
Understanding Metadata: Why it's essential to your big data solution and how ...Understanding Metadata: Why it's essential to your big data solution and how ...
Understanding Metadata: Why it's essential to your big data solution and how ...
Zaloni
 
Incorporating the Data Lake into Your Analytic Architecture
Incorporating the Data Lake into Your Analytic ArchitectureIncorporating the Data Lake into Your Analytic Architecture
Incorporating the Data Lake into Your Analytic Architecture
Caserta
 
Big Data Day LA 2015 - Data Lake - Re Birth of Enterprise Data Thinking by Ra...
Big Data Day LA 2015 - Data Lake - Re Birth of Enterprise Data Thinking by Ra...Big Data Day LA 2015 - Data Lake - Re Birth of Enterprise Data Thinking by Ra...
Big Data Day LA 2015 - Data Lake - Re Birth of Enterprise Data Thinking by Ra...
Data Con LA
 
Deploying a Governed Data Lake
Deploying a Governed Data LakeDeploying a Governed Data Lake
Deploying a Governed Data Lake
WaterlineData
 
Data Lake Architecture
Data Lake ArchitectureData Lake Architecture
Data Lake Architecture
DATAVERSITY
 
Ovum Fireside Chat: Governing the data lake - Understanding what's in there
Ovum Fireside Chat: Governing the data lake - Understanding what's in thereOvum Fireside Chat: Governing the data lake - Understanding what's in there
Ovum Fireside Chat: Governing the data lake - Understanding what's in there
Zaloni
 
Flash session -streaming--ses1243-lon
Flash session -streaming--ses1243-lonFlash session -streaming--ses1243-lon
Flash session -streaming--ses1243-lon
Jeffrey T. Pollock
 
Constant Contact: An Online Marketing Leader’s Data Lake Journey
Constant Contact: An Online Marketing Leader’s Data Lake JourneyConstant Contact: An Online Marketing Leader’s Data Lake Journey
Constant Contact: An Online Marketing Leader’s Data Lake Journey
Seeling Cheung
 
Designing the Next Generation Data Lake
Designing the Next Generation Data LakeDesigning the Next Generation Data Lake
Designing the Next Generation Data Lake
Robert Chong
 
Webinar - Data Lake Management: Extending Storage and Lifecycle of Data
Webinar - Data Lake Management: Extending Storage and Lifecycle of DataWebinar - Data Lake Management: Extending Storage and Lifecycle of Data
Webinar - Data Lake Management: Extending Storage and Lifecycle of Data
Zaloni
 
Data Lakes - The Key to a Scalable Data Architecture
Data Lakes - The Key to a Scalable Data ArchitectureData Lakes - The Key to a Scalable Data Architecture
Data Lakes - The Key to a Scalable Data Architecture
Zaloni
 
Modern Data Architecture
Modern Data Architecture Modern Data Architecture
Modern Data Architecture
Mark Hewitt
 

What's hot (20)

Building a Logical Data Fabric using Data Virtualization (ASEAN)
Building a Logical Data Fabric using Data Virtualization (ASEAN)Building a Logical Data Fabric using Data Virtualization (ASEAN)
Building a Logical Data Fabric using Data Virtualization (ASEAN)
 
Webinar -Data Warehouse Augmentation: Cut Costs, Increase Power
Webinar -Data Warehouse Augmentation: Cut Costs, Increase PowerWebinar -Data Warehouse Augmentation: Cut Costs, Increase Power
Webinar -Data Warehouse Augmentation: Cut Costs, Increase Power
 
Webinar: Data Modeling and Shortcuts to Success in Scaling Time Series Applic...
Webinar: Data Modeling and Shortcuts to Success in Scaling Time Series Applic...Webinar: Data Modeling and Shortcuts to Success in Scaling Time Series Applic...
Webinar: Data Modeling and Shortcuts to Success in Scaling Time Series Applic...
 
5 Steps for Architecting a Data Lake
5 Steps for Architecting a Data Lake5 Steps for Architecting a Data Lake
5 Steps for Architecting a Data Lake
 
Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011
Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011
Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011
 
Building a Modern Data Architecture by Ben Sharma at Strata + Hadoop World Sa...
Building a Modern Data Architecture by Ben Sharma at Strata + Hadoop World Sa...Building a Modern Data Architecture by Ben Sharma at Strata + Hadoop World Sa...
Building a Modern Data Architecture by Ben Sharma at Strata + Hadoop World Sa...
 
Big Data: Setting Up the Big Data Lake
Big Data: Setting Up the Big Data LakeBig Data: Setting Up the Big Data Lake
Big Data: Setting Up the Big Data Lake
 
Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)
 
Understanding Metadata: Why it's essential to your big data solution and how ...
Understanding Metadata: Why it's essential to your big data solution and how ...Understanding Metadata: Why it's essential to your big data solution and how ...
Understanding Metadata: Why it's essential to your big data solution and how ...
 
Incorporating the Data Lake into Your Analytic Architecture
Incorporating the Data Lake into Your Analytic ArchitectureIncorporating the Data Lake into Your Analytic Architecture
Incorporating the Data Lake into Your Analytic Architecture
 
Big Data Day LA 2015 - Data Lake - Re Birth of Enterprise Data Thinking by Ra...
Big Data Day LA 2015 - Data Lake - Re Birth of Enterprise Data Thinking by Ra...Big Data Day LA 2015 - Data Lake - Re Birth of Enterprise Data Thinking by Ra...
Big Data Day LA 2015 - Data Lake - Re Birth of Enterprise Data Thinking by Ra...
 
Deploying a Governed Data Lake
Deploying a Governed Data LakeDeploying a Governed Data Lake
Deploying a Governed Data Lake
 
Data Lake Architecture
Data Lake ArchitectureData Lake Architecture
Data Lake Architecture
 
Ovum Fireside Chat: Governing the data lake - Understanding what's in there
Ovum Fireside Chat: Governing the data lake - Understanding what's in thereOvum Fireside Chat: Governing the data lake - Understanding what's in there
Ovum Fireside Chat: Governing the data lake - Understanding what's in there
 
Flash session -streaming--ses1243-lon
Flash session -streaming--ses1243-lonFlash session -streaming--ses1243-lon
Flash session -streaming--ses1243-lon
 
Constant Contact: An Online Marketing Leader’s Data Lake Journey
Constant Contact: An Online Marketing Leader’s Data Lake JourneyConstant Contact: An Online Marketing Leader’s Data Lake Journey
Constant Contact: An Online Marketing Leader’s Data Lake Journey
 
Designing the Next Generation Data Lake
Designing the Next Generation Data LakeDesigning the Next Generation Data Lake
Designing the Next Generation Data Lake
 
Webinar - Data Lake Management: Extending Storage and Lifecycle of Data
Webinar - Data Lake Management: Extending Storage and Lifecycle of DataWebinar - Data Lake Management: Extending Storage and Lifecycle of Data
Webinar - Data Lake Management: Extending Storage and Lifecycle of Data
 
Data Lakes - The Key to a Scalable Data Architecture
Data Lakes - The Key to a Scalable Data ArchitectureData Lakes - The Key to a Scalable Data Architecture
Data Lakes - The Key to a Scalable Data Architecture
 
Modern Data Architecture
Modern Data Architecture Modern Data Architecture
Modern Data Architecture
 

Similar to Achieve data democracy in data lake with data integration

The Shifting Landscape of Data Integration
The Shifting Landscape of Data IntegrationThe Shifting Landscape of Data Integration
The Shifting Landscape of Data Integration
DATAVERSITY
 
ADV Slides: Building and Growing Organizational Analytics with Data Lakes
ADV Slides: Building and Growing Organizational Analytics with Data LakesADV Slides: Building and Growing Organizational Analytics with Data Lakes
ADV Slides: Building and Growing Organizational Analytics with Data Lakes
DATAVERSITY
 
Using Data Platforms That Are Fit-For-Purpose
Using Data Platforms That Are Fit-For-PurposeUsing Data Platforms That Are Fit-For-Purpose
Using Data Platforms That Are Fit-For-Purpose
DATAVERSITY
 
Data Mesh
Data MeshData Mesh
ADV Slides: Data Pipelines in the Enterprise and Comparison
ADV Slides: Data Pipelines in the Enterprise and ComparisonADV Slides: Data Pipelines in the Enterprise and Comparison
ADV Slides: Data Pipelines in the Enterprise and Comparison
DATAVERSITY
 
Data Mesh in Azure using Cloud Scale Analytics (WAF)
Data Mesh in Azure using Cloud Scale Analytics (WAF)Data Mesh in Azure using Cloud Scale Analytics (WAF)
Data Mesh in Azure using Cloud Scale Analytics (WAF)
Nathan Bijnens
 
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data ArchitectureADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
DATAVERSITY
 
When and How Data Lakes Fit into a Modern Data Architecture
When and How Data Lakes Fit into a Modern Data ArchitectureWhen and How Data Lakes Fit into a Modern Data Architecture
When and How Data Lakes Fit into a Modern Data Architecture
DATAVERSITY
 
Data lake-itweekend-sharif university-vahid amiry
Data lake-itweekend-sharif university-vahid amiryData lake-itweekend-sharif university-vahid amiry
Data lake-itweekend-sharif university-vahid amiry
datastack
 
2022 Trends in Enterprise Analytics
2022 Trends in Enterprise Analytics2022 Trends in Enterprise Analytics
2022 Trends in Enterprise Analytics
DATAVERSITY
 
Benefits of a data lake
Benefits of a data lake Benefits of a data lake
Benefits of a data lake
Sun Technologies
 
Unlock Your Data for ML & AI using Data Virtualization
Unlock Your Data for ML & AI using Data VirtualizationUnlock Your Data for ML & AI using Data Virtualization
Unlock Your Data for ML & AI using Data Virtualization
Denodo
 
ADV Slides: The Evolution of the Data Platform and What It Means to Enterpris...
ADV Slides: The Evolution of the Data Platform and What It Means to Enterpris...ADV Slides: The Evolution of the Data Platform and What It Means to Enterpris...
ADV Slides: The Evolution of the Data Platform and What It Means to Enterpris...
DATAVERSITY
 
How to Quickly and Easily Draw Value from Big Data Sources_Q3 symposia(Moa)
How to Quickly and Easily Draw Value  from Big Data Sources_Q3 symposia(Moa)How to Quickly and Easily Draw Value  from Big Data Sources_Q3 symposia(Moa)
How to Quickly and Easily Draw Value from Big Data Sources_Q3 symposia(Moa)Moacyr Passador
 
Advanced Analytics and Machine Learning with Data Virtualization (India)
Advanced Analytics and Machine Learning with Data Virtualization (India)Advanced Analytics and Machine Learning with Data Virtualization (India)
Advanced Analytics and Machine Learning with Data Virtualization (India)
Denodo
 
The Future of Data Warehousing: ETL Will Never be the Same
The Future of Data Warehousing: ETL Will Never be the SameThe Future of Data Warehousing: ETL Will Never be the Same
The Future of Data Warehousing: ETL Will Never be the Same
Cloudera, Inc.
 
Architect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh ArchitectureArchitect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh Architecture
Databricks
 
Why Your Data Science Architecture Should Include a Data Virtualization Tool ...
Why Your Data Science Architecture Should Include a Data Virtualization Tool ...Why Your Data Science Architecture Should Include a Data Virtualization Tool ...
Why Your Data Science Architecture Should Include a Data Virtualization Tool ...
Denodo
 
How a Logical Data Fabric Enhances the Customer 360 View
How a Logical Data Fabric Enhances the Customer 360 ViewHow a Logical Data Fabric Enhances the Customer 360 View
How a Logical Data Fabric Enhances the Customer 360 View
Denodo
 
Active Governance Across the Delta Lake with Alation
Active Governance Across the Delta Lake with AlationActive Governance Across the Delta Lake with Alation
Active Governance Across the Delta Lake with Alation
Databricks
 

Similar to Achieve data democracy in data lake with data integration (20)

The Shifting Landscape of Data Integration
The Shifting Landscape of Data IntegrationThe Shifting Landscape of Data Integration
The Shifting Landscape of Data Integration
 
ADV Slides: Building and Growing Organizational Analytics with Data Lakes
ADV Slides: Building and Growing Organizational Analytics with Data LakesADV Slides: Building and Growing Organizational Analytics with Data Lakes
ADV Slides: Building and Growing Organizational Analytics with Data Lakes
 
Using Data Platforms That Are Fit-For-Purpose
Using Data Platforms That Are Fit-For-PurposeUsing Data Platforms That Are Fit-For-Purpose
Using Data Platforms That Are Fit-For-Purpose
 
Data Mesh
Data MeshData Mesh
Data Mesh
 
ADV Slides: Data Pipelines in the Enterprise and Comparison
ADV Slides: Data Pipelines in the Enterprise and ComparisonADV Slides: Data Pipelines in the Enterprise and Comparison
ADV Slides: Data Pipelines in the Enterprise and Comparison
 
Data Mesh in Azure using Cloud Scale Analytics (WAF)
Data Mesh in Azure using Cloud Scale Analytics (WAF)Data Mesh in Azure using Cloud Scale Analytics (WAF)
Data Mesh in Azure using Cloud Scale Analytics (WAF)
 
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data ArchitectureADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
 
When and How Data Lakes Fit into a Modern Data Architecture
When and How Data Lakes Fit into a Modern Data ArchitectureWhen and How Data Lakes Fit into a Modern Data Architecture
When and How Data Lakes Fit into a Modern Data Architecture
 
Data lake-itweekend-sharif university-vahid amiry
Data lake-itweekend-sharif university-vahid amiryData lake-itweekend-sharif university-vahid amiry
Data lake-itweekend-sharif university-vahid amiry
 
2022 Trends in Enterprise Analytics
2022 Trends in Enterprise Analytics2022 Trends in Enterprise Analytics
2022 Trends in Enterprise Analytics
 
Benefits of a data lake
Benefits of a data lake Benefits of a data lake
Benefits of a data lake
 
Unlock Your Data for ML & AI using Data Virtualization
Unlock Your Data for ML & AI using Data VirtualizationUnlock Your Data for ML & AI using Data Virtualization
Unlock Your Data for ML & AI using Data Virtualization
 
ADV Slides: The Evolution of the Data Platform and What It Means to Enterpris...
ADV Slides: The Evolution of the Data Platform and What It Means to Enterpris...ADV Slides: The Evolution of the Data Platform and What It Means to Enterpris...
ADV Slides: The Evolution of the Data Platform and What It Means to Enterpris...
 
How to Quickly and Easily Draw Value from Big Data Sources_Q3 symposia(Moa)
How to Quickly and Easily Draw Value  from Big Data Sources_Q3 symposia(Moa)How to Quickly and Easily Draw Value  from Big Data Sources_Q3 symposia(Moa)
How to Quickly and Easily Draw Value from Big Data Sources_Q3 symposia(Moa)
 
Advanced Analytics and Machine Learning with Data Virtualization (India)
Advanced Analytics and Machine Learning with Data Virtualization (India)Advanced Analytics and Machine Learning with Data Virtualization (India)
Advanced Analytics and Machine Learning with Data Virtualization (India)
 
The Future of Data Warehousing: ETL Will Never be the Same
The Future of Data Warehousing: ETL Will Never be the SameThe Future of Data Warehousing: ETL Will Never be the Same
The Future of Data Warehousing: ETL Will Never be the Same
 
Architect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh ArchitectureArchitect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh Architecture
 
Why Your Data Science Architecture Should Include a Data Virtualization Tool ...
Why Your Data Science Architecture Should Include a Data Virtualization Tool ...Why Your Data Science Architecture Should Include a Data Virtualization Tool ...
Why Your Data Science Architecture Should Include a Data Virtualization Tool ...
 
How a Logical Data Fabric Enhances the Customer 360 View
How a Logical Data Fabric Enhances the Customer 360 ViewHow a Logical Data Fabric Enhances the Customer 360 View
How a Logical Data Fabric Enhances the Customer 360 View
 
Active Governance Across the Delta Lake with Alation
Active Governance Across the Delta Lake with AlationActive Governance Across the Delta Lake with Alation
Active Governance Across the Delta Lake with Alation
 

Recently uploaded

Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Jeffrey Haguewood
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
Elena Simperl
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
Cheryl Hung
 
Search and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical FuturesSearch and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical Futures
Bhaskar Mitra
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
Safe Software
 
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Inflectra
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
DianaGray10
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance
 
Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
Frank van Harmelen
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
Alison B. Lowndes
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
UiPathCommunity
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
OnBoard
 
PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)
Ralf Eggert
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
Laura Byrne
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Product School
 
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptxIOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
Abida Shariff
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
DianaGray10
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
Product School
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
DianaGray10
 

Recently uploaded (20)

Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
 
Search and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical FuturesSearch and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical Futures
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
 
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
 
Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
 
PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
 
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptxIOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
 

Achieve data democracy in data lake with data integration

  • 1. Achieve Data Democratization with effective Data Integration Saurabh K. Gupta Manager, Data & Analytics, GE www.amazon.com/author/saurabhgupta @saurabhkg
  • 2. Disclaimer: “This report has been prepared by the Authors who are part of GE. The opinions expresses herein by the Authors and the information contained herein are in good faith and Authors and GE disclaim any liability for the content in the report. The report is the property of GE and GE is the holder of the copyright or any intellectual property over the report. No part of this document may be reproduced in any manner without the written permission of GE. This Report also contains certain information available in public domain, created and maintained by private and public organizations. GE does not control or guarantee the accuracy, relevance, timelines or completeness of such information. This Report constitutes a view as on the date of publication and is subject to change. GE does not warrant or solicit any kind of act or omission based on this Report.”
  • 3. AIOUG Sangam’17 ✓ Data lake is relatively a new term when compared to all fancy ones, since the industry realized the potential of data. ✓ Enterprises are bending their backwards to build a stable data strategy and take a leap towards data democratization. ✓ Traditional approaches pertaining to data pipelines, data processing, data security still hold true but architects need to scoot an extra mile while designing a big data lake. ✓ This session will focus on data integration design considerations. We will discuss the relevance of data democratization in the organizational data strategy. Abstract ✓ Data Lake architectural styles ✓ Implement data lake for democratization ✓ Data Ingestion framework principles Learning objectives ✓ Manager, Data & Analytics at General Electric ✓ 11+ years of experience in data architecture, data engineering, analytics ✓ Books authored: ✓ Practical Enterprise Data Lake Insights | Apress | 2018 ✓ Advanced Oracle PL/SQL Developer’s Guide | Packt | 2016 ✓ Oracle Advanced PL/SQL Developer Professional Guide | Packt | 2012 ✓ Speaker at AIOUG, IOUG, NASSCOM ✓ Twitter @saurabhkg ✓ Blog@ sbhoracle.wordpress.com About Me
  • 4. Practical Enterprise Data Lake Insights - Published Footer Handle Data-Driven Challenges in an Enterprise Big Data Lake Use this practical guide to successfully handle the challenges encountered when designing an enterprise data lake and learn industry best practices to resolve issues. What You'll Learn: • Get to know data lake architecture and design principles • Implement data capture and streaming strategies • Implement data processing strategies in Hadoop • Understand the data lake security framework and availability model Apress – https://www.apress.com/in/book/9781484235218/ Amazon - https://www.amazon.com/Practical-Enterprise-Data-Lake- Insights/dp/1484235215/
  • 5. Data Explosion Future data trends AIOUG Sangam’17 Data as a Service Cybersecurity Augmented Analytics Machine Intelligence “Fast Data” and “Actionable data”
  • 6. Evolution of Data Lake • James Dixon’s “time machine” vision of data • Leads Data-as-an-Asset strategy “If you think of a datamart as a store of bottled water – cleansed and packaged and structured for easy consumption – the data lake is a large body of water in a more natural state. The contents of the data lake stream in from a source to fill the lake, and various users of the lake can come to examine, dive in, or take samples.” -James Dixon Footer
  • 7. July 21, 2018 Presentation Title 7 Conceptualizing a Data Lake © 2018 Gartner, Inc. Data Lake Conceptual Design Insight Discovery and Development Data Acquisition Analytics Consumption Optimization and Governance © 2018 Gartner, Inc. Data Lake Input, Processing and Output Data In Acquisition Data Out ConsumptionDiscovery and Development Metadata, Modeling and Governance Data Lake Tooling and Infrastructure 01011 Extend analytical capabilities to unrefined data in its native or near-native format Architecture of an Enterprise Data Lake depends upon Data-In and Data-Out strategy
  • 8. July 21, 2018 Presentation Title 8 Data Lake Architecture Styles © 2018 Gartner, Inc. Outflow Data Lake 10101 01010 10101 Data Science Lab Inflow Data Lake “Data Hub” to bridge information silos to find new insights Under agile governance, caters data to the consumers at a rapid scale With minimal governance, enables innovation and drives analytics Analytics as a Service
  • 9. Data Democracy Provide the data in a matter in which it can be consumed - regardless of the people, process or end user technology Footer Data Empowerment Complement business with data Data Commercialization Data backed decisions Data As A Service Data Vision
  • 10. Enterprise Data Lake operational pillars Footer Data Lake Data Management Data Ingestion Data Engineering Data Consumption Data Integration
  • 12. Understand the data • Structured data is an organized piece of information • Aligns strongly with the relational standards • Defined metadata • Easy ingestion, retrieval, and processing • Unstructured data lacks structure and metadata • Not so easy to ingest • Complex retrieval • Complex processing Footer
  • 13. Understand the data sources • OLTP and Data warehouses – structured data from typical relational data stores. • Data management systems – documents and text files • Legacy systems – essential for historical and regulatory analytics • Sensors and IoT devices – Devices installed on healthcare, home, and mobile appliances and large machines can upload logs to data lake at periodic intervals or in a secure network region • Web content – data from the web world (retail sites, blogs, social media) • Geographical data – data flowing from location data, maps, and geo-positioning systems. Footer
  • 14. Data Ingestion Framework Design Considerations • Data format – What format is the data to be ingested? • Data change rate – Critical for CDC design and streaming data. Performance is a derivative of throughput and latency. • Data location and security – • Whether data is located on-premise or public cloud infrastructure. While fetching data from cloud instances, network bandwidth plays an important role. • If the data source is enclosed within a security layer, ingestion framework should be enabled establishment of a secure tunnel to collect data for ingestion • Transfer data size (file compression and file splitting) – what would be the average and maximum size of block or object in a single ingestion operation? • Target file format – Data from a source system needs to be ingested in a hadoop compatible file format. Footer
  • 15. ETL vs ELT for Data Lake • Heavy transformation may restrict data surface area for data exploration • Brings down the data agility • Transformation on huge volumes of data may foster a latency between data source and data lake • Curated layer to empower analytical models Footer ETL ELT
  • 16. Batched data ingestion principles Structured data • Data collector fires a SELECT query (also known as filter query) on the source to pull incremental records or full extract • Query performance and source workload determine how efficient data collector is • Robust and flexible Footer • Change Track flag – flag rows with the operation code • Incremental extraction – pull all the changes after certain timestamp • Full Extraction – refresh target on every ingestion run Standard Ingestion techniques
  • 17. Change Data Capture • Log mining process • Capture changed data from the source system’s transaction logs and integrate with the target • Eliminating the need to run SQL queries on source system. Incurs no load overhead on a transactional source system. • Achieves near real-time replication between source and target Footer
  • 18. Change Data Capture Design Considerations • Source database be enabled for logging • Commercial tools - Oracle GoldenGate, HVR, Talend CDC, custom replicators • Keys are extremely important for replication • Helps capture job in establishing uniqueness of a record in the changed data set • Source PK ensures the changes are applied to the correct record on target • PK not available; establish uniqueness based on composite columns • Establish uniqueness based on a unique constraint - terrible design!! • Trigger based CDC • Event on a table triggers the change to be captured in an change-log table • Change-log table merged with the target • Works when source transaction logs are not available Footer
  • 19. LinkedIn Databus CDC capture pipeline • Relay is responsible for pulling the most recent committed transactions from the source • Relays are implemented through tungsten replicator • Relay stores the changes in logs or cache in compressed format • Consumer pulls the changes from relay • Bootstrap component – a snapshot of data source on a temporary instance. It is consistent with the changes captured by Relay • If any consumer falls behind and can’t find the changes in relay, bootstrap component transforms and packages the changes to the consumer • A new consumer, with the help of client library, can apply all the changes from bootstrap component until a time. Client library will point the consumer to Relay to continue pulling most recent changes Footer Relay LogWriter Log Storage LogApplier Snapshot Storage Consolidated changes Consistent Snapshot
  • 20. Change merge techniques Design Considerations Footer Exchange Partition Prepare final dataset with merged changes P1 P2 P3 P4 #Changes P4 1 Change capture 2 Pull most recent partition for Compare andMerge 3 4 Data source Hive Table with partitions Flow • Table partitioned on time dimension • Changes are captured incrementally • Changes tagged by table name and the most recent partition • Exchange partition process - recent partition compared against the “change” data set for merging
  • 21. Apache Sqoop • Native member of Hadoop tech stack for data ingestion • Batched ingestion, no CDC • Java based utility (web interface in Sqoop2) that spawns Map jobs from MapReduce engine to store data in HDFS • Provides full extract as well as incremental import mode support • Runs on HDFS cluster and can populate tables in Hive, HBase • Can establish a data integration layer between NoSQL and HDFS • Can be integrated with Oozie to schedule import/export tasks • Supports connectors to multiple relational databases like Oracle, SQL Server, MySQL Footer
  • 22. Sqoop architecture • Mapper jobs of MapReduce processing layer in Hadoop • By default, a sqoop job has four mappers • Rule of Split • Values of --split-by column must be equally distributed to each mapper • --split-by column must be a primary key • --split-by column should be a primary key Footer
  • 23. Sqoop - FYI Design considerations - I • Mappers • --num-mappers [n] argument • run in parallel within Hadoop. No formula but needs to be judiciously set • Cannot split • --autoreset-to-one-mapper to perform unsplit extraction • Source has no PK • Split based on natural or surrogate key • Source has character keys • Divide and conquer! Manual partitions and run one mapper per partition • If key value is an integer, no worries Footer
  • 24. Sqoop - FYI Design considerations - II • If only subset of columns is required from the source table, specify column list in --columns argument. • For example, --columns “orderId, product, sales” • If limited rows are required to be “sqooped”, specify --where clause with the predicate clause. • For example, --where “sales > 1000” • If result of a structured query needs to be imported, use --query clause. • For example, --query ‘select orderId, product, sales from orders where sales>1000’ • Use --hive-partition-key and --hive-partition-value attributes to create partitions on a column key from the import • Delimiters can be handled through either of the below ways – • Specify --hive-drop-import-delims to remove delimiters during import process • Specify --hive-delims-replacement to replace delimiters with an alternate character Footer
  • 25. Oracle copyToBDA • Licensed under Oracle BigData SQL • Stack of Oracle BDA, Exadata, Infiniband • Helps in loading Oracle database tables to Hadoop by – • Dumping the table data in Data Pump format • Copying them into HDFS • Full extract and load • Source data changes • Rerun the utility to refresh Hive tables Footer
  • 26. Greenplum’s GPHDFS • Setup on all segment nodes of a Greenplum cluster • All segments concurrently push the local copies of data splits to Hadoop cluster • Cluster segments yield the power of parallelism Footer Writeable Ext Table Writeable Ext Table Writeable Ext Table Writeable Ext Table Writeable Ext Table Hadoop clusterGreenplum cluster
  • 27. Stream unstructured data using Flume • Distributed system to capture and load large volumes of log data from different source systems to data lake • Collection and aggregation of streaming data as events Footer Incoming Events Outgoing Data Source Sink Flume Agent Channel Source Transaction Sink Transaction Client PUT TAKE
  • 28. Apache Flume - FYI Design considerations - I • Channel type • MEMORY - events are read from source to memory • Good performance, but volatile. Not cost effective • FILE – events are ready from source into file system • Controllable performance. Persistent and Transactional guarantee • JDBC - events are read and stored in Derby database • Slow performance • Kafka – store events in Kafka topic • Event batch size - maximum number of events that can be batched by source or sink in a single transaction • Fatter the better. But not for FILE channel • Stable number ensures data consistency Footer
  • 29. Apache Flume - FYI Design considerations - II • Channel capacity and transaction capacity • For MEMORY channel, channel capacity is limited by RAM size. • For FILE, channel capacity is limited by disk size • Should not exceed batch size configured for the sinks • Channel selector • An event can either be replicated or multiplexed • Preferable vs conditional • Handle high throughput systems • Tiered architecture to handle event flow • Aggregate and push approach Footer