SlideShare a Scribd company logo
Submit Search
Upload
HBaseCon 2013: ETL for Apache HBase
Report
Share
Cloudera, Inc.
Cloudera, Inc.
Follow
•
10 likes
•
6,892 views
1
of
25
HBaseCon 2013: ETL for Apache HBase
•
10 likes
•
6,892 views
Report
Share
Technology
Presented by: Manoj Khanwalkar (Experian) and Govind Asawa (Experian)
Read more
Cloudera, Inc.
Cloudera, Inc.
Follow
Recommended
Data Evolution in HBase by
Data Evolution in HBase
HBaseCon
5K views
•
38 slides
HBaseCon 2015 General Session: State of HBase by
HBaseCon 2015 General Session: State of HBase
HBaseCon
4.5K views
•
35 slides
HBaseCon 2013: Apache Drill - A Community-driven Initiative to Deliver ANSI S... by
HBaseCon 2013: Apache Drill - A Community-driven Initiative to Deliver ANSI S...
Cloudera, Inc.
4.7K views
•
11 slides
HBaseCon 2015- HBase @ Flipboard by
HBaseCon 2015- HBase @ Flipboard
Matthew Blair
7.6K views
•
34 slides
HBaseCon 2012 | Building a Large Search Platform on a Shoestring Budget by
HBaseCon 2012 | Building a Large Search Platform on a Shoestring Budget
Cloudera, Inc.
3.1K views
•
26 slides
A Survey of HBase Application Archetypes by
A Survey of HBase Application Archetypes
HBaseCon
20K views
•
60 slides
More Related Content
What's hot
HBaseCon 2013: HBase SEP - Reliable Maintenance of Auxiliary Index Structures by
HBaseCon 2013: HBase SEP - Reliable Maintenance of Auxiliary Index Structures
Cloudera, Inc.
4.1K views
•
16 slides
A Graph Service for Global Web Entities Traversal and Reputation Evaluation B... by
A Graph Service for Global Web Entities Traversal and Reputation Evaluation B...
HBaseCon
4.1K views
•
54 slides
Real-Time Video Analytics Using Hadoop and HBase (HBaseCon 2013) by
Real-Time Video Analytics Using Hadoop and HBase (HBaseCon 2013)
Suman Srinivasan
10.9K views
•
14 slides
HBase Read High Availability Using Timeline-Consistent Region Replicas by
HBase Read High Availability Using Timeline-Consistent Region Replicas
HBaseCon
4.1K views
•
38 slides
HBaseConAsia2018 Track2-1: Kerberos-based Big Data Security Solution and Prac... by
HBaseConAsia2018 Track2-1: Kerberos-based Big Data Security Solution and Prac...
Michael Stack
1.6K views
•
33 slides
HBaseCon 2013: Using Coprocessors to Index Columns in an Elasticsearch Cluster by
HBaseCon 2013: Using Coprocessors to Index Columns in an Elasticsearch Cluster
Cloudera, Inc.
7.5K views
•
33 slides
What's hot
(20)
HBaseCon 2013: HBase SEP - Reliable Maintenance of Auxiliary Index Structures by Cloudera, Inc.
HBaseCon 2013: HBase SEP - Reliable Maintenance of Auxiliary Index Structures
Cloudera, Inc.
•
4.1K views
A Graph Service for Global Web Entities Traversal and Reputation Evaluation B... by HBaseCon
A Graph Service for Global Web Entities Traversal and Reputation Evaluation B...
HBaseCon
•
4.1K views
Real-Time Video Analytics Using Hadoop and HBase (HBaseCon 2013) by Suman Srinivasan
Real-Time Video Analytics Using Hadoop and HBase (HBaseCon 2013)
Suman Srinivasan
•
10.9K views
HBase Read High Availability Using Timeline-Consistent Region Replicas by HBaseCon
HBase Read High Availability Using Timeline-Consistent Region Replicas
HBaseCon
•
4.1K views
HBaseConAsia2018 Track2-1: Kerberos-based Big Data Security Solution and Prac... by Michael Stack
HBaseConAsia2018 Track2-1: Kerberos-based Big Data Security Solution and Prac...
Michael Stack
•
1.6K views
HBaseCon 2013: Using Coprocessors to Index Columns in an Elasticsearch Cluster by Cloudera, Inc.
HBaseCon 2013: Using Coprocessors to Index Columns in an Elasticsearch Cluster
Cloudera, Inc.
•
7.5K views
HBaseConAsia2018 Track2-4: HTAP DB-System: AsparaDB HBase, Phoenix, and Spark by Michael Stack
HBaseConAsia2018 Track2-4: HTAP DB-System: AsparaDB HBase, Phoenix, and Spark
Michael Stack
•
742 views
Harmonizing Multi-tenant HBase Clusters for Managing Workload Diversity by HBaseCon
Harmonizing Multi-tenant HBase Clusters for Managing Workload Diversity
HBaseCon
•
4.8K views
HBaseCon 2015: Apache Phoenix - The Evolution of a Relational Database Layer ... by HBaseCon
HBaseCon 2015: Apache Phoenix - The Evolution of a Relational Database Layer ...
HBaseCon
•
4.3K views
Optimizing Apache HBase for Cloud Storage in Microsoft Azure HDInsight by HBaseCon
Optimizing Apache HBase for Cloud Storage in Microsoft Azure HDInsight
HBaseCon
•
2.8K views
HBase in Practice by DataWorks Summit/Hadoop Summit
HBase in Practice
DataWorks Summit/Hadoop Summit
•
5.4K views
HBaseCon 2012 | Solbase - Kyungseog Oh, Photobucket by Cloudera, Inc.
HBaseCon 2012 | Solbase - Kyungseog Oh, Photobucket
Cloudera, Inc.
•
3.6K views
Keynote: The Future of Apache HBase by HBaseCon
Keynote: The Future of Apache HBase
HBaseCon
•
2.9K views
HBaseCon 2015: Graph Processing of Stock Market Order Flow in HBase on AWS by HBaseCon
HBaseCon 2015: Graph Processing of Stock Market Order Flow in HBase on AWS
HBaseCon
•
4K views
Architecting Applications with Hadoop by markgrover
Architecting Applications with Hadoop
markgrover
•
765 views
Unified Batch & Stream Processing with Apache Samza by DataWorks Summit
Unified Batch & Stream Processing with Apache Samza
DataWorks Summit
•
2.9K views
Ingesting data at scale into elasticsearch with apache pulsar by Timothy Spann
Ingesting data at scale into elasticsearch with apache pulsar
Timothy Spann
•
1.3K views
From Device to Data Center to Insights by DataWorks Summit/Hadoop Summit
From Device to Data Center to Insights
DataWorks Summit/Hadoop Summit
•
1.9K views
Ingest and Stream Processing - What will you choose? by DataWorks Summit/Hadoop Summit
Ingest and Stream Processing - What will you choose?
DataWorks Summit/Hadoop Summit
•
1.8K views
Data Wrangling and Oracle Connectors for Hadoop by Gwen (Chen) Shapira
Data Wrangling and Oracle Connectors for Hadoop
Gwen (Chen) Shapira
•
2.7K views
Viewers also liked
Using Morphlines for On-the-Fly ETL by
Using Morphlines for On-the-Fly ETL
Cloudera, Inc.
18.8K views
•
33 slides
HBaseCon 2012 | Building Mobile Infrastructure with HBase by
HBaseCon 2012 | Building Mobile Infrastructure with HBase
Cloudera, Inc.
2.6K views
•
36 slides
HBaseCon 2012 | Content Addressable Storages for Fun and Profit - Berk Demir,... by
HBaseCon 2012 | Content Addressable Storages for Fun and Profit - Berk Demir,...
Cloudera, Inc.
3.8K views
•
30 slides
HBaseCon 2013: Apache HBase, Meet Ops. Ops, Meet Apache HBase. by
HBaseCon 2013: Apache HBase, Meet Ops. Ops, Meet Apache HBase.
Cloudera, Inc.
7.1K views
•
52 slides
HBaseCon 2013: Apache Hadoop and Apache HBase for Real-Time Video Analytics by
HBaseCon 2013: Apache Hadoop and Apache HBase for Real-Time Video Analytics
Cloudera, Inc.
4.8K views
•
14 slides
HBaseCon 2015: Trafodion - Integrating Operational SQL into HBase by
HBaseCon 2015: Trafodion - Integrating Operational SQL into HBase
HBaseCon
3.3K views
•
20 slides
Viewers also liked
(20)
Using Morphlines for On-the-Fly ETL by Cloudera, Inc.
Using Morphlines for On-the-Fly ETL
Cloudera, Inc.
•
18.8K views
HBaseCon 2012 | Building Mobile Infrastructure with HBase by Cloudera, Inc.
HBaseCon 2012 | Building Mobile Infrastructure with HBase
Cloudera, Inc.
•
2.6K views
HBaseCon 2012 | Content Addressable Storages for Fun and Profit - Berk Demir,... by Cloudera, Inc.
HBaseCon 2012 | Content Addressable Storages for Fun and Profit - Berk Demir,...
Cloudera, Inc.
•
3.8K views
HBaseCon 2013: Apache HBase, Meet Ops. Ops, Meet Apache HBase. by Cloudera, Inc.
HBaseCon 2013: Apache HBase, Meet Ops. Ops, Meet Apache HBase.
Cloudera, Inc.
•
7.1K views
HBaseCon 2013: Apache Hadoop and Apache HBase for Real-Time Video Analytics by Cloudera, Inc.
HBaseCon 2013: Apache Hadoop and Apache HBase for Real-Time Video Analytics
Cloudera, Inc.
•
4.8K views
HBaseCon 2015: Trafodion - Integrating Operational SQL into HBase by HBaseCon
HBaseCon 2015: Trafodion - Integrating Operational SQL into HBase
HBaseCon
•
3.3K views
HBaseCon 2012 | Unique Sets on HBase and Hadoop - Elliot Clark, StumbleUpon by Cloudera, Inc.
HBaseCon 2012 | Unique Sets on HBase and Hadoop - Elliot Clark, StumbleUpon
Cloudera, Inc.
•
3.4K views
HBaseCon 2013: Evolving a First-Generation Apache HBase Deployment to Second... by Cloudera, Inc.
HBaseCon 2013: Evolving a First-Generation Apache HBase Deployment to Second...
Cloudera, Inc.
•
4.2K views
Cross-Site BigTable using HBase by HBaseCon
Cross-Site BigTable using HBase
HBaseCon
•
3.5K views
HBaseCon 2012 | Relaxed Transactions for HBase - Francis Liu, Yahoo! by Cloudera, Inc.
HBaseCon 2012 | Relaxed Transactions for HBase - Francis Liu, Yahoo!
Cloudera, Inc.
•
3.2K views
Tales from the Cloudera Field by HBaseCon
Tales from the Cloudera Field
HBaseCon
•
4K views
HBaseCon 2012 | Scaling GIS In Three Acts by Cloudera, Inc.
HBaseCon 2012 | Scaling GIS In Three Acts
Cloudera, Inc.
•
3.6K views
HBaseCon 2012 | Leveraging HBase for the World’s Largest Curated Genomic Data... by Cloudera, Inc.
HBaseCon 2012 | Leveraging HBase for the World’s Largest Curated Genomic Data...
Cloudera, Inc.
•
3.5K views
HBaseCon 2015: DeathStar - Easy, Dynamic, Multi-tenant HBase via YARN by HBaseCon
HBaseCon 2015: DeathStar - Easy, Dynamic, Multi-tenant HBase via YARN
HBaseCon
•
2.9K views
HBaseCon 2013: Apache HBase on Flash by Cloudera, Inc.
HBaseCon 2013: Apache HBase on Flash
Cloudera, Inc.
•
4.3K views
HBaseCon 2013: 1500 JIRAs in 20 Minutes by Cloudera, Inc.
HBaseCon 2013: 1500 JIRAs in 20 Minutes
Cloudera, Inc.
•
4.1K views
HBaseCon 2013: Project Valta - A Resource Management Layer over Apache HBase by Cloudera, Inc.
HBaseCon 2013: Project Valta - A Resource Management Layer over Apache HBase
Cloudera, Inc.
•
3.2K views
HBaseCon 2012 | Living Data: Applying Adaptable Schemas to HBase - Aaron Kimb... by Cloudera, Inc.
HBaseCon 2012 | Living Data: Applying Adaptable Schemas to HBase - Aaron Kimb...
Cloudera, Inc.
•
3.2K views
HBaseCon 2012 | HBase for the Worlds Libraries - OCLC by Cloudera, Inc.
HBaseCon 2012 | HBase for the Worlds Libraries - OCLC
Cloudera, Inc.
•
3.9K views
HBaseCon 2013: Being Smarter Than the Smart Meter by Cloudera, Inc.
HBaseCon 2013: Being Smarter Than the Smart Meter
Cloudera, Inc.
•
4.3K views
Similar to HBaseCon 2013: ETL for Apache HBase
FSI201 FINRA’s Managed Data Lake – Next Gen Analytics in the Cloud by
FSI201 FINRA’s Managed Data Lake – Next Gen Analytics in the Cloud
Amazon Web Services
1K views
•
36 slides
Enhanced Data Visualization provided for 200,000 Machines with OpenTSDB and C... by
Enhanced Data Visualization provided for 200,000 Machines with OpenTSDB and C...
YASH Technologies
82 views
•
7 slides
Enhanced Data Visualization provided for 200,000 Machines with OpenTSDB and ... by
Enhanced Data Visualization provided for 200,000 Machines with OpenTSDB and ...
YASH Technologies
118 views
•
7 slides
Fast SQL on Hadoop, really? by
Fast SQL on Hadoop, really?
DataWorks Summit
425 views
•
27 slides
What's New in Apache Hive 3.0? by
What's New in Apache Hive 3.0?
DataWorks Summit
270 views
•
22 slides
What's New in Apache Hive 3.0 - Tokyo by
What's New in Apache Hive 3.0 - Tokyo
DataWorks Summit
945 views
•
22 slides
Similar to HBaseCon 2013: ETL for Apache HBase
(20)
FSI201 FINRA’s Managed Data Lake – Next Gen Analytics in the Cloud by Amazon Web Services
FSI201 FINRA’s Managed Data Lake – Next Gen Analytics in the Cloud
Amazon Web Services
•
1K views
Enhanced Data Visualization provided for 200,000 Machines with OpenTSDB and C... by YASH Technologies
Enhanced Data Visualization provided for 200,000 Machines with OpenTSDB and C...
YASH Technologies
•
82 views
Enhanced Data Visualization provided for 200,000 Machines with OpenTSDB and ... by YASH Technologies
Enhanced Data Visualization provided for 200,000 Machines with OpenTSDB and ...
YASH Technologies
•
118 views
Fast SQL on Hadoop, really? by DataWorks Summit
Fast SQL on Hadoop, really?
DataWorks Summit
•
425 views
What's New in Apache Hive 3.0? by DataWorks Summit
What's New in Apache Hive 3.0?
DataWorks Summit
•
270 views
What's New in Apache Hive 3.0 - Tokyo by DataWorks Summit
What's New in Apache Hive 3.0 - Tokyo
DataWorks Summit
•
945 views
Hive Performance Dataworks Summit Melbourne February 2019 by alanfgates
Hive Performance Dataworks Summit Melbourne February 2019
alanfgates
•
227 views
Fast SQL on Hadoop, Really? by DataWorks Summit
Fast SQL on Hadoop, Really?
DataWorks Summit
•
654 views
New Performance Benchmarks: Apache Impala (incubating) Leads Traditional Anal... by Cloudera, Inc.
New Performance Benchmarks: Apache Impala (incubating) Leads Traditional Anal...
Cloudera, Inc.
•
3.4K views
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co... by DataStax
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...
DataStax
•
1.5K views
Rebuilding from MongoDB for Scale on HBase by Robert Roland
Rebuilding from MongoDB for Scale on HBase
Robert Roland
•
747 views
Database Freedom | AWS Floor28 by Amazon Web Services
Database Freedom | AWS Floor28
Amazon Web Services
•
778 views
Thomas Weise, Apache Apex PMC Member and Architect/Co-Founder, DataTorrent - ... by Dataconomy Media
Thomas Weise, Apache Apex PMC Member and Architect/Co-Founder, DataTorrent - ...
Dataconomy Media
•
1.2K views
Big Data Berlin v8.0 Stream Processing with Apache Apex by Apache Apex
Big Data Berlin v8.0 Stream Processing with Apache Apex
Apache Apex
•
1.1K views
Migrate from Oracle to Aurora PostgreSQL: Best Practices, Design Patterns, & ... by Amazon Web Services
Migrate from Oracle to Aurora PostgreSQL: Best Practices, Design Patterns, & ...
Amazon Web Services
•
1.9K views
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ... by Precisely
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Precisely
•
270 views
Webinar: The Performance Challenge: Providing an Amazing Customer Experience ... by DataStax
Webinar: The Performance Challenge: Providing an Amazing Customer Experience ...
DataStax
•
425 views
AWS Webcast - Introduction to Amazon Kinesis by Amazon Web Services
AWS Webcast - Introduction to Amazon Kinesis
Amazon Web Services
•
10.7K views
Data exposure in Azure - production use-case by Alexander Laysha
Data exposure in Azure - production use-case
Alexander Laysha
•
139 views
(SDD405) Amazon Kinesis Deep Dive | AWS re:Invent 2014 by Amazon Web Services
(SDD405) Amazon Kinesis Deep Dive | AWS re:Invent 2014
Amazon Web Services
•
3.5K views
More from Cloudera, Inc.
Partner Briefing_January 25 (FINAL).pptx by
Partner Briefing_January 25 (FINAL).pptx
Cloudera, Inc.
107 views
•
55 slides
Cloudera Data Impact Awards 2021 - Finalists by
Cloudera Data Impact Awards 2021 - Finalists
Cloudera, Inc.
6.4K views
•
34 slides
2020 Cloudera Data Impact Awards Finalists by
2020 Cloudera Data Impact Awards Finalists
Cloudera, Inc.
6.3K views
•
43 slides
Edc event vienna presentation 1 oct 2019 by
Edc event vienna presentation 1 oct 2019
Cloudera, Inc.
4.5K views
•
67 slides
Machine Learning with Limited Labeled Data 4/3/19 by
Machine Learning with Limited Labeled Data 4/3/19
Cloudera, Inc.
3.6K views
•
36 slides
Data Driven With the Cloudera Modern Data Warehouse 3.19.19 by
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Cloudera, Inc.
2.5K views
•
21 slides
More from Cloudera, Inc.
(20)
Partner Briefing_January 25 (FINAL).pptx by Cloudera, Inc.
Partner Briefing_January 25 (FINAL).pptx
Cloudera, Inc.
•
107 views
Cloudera Data Impact Awards 2021 - Finalists by Cloudera, Inc.
Cloudera Data Impact Awards 2021 - Finalists
Cloudera, Inc.
•
6.4K views
2020 Cloudera Data Impact Awards Finalists by Cloudera, Inc.
2020 Cloudera Data Impact Awards Finalists
Cloudera, Inc.
•
6.3K views
Edc event vienna presentation 1 oct 2019 by Cloudera, Inc.
Edc event vienna presentation 1 oct 2019
Cloudera, Inc.
•
4.5K views
Machine Learning with Limited Labeled Data 4/3/19 by Cloudera, Inc.
Machine Learning with Limited Labeled Data 4/3/19
Cloudera, Inc.
•
3.6K views
Data Driven With the Cloudera Modern Data Warehouse 3.19.19 by Cloudera, Inc.
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Cloudera, Inc.
•
2.5K views
Introducing Cloudera DataFlow (CDF) 2.13.19 by Cloudera, Inc.
Introducing Cloudera DataFlow (CDF) 2.13.19
Cloudera, Inc.
•
4.9K views
Introducing Cloudera Data Science Workbench for HDP 2.12.19 by Cloudera, Inc.
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Cloudera, Inc.
•
2.7K views
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19 by Cloudera, Inc.
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Cloudera, Inc.
•
1.6K views
Leveraging the cloud for analytics and machine learning 1.29.19 by Cloudera, Inc.
Leveraging the cloud for analytics and machine learning 1.29.19
Cloudera, Inc.
•
1.6K views
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19 by Cloudera, Inc.
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Cloudera, Inc.
•
2.5K views
Leveraging the Cloud for Big Data Analytics 12.11.18 by Cloudera, Inc.
Leveraging the Cloud for Big Data Analytics 12.11.18
Cloudera, Inc.
•
1.7K views
Modern Data Warehouse Fundamentals Part 3 by Cloudera, Inc.
Modern Data Warehouse Fundamentals Part 3
Cloudera, Inc.
•
1.3K views
Modern Data Warehouse Fundamentals Part 2 by Cloudera, Inc.
Modern Data Warehouse Fundamentals Part 2
Cloudera, Inc.
•
2.3K views
Modern Data Warehouse Fundamentals Part 1 by Cloudera, Inc.
Modern Data Warehouse Fundamentals Part 1
Cloudera, Inc.
•
1.5K views
Extending Cloudera SDX beyond the Platform by Cloudera, Inc.
Extending Cloudera SDX beyond the Platform
Cloudera, Inc.
•
966 views
Federated Learning: ML with Privacy on the Edge 11.15.18 by Cloudera, Inc.
Federated Learning: ML with Privacy on the Edge 11.15.18
Cloudera, Inc.
•
2.2K views
Analyst Webinar: Doing a 180 on Customer 360 by Cloudera, Inc.
Analyst Webinar: Doing a 180 on Customer 360
Cloudera, Inc.
•
1.4K views
Build a modern platform for anti-money laundering 9.19.18 by Cloudera, Inc.
Build a modern platform for anti-money laundering 9.19.18
Cloudera, Inc.
•
1K views
Introducing the data science sandbox as a service 8.30.18 by Cloudera, Inc.
Introducing the data science sandbox as a service 8.30.18
Cloudera, Inc.
•
1.2K views
Recently uploaded
virtual reality.pptx by
virtual reality.pptx
G036GaikwadSnehal
18 views
•
15 slides
Design Driven Network Assurance by
Design Driven Network Assurance
Network Automation Forum
19 views
•
42 slides
ESPC 2023 - Protect and Govern your Sensitive Data with Microsoft Purview in ... by
ESPC 2023 - Protect and Govern your Sensitive Data with Microsoft Purview in ...
Jasper Oosterveld
27 views
•
49 slides
iSAQB Software Architecture Gathering 2023: How Process Orchestration Increas... by
iSAQB Software Architecture Gathering 2023: How Process Orchestration Increas...
Bernd Ruecker
48 views
•
69 slides
Vertical User Stories by
Vertical User Stories
Moisés Armani Ramírez
17 views
•
16 slides
Business Analyst Series 2023 - Week 3 Session 5 by
Business Analyst Series 2023 - Week 3 Session 5
DianaGray10
345 views
•
20 slides
Recently uploaded
(20)
virtual reality.pptx by G036GaikwadSnehal
virtual reality.pptx
G036GaikwadSnehal
•
18 views
Design Driven Network Assurance by Network Automation Forum
Design Driven Network Assurance
Network Automation Forum
•
19 views
ESPC 2023 - Protect and Govern your Sensitive Data with Microsoft Purview in ... by Jasper Oosterveld
ESPC 2023 - Protect and Govern your Sensitive Data with Microsoft Purview in ...
Jasper Oosterveld
•
27 views
iSAQB Software Architecture Gathering 2023: How Process Orchestration Increas... by Bernd Ruecker
iSAQB Software Architecture Gathering 2023: How Process Orchestration Increas...
Bernd Ruecker
•
48 views
Vertical User Stories by Moisés Armani Ramírez
Vertical User Stories
Moisés Armani Ramírez
•
17 views
Business Analyst Series 2023 - Week 3 Session 5 by DianaGray10
Business Analyst Series 2023 - Week 3 Session 5
DianaGray10
•
345 views
Ransomware is Knocking your Door_Final.pdf by Security Bootcamp
Ransomware is Knocking your Door_Final.pdf
Security Bootcamp
•
66 views
Network Source of Truth and Infrastructure as Code revisited by Network Automation Forum
Network Source of Truth and Infrastructure as Code revisited
Network Automation Forum
•
32 views
Igniting Next Level Productivity with AI-Infused Data Integration Workflows by Safe Software
Igniting Next Level Productivity with AI-Infused Data Integration Workflows
Safe Software
•
317 views
Future of Indian ConsumerTech by Kapil Khandelwal (KK)
Future of Indian ConsumerTech
Kapil Khandelwal (KK)
•
24 views
2024: A Travel Odyssey The Role of Generative AI in the Tourism Universe by Simone Puorto
2024: A Travel Odyssey The Role of Generative AI in the Tourism Universe
Simone Puorto
•
13 views
The Research Portal of Catalonia: Growing more (information) & more (services) by CSUC - Consorci de Serveis Universitaris de Catalunya
The Research Portal of Catalonia: Growing more (information) & more (services)
CSUC - Consorci de Serveis Universitaris de Catalunya
•
115 views
Automating a World-Class Technology Conference; Behind the Scenes of CiscoLive by Network Automation Forum
Automating a World-Class Technology Conference; Behind the Scenes of CiscoLive
Network Automation Forum
•
43 views
Microsoft Power Platform.pptx by Uni Systems S.M.S.A.
Microsoft Power Platform.pptx
Uni Systems S.M.S.A.
•
61 views
Zero to Automated in Under a Year by Network Automation Forum
Zero to Automated in Under a Year
Network Automation Forum
•
22 views
GDG Cloud Southlake 28 Brad Taylor and Shawn Augenstein Old Problems in the N... by James Anderson
GDG Cloud Southlake 28 Brad Taylor and Shawn Augenstein Old Problems in the N...
James Anderson
•
126 views
Data Integrity for Banking and Financial Services by Precisely
Data Integrity for Banking and Financial Services
Precisely
•
29 views
GDSC CTU First Meeting Party by National Yang Ming Chiao Tung University
GDSC CTU First Meeting Party
National Yang Ming Chiao Tung University
•
11 views
Five Things You SHOULD Know About Postman by Postman
Five Things You SHOULD Know About Postman
Postman
•
38 views
The Forbidden VPN Secrets.pdf by Mariam Shaba
The Forbidden VPN Secrets.pdf
Mariam Shaba
•
20 views
HBaseCon 2013: ETL for Apache HBase
1.
© 2013 Experian
Limited. All rights reserved. HBaseCon 2013 Application Track – Case Study Experian Marketing Services ETL for HBase
2.
© 2013 Experian
Limited. All rights reserved. Manoj Khanwalkar Chief Architect Experian Marketing Services, New York Govind Asawa Big Data Architect Experian Marketing Services, New York Who We Are
3.
© 2013 Experian
Limited. All rights reserved. 1. About Experian Marketing Services 2. Why HBase 3. Why custom ETL 4. ETL solution features 5. Performance 6. Case Study 7. Conclusion Agenda
4.
© 2013 Experian
Limited. All rights reserved. Experian Marketing Services 1 billion+ messages daily 2000+ Institutional clients 9 regions, 24/ 7 500+ Tetabytes of data 200+ big queries 2000+ data export jobs Email and social digital marketing messages. 100% surge in volume during peak season Across all verticals Platforms operating globally Client needs 1 to 7 years of marketing data depending on verticals Complicated queries on 200+ million records 400+ columns for segmentation Client needs daily incremental activity data
5.
© 2013 Experian
Limited. All rights reserved. • Traditional RDBMS based solution is very challenging and cost prohibitive for the scale of operations • In SaaS based multi-tenancy model we require schema flexibility to support thousands of clients with their individual requirements • In majority of cases key based lookups can satisfy data extraction requirements (including range scans and filters) which is well supported by HBase • Automatic sharding and horizontally scalable • HBase provides a Java API which can be integrated with Experian’s other systems. 5 Why HBase
6.
© 2013 Experian
Limited. All rights reserved. 6 Why develop an Integrator toolkit? Connectivity Environment Cost • Ability to ingest and read data from HBase and MongoDB • Connectors for cloud computing • Support for REST and other industry standard API’s • Supports SaaS Model • Dynamically handles data input changes (# of fields & new fields) • Licensing • Integrate with other systems seamlessly thus improving time to market • Resources required to develop, administer and maintain solution • Major ETL vendors do not support HBase • ETL solution needs extensive development if data structure changes which negates advantages offered by No-SQL solution
7.
© 2013 Experian
Limited. All rights reserved. 7 Integrator Architecture DataIngester TargetSystems Third Party JMS Database SourceSystems Connectors CSV Reader Processor Event Listener Message Broker File Watcher Parser Factory Key Generator Parser Loader RDBMS Loader HBase Loader Container Metadata Analyzer Loader Aggregator RDBMS HBase Extractor Query Output Aggregate Aware Stamping Transform SaaS JMS Files RDBMS HBase RDBMS HBase MongoDB
8.
© 2013 Experian
Limited. All rights reserved. 8 Extractor Architecture HDFS Integrator Send Data Click Data Bounce Data TXN Data Metadata Detailed data Aggregates HBase Web Server Reporting Analytics Extractor Query Optimizer
9.
© 2013 Experian
Limited. All rights reserved. Data ingestion from multiple sources • Flat files • NO-SQL • RDBMS (through JDBC) • SaaS (Salesforce etc.) • Messaging and any system providing events streaming Ability to de-normalize fact table while ingesting data • # of lookup tables can be configured Near real time generation of aggregate table • # of aggregate tables can be configured • HBase counters are used to keep aggregated sum/count • Concurrently aggregates can be populated in RDBMS of choice 9 Integrator & Extractor
10.
© 2013 Experian
Limited. All rights reserved. Transformation of column value to another value • Add column by transformation • Drop columns from input stream if no persistence is required Data filter capability • Drop record while ingesting base table • Drop record while aggregation Aggregate aware optimized query execution • Query Performance: Analyze column requested by user in query and determine based on count table with minimum record which can satisfy this requirement. • Transparent: No user intervention or knowledge of schema is required • Optimizer: Conceptually similar to RDBMS query plan optimizer. Concept extended to No-SQL databases • Metadata Management: Integrated metadata with ETL process can be used by variety of applications. 10 Integrator & Extractor
11.
© 2013 Experian
Limited. All rights reserved. Framework • Solution based on Spring as a light weight container and built a framework around it to standardize on the lifecycle of the process and to enable any arbitrary functionality to reside in the container by implementing a Service interface. • The container runs in a batch processing or daemon mode. • In the daemon mode , it uses the Java 7 File Watcher API to react to files placed in the specified directory for processing. Metadata catalogue • Metadata about all HBase table in which data ingested is stored • For each table primary key, columns and record counter is stored • HBase count is brute force scan and expensive API call. This can be avoided if metadata is published at the time of data ingestion • Avoid expensive queries which can bring cluster to its knees • Provide faster query performance 11 Integrator
12.
© 2013 Experian
Limited. All rights reserved. • We used a 20 node cluster in production; each node had 24 cores with a 10GigE network backbone. • We observed a throughput of 1.3 million records inserted in HBase per minute per node. • Framework allowed us to run ETL process on multiple machines thus providing horizontal scalability. • Most of our queries returned back in at most a few seconds. 12 Integrator – System Performance
13.
© 2013 Experian
Limited. All rights reserved. • Our experience shows that HBase offers a cost effective and performance solution for managing our data explosion while meeting the increasingly sophisticated analytical and reporting requirements of clients. • ETL framework allows us to leverage HBase and its features while improving developer productivity. • Framework gives us ability to roll out new functionality with minimum time to market. • Metadata catalogue optimizes query and improves cluster performance • Select count() on big HBase table take minutes/hours and can bring cluster to knees. Metadata of Integrator will give counts along with PrimaryKey, Columns in milliseconds 13 Conclusion
14.
© 2013 Experian
Limited. All rights reserved. • Case Study 14 Appendix
15.
© 2013 Experian
Limited. All rights reserved. 15 HBase Schema & Record Client ID Campaign ID Time logged User ID Orig domain Rcpt domain DS status Bounce cat IP Time queued 1 11 01/01/13 21 abc.com gmail.com success 192.168. 6.23 01/01/ 2013 2 12 01/02/13 31 xyz.com yahoo.com success bad- mailbox 112.168. 6.23 01/01/ 2013 Fact Table send Send Record client_id,campaign_id,time_logged,user_id,orig_domain,rcpt_domain,dsn_status,bounce_cat,ip,Time_queued 1,11,01/01/2013,21,abc.com,gmail.com,success,192.168.6.23,01/01/2013
16.
© 2013 Experian
Limited. All rights reserved. 16 HBase Schema & Record Fact Table activity Activity Record client_id,campaign_id,event_time,user_id,event_type 1,11,01/01/2013,21,open Client ID Campaign ID Time logged User ID Orig domain Rcpt domain IP city Event type IP Send time 1 11 01/01/13 21 abc.com gmail.com SFO Open 192.168. 6.23 01/01/ 2013 2 12 01/04/13 31 xyz.com yahoo.com LA Click 112.168. 6.23 01/01/ 2013
17.
© 2013 Experian
Limited. All rights reserved. 17 HBase Schema & Record Dimension Table demographics Dimension Table ip Client ID User ID Date Age Gender State City Zip Country Flag 1 11 01/01/13 21 M CA SFO 94087 USA Y 2 12 01/02/13 31 M CA SFO 94087 USA N IP Date Domain State Country City 192.168.6.23 01/01/2013 gmail.com CA USA SFO 112.168.6.23 01/02/2013 abc.edu NJ USA Newark
18.
© 2013 Experian
Limited. All rights reserved. 18 HBase Schema & Record Aggregate Table A1 Aggregate Table A2 Campaign ID Date Gender State Country Count 11 01/01/13 M CA USA 5023 12 01/02/13 M CA USA 74890 Client ID Date Gender State Country Count 1 01/01/13 M CA USA 742345 2 01/02/13 M CA USA 1023456
19.
© 2013 Experian
Limited. All rights reserved. 19 Metadata Metadata Table Table Name Primary Key Columns Count demographics Client_id,Campaig n_id,Date Client_id, Campaign_id, Date, Age, Gender,State,City,Country,Flag 10,000,000 A1 Campaign_id,Date Campaign_id,Date,Gender,State,Country,Count 1,000,000 A2 Client_id,Date Client_id,Date,Gender,State,Country,Count 500,000
20.
© 2013 Experian
Limited. All rights reserved. User Query without Extractor Aggregate Awareness • Select client_id,state,count from demographics • Query Execution: Query will be executed on demographics table which has 300,000,000 rows User Query with Extractor Aggregate Awareness • Select client_id,state,count from demographics • Query Execution: – Step 1: Extractor will parse list of columns from query – Step 2: Extractor will find list of tables which has these columns. In this example extractor will get 2 tables demographics and A1 which can satisfy this query request – Step 3: Extractor will decide which is best table to satisfy this query. This decision will be based on # of rows in table. In this example table A1 has less # of rows compared to table demographics so table A1 will be selected – Step 4: Query will be executed against table A1 with appropriate where clause specified by user 20 Query Execution in Action
21.
© 2013 Experian
Limited. All rights reserved. • Bloom filters were enabled at the row level to enable HBase to skip files efficiently. • We used HBase filters extensively in the Scans to filter out as much data as possible on the server side. • Defined Aggregates judiciously to be able to respond to queries without requiring HBase to resort to large file scans.. • We used a key concatenation that aligned to expected search patterns to enable HBase to provide an exact match or do efficient key range scans when a partial key was provided. 21 HBase Design Considerations
22.
© 2013 Experian
Limited. All rights reserved. • We didn’t use MapReduce in our ETL framework for following considerations – Overhead of MapReduce based processes. – Real-time access to data – Every file had different header metadata , in MapReduce we had difficulty in passing header metadata to each Map process – Avoid intermediate reads and writes to the HDFS file system. 22 HBase Design Considerations
23.
© 2013 Experian
Limited. All rights reserved. • We broke the Input and Output processing into separate threads and allocated a lot more threads for output processing to compensate for the relative processing speeds. • Batched the Writes to HBase to reduce number of calls to the server • Turned off the WAL in HBase , since we could always reprocess the file in case of a rare failure • Used primitives and Arrays in the code where feasible instead of Java Objects and Collections, to reduce the memory footprint and the pressure on the Garbage collector. 23 HBase Tuning
24.
© 2013 Experian
Limited. All rights reserved. • Increased the Client Write Buffer size to several megabytes. • To avoid hotspots and best data retrieval we designed composite primary key. Key design allowed us to access data by providing exact key or range scan by leading portion of key. • We found that too many filters for scan provides diminishing returns and after some point it degrades the overall scan performance 24 HBase Tuning
25.
© 2013 Experian
Limited. All rights reserved. Thank you For more information, please contact Manoj Khanwalkar Chief Architect manoj.khanwalkar@experian.com Govind Asawa Big Data Architect govind.asawa@experian.com