SlideShare a Scribd company logo
© 2013 Experian Limited. All rights reserved.
HBaseCon 2013
Application Track – Case Study
Experian Marketing Services
ETL for HBase
© 2013 Experian Limited. All rights reserved.
Manoj Khanwalkar
Chief Architect
Experian Marketing Services, New York
Govind Asawa
Big Data Architect
Experian Marketing Services, New York
Who We Are
© 2013 Experian Limited. All rights reserved.
1. About Experian Marketing Services
2. Why HBase
3. Why custom ETL
4. ETL solution features
5. Performance
6. Case Study
7. Conclusion
Agenda
© 2013 Experian Limited. All rights reserved.
Experian Marketing Services
1 billion+
messages
daily
2000+
Institutional
clients
9
regions, 24/
7
500+
Tetabytes of
data
200+ big
queries
2000+ data
export jobs
Email and
social digital
marketing
messages.
100% surge in
volume during
peak season
Across all
verticals
Platforms
operating
globally
Client needs 1 to
7 years of
marketing data
depending on
verticals
Complicated
queries on 200+
million records
400+ columns for
segmentation
Client needs
daily
incremental
activity data
© 2013 Experian Limited. All rights reserved.
• Traditional RDBMS based solution is very challenging and cost
prohibitive for the scale of operations
• In SaaS based multi-tenancy model we require schema flexibility to
support thousands of clients with their individual requirements
• In majority of cases key based lookups can satisfy data extraction
requirements (including range scans and filters) which is well supported by
HBase
• Automatic sharding and horizontally scalable
• HBase provides a Java API which can be integrated with Experian’s
other systems.
5
Why HBase
© 2013 Experian Limited. All rights reserved. 6
Why develop an Integrator toolkit?
Connectivity Environment Cost
• Ability to ingest and
read data from HBase
and MongoDB
• Connectors for cloud
computing
• Support for REST and
other industry standard
API’s
• Supports SaaS Model
• Dynamically handles
data input changes (# of
fields & new fields)
• Licensing
• Integrate with other
systems seamlessly thus
improving time to
market
• Resources required to
develop, administer and
maintain solution
• Major ETL vendors do not support HBase
• ETL solution needs extensive development if data structure changes which negates
advantages offered by No-SQL solution
© 2013 Experian Limited. All rights reserved. 7
Integrator Architecture
DataIngester
TargetSystems
Third Party
JMS
Database
SourceSystems
Connectors
CSV Reader
Processor
Event Listener
Message Broker
File Watcher
Parser Factory
Key Generator
Parser
Loader
RDBMS Loader
HBase Loader
Container
Metadata
Analyzer
Loader
Aggregator
RDBMS
HBase
Extractor
Query Output
Aggregate Aware
Stamping Transform
SaaS
JMS
Files
RDBMS
HBase
RDBMS
HBase
MongoDB
© 2013 Experian Limited. All rights reserved. 8
Extractor Architecture
HDFS
Integrator
Send Data Click Data Bounce Data TXN Data
Metadata Detailed data Aggregates
HBase
Web
Server
Reporting
Analytics
Extractor
Query
Optimizer
© 2013 Experian Limited. All rights reserved.
Data ingestion from multiple sources
• Flat files
• NO-SQL
• RDBMS (through JDBC)
• SaaS (Salesforce etc.)
• Messaging and any system providing events streaming
Ability to de-normalize fact table while ingesting data
• # of lookup tables can be configured
Near real time generation of aggregate table
• # of aggregate tables can be configured
• HBase counters are used to keep aggregated sum/count
• Concurrently aggregates can be populated in RDBMS of choice
9
Integrator & Extractor
© 2013 Experian Limited. All rights reserved.
Transformation of column value to another value
• Add column by transformation
• Drop columns from input stream if no persistence is required
Data filter capability
• Drop record while ingesting base table
• Drop record while aggregation
Aggregate aware optimized query execution
• Query Performance: Analyze column requested by user in query and determine based
on count table with minimum record which can satisfy this requirement.
• Transparent: No user intervention or knowledge of schema is required
• Optimizer: Conceptually similar to RDBMS query plan optimizer. Concept extended to
No-SQL databases
• Metadata Management: Integrated metadata with ETL process can be used by variety of
applications.
10
Integrator & Extractor
© 2013 Experian Limited. All rights reserved.
Framework
• Solution based on Spring as a light weight container and built a framework around it to
standardize on the lifecycle of the process and to enable any arbitrary functionality to
reside in the container by implementing a Service interface.
• The container runs in a batch processing or daemon mode.
• In the daemon mode , it uses the Java 7 File Watcher API to react to files placed in the
specified directory for processing.
Metadata catalogue
• Metadata about all HBase table in which data ingested is stored
• For each table primary key, columns and record counter is stored
• HBase count is brute force scan and expensive API call. This can be avoided if metadata is
published at the time of data ingestion
• Avoid expensive queries which can bring cluster to its knees
• Provide faster query performance
11
Integrator
© 2013 Experian Limited. All rights reserved.
• We used a 20 node cluster in production; each node had 24 cores with a
10GigE network backbone.
• We observed a throughput of 1.3 million records inserted in HBase per
minute per node.
• Framework allowed us to run ETL process on multiple machines thus
providing horizontal scalability.
• Most of our queries returned back in at most a few seconds.
12
Integrator – System Performance
© 2013 Experian Limited. All rights reserved.
• Our experience shows that HBase offers a cost effective and performance
solution for managing our data explosion while meeting the increasingly
sophisticated analytical and reporting requirements of clients.
• ETL framework allows us to leverage HBase and its features while
improving developer productivity.
• Framework gives us ability to roll out new functionality with minimum
time to market.
• Metadata catalogue optimizes query and improves cluster performance
• Select count() on big HBase table take minutes/hours and can bring
cluster to knees. Metadata of Integrator will give counts along with
PrimaryKey, Columns in milliseconds
13
Conclusion
© 2013 Experian Limited. All rights reserved.
• Case Study
14
Appendix
© 2013 Experian Limited. All rights reserved. 15
HBase Schema & Record
Client
ID
Campaign
ID
Time
logged
User
ID
Orig
domain
Rcpt
domain
DS
status
Bounce
cat
IP Time
queued
1 11 01/01/13 21 abc.com gmail.com success 192.168.
6.23
01/01/
2013
2 12 01/02/13 31 xyz.com yahoo.com success bad-
mailbox
112.168.
6.23
01/01/
2013
Fact Table  send
Send Record
client_id,campaign_id,time_logged,user_id,orig_domain,rcpt_domain,dsn_status,bounce_cat,ip,Time_queued
1,11,01/01/2013,21,abc.com,gmail.com,success,192.168.6.23,01/01/2013
© 2013 Experian Limited. All rights reserved. 16
HBase Schema & Record
Fact Table  activity
Activity Record
client_id,campaign_id,event_time,user_id,event_type
1,11,01/01/2013,21,open
Client
ID
Campaign
ID
Time
logged
User
ID
Orig
domain
Rcpt
domain
IP city Event
type
IP Send
time
1 11 01/01/13 21 abc.com gmail.com SFO Open 192.168.
6.23
01/01/
2013
2 12 01/04/13 31 xyz.com yahoo.com LA Click 112.168.
6.23
01/01/
2013
© 2013 Experian Limited. All rights reserved. 17
HBase Schema & Record
Dimension Table  demographics
Dimension Table  ip
Client ID User ID Date Age Gender State City Zip Country Flag
1 11 01/01/13 21 M CA SFO 94087 USA Y
2 12 01/02/13 31 M CA SFO 94087 USA N
IP Date Domain State Country City
192.168.6.23 01/01/2013 gmail.com CA USA SFO
112.168.6.23 01/02/2013 abc.edu NJ USA Newark
© 2013 Experian Limited. All rights reserved. 18
HBase Schema & Record
Aggregate Table  A1
Aggregate Table  A2
Campaign ID Date Gender State Country Count
11 01/01/13 M CA USA 5023
12 01/02/13 M CA USA 74890
Client ID Date Gender State Country Count
1 01/01/13 M CA USA 742345
2 01/02/13 M CA USA 1023456
© 2013 Experian Limited. All rights reserved. 19
Metadata
Metadata Table
Table Name Primary Key Columns Count
demographics Client_id,Campaig
n_id,Date
Client_id, Campaign_id, Date, Age,
Gender,State,City,Country,Flag
10,000,000
A1 Campaign_id,Date Campaign_id,Date,Gender,State,Country,Count 1,000,000
A2 Client_id,Date Client_id,Date,Gender,State,Country,Count 500,000
© 2013 Experian Limited. All rights reserved.
User Query without Extractor Aggregate Awareness
• Select client_id,state,count from demographics
• Query Execution: Query will be executed on demographics table which has
300,000,000 rows
User Query with Extractor Aggregate Awareness
• Select client_id,state,count from demographics
• Query Execution:
– Step 1: Extractor will parse list of columns from query
– Step 2: Extractor will find list of tables which has these columns. In this example
extractor will get 2 tables demographics and A1 which can satisfy this query request
– Step 3: Extractor will decide which is best table to satisfy this query. This decision
will be based on # of rows in table. In this example table A1 has less # of rows
compared to table demographics so table A1 will be selected
– Step 4: Query will be executed against table A1 with appropriate where clause
specified by user
20
Query Execution in Action
© 2013 Experian Limited. All rights reserved.
• Bloom filters were enabled at the row level to enable HBase to skip
files efficiently.
• We used HBase filters extensively in the Scans to filter out as much
data as possible on the server side.
• Defined Aggregates judiciously to be able to respond to queries
without requiring HBase to resort to large file scans..
• We used a key concatenation that aligned to expected search
patterns to enable HBase to provide an exact match or do efficient
key range scans when a partial key was provided.
21
HBase Design Considerations
© 2013 Experian Limited. All rights reserved.
• We didn’t use MapReduce in our ETL framework for
following considerations
– Overhead of MapReduce based processes.
– Real-time access to data
– Every file had different header metadata , in MapReduce
we had difficulty in passing header metadata to each Map
process
– Avoid intermediate reads and writes to the HDFS file
system.
22
HBase Design Considerations
© 2013 Experian Limited. All rights reserved.
• We broke the Input and Output processing into separate threads and
allocated a lot more threads for output processing to compensate for
the relative processing speeds.
• Batched the Writes to HBase to reduce number of calls to the server
• Turned off the WAL in HBase , since we could always reprocess the file
in case of a rare failure
• Used primitives and Arrays in the code where feasible instead of Java
Objects and Collections, to reduce the memory footprint and the
pressure on the Garbage collector.
23
HBase Tuning
© 2013 Experian Limited. All rights reserved.
• Increased the Client Write Buffer size to several megabytes.
• To avoid hotspots and best data retrieval we designed composite
primary key. Key design allowed us to access data by providing exact
key or range scan by leading portion of key.
• We found that too many filters for scan provides diminishing returns
and after some point it degrades the overall scan performance
24
HBase Tuning
© 2013 Experian Limited. All rights reserved.
Thank you
For more information, please contact
Manoj Khanwalkar
Chief Architect
manoj.khanwalkar@experian.com
Govind Asawa
Big Data Architect
govind.asawa@experian.com

More Related Content

What's hot

HBaseCon 2013: HBase SEP - Reliable Maintenance of Auxiliary Index Structures
HBaseCon 2013: HBase SEP - Reliable Maintenance of Auxiliary Index StructuresHBaseCon 2013: HBase SEP - Reliable Maintenance of Auxiliary Index Structures
HBaseCon 2013: HBase SEP - Reliable Maintenance of Auxiliary Index Structures
Cloudera, Inc.
 
A Graph Service for Global Web Entities Traversal and Reputation Evaluation B...
A Graph Service for Global Web Entities Traversal and Reputation Evaluation B...A Graph Service for Global Web Entities Traversal and Reputation Evaluation B...
A Graph Service for Global Web Entities Traversal and Reputation Evaluation B...
HBaseCon
 
Real-Time Video Analytics Using Hadoop and HBase (HBaseCon 2013)
Real-Time Video Analytics Using Hadoop and HBase (HBaseCon 2013)Real-Time Video Analytics Using Hadoop and HBase (HBaseCon 2013)
Real-Time Video Analytics Using Hadoop and HBase (HBaseCon 2013)
Suman Srinivasan
 
HBase Read High Availability Using Timeline-Consistent Region Replicas
HBase Read High Availability Using Timeline-Consistent Region ReplicasHBase Read High Availability Using Timeline-Consistent Region Replicas
HBase Read High Availability Using Timeline-Consistent Region Replicas
HBaseCon
 
HBaseConAsia2018 Track2-1: Kerberos-based Big Data Security Solution and Prac...
HBaseConAsia2018 Track2-1: Kerberos-based Big Data Security Solution and Prac...HBaseConAsia2018 Track2-1: Kerberos-based Big Data Security Solution and Prac...
HBaseConAsia2018 Track2-1: Kerberos-based Big Data Security Solution and Prac...
Michael Stack
 
HBaseCon 2013: Using Coprocessors to Index Columns in an Elasticsearch Cluster
HBaseCon 2013: Using Coprocessors to Index Columns in an Elasticsearch Cluster HBaseCon 2013: Using Coprocessors to Index Columns in an Elasticsearch Cluster
HBaseCon 2013: Using Coprocessors to Index Columns in an Elasticsearch Cluster
Cloudera, Inc.
 
HBaseConAsia2018 Track2-4: HTAP DB-System: AsparaDB HBase, Phoenix, and Spark
HBaseConAsia2018 Track2-4: HTAP DB-System: AsparaDB HBase, Phoenix, and SparkHBaseConAsia2018 Track2-4: HTAP DB-System: AsparaDB HBase, Phoenix, and Spark
HBaseConAsia2018 Track2-4: HTAP DB-System: AsparaDB HBase, Phoenix, and Spark
Michael Stack
 
Harmonizing Multi-tenant HBase Clusters for Managing Workload Diversity
Harmonizing Multi-tenant HBase Clusters for Managing Workload DiversityHarmonizing Multi-tenant HBase Clusters for Managing Workload Diversity
Harmonizing Multi-tenant HBase Clusters for Managing Workload Diversity
HBaseCon
 
HBaseCon 2015: Apache Phoenix - The Evolution of a Relational Database Layer ...
HBaseCon 2015: Apache Phoenix - The Evolution of a Relational Database Layer ...HBaseCon 2015: Apache Phoenix - The Evolution of a Relational Database Layer ...
HBaseCon 2015: Apache Phoenix - The Evolution of a Relational Database Layer ...
HBaseCon
 
Optimizing Apache HBase for Cloud Storage in Microsoft Azure HDInsight
Optimizing Apache HBase for Cloud Storage in Microsoft Azure HDInsightOptimizing Apache HBase for Cloud Storage in Microsoft Azure HDInsight
Optimizing Apache HBase for Cloud Storage in Microsoft Azure HDInsight
HBaseCon
 
HBase in Practice
HBase in Practice HBase in Practice
HBase in Practice
DataWorks Summit/Hadoop Summit
 
HBaseCon 2012 | Solbase - Kyungseog Oh, Photobucket
HBaseCon 2012 | Solbase - Kyungseog Oh, PhotobucketHBaseCon 2012 | Solbase - Kyungseog Oh, Photobucket
HBaseCon 2012 | Solbase - Kyungseog Oh, Photobucket
Cloudera, Inc.
 
Keynote: The Future of Apache HBase
Keynote: The Future of Apache HBaseKeynote: The Future of Apache HBase
Keynote: The Future of Apache HBase
HBaseCon
 
HBaseCon 2015: Graph Processing of Stock Market Order Flow in HBase on AWS
HBaseCon 2015: Graph Processing of Stock Market Order Flow in HBase on AWSHBaseCon 2015: Graph Processing of Stock Market Order Flow in HBase on AWS
HBaseCon 2015: Graph Processing of Stock Market Order Flow in HBase on AWS
HBaseCon
 
Architecting Applications with Hadoop
Architecting Applications with HadoopArchitecting Applications with Hadoop
Architecting Applications with Hadoop
markgrover
 
Unified Batch & Stream Processing with Apache Samza
Unified Batch & Stream Processing with Apache SamzaUnified Batch & Stream Processing with Apache Samza
Unified Batch & Stream Processing with Apache Samza
DataWorks Summit
 
Ingesting data at scale into elasticsearch with apache pulsar
Ingesting data at scale into elasticsearch with apache pulsarIngesting data at scale into elasticsearch with apache pulsar
Ingesting data at scale into elasticsearch with apache pulsar
Timothy Spann
 
From Device to Data Center to Insights
From Device to Data Center to InsightsFrom Device to Data Center to Insights
From Device to Data Center to Insights
DataWorks Summit/Hadoop Summit
 
Ingest and Stream Processing - What will you choose?
Ingest and Stream Processing - What will you choose?Ingest and Stream Processing - What will you choose?
Ingest and Stream Processing - What will you choose?
DataWorks Summit/Hadoop Summit
 
Data Wrangling and Oracle Connectors for Hadoop
Data Wrangling and Oracle Connectors for HadoopData Wrangling and Oracle Connectors for Hadoop
Data Wrangling and Oracle Connectors for Hadoop
Gwen (Chen) Shapira
 

What's hot (20)

HBaseCon 2013: HBase SEP - Reliable Maintenance of Auxiliary Index Structures
HBaseCon 2013: HBase SEP - Reliable Maintenance of Auxiliary Index StructuresHBaseCon 2013: HBase SEP - Reliable Maintenance of Auxiliary Index Structures
HBaseCon 2013: HBase SEP - Reliable Maintenance of Auxiliary Index Structures
 
A Graph Service for Global Web Entities Traversal and Reputation Evaluation B...
A Graph Service for Global Web Entities Traversal and Reputation Evaluation B...A Graph Service for Global Web Entities Traversal and Reputation Evaluation B...
A Graph Service for Global Web Entities Traversal and Reputation Evaluation B...
 
Real-Time Video Analytics Using Hadoop and HBase (HBaseCon 2013)
Real-Time Video Analytics Using Hadoop and HBase (HBaseCon 2013)Real-Time Video Analytics Using Hadoop and HBase (HBaseCon 2013)
Real-Time Video Analytics Using Hadoop and HBase (HBaseCon 2013)
 
HBase Read High Availability Using Timeline-Consistent Region Replicas
HBase Read High Availability Using Timeline-Consistent Region ReplicasHBase Read High Availability Using Timeline-Consistent Region Replicas
HBase Read High Availability Using Timeline-Consistent Region Replicas
 
HBaseConAsia2018 Track2-1: Kerberos-based Big Data Security Solution and Prac...
HBaseConAsia2018 Track2-1: Kerberos-based Big Data Security Solution and Prac...HBaseConAsia2018 Track2-1: Kerberos-based Big Data Security Solution and Prac...
HBaseConAsia2018 Track2-1: Kerberos-based Big Data Security Solution and Prac...
 
HBaseCon 2013: Using Coprocessors to Index Columns in an Elasticsearch Cluster
HBaseCon 2013: Using Coprocessors to Index Columns in an Elasticsearch Cluster HBaseCon 2013: Using Coprocessors to Index Columns in an Elasticsearch Cluster
HBaseCon 2013: Using Coprocessors to Index Columns in an Elasticsearch Cluster
 
HBaseConAsia2018 Track2-4: HTAP DB-System: AsparaDB HBase, Phoenix, and Spark
HBaseConAsia2018 Track2-4: HTAP DB-System: AsparaDB HBase, Phoenix, and SparkHBaseConAsia2018 Track2-4: HTAP DB-System: AsparaDB HBase, Phoenix, and Spark
HBaseConAsia2018 Track2-4: HTAP DB-System: AsparaDB HBase, Phoenix, and Spark
 
Harmonizing Multi-tenant HBase Clusters for Managing Workload Diversity
Harmonizing Multi-tenant HBase Clusters for Managing Workload DiversityHarmonizing Multi-tenant HBase Clusters for Managing Workload Diversity
Harmonizing Multi-tenant HBase Clusters for Managing Workload Diversity
 
HBaseCon 2015: Apache Phoenix - The Evolution of a Relational Database Layer ...
HBaseCon 2015: Apache Phoenix - The Evolution of a Relational Database Layer ...HBaseCon 2015: Apache Phoenix - The Evolution of a Relational Database Layer ...
HBaseCon 2015: Apache Phoenix - The Evolution of a Relational Database Layer ...
 
Optimizing Apache HBase for Cloud Storage in Microsoft Azure HDInsight
Optimizing Apache HBase for Cloud Storage in Microsoft Azure HDInsightOptimizing Apache HBase for Cloud Storage in Microsoft Azure HDInsight
Optimizing Apache HBase for Cloud Storage in Microsoft Azure HDInsight
 
HBase in Practice
HBase in Practice HBase in Practice
HBase in Practice
 
HBaseCon 2012 | Solbase - Kyungseog Oh, Photobucket
HBaseCon 2012 | Solbase - Kyungseog Oh, PhotobucketHBaseCon 2012 | Solbase - Kyungseog Oh, Photobucket
HBaseCon 2012 | Solbase - Kyungseog Oh, Photobucket
 
Keynote: The Future of Apache HBase
Keynote: The Future of Apache HBaseKeynote: The Future of Apache HBase
Keynote: The Future of Apache HBase
 
HBaseCon 2015: Graph Processing of Stock Market Order Flow in HBase on AWS
HBaseCon 2015: Graph Processing of Stock Market Order Flow in HBase on AWSHBaseCon 2015: Graph Processing of Stock Market Order Flow in HBase on AWS
HBaseCon 2015: Graph Processing of Stock Market Order Flow in HBase on AWS
 
Architecting Applications with Hadoop
Architecting Applications with HadoopArchitecting Applications with Hadoop
Architecting Applications with Hadoop
 
Unified Batch & Stream Processing with Apache Samza
Unified Batch & Stream Processing with Apache SamzaUnified Batch & Stream Processing with Apache Samza
Unified Batch & Stream Processing with Apache Samza
 
Ingesting data at scale into elasticsearch with apache pulsar
Ingesting data at scale into elasticsearch with apache pulsarIngesting data at scale into elasticsearch with apache pulsar
Ingesting data at scale into elasticsearch with apache pulsar
 
From Device to Data Center to Insights
From Device to Data Center to InsightsFrom Device to Data Center to Insights
From Device to Data Center to Insights
 
Ingest and Stream Processing - What will you choose?
Ingest and Stream Processing - What will you choose?Ingest and Stream Processing - What will you choose?
Ingest and Stream Processing - What will you choose?
 
Data Wrangling and Oracle Connectors for Hadoop
Data Wrangling and Oracle Connectors for HadoopData Wrangling and Oracle Connectors for Hadoop
Data Wrangling and Oracle Connectors for Hadoop
 

Viewers also liked

Using Morphlines for On-the-Fly ETL
Using Morphlines for On-the-Fly ETLUsing Morphlines for On-the-Fly ETL
Using Morphlines for On-the-Fly ETL
Cloudera, Inc.
 
HBaseCon 2012 | Building Mobile Infrastructure with HBase
HBaseCon 2012 | Building Mobile Infrastructure with HBaseHBaseCon 2012 | Building Mobile Infrastructure with HBase
HBaseCon 2012 | Building Mobile Infrastructure with HBase
Cloudera, Inc.
 
HBaseCon 2012 | Content Addressable Storages for Fun and Profit - Berk Demir,...
HBaseCon 2012 | Content Addressable Storages for Fun and Profit - Berk Demir,...HBaseCon 2012 | Content Addressable Storages for Fun and Profit - Berk Demir,...
HBaseCon 2012 | Content Addressable Storages for Fun and Profit - Berk Demir,...
Cloudera, Inc.
 
HBaseCon 2013: Apache HBase, Meet Ops. Ops, Meet Apache HBase.
HBaseCon 2013: Apache HBase, Meet Ops. Ops, Meet Apache HBase.HBaseCon 2013: Apache HBase, Meet Ops. Ops, Meet Apache HBase.
HBaseCon 2013: Apache HBase, Meet Ops. Ops, Meet Apache HBase.
Cloudera, Inc.
 
HBaseCon 2013: Apache Hadoop and Apache HBase for Real-Time Video Analytics
HBaseCon 2013: Apache Hadoop and Apache HBase for Real-Time Video Analytics HBaseCon 2013: Apache Hadoop and Apache HBase for Real-Time Video Analytics
HBaseCon 2013: Apache Hadoop and Apache HBase for Real-Time Video Analytics
Cloudera, Inc.
 
HBaseCon 2015: Trafodion - Integrating Operational SQL into HBase
HBaseCon 2015: Trafodion - Integrating Operational SQL into HBaseHBaseCon 2015: Trafodion - Integrating Operational SQL into HBase
HBaseCon 2015: Trafodion - Integrating Operational SQL into HBase
HBaseCon
 
HBaseCon 2012 | Unique Sets on HBase and Hadoop - Elliot Clark, StumbleUpon
HBaseCon 2012 | Unique Sets on HBase and Hadoop - Elliot Clark, StumbleUponHBaseCon 2012 | Unique Sets on HBase and Hadoop - Elliot Clark, StumbleUpon
HBaseCon 2012 | Unique Sets on HBase and Hadoop - Elliot Clark, StumbleUpon
Cloudera, Inc.
 
HBaseCon 2013: Evolving a First-Generation Apache HBase Deployment to Second...
HBaseCon 2013:  Evolving a First-Generation Apache HBase Deployment to Second...HBaseCon 2013:  Evolving a First-Generation Apache HBase Deployment to Second...
HBaseCon 2013: Evolving a First-Generation Apache HBase Deployment to Second...
Cloudera, Inc.
 
Cross-Site BigTable using HBase
Cross-Site BigTable using HBaseCross-Site BigTable using HBase
Cross-Site BigTable using HBase
HBaseCon
 
HBaseCon 2012 | Relaxed Transactions for HBase - Francis Liu, Yahoo!
HBaseCon 2012 | Relaxed Transactions for HBase - Francis Liu, Yahoo!HBaseCon 2012 | Relaxed Transactions for HBase - Francis Liu, Yahoo!
HBaseCon 2012 | Relaxed Transactions for HBase - Francis Liu, Yahoo!
Cloudera, Inc.
 
Tales from the Cloudera Field
Tales from the Cloudera FieldTales from the Cloudera Field
Tales from the Cloudera Field
HBaseCon
 
HBaseCon 2012 | Scaling GIS In Three Acts
HBaseCon 2012 | Scaling GIS In Three ActsHBaseCon 2012 | Scaling GIS In Three Acts
HBaseCon 2012 | Scaling GIS In Three Acts
Cloudera, Inc.
 
HBaseCon 2012 | Leveraging HBase for the World’s Largest Curated Genomic Data...
HBaseCon 2012 | Leveraging HBase for the World’s Largest Curated Genomic Data...HBaseCon 2012 | Leveraging HBase for the World’s Largest Curated Genomic Data...
HBaseCon 2012 | Leveraging HBase for the World’s Largest Curated Genomic Data...
Cloudera, Inc.
 
HBaseCon 2015: DeathStar - Easy, Dynamic, Multi-tenant HBase via YARN
HBaseCon 2015: DeathStar - Easy, Dynamic,  Multi-tenant HBase via YARNHBaseCon 2015: DeathStar - Easy, Dynamic,  Multi-tenant HBase via YARN
HBaseCon 2015: DeathStar - Easy, Dynamic, Multi-tenant HBase via YARN
HBaseCon
 
HBaseCon 2013: Apache HBase on Flash
HBaseCon 2013: Apache HBase on FlashHBaseCon 2013: Apache HBase on Flash
HBaseCon 2013: Apache HBase on Flash
Cloudera, Inc.
 
HBaseCon 2013: 1500 JIRAs in 20 Minutes
HBaseCon 2013: 1500 JIRAs in 20 MinutesHBaseCon 2013: 1500 JIRAs in 20 Minutes
HBaseCon 2013: 1500 JIRAs in 20 Minutes
Cloudera, Inc.
 
HBaseCon 2013: Project Valta - A Resource Management Layer over Apache HBase
HBaseCon 2013: Project Valta - A Resource Management Layer over Apache HBaseHBaseCon 2013: Project Valta - A Resource Management Layer over Apache HBase
HBaseCon 2013: Project Valta - A Resource Management Layer over Apache HBase
Cloudera, Inc.
 
HBaseCon 2012 | Living Data: Applying Adaptable Schemas to HBase - Aaron Kimb...
HBaseCon 2012 | Living Data: Applying Adaptable Schemas to HBase - Aaron Kimb...HBaseCon 2012 | Living Data: Applying Adaptable Schemas to HBase - Aaron Kimb...
HBaseCon 2012 | Living Data: Applying Adaptable Schemas to HBase - Aaron Kimb...
Cloudera, Inc.
 
HBaseCon 2012 | HBase for the Worlds Libraries - OCLC
HBaseCon 2012 | HBase for the Worlds Libraries - OCLCHBaseCon 2012 | HBase for the Worlds Libraries - OCLC
HBaseCon 2012 | HBase for the Worlds Libraries - OCLC
Cloudera, Inc.
 
HBaseCon 2013: Being Smarter Than the Smart Meter
HBaseCon 2013: Being Smarter Than the Smart MeterHBaseCon 2013: Being Smarter Than the Smart Meter
HBaseCon 2013: Being Smarter Than the Smart Meter
Cloudera, Inc.
 

Viewers also liked (20)

Using Morphlines for On-the-Fly ETL
Using Morphlines for On-the-Fly ETLUsing Morphlines for On-the-Fly ETL
Using Morphlines for On-the-Fly ETL
 
HBaseCon 2012 | Building Mobile Infrastructure with HBase
HBaseCon 2012 | Building Mobile Infrastructure with HBaseHBaseCon 2012 | Building Mobile Infrastructure with HBase
HBaseCon 2012 | Building Mobile Infrastructure with HBase
 
HBaseCon 2012 | Content Addressable Storages for Fun and Profit - Berk Demir,...
HBaseCon 2012 | Content Addressable Storages for Fun and Profit - Berk Demir,...HBaseCon 2012 | Content Addressable Storages for Fun and Profit - Berk Demir,...
HBaseCon 2012 | Content Addressable Storages for Fun and Profit - Berk Demir,...
 
HBaseCon 2013: Apache HBase, Meet Ops. Ops, Meet Apache HBase.
HBaseCon 2013: Apache HBase, Meet Ops. Ops, Meet Apache HBase.HBaseCon 2013: Apache HBase, Meet Ops. Ops, Meet Apache HBase.
HBaseCon 2013: Apache HBase, Meet Ops. Ops, Meet Apache HBase.
 
HBaseCon 2013: Apache Hadoop and Apache HBase for Real-Time Video Analytics
HBaseCon 2013: Apache Hadoop and Apache HBase for Real-Time Video Analytics HBaseCon 2013: Apache Hadoop and Apache HBase for Real-Time Video Analytics
HBaseCon 2013: Apache Hadoop and Apache HBase for Real-Time Video Analytics
 
HBaseCon 2015: Trafodion - Integrating Operational SQL into HBase
HBaseCon 2015: Trafodion - Integrating Operational SQL into HBaseHBaseCon 2015: Trafodion - Integrating Operational SQL into HBase
HBaseCon 2015: Trafodion - Integrating Operational SQL into HBase
 
HBaseCon 2012 | Unique Sets on HBase and Hadoop - Elliot Clark, StumbleUpon
HBaseCon 2012 | Unique Sets on HBase and Hadoop - Elliot Clark, StumbleUponHBaseCon 2012 | Unique Sets on HBase and Hadoop - Elliot Clark, StumbleUpon
HBaseCon 2012 | Unique Sets on HBase and Hadoop - Elliot Clark, StumbleUpon
 
HBaseCon 2013: Evolving a First-Generation Apache HBase Deployment to Second...
HBaseCon 2013:  Evolving a First-Generation Apache HBase Deployment to Second...HBaseCon 2013:  Evolving a First-Generation Apache HBase Deployment to Second...
HBaseCon 2013: Evolving a First-Generation Apache HBase Deployment to Second...
 
Cross-Site BigTable using HBase
Cross-Site BigTable using HBaseCross-Site BigTable using HBase
Cross-Site BigTable using HBase
 
HBaseCon 2012 | Relaxed Transactions for HBase - Francis Liu, Yahoo!
HBaseCon 2012 | Relaxed Transactions for HBase - Francis Liu, Yahoo!HBaseCon 2012 | Relaxed Transactions for HBase - Francis Liu, Yahoo!
HBaseCon 2012 | Relaxed Transactions for HBase - Francis Liu, Yahoo!
 
Tales from the Cloudera Field
Tales from the Cloudera FieldTales from the Cloudera Field
Tales from the Cloudera Field
 
HBaseCon 2012 | Scaling GIS In Three Acts
HBaseCon 2012 | Scaling GIS In Three ActsHBaseCon 2012 | Scaling GIS In Three Acts
HBaseCon 2012 | Scaling GIS In Three Acts
 
HBaseCon 2012 | Leveraging HBase for the World’s Largest Curated Genomic Data...
HBaseCon 2012 | Leveraging HBase for the World’s Largest Curated Genomic Data...HBaseCon 2012 | Leveraging HBase for the World’s Largest Curated Genomic Data...
HBaseCon 2012 | Leveraging HBase for the World’s Largest Curated Genomic Data...
 
HBaseCon 2015: DeathStar - Easy, Dynamic, Multi-tenant HBase via YARN
HBaseCon 2015: DeathStar - Easy, Dynamic,  Multi-tenant HBase via YARNHBaseCon 2015: DeathStar - Easy, Dynamic,  Multi-tenant HBase via YARN
HBaseCon 2015: DeathStar - Easy, Dynamic, Multi-tenant HBase via YARN
 
HBaseCon 2013: Apache HBase on Flash
HBaseCon 2013: Apache HBase on FlashHBaseCon 2013: Apache HBase on Flash
HBaseCon 2013: Apache HBase on Flash
 
HBaseCon 2013: 1500 JIRAs in 20 Minutes
HBaseCon 2013: 1500 JIRAs in 20 MinutesHBaseCon 2013: 1500 JIRAs in 20 Minutes
HBaseCon 2013: 1500 JIRAs in 20 Minutes
 
HBaseCon 2013: Project Valta - A Resource Management Layer over Apache HBase
HBaseCon 2013: Project Valta - A Resource Management Layer over Apache HBaseHBaseCon 2013: Project Valta - A Resource Management Layer over Apache HBase
HBaseCon 2013: Project Valta - A Resource Management Layer over Apache HBase
 
HBaseCon 2012 | Living Data: Applying Adaptable Schemas to HBase - Aaron Kimb...
HBaseCon 2012 | Living Data: Applying Adaptable Schemas to HBase - Aaron Kimb...HBaseCon 2012 | Living Data: Applying Adaptable Schemas to HBase - Aaron Kimb...
HBaseCon 2012 | Living Data: Applying Adaptable Schemas to HBase - Aaron Kimb...
 
HBaseCon 2012 | HBase for the Worlds Libraries - OCLC
HBaseCon 2012 | HBase for the Worlds Libraries - OCLCHBaseCon 2012 | HBase for the Worlds Libraries - OCLC
HBaseCon 2012 | HBase for the Worlds Libraries - OCLC
 
HBaseCon 2013: Being Smarter Than the Smart Meter
HBaseCon 2013: Being Smarter Than the Smart MeterHBaseCon 2013: Being Smarter Than the Smart Meter
HBaseCon 2013: Being Smarter Than the Smart Meter
 

Similar to HBaseCon 2013: ETL for Apache HBase

FSI201 FINRA’s Managed Data Lake – Next Gen Analytics in the Cloud
FSI201 FINRA’s Managed Data Lake – Next Gen Analytics in the CloudFSI201 FINRA’s Managed Data Lake – Next Gen Analytics in the Cloud
FSI201 FINRA’s Managed Data Lake – Next Gen Analytics in the Cloud
Amazon Web Services
 
Enhanced Data Visualization provided for 200,000 Machines with OpenTSDB and C...
Enhanced Data Visualization provided for 200,000 Machines with OpenTSDB and C...Enhanced Data Visualization provided for 200,000 Machines with OpenTSDB and C...
Enhanced Data Visualization provided for 200,000 Machines with OpenTSDB and C...
YASH Technologies
 
Enhanced Data Visualization provided for 200,000 Machines with OpenTSDB and ...
 Enhanced Data Visualization provided for 200,000 Machines with OpenTSDB and ... Enhanced Data Visualization provided for 200,000 Machines with OpenTSDB and ...
Enhanced Data Visualization provided for 200,000 Machines with OpenTSDB and ...
YASH Technologies
 
Fast SQL on Hadoop, really?
Fast SQL on Hadoop, really?Fast SQL on Hadoop, really?
Fast SQL on Hadoop, really?
DataWorks Summit
 
What's New in Apache Hive 3.0 - Tokyo
What's New in Apache Hive 3.0 - TokyoWhat's New in Apache Hive 3.0 - Tokyo
What's New in Apache Hive 3.0 - Tokyo
DataWorks Summit
 
What's New in Apache Hive 3.0?
What's New in Apache Hive 3.0?What's New in Apache Hive 3.0?
What's New in Apache Hive 3.0?
DataWorks Summit
 
Hive Performance Dataworks Summit Melbourne February 2019
Hive Performance Dataworks Summit Melbourne February 2019Hive Performance Dataworks Summit Melbourne February 2019
Hive Performance Dataworks Summit Melbourne February 2019
alanfgates
 
Fast SQL on Hadoop, Really?
Fast SQL on Hadoop, Really?Fast SQL on Hadoop, Really?
Fast SQL on Hadoop, Really?
DataWorks Summit
 
New Performance Benchmarks: Apache Impala (incubating) Leads Traditional Anal...
New Performance Benchmarks: Apache Impala (incubating) Leads Traditional Anal...New Performance Benchmarks: Apache Impala (incubating) Leads Traditional Anal...
New Performance Benchmarks: Apache Impala (incubating) Leads Traditional Anal...
Cloudera, Inc.
 
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...
DataStax
 
Rebuilding from MongoDB for Scale on HBase
Rebuilding from MongoDB for Scale on HBaseRebuilding from MongoDB for Scale on HBase
Rebuilding from MongoDB for Scale on HBase
Robert Roland
 
Database Freedom | AWS Floor28
Database Freedom | AWS Floor28Database Freedom | AWS Floor28
Database Freedom | AWS Floor28
Amazon Web Services
 
Big Data Berlin v8.0 Stream Processing with Apache Apex
Big Data Berlin v8.0 Stream Processing with Apache Apex Big Data Berlin v8.0 Stream Processing with Apache Apex
Big Data Berlin v8.0 Stream Processing with Apache Apex
Apache Apex
 
Thomas Weise, Apache Apex PMC Member and Architect/Co-Founder, DataTorrent - ...
Thomas Weise, Apache Apex PMC Member and Architect/Co-Founder, DataTorrent - ...Thomas Weise, Apache Apex PMC Member and Architect/Co-Founder, DataTorrent - ...
Thomas Weise, Apache Apex PMC Member and Architect/Co-Founder, DataTorrent - ...
Dataconomy Media
 
Migrate from Oracle to Aurora PostgreSQL: Best Practices, Design Patterns, & ...
Migrate from Oracle to Aurora PostgreSQL: Best Practices, Design Patterns, & ...Migrate from Oracle to Aurora PostgreSQL: Best Practices, Design Patterns, & ...
Migrate from Oracle to Aurora PostgreSQL: Best Practices, Design Patterns, & ...
Amazon Web Services
 
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Precisely
 
Webinar: The Performance Challenge: Providing an Amazing Customer Experience ...
Webinar: The Performance Challenge: Providing an Amazing Customer Experience ...Webinar: The Performance Challenge: Providing an Amazing Customer Experience ...
Webinar: The Performance Challenge: Providing an Amazing Customer Experience ...
DataStax
 
AWS Webcast - Introduction to Amazon Kinesis
AWS Webcast - Introduction to Amazon KinesisAWS Webcast - Introduction to Amazon Kinesis
AWS Webcast - Introduction to Amazon Kinesis
Amazon Web Services
 
Data exposure in Azure - production use-case
Data exposure in Azure - production use-caseData exposure in Azure - production use-case
Data exposure in Azure - production use-case
Alexander Laysha
 
UNIT3 DBMS.pptx operation nd management of data base
UNIT3 DBMS.pptx operation nd management of data baseUNIT3 DBMS.pptx operation nd management of data base
UNIT3 DBMS.pptx operation nd management of data base
shindhe1098cv
 

Similar to HBaseCon 2013: ETL for Apache HBase (20)

FSI201 FINRA’s Managed Data Lake – Next Gen Analytics in the Cloud
FSI201 FINRA’s Managed Data Lake – Next Gen Analytics in the CloudFSI201 FINRA’s Managed Data Lake – Next Gen Analytics in the Cloud
FSI201 FINRA’s Managed Data Lake – Next Gen Analytics in the Cloud
 
Enhanced Data Visualization provided for 200,000 Machines with OpenTSDB and C...
Enhanced Data Visualization provided for 200,000 Machines with OpenTSDB and C...Enhanced Data Visualization provided for 200,000 Machines with OpenTSDB and C...
Enhanced Data Visualization provided for 200,000 Machines with OpenTSDB and C...
 
Enhanced Data Visualization provided for 200,000 Machines with OpenTSDB and ...
 Enhanced Data Visualization provided for 200,000 Machines with OpenTSDB and ... Enhanced Data Visualization provided for 200,000 Machines with OpenTSDB and ...
Enhanced Data Visualization provided for 200,000 Machines with OpenTSDB and ...
 
Fast SQL on Hadoop, really?
Fast SQL on Hadoop, really?Fast SQL on Hadoop, really?
Fast SQL on Hadoop, really?
 
What's New in Apache Hive 3.0 - Tokyo
What's New in Apache Hive 3.0 - TokyoWhat's New in Apache Hive 3.0 - Tokyo
What's New in Apache Hive 3.0 - Tokyo
 
What's New in Apache Hive 3.0?
What's New in Apache Hive 3.0?What's New in Apache Hive 3.0?
What's New in Apache Hive 3.0?
 
Hive Performance Dataworks Summit Melbourne February 2019
Hive Performance Dataworks Summit Melbourne February 2019Hive Performance Dataworks Summit Melbourne February 2019
Hive Performance Dataworks Summit Melbourne February 2019
 
Fast SQL on Hadoop, Really?
Fast SQL on Hadoop, Really?Fast SQL on Hadoop, Really?
Fast SQL on Hadoop, Really?
 
New Performance Benchmarks: Apache Impala (incubating) Leads Traditional Anal...
New Performance Benchmarks: Apache Impala (incubating) Leads Traditional Anal...New Performance Benchmarks: Apache Impala (incubating) Leads Traditional Anal...
New Performance Benchmarks: Apache Impala (incubating) Leads Traditional Anal...
 
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...
 
Rebuilding from MongoDB for Scale on HBase
Rebuilding from MongoDB for Scale on HBaseRebuilding from MongoDB for Scale on HBase
Rebuilding from MongoDB for Scale on HBase
 
Database Freedom | AWS Floor28
Database Freedom | AWS Floor28Database Freedom | AWS Floor28
Database Freedom | AWS Floor28
 
Big Data Berlin v8.0 Stream Processing with Apache Apex
Big Data Berlin v8.0 Stream Processing with Apache Apex Big Data Berlin v8.0 Stream Processing with Apache Apex
Big Data Berlin v8.0 Stream Processing with Apache Apex
 
Thomas Weise, Apache Apex PMC Member and Architect/Co-Founder, DataTorrent - ...
Thomas Weise, Apache Apex PMC Member and Architect/Co-Founder, DataTorrent - ...Thomas Weise, Apache Apex PMC Member and Architect/Co-Founder, DataTorrent - ...
Thomas Weise, Apache Apex PMC Member and Architect/Co-Founder, DataTorrent - ...
 
Migrate from Oracle to Aurora PostgreSQL: Best Practices, Design Patterns, & ...
Migrate from Oracle to Aurora PostgreSQL: Best Practices, Design Patterns, & ...Migrate from Oracle to Aurora PostgreSQL: Best Practices, Design Patterns, & ...
Migrate from Oracle to Aurora PostgreSQL: Best Practices, Design Patterns, & ...
 
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
 
Webinar: The Performance Challenge: Providing an Amazing Customer Experience ...
Webinar: The Performance Challenge: Providing an Amazing Customer Experience ...Webinar: The Performance Challenge: Providing an Amazing Customer Experience ...
Webinar: The Performance Challenge: Providing an Amazing Customer Experience ...
 
AWS Webcast - Introduction to Amazon Kinesis
AWS Webcast - Introduction to Amazon KinesisAWS Webcast - Introduction to Amazon Kinesis
AWS Webcast - Introduction to Amazon Kinesis
 
Data exposure in Azure - production use-case
Data exposure in Azure - production use-caseData exposure in Azure - production use-case
Data exposure in Azure - production use-case
 
UNIT3 DBMS.pptx operation nd management of data base
UNIT3 DBMS.pptx operation nd management of data baseUNIT3 DBMS.pptx operation nd management of data base
UNIT3 DBMS.pptx operation nd management of data base
 

More from Cloudera, Inc.

Partner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxPartner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptx
Cloudera, Inc.
 
Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists
Cloudera, Inc.
 
2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists
Cloudera, Inc.
 
Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019
Cloudera, Inc.
 
Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19
Cloudera, Inc.
 
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Cloudera, Inc.
 
Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19
Cloudera, Inc.
 
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Cloudera, Inc.
 
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Cloudera, Inc.
 
Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19
Cloudera, Inc.
 
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Cloudera, Inc.
 
Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18
Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3
Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2
Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1
Cloudera, Inc.
 
Extending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformExtending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the Platform
Cloudera, Inc.
 
Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18
Cloudera, Inc.
 
Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360
Cloudera, Inc.
 
Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18
Cloudera, Inc.
 
Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18
Cloudera, Inc.
 

More from Cloudera, Inc. (20)

Partner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxPartner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptx
 
Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists
 
2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists
 
Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019
 
Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19
 
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
 
Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19
 
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19
 
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
 
Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19
 
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
 
Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18
 
Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3
 
Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2
 
Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1
 
Extending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformExtending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the Platform
 
Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18
 
Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360
 
Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18
 
Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18
 

Recently uploaded

Programming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup SlidesProgramming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup Slides
Zilliz
 
National Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practicesNational Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practices
Quotidiano Piemontese
 
Building Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and MilvusBuilding Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and Milvus
Zilliz
 
How to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptxHow to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptx
danishmna97
 
Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
Octavian Nadolu
 
June Patch Tuesday
June Patch TuesdayJune Patch Tuesday
June Patch Tuesday
Ivanti
 
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with SlackLet's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
shyamraj55
 
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfUnlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Malak Abu Hammad
 
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success StoryDriving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Safe Software
 
20240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 202420240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 2024
Matthew Sinclair
 
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development ProvidersYour One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
akankshawande
 
Full-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalizationFull-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalization
Zilliz
 
Mind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AIMind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AI
Kumud Singh
 
HCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAUHCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAU
panagenda
 
Presentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of GermanyPresentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of Germany
innovationoecd
 
Best 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERPBest 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERP
Pixlogix Infotech
 
OpenID AuthZEN Interop Read Out - Authorization
OpenID AuthZEN Interop Read Out - AuthorizationOpenID AuthZEN Interop Read Out - Authorization
OpenID AuthZEN Interop Read Out - Authorization
David Brossard
 
20240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 202420240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 2024
Matthew Sinclair
 
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc
 
How to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For FlutterHow to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For Flutter
Daiki Mogmet Ito
 

Recently uploaded (20)

Programming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup SlidesProgramming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup Slides
 
National Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practicesNational Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practices
 
Building Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and MilvusBuilding Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and Milvus
 
How to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptxHow to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptx
 
Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
 
June Patch Tuesday
June Patch TuesdayJune Patch Tuesday
June Patch Tuesday
 
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with SlackLet's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
 
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfUnlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
 
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success StoryDriving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success Story
 
20240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 202420240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 2024
 
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development ProvidersYour One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
 
Full-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalizationFull-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalization
 
Mind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AIMind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AI
 
HCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAUHCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAU
 
Presentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of GermanyPresentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of Germany
 
Best 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERPBest 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERP
 
OpenID AuthZEN Interop Read Out - Authorization
OpenID AuthZEN Interop Read Out - AuthorizationOpenID AuthZEN Interop Read Out - Authorization
OpenID AuthZEN Interop Read Out - Authorization
 
20240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 202420240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 2024
 
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy Survey
 
How to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For FlutterHow to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For Flutter
 

HBaseCon 2013: ETL for Apache HBase

  • 1. © 2013 Experian Limited. All rights reserved. HBaseCon 2013 Application Track – Case Study Experian Marketing Services ETL for HBase
  • 2. © 2013 Experian Limited. All rights reserved. Manoj Khanwalkar Chief Architect Experian Marketing Services, New York Govind Asawa Big Data Architect Experian Marketing Services, New York Who We Are
  • 3. © 2013 Experian Limited. All rights reserved. 1. About Experian Marketing Services 2. Why HBase 3. Why custom ETL 4. ETL solution features 5. Performance 6. Case Study 7. Conclusion Agenda
  • 4. © 2013 Experian Limited. All rights reserved. Experian Marketing Services 1 billion+ messages daily 2000+ Institutional clients 9 regions, 24/ 7 500+ Tetabytes of data 200+ big queries 2000+ data export jobs Email and social digital marketing messages. 100% surge in volume during peak season Across all verticals Platforms operating globally Client needs 1 to 7 years of marketing data depending on verticals Complicated queries on 200+ million records 400+ columns for segmentation Client needs daily incremental activity data
  • 5. © 2013 Experian Limited. All rights reserved. • Traditional RDBMS based solution is very challenging and cost prohibitive for the scale of operations • In SaaS based multi-tenancy model we require schema flexibility to support thousands of clients with their individual requirements • In majority of cases key based lookups can satisfy data extraction requirements (including range scans and filters) which is well supported by HBase • Automatic sharding and horizontally scalable • HBase provides a Java API which can be integrated with Experian’s other systems. 5 Why HBase
  • 6. © 2013 Experian Limited. All rights reserved. 6 Why develop an Integrator toolkit? Connectivity Environment Cost • Ability to ingest and read data from HBase and MongoDB • Connectors for cloud computing • Support for REST and other industry standard API’s • Supports SaaS Model • Dynamically handles data input changes (# of fields & new fields) • Licensing • Integrate with other systems seamlessly thus improving time to market • Resources required to develop, administer and maintain solution • Major ETL vendors do not support HBase • ETL solution needs extensive development if data structure changes which negates advantages offered by No-SQL solution
  • 7. © 2013 Experian Limited. All rights reserved. 7 Integrator Architecture DataIngester TargetSystems Third Party JMS Database SourceSystems Connectors CSV Reader Processor Event Listener Message Broker File Watcher Parser Factory Key Generator Parser Loader RDBMS Loader HBase Loader Container Metadata Analyzer Loader Aggregator RDBMS HBase Extractor Query Output Aggregate Aware Stamping Transform SaaS JMS Files RDBMS HBase RDBMS HBase MongoDB
  • 8. © 2013 Experian Limited. All rights reserved. 8 Extractor Architecture HDFS Integrator Send Data Click Data Bounce Data TXN Data Metadata Detailed data Aggregates HBase Web Server Reporting Analytics Extractor Query Optimizer
  • 9. © 2013 Experian Limited. All rights reserved. Data ingestion from multiple sources • Flat files • NO-SQL • RDBMS (through JDBC) • SaaS (Salesforce etc.) • Messaging and any system providing events streaming Ability to de-normalize fact table while ingesting data • # of lookup tables can be configured Near real time generation of aggregate table • # of aggregate tables can be configured • HBase counters are used to keep aggregated sum/count • Concurrently aggregates can be populated in RDBMS of choice 9 Integrator & Extractor
  • 10. © 2013 Experian Limited. All rights reserved. Transformation of column value to another value • Add column by transformation • Drop columns from input stream if no persistence is required Data filter capability • Drop record while ingesting base table • Drop record while aggregation Aggregate aware optimized query execution • Query Performance: Analyze column requested by user in query and determine based on count table with minimum record which can satisfy this requirement. • Transparent: No user intervention or knowledge of schema is required • Optimizer: Conceptually similar to RDBMS query plan optimizer. Concept extended to No-SQL databases • Metadata Management: Integrated metadata with ETL process can be used by variety of applications. 10 Integrator & Extractor
  • 11. © 2013 Experian Limited. All rights reserved. Framework • Solution based on Spring as a light weight container and built a framework around it to standardize on the lifecycle of the process and to enable any arbitrary functionality to reside in the container by implementing a Service interface. • The container runs in a batch processing or daemon mode. • In the daemon mode , it uses the Java 7 File Watcher API to react to files placed in the specified directory for processing. Metadata catalogue • Metadata about all HBase table in which data ingested is stored • For each table primary key, columns and record counter is stored • HBase count is brute force scan and expensive API call. This can be avoided if metadata is published at the time of data ingestion • Avoid expensive queries which can bring cluster to its knees • Provide faster query performance 11 Integrator
  • 12. © 2013 Experian Limited. All rights reserved. • We used a 20 node cluster in production; each node had 24 cores with a 10GigE network backbone. • We observed a throughput of 1.3 million records inserted in HBase per minute per node. • Framework allowed us to run ETL process on multiple machines thus providing horizontal scalability. • Most of our queries returned back in at most a few seconds. 12 Integrator – System Performance
  • 13. © 2013 Experian Limited. All rights reserved. • Our experience shows that HBase offers a cost effective and performance solution for managing our data explosion while meeting the increasingly sophisticated analytical and reporting requirements of clients. • ETL framework allows us to leverage HBase and its features while improving developer productivity. • Framework gives us ability to roll out new functionality with minimum time to market. • Metadata catalogue optimizes query and improves cluster performance • Select count() on big HBase table take minutes/hours and can bring cluster to knees. Metadata of Integrator will give counts along with PrimaryKey, Columns in milliseconds 13 Conclusion
  • 14. © 2013 Experian Limited. All rights reserved. • Case Study 14 Appendix
  • 15. © 2013 Experian Limited. All rights reserved. 15 HBase Schema & Record Client ID Campaign ID Time logged User ID Orig domain Rcpt domain DS status Bounce cat IP Time queued 1 11 01/01/13 21 abc.com gmail.com success 192.168. 6.23 01/01/ 2013 2 12 01/02/13 31 xyz.com yahoo.com success bad- mailbox 112.168. 6.23 01/01/ 2013 Fact Table  send Send Record client_id,campaign_id,time_logged,user_id,orig_domain,rcpt_domain,dsn_status,bounce_cat,ip,Time_queued 1,11,01/01/2013,21,abc.com,gmail.com,success,192.168.6.23,01/01/2013
  • 16. © 2013 Experian Limited. All rights reserved. 16 HBase Schema & Record Fact Table  activity Activity Record client_id,campaign_id,event_time,user_id,event_type 1,11,01/01/2013,21,open Client ID Campaign ID Time logged User ID Orig domain Rcpt domain IP city Event type IP Send time 1 11 01/01/13 21 abc.com gmail.com SFO Open 192.168. 6.23 01/01/ 2013 2 12 01/04/13 31 xyz.com yahoo.com LA Click 112.168. 6.23 01/01/ 2013
  • 17. © 2013 Experian Limited. All rights reserved. 17 HBase Schema & Record Dimension Table  demographics Dimension Table  ip Client ID User ID Date Age Gender State City Zip Country Flag 1 11 01/01/13 21 M CA SFO 94087 USA Y 2 12 01/02/13 31 M CA SFO 94087 USA N IP Date Domain State Country City 192.168.6.23 01/01/2013 gmail.com CA USA SFO 112.168.6.23 01/02/2013 abc.edu NJ USA Newark
  • 18. © 2013 Experian Limited. All rights reserved. 18 HBase Schema & Record Aggregate Table  A1 Aggregate Table  A2 Campaign ID Date Gender State Country Count 11 01/01/13 M CA USA 5023 12 01/02/13 M CA USA 74890 Client ID Date Gender State Country Count 1 01/01/13 M CA USA 742345 2 01/02/13 M CA USA 1023456
  • 19. © 2013 Experian Limited. All rights reserved. 19 Metadata Metadata Table Table Name Primary Key Columns Count demographics Client_id,Campaig n_id,Date Client_id, Campaign_id, Date, Age, Gender,State,City,Country,Flag 10,000,000 A1 Campaign_id,Date Campaign_id,Date,Gender,State,Country,Count 1,000,000 A2 Client_id,Date Client_id,Date,Gender,State,Country,Count 500,000
  • 20. © 2013 Experian Limited. All rights reserved. User Query without Extractor Aggregate Awareness • Select client_id,state,count from demographics • Query Execution: Query will be executed on demographics table which has 300,000,000 rows User Query with Extractor Aggregate Awareness • Select client_id,state,count from demographics • Query Execution: – Step 1: Extractor will parse list of columns from query – Step 2: Extractor will find list of tables which has these columns. In this example extractor will get 2 tables demographics and A1 which can satisfy this query request – Step 3: Extractor will decide which is best table to satisfy this query. This decision will be based on # of rows in table. In this example table A1 has less # of rows compared to table demographics so table A1 will be selected – Step 4: Query will be executed against table A1 with appropriate where clause specified by user 20 Query Execution in Action
  • 21. © 2013 Experian Limited. All rights reserved. • Bloom filters were enabled at the row level to enable HBase to skip files efficiently. • We used HBase filters extensively in the Scans to filter out as much data as possible on the server side. • Defined Aggregates judiciously to be able to respond to queries without requiring HBase to resort to large file scans.. • We used a key concatenation that aligned to expected search patterns to enable HBase to provide an exact match or do efficient key range scans when a partial key was provided. 21 HBase Design Considerations
  • 22. © 2013 Experian Limited. All rights reserved. • We didn’t use MapReduce in our ETL framework for following considerations – Overhead of MapReduce based processes. – Real-time access to data – Every file had different header metadata , in MapReduce we had difficulty in passing header metadata to each Map process – Avoid intermediate reads and writes to the HDFS file system. 22 HBase Design Considerations
  • 23. © 2013 Experian Limited. All rights reserved. • We broke the Input and Output processing into separate threads and allocated a lot more threads for output processing to compensate for the relative processing speeds. • Batched the Writes to HBase to reduce number of calls to the server • Turned off the WAL in HBase , since we could always reprocess the file in case of a rare failure • Used primitives and Arrays in the code where feasible instead of Java Objects and Collections, to reduce the memory footprint and the pressure on the Garbage collector. 23 HBase Tuning
  • 24. © 2013 Experian Limited. All rights reserved. • Increased the Client Write Buffer size to several megabytes. • To avoid hotspots and best data retrieval we designed composite primary key. Key design allowed us to access data by providing exact key or range scan by leading portion of key. • We found that too many filters for scan provides diminishing returns and after some point it degrades the overall scan performance 24 HBase Tuning
  • 25. © 2013 Experian Limited. All rights reserved. Thank you For more information, please contact Manoj Khanwalkar Chief Architect manoj.khanwalkar@experian.com Govind Asawa Big Data Architect govind.asawa@experian.com