Innovation in the Data Warehouse - StampedeCon 2016

Innovation in the Data Warehouse
Kit Menke, Software Architect
StampedeCon 2016
July 27, 2016

Agenda
▪ Use Case
▪ Architectures
▪ Decision Points

Enterprise Holdings, Inc.
▪ Our Business
• 9 thousand locations
• 80 countries
• 93 thousand employees
• 1.7 million vehicles
▪ Data Warehouse
• Near capacity: Used about 75+ of 80 Terabytes
• Streaming and batch data feeds from over 50 internal systems &
external sources
• 100+ databases and 22+ thousand tables
• Around 1 billion queries executed per month
• Over 45,000 reporting users with 5+ million report executions
every month.
• Statistical Modeling & Advanced Analytics - 40+ Projects
Implemented for Predictive & Diagnostic Analytics

Challenges – Current Platform
▪ System Capacity Constraints
• Overall Current System Utilization is High
• Space & CPU Constraints
• Most of these challenges can be overcome by adding
more Teradata capacity or doing augmentation
▪ Use Cases not good fit for Teradata EDW
• Unstructured data
• Source structures changing frequently
• Data for exploration, discovery, & analytics
• Staging, transient, & history data
• These challenges can be overcome by augmentation
▪ Bottom-line: Improved agility & greater value

Augmentation Recommendation: Hadoop
▪ Leverage Hadoop to complement Teradata
EDW
• Hybrid Approach
▪ The Hortonworks distribution of Hadoop
• Compatibility/integration with Teradata EDW to
achieve high degree of interoperability
▪ Intent is not to have a centralized Hadoop
service
• EDW Augmentation Only
7

Architectures
▪ Data warehouse augmentation contains
streaming and batch use cases
▪ Three Big Data architectures to explore:
1. Batch
2. Lambda
3. Kappa

Batch
▪ Land data into Hadoop first
▪ ETL in Hadoop to build reporting tables and
publish to Teradata
▪ Archive old data from Teradata DB
▪ Data available for analysis in Hive
▪ Great for semi-structured data files
▪ But… too slow for streaming data

Lambda
▪ Attempts to combine batch and streaming
to get benefits from both
▪ Batch layer is comprehensive and accurate
▪ Streaming layer is fast but might only be
able to keep recent data
▪ Potentially have to maintain two codebases

Kappa
▪ Everything is a stream (no batch!)
▪ Depends largely on your log data store
usually Kafka
▪ All raw data is stored in Kafka
▪ Much simpler architecture than lambda
• New version? Re-deploy app and start
reprocessing from the start and generate new
output table
• Once complete point app to new output table

Choosing an Architecture
▪ Batch – process data in batches
• All data processed in batches to create an
output
▪ Lambda – split streaming data into batch
and real-time
• Stream processing for the data you need fast
and the rest is batch processed
▪ Kappa – everything is a stream
• All data is processed as a stream even when it
needs to be reprocessed

Implementing an Architecture
▪ Requirements for the use case drives
architecture
▪ Walk through decision points
1. Cloud or on premises
2. Physical or virtual machines
3. Cluster workload
▪ Plus others!

Cloud vs on premises
▪ Scalability
• Much easier to scale a Cloud solution
• Physical hardware requires an infrastructure team to manage
▪ Data source location (data gravity) / integration points
• Cluster should be as close as possible to your data source
• Cloud is good option for internet data sources
▪ Cloud offerings
• Hadoop: Azure HDInsight, Amazon EMR, Google Cloud
• Integration with other PaaS services
▪ Network
• Bandwidth to/from cloud implementation

Physical vs virtual
▪ Performance
• Physical hardware will perform better, Hadoop is
designed with physical hardware in mind
▪ Maintenance
• No hardware to maintain for virtual servers
▪ Time to market
• Virtual machines much faster to provision
• For physical hardware if infrastructure team is a
roadblock then appliance is good option instead of
commodity
▪ Development and test environments make more
sense to virtualize

Workload
▪ Streaming
• Running 24/7
• Need dedicated resources
▪ Batch
• Scheduled
• Periods of high utilization (scalability)
▪ Multi-Tenancy
• Blended workloads
• YARN (queues, node labels)
• Think about Isolating nodes for real-time

Other considerations
▪ Disaster recovery
• Data is locally redundant
• Backups not usually required unless you need geo-redundancy
▪ Security - Many different things to secure!
• Kerberos for user, service, and host authentication
• Authorization: Apache Ranger (Hortonworks) or Apache Sentry
(Cloudera) or MapR Control System
• Network isolation for Hadoop services
• Data at rest (HFDS encryption)
▪ Hadoop Distribution - Race to include the most Apache projects
• Top 3: Hortonworks, Cloudera, MapR
• Big companies with Hadoop offering:
– Teradata Hadoop aka TDH (Hortonworks, Cloudera, MapR)
– Oracle Big Data Applicance (Cloudera)

Spectrum of Options
▪ Cloud PaaS
• No hardware or software to manage
• Amazon S3, Azure Data Lake
▪ Cloud
• Weird space between IaaS and PaaS
• Amazon EMR
• HDInsight is more PaaS
▪ Cloud IaaS
• All virtual, no hardware to manage
• You manage all software
▪ Third party hosted
• Rackspace
• Software managed by you
▪ Appliance
• Infrastructure handled for you
• Dell, HP, Cisco, Teradata, Oracle
• Software (varies depending on vendor)
▪ Commodity
• DIY

Lessons Learned
▪ Workload isolation is hard
• Multi-tenancy is possible
• Takes work to make sure batch jobs don’t impact
the real-time streaming processes
▪ Things we like: Hive, Hbase
▪ Things we don’t like: SOLR, debugging
▪ Debugging / development is hard
• Lots of moving pieces
• Logs spread out across many machines
• Development environments require a lot of software
• Distributed systems just work differently

Questions?
▪ Hortonworks Community
• https://community.hortonworks.com/answers/
index.html
▪ Kit Menke
• @kitmenke on Twitter

Resources
▪ Lambda Architecture
• http://lambda-architecture.net/
▪ Kappa Architecture
• http://kappa-architecture.com/
▪ Kappa Architecture - Our Experience by ASPgems
• http://events.linuxfoundation.org/sites/events/files/slides/
ASPgems%20-%20Kappa%20Architecture.pdf
▪ Apache Hadoop YARN – Multi-Tenancy, Capacity
Scheduler & Preemption - StampedeCon 2015
• http://www.slideshare.net/StampedeCon/apache-hadoop-
yarn-multitenancy-capacity-scheduler-preemption-
stampedecon-2015

Innovation in the Data Warehouse - StampedeCon 2016

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Viewers also liked

Viewers also liked (20)

Similar to Innovation in the Data Warehouse - StampedeCon 2016

Similar to Innovation in the Data Warehouse - StampedeCon 2016 (20)

More from StampedeCon

More from StampedeCon (20)

Recently uploaded

Recently uploaded (20)

Innovation in the Data Warehouse - StampedeCon 2016

Editor's Notes