Innovation in the Data Warehouse
Kit Menke, Software Architect
StampedeCon 2016
July 27, 2016
Agenda
▪ Use Case
▪ Architectures
▪ Decision Points
Enterprise Holdings, Inc.
▪ Our Business
• 9 thousand locations
• 80 countries
• 93 thousand employees
• 1.7 million vehicles
▪ Data Warehouse
• Near capacity: Used about 75+ of 80 Terabytes
• Streaming and batch data feeds from over 50 internal systems &
external sources
• 100+ databases and 22+ thousand tables
• Around 1 billion queries executed per month
• Over 45,000 reporting users with 5+ million report executions
every month.
• Statistical Modeling & Advanced Analytics - 40+ Projects
Implemented for Predictive & Diagnostic Analytics
Data Warehouse - Present
Data Warehouse Growth
Challenges – Current Platform
▪ System Capacity Constraints
• Overall Current System Utilization is High
• Space & CPU Constraints
• Most of these challenges can be overcome by adding
more Teradata capacity or doing augmentation
▪ Use Cases not good fit for Teradata EDW
• Unstructured data
• Source structures changing frequently
• Data for exploration, discovery, & analytics
• Staging, transient, & history data
• These challenges can be overcome by augmentation
▪ Bottom-line: Improved agility & greater value
Augmentation Recommendation: Hadoop
▪ Leverage Hadoop to complement Teradata
EDW
• Hybrid Approach
▪ The Hortonworks distribution of Hadoop
• Compatibility/integration with Teradata EDW to
achieve high degree of interoperability
▪ Intent is not to have a centralized Hadoop
service
• EDW Augmentation Only
7
Data Warehouse - Future
Architectures
▪ Data warehouse augmentation contains
streaming and batch use cases
▪ Three Big Data architectures to explore:
1. Batch
2. Lambda
3. Kappa
Batch
Batch
▪ Land data into Hadoop first
▪ ETL in Hadoop to build reporting tables and
publish to Teradata
▪ Archive old data from Teradata DB
▪ Data available for analysis in Hive
▪ Great for semi-structured data files
▪ But… too slow for streaming data
Lambda
Lambda
▪ Attempts to combine batch and streaming
to get benefits from both
▪ Batch layer is comprehensive and accurate
▪ Streaming layer is fast but might only be
able to keep recent data
▪ Potentially have to maintain two codebases
Kappa
Kappa
▪ Everything is a stream (no batch!)
▪ Depends largely on your log data store
usually Kafka
▪ All raw data is stored in Kafka
▪ Much simpler architecture than lambda
• New version? Re-deploy app and start
reprocessing from the start and generate new
output table
• Once complete point app to new output table
Choosing an Architecture
▪ Batch – process data in batches
• All data processed in batches to create an
output
▪ Lambda – split streaming data into batch
and real-time
• Stream processing for the data you need fast
and the rest is batch processed
▪ Kappa – everything is a stream
• All data is processed as a stream even when it
needs to be reprocessed
Implementing an Architecture
▪ Requirements for the use case drives
architecture
▪ Walk through decision points
1. Cloud or on premises
2. Physical or virtual machines
3. Cluster workload
▪ Plus others!
Cloud vs on premises
▪ Scalability
• Much easier to scale a Cloud solution
• Physical hardware requires an infrastructure team to manage
▪ Data source location (data gravity) / integration points
• Cluster should be as close as possible to your data source
• Cloud is good option for internet data sources
▪ Cloud offerings
• Hadoop: Azure HDInsight, Amazon EMR, Google Cloud
• Integration with other PaaS services
▪ Network
• Bandwidth to/from cloud implementation
Physical vs virtual
▪ Performance
• Physical hardware will perform better, Hadoop is
designed with physical hardware in mind
▪ Maintenance
• No hardware to maintain for virtual servers
▪ Time to market
• Virtual machines much faster to provision
• For physical hardware if infrastructure team is a
roadblock then appliance is good option instead of
commodity
▪ Development and test environments make more
sense to virtualize
Workload
▪ Streaming
• Running 24/7
• Need dedicated resources
▪ Batch
• Scheduled
• Periods of high utilization (scalability)
▪ Multi-Tenancy
• Blended workloads
• YARN (queues, node labels)
• Think about Isolating nodes for real-time
Other considerations
▪ Disaster recovery
• Data is locally redundant
• Backups not usually required unless you need geo-redundancy
▪ Security - Many different things to secure!
• Kerberos for user, service, and host authentication
• Authorization: Apache Ranger (Hortonworks) or Apache Sentry
(Cloudera) or MapR Control System
• Network isolation for Hadoop services
• Data at rest (HFDS encryption)
▪ Hadoop Distribution - Race to include the most Apache projects
• Top 3: Hortonworks, Cloudera, MapR
• Big companies with Hadoop offering:
– Teradata Hadoop aka TDH (Hortonworks, Cloudera, MapR)
– Oracle Big Data Applicance (Cloudera)
Spectrum of Options
▪ Cloud PaaS
• No hardware or software to manage
• Amazon S3, Azure Data Lake
▪ Cloud
• Weird space between IaaS and PaaS
• Amazon EMR
• HDInsight is more PaaS
▪ Cloud IaaS
• All virtual, no hardware to manage
• You manage all software
▪ Third party hosted
• Rackspace
• Software managed by you
▪ Appliance
• Infrastructure handled for you
• Dell, HP, Cisco, Teradata, Oracle
• Software (varies depending on vendor)
▪ Commodity
• DIY
Lessons Learned
▪ Workload isolation is hard
• Multi-tenancy is possible
• Takes work to make sure batch jobs don’t impact
the real-time streaming processes
▪ Things we like: Hive, Hbase
▪ Things we don’t like: SOLR, debugging
▪ Debugging / development is hard
• Lots of moving pieces
• Logs spread out across many machines
• Development environments require a lot of software
• Distributed systems just work differently
Questions?
▪ Hortonworks Community
• https://community.hortonworks.com/answers/
index.html
▪ Kit Menke
• @kitmenke on Twitter
Resources
▪ Lambda Architecture
• http://lambda-architecture.net/
▪ Kappa Architecture
• http://kappa-architecture.com/
▪ Kappa Architecture - Our Experience by ASPgems
• http://events.linuxfoundation.org/sites/events/files/slides/
ASPgems%20-%20Kappa%20Architecture.pdf
▪ Apache Hadoop YARN – Multi-Tenancy, Capacity
Scheduler & Preemption - StampedeCon 2015
• http://www.slideshare.net/StampedeCon/apache-hadoop-
yarn-multitenancy-capacity-scheduler-preemption-
stampedecon-2015

Innovation in the Data Warehouse - StampedeCon 2016

  • 1.
    Innovation in theData Warehouse Kit Menke, Software Architect StampedeCon 2016 July 27, 2016
  • 2.
    Agenda ▪ Use Case ▪Architectures ▪ Decision Points
  • 3.
    Enterprise Holdings, Inc. ▪Our Business • 9 thousand locations • 80 countries • 93 thousand employees • 1.7 million vehicles ▪ Data Warehouse • Near capacity: Used about 75+ of 80 Terabytes • Streaming and batch data feeds from over 50 internal systems & external sources • 100+ databases and 22+ thousand tables • Around 1 billion queries executed per month • Over 45,000 reporting users with 5+ million report executions every month. • Statistical Modeling & Advanced Analytics - 40+ Projects Implemented for Predictive & Diagnostic Analytics
  • 4.
  • 5.
  • 6.
    Challenges – CurrentPlatform ▪ System Capacity Constraints • Overall Current System Utilization is High • Space & CPU Constraints • Most of these challenges can be overcome by adding more Teradata capacity or doing augmentation ▪ Use Cases not good fit for Teradata EDW • Unstructured data • Source structures changing frequently • Data for exploration, discovery, & analytics • Staging, transient, & history data • These challenges can be overcome by augmentation ▪ Bottom-line: Improved agility & greater value
  • 7.
    Augmentation Recommendation: Hadoop ▪Leverage Hadoop to complement Teradata EDW • Hybrid Approach ▪ The Hortonworks distribution of Hadoop • Compatibility/integration with Teradata EDW to achieve high degree of interoperability ▪ Intent is not to have a centralized Hadoop service • EDW Augmentation Only 7
  • 8.
  • 9.
    Architectures ▪ Data warehouseaugmentation contains streaming and batch use cases ▪ Three Big Data architectures to explore: 1. Batch 2. Lambda 3. Kappa
  • 10.
  • 11.
    Batch ▪ Land datainto Hadoop first ▪ ETL in Hadoop to build reporting tables and publish to Teradata ▪ Archive old data from Teradata DB ▪ Data available for analysis in Hive ▪ Great for semi-structured data files ▪ But… too slow for streaming data
  • 12.
  • 13.
    Lambda ▪ Attempts tocombine batch and streaming to get benefits from both ▪ Batch layer is comprehensive and accurate ▪ Streaming layer is fast but might only be able to keep recent data ▪ Potentially have to maintain two codebases
  • 14.
  • 15.
    Kappa ▪ Everything isa stream (no batch!) ▪ Depends largely on your log data store usually Kafka ▪ All raw data is stored in Kafka ▪ Much simpler architecture than lambda • New version? Re-deploy app and start reprocessing from the start and generate new output table • Once complete point app to new output table
  • 16.
    Choosing an Architecture ▪Batch – process data in batches • All data processed in batches to create an output ▪ Lambda – split streaming data into batch and real-time • Stream processing for the data you need fast and the rest is batch processed ▪ Kappa – everything is a stream • All data is processed as a stream even when it needs to be reprocessed
  • 17.
    Implementing an Architecture ▪Requirements for the use case drives architecture ▪ Walk through decision points 1. Cloud or on premises 2. Physical or virtual machines 3. Cluster workload ▪ Plus others!
  • 18.
    Cloud vs onpremises ▪ Scalability • Much easier to scale a Cloud solution • Physical hardware requires an infrastructure team to manage ▪ Data source location (data gravity) / integration points • Cluster should be as close as possible to your data source • Cloud is good option for internet data sources ▪ Cloud offerings • Hadoop: Azure HDInsight, Amazon EMR, Google Cloud • Integration with other PaaS services ▪ Network • Bandwidth to/from cloud implementation
  • 19.
    Physical vs virtual ▪Performance • Physical hardware will perform better, Hadoop is designed with physical hardware in mind ▪ Maintenance • No hardware to maintain for virtual servers ▪ Time to market • Virtual machines much faster to provision • For physical hardware if infrastructure team is a roadblock then appliance is good option instead of commodity ▪ Development and test environments make more sense to virtualize
  • 20.
    Workload ▪ Streaming • Running24/7 • Need dedicated resources ▪ Batch • Scheduled • Periods of high utilization (scalability) ▪ Multi-Tenancy • Blended workloads • YARN (queues, node labels) • Think about Isolating nodes for real-time
  • 21.
    Other considerations ▪ Disasterrecovery • Data is locally redundant • Backups not usually required unless you need geo-redundancy ▪ Security - Many different things to secure! • Kerberos for user, service, and host authentication • Authorization: Apache Ranger (Hortonworks) or Apache Sentry (Cloudera) or MapR Control System • Network isolation for Hadoop services • Data at rest (HFDS encryption) ▪ Hadoop Distribution - Race to include the most Apache projects • Top 3: Hortonworks, Cloudera, MapR • Big companies with Hadoop offering: – Teradata Hadoop aka TDH (Hortonworks, Cloudera, MapR) – Oracle Big Data Applicance (Cloudera)
  • 22.
    Spectrum of Options ▪Cloud PaaS • No hardware or software to manage • Amazon S3, Azure Data Lake ▪ Cloud • Weird space between IaaS and PaaS • Amazon EMR • HDInsight is more PaaS ▪ Cloud IaaS • All virtual, no hardware to manage • You manage all software ▪ Third party hosted • Rackspace • Software managed by you ▪ Appliance • Infrastructure handled for you • Dell, HP, Cisco, Teradata, Oracle • Software (varies depending on vendor) ▪ Commodity • DIY
  • 23.
    Lessons Learned ▪ Workloadisolation is hard • Multi-tenancy is possible • Takes work to make sure batch jobs don’t impact the real-time streaming processes ▪ Things we like: Hive, Hbase ▪ Things we don’t like: SOLR, debugging ▪ Debugging / development is hard • Lots of moving pieces • Logs spread out across many machines • Development environments require a lot of software • Distributed systems just work differently
  • 24.
    Questions? ▪ Hortonworks Community •https://community.hortonworks.com/answers/ index.html ▪ Kit Menke • @kitmenke on Twitter
  • 25.
    Resources ▪ Lambda Architecture •http://lambda-architecture.net/ ▪ Kappa Architecture • http://kappa-architecture.com/ ▪ Kappa Architecture - Our Experience by ASPgems • http://events.linuxfoundation.org/sites/events/files/slides/ ASPgems%20-%20Kappa%20Architecture.pdf ▪ Apache Hadoop YARN – Multi-Tenancy, Capacity Scheduler & Preemption - StampedeCon 2015 • http://www.slideshare.net/StampedeCon/apache-hadoop- yarn-multitenancy-capacity-scheduler-preemption- stampedecon-2015

Editor's Notes

  • #7 Explain our use case Expanding reporting windows and shrinking ETL windows