Advertisement

Innovation in the Data Warehouse - StampedeCon 2016

Aug. 9, 2016
Advertisement

More Related Content

Advertisement

More from StampedeCon(20)

Advertisement

Innovation in the Data Warehouse - StampedeCon 2016

  1. Innovation in the Data Warehouse Kit Menke, Software Architect StampedeCon 2016 July 27, 2016
  2. Agenda ▪ Use Case ▪ Architectures ▪ Decision Points
  3. Enterprise Holdings, Inc. ▪ Our Business • 9 thousand locations • 80 countries • 93 thousand employees • 1.7 million vehicles ▪ Data Warehouse • Near capacity: Used about 75+ of 80 Terabytes • Streaming and batch data feeds from over 50 internal systems & external sources • 100+ databases and 22+ thousand tables • Around 1 billion queries executed per month • Over 45,000 reporting users with 5+ million report executions every month. • Statistical Modeling & Advanced Analytics - 40+ Projects Implemented for Predictive & Diagnostic Analytics
  4. Data Warehouse - Present
  5. Data Warehouse Growth
  6. Challenges – Current Platform ▪ System Capacity Constraints • Overall Current System Utilization is High • Space & CPU Constraints • Most of these challenges can be overcome by adding more Teradata capacity or doing augmentation ▪ Use Cases not good fit for Teradata EDW • Unstructured data • Source structures changing frequently • Data for exploration, discovery, & analytics • Staging, transient, & history data • These challenges can be overcome by augmentation ▪ Bottom-line: Improved agility & greater value
  7. Augmentation Recommendation: Hadoop ▪ Leverage Hadoop to complement Teradata EDW • Hybrid Approach ▪ The Hortonworks distribution of Hadoop • Compatibility/integration with Teradata EDW to achieve high degree of interoperability ▪ Intent is not to have a centralized Hadoop service • EDW Augmentation Only 7
  8. Data Warehouse - Future
  9. Architectures ▪ Data warehouse augmentation contains streaming and batch use cases ▪ Three Big Data architectures to explore: 1. Batch 2. Lambda 3. Kappa
  10. Batch
  11. Batch ▪ Land data into Hadoop first ▪ ETL in Hadoop to build reporting tables and publish to Teradata ▪ Archive old data from Teradata DB ▪ Data available for analysis in Hive ▪ Great for semi-structured data files ▪ But… too slow for streaming data
  12. Lambda
  13. Lambda ▪ Attempts to combine batch and streaming to get benefits from both ▪ Batch layer is comprehensive and accurate ▪ Streaming layer is fast but might only be able to keep recent data ▪ Potentially have to maintain two codebases
  14. Kappa
  15. Kappa ▪ Everything is a stream (no batch!) ▪ Depends largely on your log data store usually Kafka ▪ All raw data is stored in Kafka ▪ Much simpler architecture than lambda • New version? Re-deploy app and start reprocessing from the start and generate new output table • Once complete point app to new output table
  16. Choosing an Architecture ▪ Batch – process data in batches • All data processed in batches to create an output ▪ Lambda – split streaming data into batch and real-time • Stream processing for the data you need fast and the rest is batch processed ▪ Kappa – everything is a stream • All data is processed as a stream even when it needs to be reprocessed
  17. Implementing an Architecture ▪ Requirements for the use case drives architecture ▪ Walk through decision points 1. Cloud or on premises 2. Physical or virtual machines 3. Cluster workload ▪ Plus others!
  18. Cloud vs on premises ▪ Scalability • Much easier to scale a Cloud solution • Physical hardware requires an infrastructure team to manage ▪ Data source location (data gravity) / integration points • Cluster should be as close as possible to your data source • Cloud is good option for internet data sources ▪ Cloud offerings • Hadoop: Azure HDInsight, Amazon EMR, Google Cloud • Integration with other PaaS services ▪ Network • Bandwidth to/from cloud implementation
  19. Physical vs virtual ▪ Performance • Physical hardware will perform better, Hadoop is designed with physical hardware in mind ▪ Maintenance • No hardware to maintain for virtual servers ▪ Time to market • Virtual machines much faster to provision • For physical hardware if infrastructure team is a roadblock then appliance is good option instead of commodity ▪ Development and test environments make more sense to virtualize
  20. Workload ▪ Streaming • Running 24/7 • Need dedicated resources ▪ Batch • Scheduled • Periods of high utilization (scalability) ▪ Multi-Tenancy • Blended workloads • YARN (queues, node labels) • Think about Isolating nodes for real-time
  21. Other considerations ▪ Disaster recovery • Data is locally redundant • Backups not usually required unless you need geo-redundancy ▪ Security - Many different things to secure! • Kerberos for user, service, and host authentication • Authorization: Apache Ranger (Hortonworks) or Apache Sentry (Cloudera) or MapR Control System • Network isolation for Hadoop services • Data at rest (HFDS encryption) ▪ Hadoop Distribution - Race to include the most Apache projects • Top 3: Hortonworks, Cloudera, MapR • Big companies with Hadoop offering: – Teradata Hadoop aka TDH (Hortonworks, Cloudera, MapR) – Oracle Big Data Applicance (Cloudera)
  22. Spectrum of Options ▪ Cloud PaaS • No hardware or software to manage • Amazon S3, Azure Data Lake ▪ Cloud • Weird space between IaaS and PaaS • Amazon EMR • HDInsight is more PaaS ▪ Cloud IaaS • All virtual, no hardware to manage • You manage all software ▪ Third party hosted • Rackspace • Software managed by you ▪ Appliance • Infrastructure handled for you • Dell, HP, Cisco, Teradata, Oracle • Software (varies depending on vendor) ▪ Commodity • DIY
  23. Lessons Learned ▪ Workload isolation is hard • Multi-tenancy is possible • Takes work to make sure batch jobs don’t impact the real-time streaming processes ▪ Things we like: Hive, Hbase ▪ Things we don’t like: SOLR, debugging ▪ Debugging / development is hard • Lots of moving pieces • Logs spread out across many machines • Development environments require a lot of software • Distributed systems just work differently
  24. Questions? ▪ Hortonworks Community • https://community.hortonworks.com/answers/ index.html ▪ Kit Menke • @kitmenke on Twitter
  25. Resources ▪ Lambda Architecture • http://lambda-architecture.net/ ▪ Kappa Architecture • http://kappa-architecture.com/ ▪ Kappa Architecture - Our Experience by ASPgems • http://events.linuxfoundation.org/sites/events/files/slides/ ASPgems%20-%20Kappa%20Architecture.pdf ▪ Apache Hadoop YARN – Multi-Tenancy, Capacity Scheduler & Preemption - StampedeCon 2015 • http://www.slideshare.net/StampedeCon/apache-hadoop- yarn-multitenancy-capacity-scheduler-preemption- stampedecon-2015

Editor's Notes

  1. Explain our use case Expanding reporting windows and shrinking ETL windows
Advertisement