Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

New World Hadoop Architectures (& What Problems They Really Solve) for Oracle DBAs


Published on

Most DBAs are aware something interesting is going on with big data and the Hadoop product ecosystem that underpins it, but aren't so clear about what each component in the stack does, what problem each part solves and why those problems couldn't be solved using the old approach. We'll look at where it's all going with the advent of Spark and machine learning, what's happening with ETL, metadata and analytics on this platform ... why IaaS and datawarehousing-as-a-service will have such a big impact, sooner than you think

Published in: Data & Analytics
  • Be the first to comment

New World Hadoop Architectures (& What Problems They Really Solve) for Oracle DBAs

  2. 2. •Oracle ACE Director, Independent Analyst •Past ODTUG Exec Board Member + Oracle Scene Editor •Author of two books on Oracle BI •Co-founder & CTO of Rittman Mead •15+ Years in Oracle BI, DW, ETL + now Big Data •Host of the Drill to Detail Podcast ( •Based in Brighton & work in London, UK About The Presenter 2
  4. 4. “Hi Mark, In things I have seen and read quite o6en people start with a high-level overview of a product (e.g. Hadoop, Ka@a), then describe the technical concepts (using all the appropriate terminology) …” “but I am usually le6 missing something. I think it's around the area of what problems these technologies are solving and how they are doing it? Without that context I'm finding it all very academic” “Many people say tradiKonal systems will sKll be needed. Are these new technologies solving completely different problems to those handled by tradi=onal IT? Is there an overlap?”
  5. 5. •Started back in 1996 on a bank Oracle DW project •Our tools were Oracle 7.3.4, SQL*Plus, PL/SQL and shell scripts •Data warehouses provided a unified view of the business •Single place to store key data and metrics •Joined-up view of the business •Aggregates and conformed dimensions •ETL routines to load, cleanse and conform data •BI tools for simple, guided access to information •Tabular data access using SQL-generating tools •Drill paths, hierarchies, facts, attributes •Fast access to pre-computed aggregates •Packaged BI for fast-start ERP analytics 20 Years in Old-school BI & Data Warehousing 5
  6. 6. Data Warehousing and BI at “Peak Oracle” 7
  7. 7. Oracle Data Management Platform as of Today 8
  8. 8. What Happened?
  9. 9. 10 Let’s Go Back to 2003…
  10. 10. •Google needed to store and query their vast amount of server log files •And wanted to do so using cheap, commodity hardware •Google File System and MapReduce designed together for this use Google File System and MapReduce 12
  11. 11. •GFS optimised for particular task at hand - computing PageRank for sites •Streaming reads for PageRank calcs, block writes for crawler whole-site dumps •Master node only holds metadata •Stops client/master I/O being bottleneck, also acts as traffic controller for clients •Simple design, optimised for specific Google Need •MapReduce focused on simple computations on abstraction framework •Select & filter (MAP) and reduce (aggregate) functions, easily to distribute on cluster •MapReduce abstracted cluster compute, HDFS abstracted cluster storage •Projects that inspired Apache Hadoop + HDFS Google File System + MapReduce Key Innovations 13
  12. 12. How Traditional RDBMS Data Warehousing Scaled-Up 14 Shared-Everything Architectures (i.e. Oracle RAC, Exadata) Shared-Nothing Architectures
 (e.g. Teradata, Netezza)
  13. 13. Problem #1 That Hadoop / NoSQL Solved : Scaling Affordably
  14. 14. “Oracle scales infinitely and is free. Period”
  15. 15. •Enterprise High-End RDBMSs such as Oracle can scale •Clustering for single-instance DBs can scale to >PB •Exadata scales further by offloading queries to storage •Sharded databases (e.g. Netezza) can scale further •But cost (and complexity) become limiting factors •Typically $1m/node is not uncommon Cost and Complexity around Scaling DW Clusters 17
  16. 16. •A way of storing (non-relational) data cheaply and easily expandable •Gave us a way of scaling beyond TB-size without paying $$$ •First use-cases were offline storage, active archive of data Hadoop’s Original Appeal to Data Warehouse Owners 18 (c) 2013
  17. 17. Hadoop Ecosystem Expanded Beyond MapReduce 19 •Core Hadoop, MapReduce and HDFS •HBase and other NoSQL Databases •Apache Hive and SQL-on-Hadoop •Storm, Spark and Stream Processing •Apache YARN and Hadoop 2.0
  18. 18. •Solution to the problem of storing semi-structured data at-scale •Built on Google File System •Scale for capacity e.g., webtable •100,000,000,000 pages, •10 versions per page, •20 KB / version = 20 PB of data •Scale for throughput •Hundreds of millions of users •Tens of thousands to millions of queries/sec •At low-latency with high-reliability Google BigTable, HBase and NoSQL Databases 20
  19. 19. •Optimised for a particular task - fast lookups of ts-versioned web data •Data stored in multidimensional map keyed on row, column + timestamp •Master + data tablets stored on GFS cluster nodes •Simple key/value lookup with client doing interpretation •Innovation - focus on single job with different needs to OLTP •Formed inspiration for Apache HBase How BigTable Scaled Beyond Traditional RDBMSs 21
  20. 20. •Original developed at Facebook, now foundational within Hadoop •SQL-like language that compiles to MapReduce, Spark, HBase •Solved the problem of enabling non-programmers to access big data •And made Hadoop data transformation and aggregation code more productive •JDBC and ODBC drivers for tool integration Hive - Hadoop Discovers Set-Based Processing 22
  21. 21. •Hive is extensible to help with accessing and integrating new data sets •SerDes : Serializer-Deserializers that interpret semi-structured sources •UDFs + Hive Streaming : User-defined functions and streaming input •File Formats : make use of compressed and/or optimised file storage •Storage Handlers : use storage other than HDFS (e.g. MongoDB) Apache Hive as SQL Access Engine For Everything 23
  22. 22. •Hadoop as low-cost ETL pre-processing engine - “ETL-offload” •NoSQL database for landing real-time data at high speed/low latency •Incoming data then aggregated and stored in RBDMS DW Common Hadoop/NoSQL Use-Case (c) 2014 24 MartsData Warehouse Σ Σ Business Intelligence • Online • Scalable • Flexible • Cost Effective Hadoop
  23. 23. 25 Jump Ahead to 2012…
  24. 24. •Driven by pace of business, and user demands for more agility and control •Traditional IT-governed data loading not always appropriate •Not all data needed to be modelled right-away •Not all data suited storing in tabular form •New ways of analyzing data beyond SQL •Graph analysis •Machine learning Data Warehousing and ETL Needed Some Agility 29
  25. 25. Problem #2 That Hadoop / NoSQL Solved : Making Data Warehousing Agile
  26. 26. •Storing data in format it arrived in, and then applying schema at query time •Suits data that may be analysed in different ways by different tools •In addition, some datatypes may have schema embedded in file format •Key benefit - fast arriving data of unknown value can get to users earlier •Made possible by tools such as Apache Hive + SerDes,
 Apache Drill and self-describing file formats, HDFS storage Advent of Schema-on-Read, and Data Lakes 31
  27. 27. •Data now landed in Hadoop clusters, NoSQL databases and Cloud Storage •Flexible data storage platform with cheap storage, flexible schema support + compute •Solves the problem of how to store new types of data + choose best time/way to process it •Hadoop/NoSQL increasingly used for all store/transform/query tasks Meet the New Data Warehouse : The “Data Lake” 32 Data Transfer Data Access Data Factory Data Reservoir Business Intelligence Tools Hadoop Platform File Based Integration Stream Based Integration Data streams Discovery & Development Labs Safe & secure Discovery and Development environment Data sets and samples Models and programs Marketing / Sales Applications Models Machine Learning Segments Operational Data Transactions Customer Master ata Unstructured Data Voice + Chat Transcripts ETL Based Integration Raw Customer Data Data stored in the original format (usually files) such as SS7, ASN.1, JSON etc. Mapped Customer Data Data sets produced by mapping and transforming raw data
  28. 28. Hadoop 2.0 and YARN
 (“Yet Another Resource Negotiator”) Key Innovation : Separating how data is stored,
 from how it is processed
  29. 29. •Hadoop started by being synonymous with MapReduce, and Java coding •But YARN (Yet another Resource Negotiator) broke this dependency •Hadoop now just handles resource management •Multiple different query engines can run against data in-place •General-purpose (e.g. MapReduce) •Graph processing •Machine Learning •Real-Time Processing Hadoop 2.0 - Enabling Multiple Query Engines 35
  30. 30. Technologies Emerged to Bridge Old/New World 36
  31. 31. FAST FORWARD TO NOW… 37
  32. 32. •New generation of big data platform services from Google, Amazon, Oracle •Combines three key innovations from earlier technologies: •Organising of data into tables and columns (from RDBMS DWs) •Massively-scalable and distributed storage and query (from Big Data) •Elastically-scalable Platform-as-a-Service (from Cloud) Elastically-Scalable Data Warehouse-as-a-Service 38
  33. 33. … Which Is What I’m Working On Right Now 39
  34. 34. Example Architecture : Google BigQuery 40
  35. 35. 41
  36. 36. •On-premise Hadoop, even with simple resilient clustering, will hit limits •Clusters can reach 5000+ nodes, need to scale-up for demand peaks etc •Scale limits are encountered way beyond those for DWs… •… but future is elastically-scaled, query and compute-as-a-service What Problem Did Analytics-as-a-Service Solve? 42 Oracle Big Data Cloud Compute Edition Free $300 developer credit at:
  37. 37. •And things come full-circle … analytics typically requires tabular data •Google BigQuery based-on DremelX massively-parallel query engine •But stores data columnar and provides SQL interface •Solves the problem of providing DW-like functionality at scale, as-a-service •This is the future … ;-) BigQuery : Big Data Meets Data Warehousing 43