• Save
Big Data Management System: Smart SQL Processing Across Hadoop and your Data Warehouse
Upcoming SlideShare
Loading in...5
×
 

Big Data Management System: Smart SQL Processing Across Hadoop and your Data Warehouse

on

  • 950 views

 

Statistics

Views

Total Views
950
Views on SlideShare
943
Embed Views
7

Actions

Likes
5
Downloads
0
Comments
0

1 Embed 7

https://twitter.com 7

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • This is a Title Slide with Picture slide ideal for including a picture with a brief title, subtitle and presenter information. <br /> <br /> To customize this slide with your own picture: <br /> <br /> Right-click the slide area and choose Format Background from the pop-up menu. From the Fill menu, click Picture and texture fill. Under Insert from: click File. Locate your new picture and click Insert.
  • This is a Safe Harbor Front slide, one of two Safe Harbor Statement slides included in this template. <br /> <br /> One of the Safe Harbor slides must be used if your presentation covers material affected by Oracle’s Revenue Recognition Policy <br /> <br /> To learn more about this policy, e-mail: Revrec-americasiebc_us@oracle.com <br /> <br /> For internal communication, Safe Harbor Statements are not required. However, there is an applicable disclaimer (Exhibit E) that should be used, found in the Oracle Revenue Recognition Policy for Future Product Communications. Copy and paste this link into a web browser, to find out more information.   <br /> <br /> http://my.oracle.com/site/fin/gfo/GlobalProcesses/cnt452504.pdf <br /> <br /> For all external communications such as press release, roadmaps, PowerPoint presentations, Safe Harbor Statements are required. You can refer to the link mentioned above to find out additional information/disclaimers required depending on your audience.
  • InputFormat <br /> <br /> Hadoop relies on the input format of the job to do three things: 1. Validate the input configuration for the job (i.e., checking that the data is there). 2. Split the input blocks and files into logical chunks of type InputSplit, each of which is assigned to a map task for processing. 3. Create the RecordReader implementation to be used to create key/value pairs from the raw InputSplit. These pairs are sent one by one to their mapper. <br /> <br /> <br /> A RecordReader uses the data within the boundaries created by the input split to generate key/value pairs. In the context of file-based input, the “start” is the byte position in the file where the RecordReader should start generating key/value pairs. The “end” is where it should stop reading records. These are not hard boundaries as far as the API is concerned—there is nothing stopping a developer from reading the entire file for each map task. While reading the entire file is not advised, reading outside of the boundaries it often necessary to ensure that a complete record is generated

Big Data Management System: Smart SQL Processing Across Hadoop and your Data Warehouse Big Data Management System: Smart SQL Processing Across Hadoop and your Data Warehouse Presentation Transcript

  • Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | Smart SQL Processing for Databases, Hadoop, and Beyond Dan McClary, Ph.D. Big Data Product Management Oracle June, 2014 Oracle Confidential – Internal/Restricted/Highly Restricted
  • Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | Safe Harbor Statement The following is intended to outline our general product direction. It is intended for information purposes only, and may not be incorporated into any contract. It is not a commitment to deliver any material, code, or functionality, and should not be relied upon in making purchasing decisions. The development, release, and timing of any features or functionality described for Oracle’s products remains at the sole discretion of Oracle. Oracle Confidential – Internal/Restricted/Highly Restricted 3
  • Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | Databases, Hadoop, and Beyond 1 2 3 How and Why Companies are Using Big Data Making Hadoop a first-class citizen Smarter SQL Processing Oracle Confidential – Internal/Restricted/Highly Restricted 4 View slide
  • Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | Big Data Customer Snapshot Oracle Confidential – Internal/Restricted/Highly Restricted 5 Big Data Analytic Services • R&D, Cross-property analytics, massive ingestion • Consolidated data science platform Business Transformation • Leading Spanish Bank > 13M customers • Collect & unify all relevant information Innovative Network Defense • Hadoop and NoSQL DB for data of different speeds • Detect 0-days, uncover intrusions BDA Exadata BDA Exadata BDA Exadata View slide
  • Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | Exploit the Strengths of Both Systems Oracle Confidential – Internal/Restricted/Highly Restricted 6 0 1 2 3 4 5 Tooling maturity Stringent Functionals ACID transactions Security Variety of data formats Release Pace ETL simplicity Cost effectively store data Ingestion rate Business Interoperability Hadoop RDBMS • Hadoop is good at some things • Databases are good at others • Don’t reinvent wheels
  • Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | BDMS: Big Data Management System Oracle Confidential – Internal/Restricted/Highly Restricted 7 Run the Business  Integrate existing systems  Support mission-critical tasks  Protect existing expenditures  Insure skills relevance RelationalHadoop Change the Business  Disrupt competitors  Disintermediate supply chains  Leverage new paradigms  Exploit new analyses NoSQL Scale the Business  Serve faster  Meet mobile challenges  Scale-out economically
  • Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | Remarkable Innovation Oracle Confidential – Internal/Restricted/Highly Restricted 8 Hadoop Ecosystem
  • Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | Innovation Breeds Challenge Oracle Confidential – Internal/Restricted/Highly Restricted 9 Operations Languages Custom assembly HW/SW optimization Security Redundancy Integration Support Complexity APIs in flux Constant upgrade Skill sets Hadoop Ecosystem
  • Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | Building for Database Operations At Scale Oracle Confidential – Internal/Restricted/Highly Restricted 10 Intelligent Storage Smart Scan Storage Indexing Advanced Compression Optimized Network Protocols Easy Upgrades Easy Consolidation Engineered System for Oracle Database
  • Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | Building for Hadoop Operations at Scale Oracle Confidential – Internal/Restricted/Highly Restricted 11 Integrated Enterprise Management OOB Authentication Auditing Role-based Access Control Encryption High Availability Easy Upgrades Rapid Provisioning Engineered System for Hadoop & NoSQL
  • Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | Real Barriers to Adopting Big Data The Platform is not the Problem •Skills –Hadoop requires new expertise –Let experts be experts! –Ensure experts can work together •Integration –Prevent Hadoop from becoming a silo •Security –Need clear routes to governance or enforcement Oracle Confidential – Internal/Restricted/Highly Restricted 12
  • Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | Oracle Confidential – Internal/Restricted/Highly Restricted 13 How do we make Hadoop a first-class citizen?
  • Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | Oracle Confidential – Internal/Restricted/Highly Restricted 14 SQL
  • Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | Oracle Confidential – Internal/Restricted/Highly Restricted 15 Why?
  • Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | Oracle Confidential – Internal/Restricted/Highly Restricted 16 40 Years of SQL SELECT dept, sum(salary) FROM emp, dept WHERE dept.empid = emp.empid GROUP BY dept Still works Faster and in more places YEAR 1974YEAR 2014
  • Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | SQL on Hadoop is Obvious Oracle Confidential – Internal/Restricted/Highly Restricted 17 Stinger
  • Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | Data Lives in Many Places Oracle Confidential – Internal/Restricted/Highly Restricted 18 Profit and Loss RelationalHadoop Application Logs NoSQL Customer Profiles SQL
  • Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | The Challenge is ON Create a system that: • Gives you the full power of SQL • Requires no changes to application code • Gives you a single view of All Data stored in RDBMS and in Hadoop (++) • No changes (required) to Hadoop or my data • Best possible performance on my Hadoop data 19
  • Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | Smart SQL Processing on Hadoop (and more) data 20
  • Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | 100% of you are wondering how we do this! 21
  • Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | BDMS Requirements • Full Power of SQL and Advanced Analytics • No Changes to Application Code • Single View of All Data • Fastest Performance • No Changes to Hadoop + • Unified Metadata Across RDBMS & Hadoop • SQL Access to NoSQL
  • Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | How did we do this? 1. Give database queries the ability to be a Hadoop client 2. Expand the database metadata to understand Hadoop objects 3. Add services to Hadoop to execute and optimize data requests 23
  • Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | Teaching Oracle About Hadoop 24
  • Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | How does MapReduce process data? • Scan and row creation needs to be able to work on “any” data format • User defined Java Classes are used to scan and create the rows RecordReader => Scans data (keys and values) InputFormat => Defines parallelism 25 Data Node disk Consumer SCAN Create ROWS
  • Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | How does Hive help? • Definitions are represented as tables in the Hive Metastore • Hive leverages a SerDe (Java class) to define columns on rows generated SerDe => Creates columns RecordReader => Scans data (keys and values) InputFormat => Defines parallelism 26 Data Node disk Consumer SCAN Create ROWS & COLUMNS
  • Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | 27 Big Data Appliance + Hadoop HDFS DataNode Exadata + Oracle Database OracleCatalog ExternalTable create table customer_address ( ca_customer_id number(10,0) , ca_street_number char(10) , ca_state char(2) , ca_zip char(10) ) organization external ( TYPE ORACLE_HIVE DEFAULT DIRECTORY DEFAULT_DIR ACCESS PARAMETERS (com.oracle.bigdata.cluster hadoop_cl_1) LOCATION ('hive://customer_address') ) HDFS DataNode HDFS NameNode Hivemetadata ExternalTable Hivemetadata Publish Hadoop Metadata to Oracle Catalog
  • Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | 28 create table customer_address ( ca_customer_id number(10,0) , ca_street_number char(10) , ca_state char(2) , ca_zip char(10) ) organization external ( TYPE ORACLE_HIVE DEFAULT DIRECTORY DEFAULT_DIR ACCESS PARAMETERS (com.oracle.bigdata.cluster hadoop_cl_1) LOCATION ('hive://customer_address') ) Publish Hadoop Metadata to Oracle Catalog Big Data Appliance + Hadoop HDFS DataNode Exadata + Oracle Database OracleCatalog ExternalTable HDFS DataNode HDFS NameNode Hivemetadata ExternalTable Hivemetadata create table customer_address ( ca_customer_id number(10,0) , ca_street_number char(10) , ca_state char(2) , ca_zip char(10) ) organization external ( TYPE ORACLE_HIVE DEFAULT DIRECTORY DEFAULT_DIR ACCESS PARAMETERS (com.oracle.bigdata.cluster hadoop_cl_1) LOCATION ('hive://customer_address') ) • SerDe • RecordReader • InputFormat • StorageHandlers!
  • Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | 29 HDFS DataNode OracleCatalog ExternalTable Select c_customer_id , c_customer_last_name , ca_county From customers , customer_address where c_customer_id = ca_customer_id and ca_state = ‘CA’ HDFS DataNode HDFS NameNode Hivemetadata ExternalTable Hivemetadata Executing Queries on Hadoop HDFS DataNode HDFS DataNode Determine: • Data locations • Data structure • Parallelism Send to specific data nodes: • Data request • Context There’s a bottleneck here!
  • Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | Making SQL Processing Smarter 30
  • Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | What Can Big Data Learn from Exadata? Oracle Confidential – Internal/Restricted/Highly Restricted 31 Minimized data movement  Performance  Smart Scan −Filters data as it streams from disk  Storage Indexing −Ensures only relevant data is read  Caching −Frequently accessed data takes less time to read
  • Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | 32 HDFS DataNode OracleCatalog ExternalTable Select c_customer_id , c_customer_last_name , ca_county From customers , customer_address where c_customer_id = ca_customer_id and ca_state = ‘CA’ HDFS DataNode HDFS NameNode Hivemetadata ExternalTable Hivemetadata Executing Queries on Hadoop HDFS DataNode HDFS DataNode “Tables” Do I/O and Smart Scan: • Filter rows • Project columns Move only relevant data • Relevant rows • Relevant columns Apply join with database data Note: This also works without Hive definitions, as the underlying HDFS access concepts apply…
  • Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | Storage Indexes Optimizing Scans on Hadoop • Automatically collect and store the minimum and maximum value within a storage unit • Before scanning a storage unit, verify whether the data requires falls within the Min- Max • If not, skip scanning the block and reduce scan time 33 HDFS DataNode HDFS DataNode HDFS NameNode Hivemetadata HDFS DataNode HDFS DataNode “Blocks” Min Max Min Max Min Max Note: This also works without Hive definitions, simply leverage the SerDE
  • Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | What Does This Mean for Me? 34
  • Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | What if You Could Query All Data? Oracle Confidential – Internal/Restricted/Highly Restricted 35 Store JSON data unconverted in Hadoop JSON Oracle Database 12cOracle Big Data Appliance SQL Data analyzed via SQLStore business-critical data in Oracle select customers_document.address.state, revenue from customers, sales where customers_document.id=sales.custID group by customers_document.address.state;  Push down to Hadoop − JSON parsing − Column projection − Bloom filter for faster join
  • Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | What if You Could Govern All Data? Oracle Confidential – Internal/Restricted/Highly Restricted 36 Store JSON data unconverted in Hadoop JSON Oracle Database 12cOracle Big Data Appliance SQL Data analyzed via SQLStore business-critical data in Oracle DBMS_REDACT.ADD_POLICY( object_schema => 'txadp_hive_01', object_name => 'customer_address_ext', column_name => 'ca_street_name', policy_name => 'customer_address_redaction', function_type => DBMS_REDACT.RANDOM, expression => 'SYS_CONTEXT(''SYS_SESSION_ROLES'', ''REDACTION_TESTER'')=''TRUE''' );  Apply advanced security on Hadoop − Masking/Redaction − Virtual Private Database − Fine-grained Access Control
  • Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | Oracle’s Big Data Management System Oracle Confidential – Internal/Restricted/Highly Restricted 37 One fast SQL query, on all your data. Oracle SQL on Hadoop and beyond • With a Smart Scan service as in Exadata • With native SQL operators • With the security and certainty of Oracle DatabaseHappy 40th Birthday SQL
  • Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | http://www.oracle.com/bigdatabreakthrough @dan_mcclary Oracle Confidential – Internal/Restricted/Highly Restricted 38
  • Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | Oracle Confidential – Internal/Restricted/Highly Restricted 39