Big Data Management System: Smart SQL Processing Across Hadoop and your Data Warehouse

1,745 views
1,575 views

Published on

Published in: Technology, Business
0 Comments
8 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,745
On SlideShare
0
From Embeds
0
Number of Embeds
6
Actions
Shares
0
Downloads
0
Comments
0
Likes
8
Embeds 0
No embeds

No notes for slide
  • This is a Title Slide with Picture slide ideal for including a picture with a brief title, subtitle and presenter information.

    To customize this slide with your own picture:

    Right-click the slide area and choose Format Background from the pop-up menu. From the Fill menu, click Picture and texture fill. Under Insert from: click File. Locate your new picture and click Insert.
  • This is a Safe Harbor Front slide, one of two Safe Harbor Statement slides included in this template.

    One of the Safe Harbor slides must be used if your presentation covers material affected by Oracle’s Revenue Recognition Policy

    To learn more about this policy, e-mail: Revrec-americasiebc_us@oracle.com

    For internal communication, Safe Harbor Statements are not required. However, there is an applicable disclaimer (Exhibit E) that should be used, found in the Oracle Revenue Recognition Policy for Future Product Communications. Copy and paste this link into a web browser, to find out more information.  

    http://my.oracle.com/site/fin/gfo/GlobalProcesses/cnt452504.pdf

    For all external communications such as press release, roadmaps, PowerPoint presentations, Safe Harbor Statements are required. You can refer to the link mentioned above to find out additional information/disclaimers required depending on your audience.
  • InputFormat

    Hadoop relies on the input format of the job to do three things: 1. Validate the input configuration for the job (i.e., checking that the data is there). 2. Split the input blocks and files into logical chunks of type InputSplit, each of which is assigned to a map task for processing. 3. Create the RecordReader implementation to be used to create key/value pairs from the raw InputSplit. These pairs are sent one by one to their mapper.


    A RecordReader uses the data within the boundaries created by the input split to generate key/value pairs. In the context of file-based input, the “start” is the byte position in the file where the RecordReader should start generating key/value pairs. The “end” is where it should stop reading records. These are not hard boundaries as far as the API is concerned—there is nothing stopping a developer from reading the entire file for each map task. While reading the entire file is not advised, reading outside of the boundaries it often necessary to ensure that a complete record is generated
  • Big Data Management System: Smart SQL Processing Across Hadoop and your Data Warehouse

    1. 1. Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | Smart SQL Processing for Databases, Hadoop, and Beyond Dan McClary, Ph.D. Big Data Product Management Oracle June, 2014 Oracle Confidential – Internal/Restricted/Highly Restricted
    2. 2. Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | Safe Harbor Statement The following is intended to outline our general product direction. It is intended for information purposes only, and may not be incorporated into any contract. It is not a commitment to deliver any material, code, or functionality, and should not be relied upon in making purchasing decisions. The development, release, and timing of any features or functionality described for Oracle’s products remains at the sole discretion of Oracle. Oracle Confidential – Internal/Restricted/Highly Restricted 3
    3. 3. Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | Databases, Hadoop, and Beyond 1 2 3 How and Why Companies are Using Big Data Making Hadoop a first-class citizen Smarter SQL Processing Oracle Confidential – Internal/Restricted/Highly Restricted 4
    4. 4. Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | Big Data Customer Snapshot Oracle Confidential – Internal/Restricted/Highly Restricted 5 Big Data Analytic Services • R&D, Cross-property analytics, massive ingestion • Consolidated data science platform Business Transformation • Leading Spanish Bank > 13M customers • Collect & unify all relevant information Innovative Network Defense • Hadoop and NoSQL DB for data of different speeds • Detect 0-days, uncover intrusions BDA Exadata BDA Exadata BDA Exadata
    5. 5. Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | Exploit the Strengths of Both Systems Oracle Confidential – Internal/Restricted/Highly Restricted 6 0 1 2 3 4 5 Tooling maturity Stringent Functionals ACID transactions Security Variety of data formats Release Pace ETL simplicity Cost effectively store data Ingestion rate Business Interoperability Hadoop RDBMS • Hadoop is good at some things • Databases are good at others • Don’t reinvent wheels
    6. 6. Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | BDMS: Big Data Management System Oracle Confidential – Internal/Restricted/Highly Restricted 7 Run the Business  Integrate existing systems  Support mission-critical tasks  Protect existing expenditures  Insure skills relevance RelationalHadoop Change the Business  Disrupt competitors  Disintermediate supply chains  Leverage new paradigms  Exploit new analyses NoSQL Scale the Business  Serve faster  Meet mobile challenges  Scale-out economically
    7. 7. Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | Remarkable Innovation Oracle Confidential – Internal/Restricted/Highly Restricted 8 Hadoop Ecosystem
    8. 8. Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | Innovation Breeds Challenge Oracle Confidential – Internal/Restricted/Highly Restricted 9 Operations Languages Custom assembly HW/SW optimization Security Redundancy Integration Support Complexity APIs in flux Constant upgrade Skill sets Hadoop Ecosystem
    9. 9. Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | Building for Database Operations At Scale Oracle Confidential – Internal/Restricted/Highly Restricted 10 Intelligent Storage Smart Scan Storage Indexing Advanced Compression Optimized Network Protocols Easy Upgrades Easy Consolidation Engineered System for Oracle Database
    10. 10. Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | Building for Hadoop Operations at Scale Oracle Confidential – Internal/Restricted/Highly Restricted 11 Integrated Enterprise Management OOB Authentication Auditing Role-based Access Control Encryption High Availability Easy Upgrades Rapid Provisioning Engineered System for Hadoop & NoSQL
    11. 11. Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | Real Barriers to Adopting Big Data The Platform is not the Problem •Skills –Hadoop requires new expertise –Let experts be experts! –Ensure experts can work together •Integration –Prevent Hadoop from becoming a silo •Security –Need clear routes to governance or enforcement Oracle Confidential – Internal/Restricted/Highly Restricted 12
    12. 12. Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | Oracle Confidential – Internal/Restricted/Highly Restricted 13 How do we make Hadoop a first-class citizen?
    13. 13. Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | Oracle Confidential – Internal/Restricted/Highly Restricted 14 SQL
    14. 14. Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | Oracle Confidential – Internal/Restricted/Highly Restricted 15 Why?
    15. 15. Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | Oracle Confidential – Internal/Restricted/Highly Restricted 16 40 Years of SQL SELECT dept, sum(salary) FROM emp, dept WHERE dept.empid = emp.empid GROUP BY dept Still works Faster and in more places YEAR 1974YEAR 2014
    16. 16. Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | SQL on Hadoop is Obvious Oracle Confidential – Internal/Restricted/Highly Restricted 17 Stinger
    17. 17. Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | Data Lives in Many Places Oracle Confidential – Internal/Restricted/Highly Restricted 18 Profit and Loss RelationalHadoop Application Logs NoSQL Customer Profiles SQL
    18. 18. Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | The Challenge is ON Create a system that: • Gives you the full power of SQL • Requires no changes to application code • Gives you a single view of All Data stored in RDBMS and in Hadoop (++) • No changes (required) to Hadoop or my data • Best possible performance on my Hadoop data 19
    19. 19. Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | Smart SQL Processing on Hadoop (and more) data 20
    20. 20. Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | 100% of you are wondering how we do this! 21
    21. 21. Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | BDMS Requirements • Full Power of SQL and Advanced Analytics • No Changes to Application Code • Single View of All Data • Fastest Performance • No Changes to Hadoop + • Unified Metadata Across RDBMS & Hadoop • SQL Access to NoSQL
    22. 22. Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | How did we do this? 1. Give database queries the ability to be a Hadoop client 2. Expand the database metadata to understand Hadoop objects 3. Add services to Hadoop to execute and optimize data requests 23
    23. 23. Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | Teaching Oracle About Hadoop 24
    24. 24. Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | How does MapReduce process data? • Scan and row creation needs to be able to work on “any” data format • User defined Java Classes are used to scan and create the rows RecordReader => Scans data (keys and values) InputFormat => Defines parallelism 25 Data Node disk Consumer SCAN Create ROWS
    25. 25. Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | How does Hive help? • Definitions are represented as tables in the Hive Metastore • Hive leverages a SerDe (Java class) to define columns on rows generated SerDe => Creates columns RecordReader => Scans data (keys and values) InputFormat => Defines parallelism 26 Data Node disk Consumer SCAN Create ROWS & COLUMNS
    26. 26. Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | 27 Big Data Appliance + Hadoop HDFS DataNode Exadata + Oracle Database OracleCatalog ExternalTable create table customer_address ( ca_customer_id number(10,0) , ca_street_number char(10) , ca_state char(2) , ca_zip char(10) ) organization external ( TYPE ORACLE_HIVE DEFAULT DIRECTORY DEFAULT_DIR ACCESS PARAMETERS (com.oracle.bigdata.cluster hadoop_cl_1) LOCATION ('hive://customer_address') ) HDFS DataNode HDFS NameNode Hivemetadata ExternalTable Hivemetadata Publish Hadoop Metadata to Oracle Catalog
    27. 27. Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | 28 create table customer_address ( ca_customer_id number(10,0) , ca_street_number char(10) , ca_state char(2) , ca_zip char(10) ) organization external ( TYPE ORACLE_HIVE DEFAULT DIRECTORY DEFAULT_DIR ACCESS PARAMETERS (com.oracle.bigdata.cluster hadoop_cl_1) LOCATION ('hive://customer_address') ) Publish Hadoop Metadata to Oracle Catalog Big Data Appliance + Hadoop HDFS DataNode Exadata + Oracle Database OracleCatalog ExternalTable HDFS DataNode HDFS NameNode Hivemetadata ExternalTable Hivemetadata create table customer_address ( ca_customer_id number(10,0) , ca_street_number char(10) , ca_state char(2) , ca_zip char(10) ) organization external ( TYPE ORACLE_HIVE DEFAULT DIRECTORY DEFAULT_DIR ACCESS PARAMETERS (com.oracle.bigdata.cluster hadoop_cl_1) LOCATION ('hive://customer_address') ) • SerDe • RecordReader • InputFormat • StorageHandlers!
    28. 28. Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | 29 HDFS DataNode OracleCatalog ExternalTable Select c_customer_id , c_customer_last_name , ca_county From customers , customer_address where c_customer_id = ca_customer_id and ca_state = ‘CA’ HDFS DataNode HDFS NameNode Hivemetadata ExternalTable Hivemetadata Executing Queries on Hadoop HDFS DataNode HDFS DataNode Determine: • Data locations • Data structure • Parallelism Send to specific data nodes: • Data request • Context There’s a bottleneck here!
    29. 29. Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | Making SQL Processing Smarter 30
    30. 30. Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | What Can Big Data Learn from Exadata? Oracle Confidential – Internal/Restricted/Highly Restricted 31 Minimized data movement  Performance  Smart Scan −Filters data as it streams from disk  Storage Indexing −Ensures only relevant data is read  Caching −Frequently accessed data takes less time to read
    31. 31. Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | 32 HDFS DataNode OracleCatalog ExternalTable Select c_customer_id , c_customer_last_name , ca_county From customers , customer_address where c_customer_id = ca_customer_id and ca_state = ‘CA’ HDFS DataNode HDFS NameNode Hivemetadata ExternalTable Hivemetadata Executing Queries on Hadoop HDFS DataNode HDFS DataNode “Tables” Do I/O and Smart Scan: • Filter rows • Project columns Move only relevant data • Relevant rows • Relevant columns Apply join with database data Note: This also works without Hive definitions, as the underlying HDFS access concepts apply…
    32. 32. Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | Storage Indexes Optimizing Scans on Hadoop • Automatically collect and store the minimum and maximum value within a storage unit • Before scanning a storage unit, verify whether the data requires falls within the Min- Max • If not, skip scanning the block and reduce scan time 33 HDFS DataNode HDFS DataNode HDFS NameNode Hivemetadata HDFS DataNode HDFS DataNode “Blocks” Min Max Min Max Min Max Note: This also works without Hive definitions, simply leverage the SerDE
    33. 33. Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | What Does This Mean for Me? 34
    34. 34. Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | What if You Could Query All Data? Oracle Confidential – Internal/Restricted/Highly Restricted 35 Store JSON data unconverted in Hadoop JSON Oracle Database 12cOracle Big Data Appliance SQL Data analyzed via SQLStore business-critical data in Oracle select customers_document.address.state, revenue from customers, sales where customers_document.id=sales.custID group by customers_document.address.state;  Push down to Hadoop − JSON parsing − Column projection − Bloom filter for faster join
    35. 35. Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | What if You Could Govern All Data? Oracle Confidential – Internal/Restricted/Highly Restricted 36 Store JSON data unconverted in Hadoop JSON Oracle Database 12cOracle Big Data Appliance SQL Data analyzed via SQLStore business-critical data in Oracle DBMS_REDACT.ADD_POLICY( object_schema => 'txadp_hive_01', object_name => 'customer_address_ext', column_name => 'ca_street_name', policy_name => 'customer_address_redaction', function_type => DBMS_REDACT.RANDOM, expression => 'SYS_CONTEXT(''SYS_SESSION_ROLES'', ''REDACTION_TESTER'')=''TRUE''' );  Apply advanced security on Hadoop − Masking/Redaction − Virtual Private Database − Fine-grained Access Control
    36. 36. Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | Oracle’s Big Data Management System Oracle Confidential – Internal/Restricted/Highly Restricted 37 One fast SQL query, on all your data. Oracle SQL on Hadoop and beyond • With a Smart Scan service as in Exadata • With native SQL operators • With the security and certainty of Oracle DatabaseHappy 40th Birthday SQL
    37. 37. Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | http://www.oracle.com/bigdatabreakthrough @dan_mcclary Oracle Confidential – Internal/Restricted/Highly Restricted 38
    38. 38. Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | Oracle Confidential – Internal/Restricted/Highly Restricted 39

    ×