Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Hadoop as Storage for Aging Data

866 views

Published on

Hadoop as Storage for Aging Data
Simplicity with Big SQL and DB2

Published in: Data & Analytics
  • Be the first to comment

  • Be the first to like this

Hadoop as Storage for Aging Data

  1. 1. © 2015 IBM Corporation DDB-1560: Hadoop as Storage for Aging Data - Simplicity with Big SQL and DB2 Eugen Stoianovici (eugensto@ie.ibm.com) Software Developer , IBM 02:00 pm - 03:00 pm Thursday, October 29th, 2015
  2. 2. • IBM’s statements regarding its plans, directions, and intent are subject to change or withdrawal without notice at IBM’s sole discretion. • Information regarding potential future products is intended to outline our general product direction and it should not be relied on in making a purchasing decision. • The information mentioned regarding potential future products is not a commitment, promise, or legal obligation to deliver any material, code or functionality. Information about potential future products may not be incorporated into any contract. • The development, release, and timing of any future features or functionality described for our products remains at our sole discretion. Performance is based on measurements and projections using standard IBM benchmarks in a controlled environment. The actual throughput or performance that any user will experience will vary depending upon many factors, including considerations such as the amount of multiprogramming in the user’s job stream, the I/O configuration, the storage configuration, and the workload processed. Therefore, no assurance can be given that an individual user will achieve results similar to those stated here. Please Note: 2
  3. 3. • Problem description – data aging • Big SQL - SQL on Hadoop for online query archive • Data movement tools with Big Insights • Big SQL – Tips for maximizing usability Agenda 2
  4. 4. Modern Warehouse Exploration, Integrated Warehouse, and Mart Zones Discovery Deep Reflection Operational Predictive All Data Sources Decision Management BI and Predictive Analytics Analytic Applications Intelligence Analysis Raw Data Structured Data Text Analytics Data Mining Entity Analytics Machine Learning Video/Audio Network/Sensor Entity Analytics Predictive Stream Processing Data Integration Master Data Streams Information Governance, Security and Business Continuity Analytic Applications Real-time Analytic Zone Landing Area, Analytics Zone and Archive Information Ingestion and Operational Information
  5. 5. IBM BigInsights for Hadoop Discovery Deep Reflection Operational Predictive All Data Sources Decision Management BI and Predictive Analytics Analytic Applications Intelligence Analysis Raw Data Structured Data Text Analytics Data Mining Entity Analytics Machine Learning Stream Processing Data Integration Master Data Streams Information Governance, Security and Business Continuity Analytic Applications IBM Big Insights for Hadoop Exploration, Integrated Warehouse, and Mart Zones Landing Area, Analytics Zone and Archive Information Ingestion and Operational Information
  6. 6. • Create a large area of storage for cold data • Immediate cost savings, freeing EDW capacity  Placing lowest detailed schemas in BigInsights / Hadoop  Offloading ETL into a cheaper, more elastic MPP environment  Removing cost of HA/DR from data • Data remains online, instead of nearline or on tape  More detailed and more variety of data  Retain longer history of detailed data Queryable Archive 5
  7. 7. • Hadoop is not a piece of software, you can't install “Hadoop" • It is an ecosystem of software that work together  Hadoop Core (API's)  HDFS (File system)  MapReduce (Data processing )  Hive (SQL access)  HBase (NoSQL database)  Spark (compute engine)  Sqoop (Data movement)  Oozie (Job workflow)  …. There is a LOT of "Hadoop" software • Open Data Platform Initiative – facilitate adoption of Hadoop What is Hadoop 6
  8. 8. Text Analytics POSIX Distributed Filesystem Multi-workload, multi-tenant scheduling IBM BigInsights Enterprise Management Machine Learning on Big R Big R (R support) IBM Open Platform with Apache Hadoop (HDFS, YARN, MapReduce, Ambari, Flume, HBase, Hive, Kafta, Knox, Oozie, Pig, Slider, Solr, Spark, Sqoop, Zookeeper) IBM BigInsights Data Scientist IBM BigInsights Analyst Big SQL BigSheets Industry standard SQL (Big SQL) Spreadsheet-style tool (BigSheets) Overview of BigInsights Free Quick Start (non production): • IBM Open Platform • BigInsights Analyst, Data Scientist features • Community support . . . IBM BigInsights for Apache Hadoop IBM InfoSphere BigInsights is now IBM BigInsights for Apache Hadoop!
  9. 9. Part II - Hadoop and SQL
  10. 10. • Data in Hadoop is truly “schema-less” • Hadoop is API based at lower levels  Requires strong programming expertise  Steep learning curve  Simple operations can be tedious • In most cases, data is structured  e.g. aging old data • SQL when possible  Already lots of expertise available  Separation of what you want vs. how to get it  Robust ecosystem of tools SQL on Hadoop 9
  11. 11. • Part of the value-add packages for BigInsights • IBM’s SQL for Hadoop  Opens Hadoop data to a wider audience • Complements the Data Warehouse  Exploratory analytics  Sandbox  Data Lake • Use familiar SQL tools  Cognos  MicroStrategy  DSM What is Big SQL? Real-time Analytics InfoSphere Streams Data Governance and Security Data Click, LDAP and Secured Cluster Data Exploration BigSheets “schema-on-read” tooling Predictive Modeling Big R scalable data mining” on R Text Analytics Text processing with AQL Application Tooling Toolkits and accelerators
  12. 12. • Key advantages of Big SQL  Performance  SQL “Richness”  Enterprise features & security • How we’re doing it  Design principles  Architecture Big SQL - topics 11
  13. 13. Big SQL V4.1 vs Hive 1.2.1 @ 1TB TPC-DS* 12 *Not an official TPC-DS Benchmark.
  14. 14. • In 99 / 99 Queries, Big SQL was faster • On Average, Big SQL was 21X faster • Excluding the Top 5 and Bottom 5 results, Big SQL was 19X faster Big SQL V4.1 vs Hive 1.2.1 @ 1TB TPC-DS* 13 *Not an official TPC-DS Benchmark.
  15. 15. 14 Big SQL runs more SQL out-of-box Big SQL 4.1 Spark SQL 1.5.0 1 hour 3-4 weeksPorting Effort: Big SQL can execute all 99 queries with minimal porting effort Single stream results: Big SQL was faster than Spark SQL 76 / 99 Queries When Big SQL was slower, it was only slower by 1.6X on average Query vs. Query, Big is on average 5.5X faster Removing Top 5 / Bottom 5, Big SQL is 2.5X faster
  16. 16. • Leverage IBM’s rich SQL history and expertise  Modern SQL:2011 capabilities  DB2 compatible SQL – PL support • SQL bodied functions • SQL bodied stored procedures • Robust error handling • Application logic/security encapsulation • Flow of control logic • Same SQL syntax used in the EDW works on Big SQL. • Grouping sets with CUBE and ROLLUP • Windowing and OLAP aggregates Big SQL - SQL Capabilities 15
  17. 17. 16 Big SQL runs more SQL out-of-box Big SQL 4.1 Spark SQL 1.5.0 1 hour 3-4 weeksPorting Effort: Big SQL can execute all 99 queries with minimal porting effort Single stream results: Big SQL was faster than Spark SQL 76 / 99 Queries When Big SQL was slower, it was only slower by 1.6X on average Query vs. Query, Big is on average 5.5X faster Removing Top 5 / Bottom 5, Big SQL is 2.5X faster
  18. 18. Hadoop-DS: Performance Test update: Big SQL V4.1 vs. Spark SQL 1.5.1 @ 1 TB, single stream* 17 *Not an official TPC-DS Benchmark.
  19. 19. But, … what happens when you scale it? Scale Single Stream 4 Concurrent Streams 1 TB • Big SQL was faster on 76 / 99 Queries • Big SQL averaged 5.5X faster • Removing Top / Bottom 5, Big SQL averaged 2.5X faster • Spark SQL FAILED on 3 queries • Big SQL was 4.4X faster* 10 TB • Big SQL was faster on 80/99 Queries • Spark SQL FAILED on 7 queries • Big SQL averaged 6.2X faster* • Removing Top / Bottom 5, Big SQL averaged 4.6X faster • Big SQL elapsed time for workload was better than linear • Spark SQL could not complete the workload (numerous issues). Partial results possible with only 2 concurrent streams. *Compares only queries that both Big SQL and Spark SQL could complete (benefits Spark SQL) More Users MoreData
  20. 20. What is the verdict? Use the right tool for the right job Machine Learning Simpler SQL Good Performance Ideal tool for BI Data Analysts and production workloads Ideal tool for Data Scientists and discovery Big SQL Spark SQL Migrating existing workloads to Hadoop Security Many Concurrent Users Best in-class Performance
  21. 21. • Modern security model  GRANT/REVOKE support  GROUP permissions  ROW level permissions  COLUMN level permissions • Data Auditing tools  IBM InfoSphere Guardium (e.g.) Big SQL enterprise grade features 20
  22. 22. Big SQL Architecture 21 Management Node Big SQL Master Node Management Node Big SQL Scheduler Big SQL Worker Node Java I/O FMP Native I/O FMP HDFS Data Node MRTask Tracker Other ServiceHDFS Data HDFS Data HDFS Data Temp Data UDF FMP Compute Node Database Service Hive Metastore Hive Server Big SQL Worker Node Java I/O FMP Native I/O FMP HDFS Data Node MRTask Tracker Other ServiceHDFS Data HDFS Data HDFS Data Temp Data UDF FMP Compute Node Big SQL Worker Node Java I/O FMP Native I/O FMP HDFS Data Node MRTask Tracker Other ServiceHDFS Data HDFS Data HDFS Data Temp Data UDF FMP Compute Node DDL FMP UDF FMP *FMP = Fenced mode process
  23. 23. • Head node HA Support • Easy to extend  Add data nodes – more storage  Worker nodes – processing power • Fault tolerance  Only use online workers for computations MPP – Flexible architecture 22
  24. 24. Big SQL – data 23 CREATE HADOOP TABLE bigsql.sales ( part_id INTEGER, part_name VARCHAR(32), quantity INTEGER, cost DECIMAL(10,4) ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘,’ LOCATION ‘/user/bigsql/sales’; 12321,apple,10,10.20 32881,banana,5,11.91 43321,salt,1,0.90 44552,sugar,1,0.80 Does not actually “store” data -> interprets it.
  25. 25. • Support for all major file formats  ORC  Parquet  Avro  Sequence  etc. • Does not lock users in  Files can be used with external tools (Spark, MR, etc) Big SQL – open storage 24
  26. 26. Part III – Tools to offload data to Online Archive
  27. 27. • Moves data from EDWs and external stores into HDFS • Bulk import  Import individual tables or entire databases • SQL import  Import result of SQL execution • Data interaction  Support for interacting with data through Java classes • Data export  Offload data from HDFS into relational databases Simple with BigInsights - Sqoop 26 Sqoop Hadoop HDFS EDW
  28. 28. • SQL statement executable in Big SQL • Extracts data from external sources  RDBMS  EDW  HDFS • Loads data into Big SQL tables  Understands Big SQL tables (e.g. columns, partitioning, etc)  Understands Big SQL table properties(e.g. delimiters, etc) Simple with Big SQL - LOAD 27 CREATE HADOOP TABLE staff( id INT, name VARCHAR(64), job VARCHAR(10)) PARTITIONED BY (years INT); LOAD HADOOP USING JDBC CONNECTION URL ‘jdbc:db2://edw:5000/SAMPLE’ WITH PARAMETERS ( ‘credential.file’= ‘/user/bigsql/mycreds’ ) FROM TABLE STAFF COLUMNS(ID, NAME, JOB, YEARS) WHERE ‘YEARS is not null’ WITH TARGET TABLE PROPERTIES( ‘delimited.drop.loadedlims’=‘true’ ) WITH LOAD PROPERTIES ( ‘max.rejected.records’=50, ‘num.map.tasks’ = 2)
  29. 29. • Transparent  acts as a single source for all data sources • Extensible  Bring heterogeneous sources together • High functionality  use the same rich SQL against all sources • High performance  understands capabilities of sources and pushes down computation Simple with Big SQL – Federation 28
  30. 30. Part IV – Tips for SQL better usability of your Online Archive
  31. 31. • SQL is standard but internals are not • Some things require more attention  Point queries  Normalization  DML (UPDATE/DELETE) Tips for better performance 30
  32. 32. • Table partitioning for data elimination • Special file formats  ORC  Parquet • HBase with secondary indexes Point query optimization 31
  33. 33. • Partitioning is defined on one or more columns • Each unique value becomes a partition • Query predicates are used to eliminate partitions Big SQL table partitioning 32 CREATE TABLE demo.sales ( part_id int, part_name string, qty int, cost double ) PARTITIONED BY ( state char(2) ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|'; biginsights hive warehouse demo.db sales state=NJ state=AR state=CA state=NY data1.csv data1.csv data2.csv data1.csvselect * from demo.sales where state in ('NJ', 'CA');
  34. 34. • What they are:  Columnar file formats  Store statistics • Pros  Efficient compression  Predicates are pushed down and evaluated at file level!  We skip files for which predicates don’t match • Cons  Not easy to integrate outside of Hadoop Big SQL special file formats (ORC, Parquet) 33 CREATE HADOOP TABLE orc_table (col1 int, col2 string, col3 bigint) STORED AS ORC CREATE HADOOP TABLE parquet_table (col1 int, col2 string, col3 bigint) STORED AS PARQUETFILE
  35. 35. • Clustered Environment  Master and Region Servers  Automatic split ( sharding ) • Key-value store  Key and value are byte arrays  Efficient access using row key • HBase supports secondary indexes! HBase 34
  36. 36. • [De]normalization – complex types 35 CREATE HADOOP TABLE people( id INT, name VARCHAR(64), phoneNo ARRAY<VARCHAR(32)> ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘,’ COLLECTION ITEMS TERMINATED BY ‘|’ SELECT p.id, p.name, tel.no FROM people p UNNEST (p.phoneNo) tel(no) ID NAME Number 1 Eugen Stoianovici +353 8733 ----- 2 Eugen Stoianovici +353 8700 ----- 3 John Smith +353 8732 ----- … … …
  37. 37. • Fault tolerance to data (errors in semi-structured data do not cause queries to fail) • Fault tolerance to hardware  Hardware maintenance performed by threshold  Hardware failure does not impact system • Elastic clusters  Add more nodes when more capacity is required  Reassign nodes when batch job is done Big SQL – elastic and fault tolerant 36
  38. 38. Notices and Disclaimers 37 Copyright © 2015 by International Business Machines Corporation (IBM). No part of this document may be reproduced or transmitted in any form without written permission from IBM. U.S. Government Users Restricted Rights - Use, duplication or disclosure restricted by GSA ADP Schedule Contract with IBM. Information in these presentations (including information relating to products that have not yet been announced by IBM) has been reviewed for accuracy as of the date of initial publication and could include unintentional technical or typographical errors. IBM shall have no responsibility to update this information. THIS DOCUMENT IS DISTRIBUTED "AS IS" WITHOUT ANY WARRANTY, EITHER EXPRESS OR IMPLIED. IN NO EVENT SHALL IBM BE LIABLE FOR ANY DAMAGE ARISING FROM THE USE OF THIS INFORMATION, INCLUDING BUT NOT LIMITED TO, LOSS OF DATA, BUSINESS INTERRUPTION, LOSS OF PROFIT OR LOSS OF OPPORTUNITY. IBM products and services are warranted according to the terms and conditions of the agreements under which they are provided. Any statements regarding IBM's future direction, intent or product plans are subject to change or withdrawal without notice. Performance data contained herein was generally obtained in a controlled, isolated environments. Customer examples are presented as illustrations of how those customers have used IBM products and the results they may have achieved. Actual performance, cost, savings or other results in other operating environments may vary. References in this document to IBM products, programs, or services does not imply that IBM intends to make such products, programs or services available in all countries in which IBM operates or does business. Workshops, sessions and associated materials may have been prepared by independent session speakers, and do not necessarily reflect the views of IBM. All materials and discussions are provided for informational purposes only, and are neither intended to, nor shall constitute legal or other guidance or advice to any individual participant or their specific situation. It is the customer’s responsibility to insure its own compliance with legal requirements and to obtain advice of competent legal counsel as to the identification and interpretation of any relevant laws and regulatory requirements that may affect the customer’s business and any actions the customer may need to take to comply with such laws. IBM does not provide legal advice or represent or warrant that its services or products will ensure that the customer is in compliance with any law.
  39. 39. Notices and Disclaimers (con’t) 38 Information concerning non-IBM products was obtained from the suppliers of those products, their published announcements or other publicly available sources. IBM has not tested those products in connection with this publication and cannot confirm the accuracy of performance, compatibility or any other claims related to non-IBM products. Questions on the capabilities of non-IBM products should be addressed to the suppliers of those products. IBM does not warrant the quality of any third-party products, or the ability of any such third-party products to interoperate with IBM’s products. IBM EXPRESSLY DISCLAIMS ALL WARRANTIES, EXPRESSED OR IMPLIED, INCLUDING BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE. The provision of the information contained herein is not intended to, and does not, grant any right or license under any IBM patents, copyrights, trademarks or other intellectual property right. • IBM, the IBM logo, ibm.com, Aspera®, Bluemix, Blueworks Live, CICS, Clearcase, Cognos®, DOORS®, Emptoris®, Enterprise Document Management System™, FASP®, FileNet®, Global Business Services ®, Global Technology Services ®, IBM ExperienceOne™, IBM SmartCloud®, IBM Social Business®, Information on Demand, ILOG, Maximo®, MQIntegrator®, MQSeries®, Netcool®, OMEGAMON, OpenPower, PureAnalytics™, PureApplication®, pureCluster™, PureCoverage®, PureData®, PureExperience®, PureFlex®, pureQuery®, pureScale®, PureSystems®, QRadar®, Rational®, Rhapsody®, Smarter Commerce®, SoDA, SPSS, Sterling Commerce®, StoredIQ, Tealeaf®, Tivoli®, Trusteer®, Unica®, urban{code}®, Watson, WebSphere®, Worklight®, X-Force® and System z® Z/OS, are trademarks of International Business Machines Corporation, registered in many jurisdictions worldwide. Other product and service names might be trademarks of IBM or other companies. A current list of IBM trademarks is available on the Web at "Copyright and trademark information" at: www.ibm.com/legal/copytrade.shtml.
  40. 40. Questions? 39
  41. 41. • BigInsights – Hadoop open platform integration in the warehouse • Big SQL architecture and compatibility with enterprise data warehouses • Data movement tools • Tips from Big SQL performance Wrap up 40
  42. 42. 41 Get started with Big SQL: External resources  Hadoop Dev: links to videos, white paper, lab, . . . . https://developer.ibm.com/hadoop/
  43. 43. We Value Your Feedback! Don’t forget to submit your Insight session and speaker feedback! Your feedback is very important to us – we use it to continually improve the conference. Access your surveys at insight2015survey.com to quickly submit your surveys from your smartphone, laptop or conference kiosk. 42
  44. 44. © 2015 IBM Corporation Thank You

×