© 2014 IBM Corporation 
Big SQL 3.0 
Fast and easy SQL on Hadoop 
Wilfried Hoge 
IT Architect Big Data hoge@de.ibm.com @wilfriedhoge 
z/OS und LUW
Hadoop Observations 
Technology Customers Vendors 
Rapid innovation 
Two sources of innovation 
- Open source community 
- Integration of existing 
technologies 
Tools and application 
vendors selecting partners 
and integrating 
High degree of interest 
Many experimental 
workstreams 
ROI establishment varies by 
use case 
Many customers want to 
offload data from EDW 
Multiple business models 
OSS support vendors have 
mindshare lead 
OSS support vendors 
business model viability 
unclear 
SW Portfolio vendors 
integrating/adding 
© 2014 International Business Machines Corporation 2
InfoSphere BigInsights 
provides Enterprise Grade Hadoop analytics 
• Manages a wide variety and huge volume 
of data 
• Augments open source Hadoop with 
enterprise capabilities 
– Visualization & Exploration 
– Development tools 
– Advanced Engines 
– Connectors 
– Workload Optimization 
– Enterprise integration 
– Analytic Accelerators 
– Application and industry accelerators 
– Administration & Security 
BIG DATA PLATFORM 
Application Discovery 
Development 
Accelerators 
Data 
Warehouse 
Stream 
Computing 
Systems 
Management 
Hadoop 
System 
Information Integration & Governance 
Data Media Content Machine Social 
© 2014 International Business Machines Corporation 3 
© 2013 IBM Corporation
Key Differentiators for BigInsights 
Enterprise Performance 
& Integration Analytics Usability 
& Productivity 
• Workload / performance 
optimization 
• GPFS 
• Security 
• Key integrations & Connectors 
with Enterprise Ecosystem 
• Text analytics 
• Social Data Analytics 
Accelerators 
• Machine Data Analytics 
Accelerators 
• Execute R in an integrated 
application 
• Big SQL 
• BigSheets 
• Development Tools 
• Web Console 
© 2014 International Business Machines Corporation 4
Integrated Web Console 
• Manage BigInsights 
– Inspect /monitor system health 
– Add / drop nodes 
– Start / stop services 
– Run / monitor jobs (applications) 
– Explore / modify file system 
– Create custom dashboards 
• Launch applications 
– Spreadsheet-like analysis tool 
– Pre-built applications (IBM supplied or 
user developed) 
• Publish applications 
• Monitor cluster, applications, data 
– Create / view event alerts. 
© 2014 International Business Machines Corporation 5
Distributed Filesystem 
© 2014 International Business Machines Corporation 6 
6 
Applications 
High level languages 
(SQL, JAQL, PIG, …) 
Map/Reduce API 
Hadoop DFS API 
GPFS HDFS 
Distributed filesystem GPFS FPO 
gives additional flexibility, security 
and high availability 
• Optional file system alternative to HDFS 
• More than 10 years experience with HPC 
• Key features 
– No single point of failure 
– Built-in High Availability 
– POSIX compliance 
• Standard applications cannot use HDFS 
but they can use GPFS-FPO 
– Enhanced Security 
– Higher performance 
• Allows concurrent read and 
write by multiple programs 
– Recovery capabilties 
• Journaling filesystem 
– Support for Storage Pools 
– SnapShot capability
BigInsights has a simple but 
effective security system based 
on a gateway to Hadoop 
Users Sources 
• All Hadoop servers are connected over a 
private network 
• Unrestricted communication between cluster 
servers on the private network 
• BigInsights Web Console acts as a 
gateway into the cluster 
• Authentication through PAM or LDAP 
• Role based authorization 
• Authorization will be enforced at 3 levels: 
– UI level 
– Data level 
– Map-Reduce level 
• Authorization also respected by services (e.g. SQL) 
• Kerberos support 
Authentication 
Authority 
External 
Gateway / Web Console 
Services Data 
Nodes 
Infrastr. 
Nodes 
Distributed Filesystem 
© 2014 International Business Machines Corporation 7
BigSheets to analyze and visualize 
• Model “big data” collected 
from various sources in 
spreadsheet-like structures 
• Filter and enrich content with 
built-in functions 
• Combine data in different 
workbooks 
• Visualize results through 
spreadsheets, charts 
• Export data into common 
formats (if desired) 
No programming knowledge needed! 
© 2014 International Business Machines Corporation 8
Centralized dashboard & data flows 
© 2014 International Business Machines Corporation 9 
9 
A centralized dashboard to 
visualize analytic results: 
• BigSheets collections 
• Analytic application results 
• Monitoring metrics 
• Ability to view BigSheets data flows between 
and across data sets to quickly navigate and 
relate analysis and charts 
• Visualize inner outer joins, enhanced filters 
for BigSheets columns, column data-type 
mapping for collections and application of 
analytics to BigSheets 
columns, … etc
Tools for Developers 
5. Deploy your 
application on the 
cluster 
© 2014 International Business Machines Corporation 10 
10 
Editors 
• A workflow editor that greatly simplifies the 
creation of complex Oozie workflows with a 
consumable interface 
• A Pig/Jaql Editor with content assist and syntax 
highlighting that enables users to create and 
execute new applications using Pig or Jaql in 
local or cluster mode from the Eclipse IDE 
Application development & deployment 
• Enablement of BigSheets macro 
and BigSheets reader development 
• Text Analytics development, 
including support for modular 
rule sets 
• Publish new application: BigSheets 
Macro, BigSheets Reader, AQL 
module, Jaql module 
1. Sample your 
Data 
2. Develop your 
application using 
BigInsights tools 
3. Test your 
application 
4. Package and publish your 
application
Running Applications on Big Data 
• Browse available applications 
• Deploy published applications 
(administrators only) 
• Launch (or schedule for launch) a 
deployed application 
• Monitor job (application) execution 
status 
• Predefined applications 
• Import & Export Data 
• Database & Files 
• Web and Social 
• Analyze and Query 
• Predictive Analytics 
• Text Analytics 
• SQL/Hive, Jaql, Pig, Hbase 
• Accelerators 
© 2014 International Business Machines Corporation 11
Application linking and interfaces to build new apps 
• Compose new 
applications from 
existing applications 
and BigSheets 
• Invoke analytics 
applications from the 
web console, including 
integration within 
BigSheets 
• REST data source App 
that enables users to 
load data from any data source supporting REST APIs into BigInsights, 
including popular social media services 
• Sampling App that enables users to sample data for analysis 
• Subsetting App that enables users to subset data for data analysis 
© 2014 International Business Machines Corporation 12 
12
Collaborative Big Data for many roles 
• Business Users can get their hands on big 
data and use big data applications and 
BigSheets to get insights into their data 
§ Data scientists can perform deeper 
analysis and get richer insights 
§ Administrators are empowered to be 
more agile through better controls and 
views into key performance indicators 
§ Developers can leverage unified tooling in a Big Data 
Application Development Lifecycle and are able to 
create and deploy new types of applications, with 
enhancements that simplify even complex workflows 
© 2014 International Business Machines Corporation 13
Big SQL 3.0 – Architected for Performance 
• Leverage IBM's rich SQL heritage, expertise, and technology 
– Modern SQL:2011 capabilities 
– DB2 compatible SQL PL support 
• SQL bodied functions and stored procedures 
• Application logic/security encapsulation 
• Architected from the ground up for performance 
– low latency and high throughput 
• MapReduce replaced with a modern MPP 
architecture 
– Compiler and runtime are native code (not java) 
– Big SQL worker daemons live directly on cluster 
– Continuously running (no startup latency) 
– Processing happens locally at the data 
• Operations occur in memory with the ability 
to spill to disk 
– Supports aggregations and sorts larger than available RAM 
• Integration with BigSheets (source & target) 
SQL-based 
Application 
IBM Data Server Client 
Big SQL 
SQL MPP Runtime 
Data Sources 
Parquet CSV Seq RC 
Avro ORC JSON Custom 
InfoSphere BigInsights 
© 2014 International Business Machines Corporation 14
Big SQL 3.0 – Architecture cont. 
• Head (coordinator / management) node 
– Listens to the JDBC/ODBC connections and compiles / optimizes the query 
– Coordinates the execution of the query 
– Optionally store user data in traditional RDBMS table (single node only) 
• Big SQL worker processes reside on compute nodes (some or all) 
• Worker nodes stream data between each other as needed 
• Workers can spill large data sets to local disk if needed 
– Allows Big SQL to work with data sets 
larger than available memory 
Big SQL 
Mgmt Node 
Hive 
Metastore 
Mgmt Node 
Name Node 
Mgmt Node 
••• Job Tracker 
Mgmt Node 
Task 
Tracker 
Data 
Node 
Big 
SQL 
Big 
SQL 
••• Node Big 
SQL 
Compute Node 
Task 
Tracker 
Data 
Node 
Compute Node 
Task 
Tracker 
Data 
Node 
Compute Node 
Task 
Tracker 
Data 
Big 
SQL 
Compute Node 
GPFS/HDFS 
© 2014 International Business Machines Corporation 15
Big SQL 3.0 – Features 
Application Portability & Integration 
Data shared with Hadoop ecosystem 
Comprehensive file format support 
Superior enablement of IBM software 
Enhanced by Third Party software 
Performance 
Modern MPP runtime 
Powerful SQL query rewriter 
Cost based optimizer 
Optimized for concurrent user throughput 
Results not constrained by memory 
Rich SQL 
Comprehensive SQL Support 
IBM SQL PL compatibility 
Distributed requests to multiple data 
sources within a single SQL statement 
Main data sources supported: 
DB2 LUW, DB2/z, Teradata, Oracle, Netezza 
Advanced security/auditing 
Resource and workload management 
Self tuning memory management 
Comprehensive monitoring 
Federation 
Enterprise Features 
© 2014 International Business Machines Corporation 16
BigSQL Demo 
© 2014 International Business Machines Corporation 17
Comparing Big SQL 3.0 and Hive 0.12 for Ad-Hoc Queries 
3500 
3000 
2500 
2000 
1500 
1000 
500 
0 
BigSQL 
3.0 
Parquet 
vs 
Hive 
0.12 
ORC 
1TB 
Classic 
BI 
Workload 
Big SQL is up 
to 41x faster 
than Hive 0.12 
1 
2 
3 
4 
5 
6 
7 
8 
9 
10 
11 
12 
13 
14 
15 
16 
17 
18 
19 
20 
21 
22 
Elapsed 
Time 
(sec) 
Query 
number 
Hive 
0.12 
BigSQL 
3.0 
*Based on IBM internal tests comparing IBM Infosphere Biginsights 3.0 Big SQL with Hive 0.12 executing the "1TB Classic BI Workload" 
in a controlled laboratory environment. The 1TB Classic BI Workload is a workload derived from the TPC-H Benchmark Standard, 
running at 1TB scale factor. It is materially equivalent with the exception that no update functions are performed. TPC Benchmark and 
TPC-H are trademarks of the Transaction Processing Performance Council (TPC). Configuration: Cluster of 9 System x3650HD servers, 
each with 64GB RAM and 9x2TB HDDs running Redhat Linux 6.3. Results may not be typical and will vary based on actual workload, 
configuration, applications, queries and other variables in a production environment. Results as of April 22, 2014 
© 2014 International Business Machines Corporation 18
IBM BigInsights brings efficient integration of R with Big R 
• R as a big data query language 
– Outside-in execution 
• R as a statistical language for 
deep computing 
– Inside-out execution 
– Partitioning of large data (“divide”) 
– Parallel cluster execution of pushed 
down R code (“conquer”) 
– Almost any R package can run in 
this environment 
• R as the gateway to scalable 
machine learning 
– A scalable ML engine that provides 
canned algorithms, and an ability to 
author new ones, all via R 
R Clients 
Pull data 
(summaries) to 
R client 
Scalable 
ML 
Engine 
Data Sources 
R Packages 
R Packages 
Embedded R Execution 
Or, push R 
functions right 
on the data 
© 2014 International Business Machines Corporation 19
Text Analytics in BigInsights 
Distill structured information from 
unstructured data 
– Rich annotator library supports multiple 
languages 
– Declarative Information Extraction (IE) system 
based on an algebraic framework 
– Richer, cleaner rule semantics 
– Better performance through optimization 
How it works 
• Parses text and detects meaning with annotators 
• Understands the context in which the text is 
analyzed 
• Hundreds of pre-built annotators for names, 
addresses, phone numbers, along others 
Accuracy 
• Highly accurate in deriving meaning from 
complex text 
Performance 
• AQL language optimized for MapReduce 
Unstructured text (document, email, etc) 
Football World Cup 2010, one team 
distinguished themselves well, losing to 
the eventual champions 1-0 in the Final. 
Early in the second half, Netherlands’ 
striker, Arjen Robben, had a breakaway, 
but the keeper for Spain, Iker Casillas 
made the save. Winger Andres Iniesta 
scored for Spain for the win. 
Classification and Insight 
© 2014 International Business Machines Corporation 20
BigInsights offers value beyond Open Source 
Enterprise Capabilities 
Visualization & Exploration 
Development Tools 
Advanced Engines 
Connectors 
Workload Optimization 
Administration & Security 
Key differentiators 
• Built-in analytics 
• Enterprise software integration 
• Spreadsheet-style analysis 
• Integrated installation of supported open 
Open source 
components 
IBM-certified 
Apache 
Hadoop 
source and other components 
• Web Console for admin and application 
access 
• Platform enrichment: additional security, 
performance features, . . . 
• World-class support 
• Full open source compatibility 
Business benefits 
• Quicker time-to-value due to IBM 
technology and support 
• Reduced operational risk 
• Enhanced business knowledge with flexible 
analytical platform 
• Leverages and complements existing 
software 
© 2014 International Business Machines Corporation 21
InfoSphere BigInsights for Hadoop includes the latest Open 
Source components, enhanced by enterprise components 
IBM InfoSphere BigInsights for Hadoop 
Visualization & Ad 
Hoc Analytics 
BigSheets 
Charting Dashboard 
Advanced Analytics 
R Big R Analytics 
Data 
Access 
Runtime 
Data Store 
File System 
Security 
Resource Management & 
Oozie 
Administration 
YARN* 
Applications & Development 
Governance 
Text 
Jaql 
Eclipse Tooling: 
MapReduce, Hive, Jaql, 
Pig, Big SQL, AQL 
Flume 
Sqoop 
HCatalog 
Hive Pig 
MapReduce 
HBase 
HDFS 
BigSheets Reader 
and Macro 
Text Analytics 
Extractors 
Stream Computing 
Streams 
Adaptive MapReduce 
Solr/ 
Lucene 
Enterprise 
Search 
ETL 
Big SQL 
Open Source IBM 
Kerberos 
ZooKeeper 
Console Monitoring 
Audit & History 
GPFS FPO 
LDAP Data Security for Hadoop 
Data Masking Data Matching Data Privacy for Hadoop 
Search 
Flexible 
Scheduler 
* In Beta 
© 2014 International Business Machines Corporation 22
From Getting Starting to Enterprise Deployment: 
Different BigInsights Editions For Varying Needs 
Enterprise Edition 
Standard Edition 
- Spreadsheet-style tool 
- - Dashboards 
- Pre-built applications 
- - Eclipse tooling 
- - RDBMS connectivity 
- - Monitoring and alerts 
- - Platform enhancements 
- Accelerators 
- - GPFS – FPO 
- - Adaptive MapReduce 
- Text analytics 
- Enterprise Integration 
- - Big R 
- - InfoSphere Streams* 
- - Watson Explorer* 
- - Cognos BI* 
- - Data Click* 
- - . . . 
- * Limited use license 
Breadth of capabilities 
Enterprise class 
- - Web console 
- - Big SQL 
- - . . . 
Apache 
Hadoop 
Quick Start 
Free. Non-production 
Same features as 
Standard Edition plus text 
analytics and Big R 
© 2014 International Business Machines Corporation 23
IBM big data • IBM big data • IBM big data 
IBM big data • IBM big data • IBM big data 
IBM big data • IBM big data 
IBM big data • IBM big data 
THINK

Big SQL 3.0 - Fast and easy SQL on Hadoop

  • 1.
    © 2014 IBMCorporation Big SQL 3.0 Fast and easy SQL on Hadoop Wilfried Hoge IT Architect Big Data hoge@de.ibm.com @wilfriedhoge z/OS und LUW
  • 2.
    Hadoop Observations TechnologyCustomers Vendors Rapid innovation Two sources of innovation - Open source community - Integration of existing technologies Tools and application vendors selecting partners and integrating High degree of interest Many experimental workstreams ROI establishment varies by use case Many customers want to offload data from EDW Multiple business models OSS support vendors have mindshare lead OSS support vendors business model viability unclear SW Portfolio vendors integrating/adding © 2014 International Business Machines Corporation 2
  • 3.
    InfoSphere BigInsights providesEnterprise Grade Hadoop analytics • Manages a wide variety and huge volume of data • Augments open source Hadoop with enterprise capabilities – Visualization & Exploration – Development tools – Advanced Engines – Connectors – Workload Optimization – Enterprise integration – Analytic Accelerators – Application and industry accelerators – Administration & Security BIG DATA PLATFORM Application Discovery Development Accelerators Data Warehouse Stream Computing Systems Management Hadoop System Information Integration & Governance Data Media Content Machine Social © 2014 International Business Machines Corporation 3 © 2013 IBM Corporation
  • 4.
    Key Differentiators forBigInsights Enterprise Performance & Integration Analytics Usability & Productivity • Workload / performance optimization • GPFS • Security • Key integrations & Connectors with Enterprise Ecosystem • Text analytics • Social Data Analytics Accelerators • Machine Data Analytics Accelerators • Execute R in an integrated application • Big SQL • BigSheets • Development Tools • Web Console © 2014 International Business Machines Corporation 4
  • 5.
    Integrated Web Console • Manage BigInsights – Inspect /monitor system health – Add / drop nodes – Start / stop services – Run / monitor jobs (applications) – Explore / modify file system – Create custom dashboards • Launch applications – Spreadsheet-like analysis tool – Pre-built applications (IBM supplied or user developed) • Publish applications • Monitor cluster, applications, data – Create / view event alerts. © 2014 International Business Machines Corporation 5
  • 6.
    Distributed Filesystem ©2014 International Business Machines Corporation 6 6 Applications High level languages (SQL, JAQL, PIG, …) Map/Reduce API Hadoop DFS API GPFS HDFS Distributed filesystem GPFS FPO gives additional flexibility, security and high availability • Optional file system alternative to HDFS • More than 10 years experience with HPC • Key features – No single point of failure – Built-in High Availability – POSIX compliance • Standard applications cannot use HDFS but they can use GPFS-FPO – Enhanced Security – Higher performance • Allows concurrent read and write by multiple programs – Recovery capabilties • Journaling filesystem – Support for Storage Pools – SnapShot capability
  • 7.
    BigInsights has asimple but effective security system based on a gateway to Hadoop Users Sources • All Hadoop servers are connected over a private network • Unrestricted communication between cluster servers on the private network • BigInsights Web Console acts as a gateway into the cluster • Authentication through PAM or LDAP • Role based authorization • Authorization will be enforced at 3 levels: – UI level – Data level – Map-Reduce level • Authorization also respected by services (e.g. SQL) • Kerberos support Authentication Authority External Gateway / Web Console Services Data Nodes Infrastr. Nodes Distributed Filesystem © 2014 International Business Machines Corporation 7
  • 8.
    BigSheets to analyzeand visualize • Model “big data” collected from various sources in spreadsheet-like structures • Filter and enrich content with built-in functions • Combine data in different workbooks • Visualize results through spreadsheets, charts • Export data into common formats (if desired) No programming knowledge needed! © 2014 International Business Machines Corporation 8
  • 9.
    Centralized dashboard &data flows © 2014 International Business Machines Corporation 9 9 A centralized dashboard to visualize analytic results: • BigSheets collections • Analytic application results • Monitoring metrics • Ability to view BigSheets data flows between and across data sets to quickly navigate and relate analysis and charts • Visualize inner outer joins, enhanced filters for BigSheets columns, column data-type mapping for collections and application of analytics to BigSheets columns, … etc
  • 10.
    Tools for Developers 5. Deploy your application on the cluster © 2014 International Business Machines Corporation 10 10 Editors • A workflow editor that greatly simplifies the creation of complex Oozie workflows with a consumable interface • A Pig/Jaql Editor with content assist and syntax highlighting that enables users to create and execute new applications using Pig or Jaql in local or cluster mode from the Eclipse IDE Application development & deployment • Enablement of BigSheets macro and BigSheets reader development • Text Analytics development, including support for modular rule sets • Publish new application: BigSheets Macro, BigSheets Reader, AQL module, Jaql module 1. Sample your Data 2. Develop your application using BigInsights tools 3. Test your application 4. Package and publish your application
  • 11.
    Running Applications onBig Data • Browse available applications • Deploy published applications (administrators only) • Launch (or schedule for launch) a deployed application • Monitor job (application) execution status • Predefined applications • Import & Export Data • Database & Files • Web and Social • Analyze and Query • Predictive Analytics • Text Analytics • SQL/Hive, Jaql, Pig, Hbase • Accelerators © 2014 International Business Machines Corporation 11
  • 12.
    Application linking andinterfaces to build new apps • Compose new applications from existing applications and BigSheets • Invoke analytics applications from the web console, including integration within BigSheets • REST data source App that enables users to load data from any data source supporting REST APIs into BigInsights, including popular social media services • Sampling App that enables users to sample data for analysis • Subsetting App that enables users to subset data for data analysis © 2014 International Business Machines Corporation 12 12
  • 13.
    Collaborative Big Datafor many roles • Business Users can get their hands on big data and use big data applications and BigSheets to get insights into their data § Data scientists can perform deeper analysis and get richer insights § Administrators are empowered to be more agile through better controls and views into key performance indicators § Developers can leverage unified tooling in a Big Data Application Development Lifecycle and are able to create and deploy new types of applications, with enhancements that simplify even complex workflows © 2014 International Business Machines Corporation 13
  • 14.
    Big SQL 3.0– Architected for Performance • Leverage IBM's rich SQL heritage, expertise, and technology – Modern SQL:2011 capabilities – DB2 compatible SQL PL support • SQL bodied functions and stored procedures • Application logic/security encapsulation • Architected from the ground up for performance – low latency and high throughput • MapReduce replaced with a modern MPP architecture – Compiler and runtime are native code (not java) – Big SQL worker daemons live directly on cluster – Continuously running (no startup latency) – Processing happens locally at the data • Operations occur in memory with the ability to spill to disk – Supports aggregations and sorts larger than available RAM • Integration with BigSheets (source & target) SQL-based Application IBM Data Server Client Big SQL SQL MPP Runtime Data Sources Parquet CSV Seq RC Avro ORC JSON Custom InfoSphere BigInsights © 2014 International Business Machines Corporation 14
  • 15.
    Big SQL 3.0– Architecture cont. • Head (coordinator / management) node – Listens to the JDBC/ODBC connections and compiles / optimizes the query – Coordinates the execution of the query – Optionally store user data in traditional RDBMS table (single node only) • Big SQL worker processes reside on compute nodes (some or all) • Worker nodes stream data between each other as needed • Workers can spill large data sets to local disk if needed – Allows Big SQL to work with data sets larger than available memory Big SQL Mgmt Node Hive Metastore Mgmt Node Name Node Mgmt Node ••• Job Tracker Mgmt Node Task Tracker Data Node Big SQL Big SQL ••• Node Big SQL Compute Node Task Tracker Data Node Compute Node Task Tracker Data Node Compute Node Task Tracker Data Big SQL Compute Node GPFS/HDFS © 2014 International Business Machines Corporation 15
  • 16.
    Big SQL 3.0– Features Application Portability & Integration Data shared with Hadoop ecosystem Comprehensive file format support Superior enablement of IBM software Enhanced by Third Party software Performance Modern MPP runtime Powerful SQL query rewriter Cost based optimizer Optimized for concurrent user throughput Results not constrained by memory Rich SQL Comprehensive SQL Support IBM SQL PL compatibility Distributed requests to multiple data sources within a single SQL statement Main data sources supported: DB2 LUW, DB2/z, Teradata, Oracle, Netezza Advanced security/auditing Resource and workload management Self tuning memory management Comprehensive monitoring Federation Enterprise Features © 2014 International Business Machines Corporation 16
  • 17.
    BigSQL Demo ©2014 International Business Machines Corporation 17
  • 18.
    Comparing Big SQL3.0 and Hive 0.12 for Ad-Hoc Queries 3500 3000 2500 2000 1500 1000 500 0 BigSQL 3.0 Parquet vs Hive 0.12 ORC 1TB Classic BI Workload Big SQL is up to 41x faster than Hive 0.12 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 Elapsed Time (sec) Query number Hive 0.12 BigSQL 3.0 *Based on IBM internal tests comparing IBM Infosphere Biginsights 3.0 Big SQL with Hive 0.12 executing the "1TB Classic BI Workload" in a controlled laboratory environment. The 1TB Classic BI Workload is a workload derived from the TPC-H Benchmark Standard, running at 1TB scale factor. It is materially equivalent with the exception that no update functions are performed. TPC Benchmark and TPC-H are trademarks of the Transaction Processing Performance Council (TPC). Configuration: Cluster of 9 System x3650HD servers, each with 64GB RAM and 9x2TB HDDs running Redhat Linux 6.3. Results may not be typical and will vary based on actual workload, configuration, applications, queries and other variables in a production environment. Results as of April 22, 2014 © 2014 International Business Machines Corporation 18
  • 19.
    IBM BigInsights bringsefficient integration of R with Big R • R as a big data query language – Outside-in execution • R as a statistical language for deep computing – Inside-out execution – Partitioning of large data (“divide”) – Parallel cluster execution of pushed down R code (“conquer”) – Almost any R package can run in this environment • R as the gateway to scalable machine learning – A scalable ML engine that provides canned algorithms, and an ability to author new ones, all via R R Clients Pull data (summaries) to R client Scalable ML Engine Data Sources R Packages R Packages Embedded R Execution Or, push R functions right on the data © 2014 International Business Machines Corporation 19
  • 20.
    Text Analytics inBigInsights Distill structured information from unstructured data – Rich annotator library supports multiple languages – Declarative Information Extraction (IE) system based on an algebraic framework – Richer, cleaner rule semantics – Better performance through optimization How it works • Parses text and detects meaning with annotators • Understands the context in which the text is analyzed • Hundreds of pre-built annotators for names, addresses, phone numbers, along others Accuracy • Highly accurate in deriving meaning from complex text Performance • AQL language optimized for MapReduce Unstructured text (document, email, etc) Football World Cup 2010, one team distinguished themselves well, losing to the eventual champions 1-0 in the Final. Early in the second half, Netherlands’ striker, Arjen Robben, had a breakaway, but the keeper for Spain, Iker Casillas made the save. Winger Andres Iniesta scored for Spain for the win. Classification and Insight © 2014 International Business Machines Corporation 20
  • 21.
    BigInsights offers valuebeyond Open Source Enterprise Capabilities Visualization & Exploration Development Tools Advanced Engines Connectors Workload Optimization Administration & Security Key differentiators • Built-in analytics • Enterprise software integration • Spreadsheet-style analysis • Integrated installation of supported open Open source components IBM-certified Apache Hadoop source and other components • Web Console for admin and application access • Platform enrichment: additional security, performance features, . . . • World-class support • Full open source compatibility Business benefits • Quicker time-to-value due to IBM technology and support • Reduced operational risk • Enhanced business knowledge with flexible analytical platform • Leverages and complements existing software © 2014 International Business Machines Corporation 21
  • 22.
    InfoSphere BigInsights forHadoop includes the latest Open Source components, enhanced by enterprise components IBM InfoSphere BigInsights for Hadoop Visualization & Ad Hoc Analytics BigSheets Charting Dashboard Advanced Analytics R Big R Analytics Data Access Runtime Data Store File System Security Resource Management & Oozie Administration YARN* Applications & Development Governance Text Jaql Eclipse Tooling: MapReduce, Hive, Jaql, Pig, Big SQL, AQL Flume Sqoop HCatalog Hive Pig MapReduce HBase HDFS BigSheets Reader and Macro Text Analytics Extractors Stream Computing Streams Adaptive MapReduce Solr/ Lucene Enterprise Search ETL Big SQL Open Source IBM Kerberos ZooKeeper Console Monitoring Audit & History GPFS FPO LDAP Data Security for Hadoop Data Masking Data Matching Data Privacy for Hadoop Search Flexible Scheduler * In Beta © 2014 International Business Machines Corporation 22
  • 23.
    From Getting Startingto Enterprise Deployment: Different BigInsights Editions For Varying Needs Enterprise Edition Standard Edition - Spreadsheet-style tool - - Dashboards - Pre-built applications - - Eclipse tooling - - RDBMS connectivity - - Monitoring and alerts - - Platform enhancements - Accelerators - - GPFS – FPO - - Adaptive MapReduce - Text analytics - Enterprise Integration - - Big R - - InfoSphere Streams* - - Watson Explorer* - - Cognos BI* - - Data Click* - - . . . - * Limited use license Breadth of capabilities Enterprise class - - Web console - - Big SQL - - . . . Apache Hadoop Quick Start Free. Non-production Same features as Standard Edition plus text analytics and Big R © 2014 International Business Machines Corporation 23
  • 24.
    IBM big data• IBM big data • IBM big data IBM big data • IBM big data • IBM big data IBM big data • IBM big data IBM big data • IBM big data THINK