GAIN BETTER INSIGHT FROM BIG DATA
USING JBOSS DATA VIRTUALIZATION
Syed Rasheed
Solution Manager
Red Hat Corp.
Kenny Peeples
Technical Manager
Red Hat Corp.
Kimberly Palko
Product Manager
Red Hat Corp.
AGENDA
Demystifying Big Data
Data Virtualization: Making Big Data Available to Everyone
Red Hat Big Data Strategy and Platform
Real World Customer Example using Red Hat Big Data Platform
Demo
Roadmap
Q&A
DO WE AGREE ON WHAT BIG DATA IS?
Source: http://blogs.ifsworld.com/2013/02/how-will-big-data-influence-your-finance-team/
IT’S ALL ABOUT GAINING BUSINESS INSIGHTS
Improve product development
Optimize business processes
Improve customer care
Improve customer lifetime value
Personalize products
Competitive intelligence
…
INFORMATION AND AGILITY GAP
OverOver 70%70%BI project efforts lies in
Data Integration – finding and
identifying source data
OnlyOnly 28%28%Users have any meaningful data
access
DATA CHALLENGES GETTING BIGGER FOR USERS
NoSQL
Hive
MapReduce
HDFS
Pig
Jaql
Flume
Storm
HBase
RED HAT’S BIG DATA STRATEGY
Reduce Information Gap thru cost effectively making ALL
data easily consumable for analytics
Data
Analytics
Data to Actionable Information Cycle
BIG DATA FOR EVERYONE
EASY ACCESS TO BIG DATA
BI Reports & Analytics
Hive
MapReduce
HDFS
Analytical Reporting Tool
Data Virtualization Server
Hadoop
Big Data
1. Reporting tool accesses the data
virtualization server via rich SQL
dialect
2. The data virtualization server
translates rich SQL dialect to HiveQL
3. Hive translates SQL to MapReduce
4. MapReduce runs MR job on big data
TURN FRAGMENTED DATA INTO ACTIONABLE INFORMATION
ConnectConnect
ComposeCompose
ConsumeConsume
BI Reports & Analytics
Mobile Applications
SOA Applications & PortalsESB, ETL
Native Data ConnectivityNative Data Connectivity
Standard based Data Provisioning
JDBC, ODBC, REST, SOAP, OData
Standard based Data Provisioning
JDBC, ODBC, REST, SOAP, OData
Design ToolsDesign Tools
DashboardDashboard
OptimizationOptimization
CachingCaching
SecuritySecurity
MetadataMetadata
Hadoop NoSQL Cloud Apps Data Warehouse
& Databases
Mainframe
XML, CSV
& Excel Files
Enterprise Apps
Siloed &
Complex
Virtualize
Transform
Federate
Easy,
Real-time
Information
Access
Unified Virtual Database / Common Data Model
Data Transformations
Unified Virtual Database / Common Data Model
Data Transformations
BENEFITS OF DATA VIRTUALIZATION ON BIG DATA
Enterprise democratization of big data
Any reporting or analytical tool can be used
Easy access to big data
Seamless integration of big data and existing data assets
Sharing of integration specifications
Collaborative development on big data
Fine-grained of security big data
Increased time-to-market of reports on big data
CONVERGENCE OF FOUR DATA TRENDS
COMPREHENSIVE MIDDLEWARE PLATFORM
CAPTURE, PROCESS AND INTEGRATE BIG DATA VOLUME, VELOCITY, VARIETY
Hadoop
Data Integration
JBoss Data Virtualization
Data Integration
JBoss Data Virtualization
In-memory Cache
JBoss Data Grid
In-memory Cache
JBoss Data Grid
BI Analytics
(historical, operational, predictive)
BI Analytics
(historical, operational, predictive)
SOA Composite ApplicationsSOA Composite Applications
Messaging and Event Processing
JBoss A-MQ and JBoss BRMS
J
Messaging and Event Processing
JBoss A-MQ and JBoss BRMS
J
Structured DataStructured Data Streaming DataStreaming Data Semi-Structured DataSemi-Structured Data
RedHatStorage
RedHatEnterpriseLinux&Virtualization
Capture&ProcessIntegrate&Analyze
RED HAT BIG DATA PLATFORM
EXAMPLES:
RED HAT BIG DATA PLATFORM
IN THE REAL WORLD
BIG DATA IN THE UTILITIES
Objective:
Combine data from smart meters on homes with data from electricity generation and transmission and
make it available to power providers
Problem:
The original smart grid project looked only at reading information from the meters on houses and now this
data needs to be combined with generation and transmission data in a cost-effective way
The data points are all over the place: sensors on the lines, in the field, homes, etc.
The information must be accessible to multiple power providers through a common interface
Solution:
Use Messaging to collect data from a variety of sources and route it to a CEP for initial filtering. Process
with Hadoop map/reduce and BRMS and distribute data to Data Virtualization to be combined with other
sources and consumed with BI tools, and/or to JDG for in-memory data caching and/or send to archive.
SMART GRID
TransmissionTransmission GenerationGeneration ConsumerConsumer
RegulatoryRegulatory UsersUsers
Collector
Sensors
Collector
Sensors Local
Data
Store
Local
Data
Store
Collector
Scada
Collector
Scada Local
Data
Store
Local
Data
Store
Collector
Meter
Collector
Meter Local
Data
Store
Local
Data
Store
Adaptor
Rules
Adaptor
Rules
Sensor
Adaptor
Sensor
Adaptor
Routing
Function
Routing
Function
Normalization /
MapReduce
Normalization /
MapReduce
PM Regional
Translator /
Scheduler
PM Regional
Translator /
Scheduler
Offline
Storage
Offline
Storage
Data
Virtualization
Data
Virtualization CacheCache
AuthenticationAuthentication PresentationPresentation REST ExposureREST Exposure
Element Connection
Tier
Data Adaptation
& Routing Tier
Normalized Data
Tier
Data
Tier
API Exposure
&Portal Tier
ComposeCompose
PM Data Schedule
PM Data Reports
Rules Creation
/ Updates
PM Admin
NoSQL-Cassandra
RETAIL CUSTOMER USE CASE
GAIN BETTER INSIGHT FOR INTELLIGENT INVENTORY MANAGEMENT
Objective:
Right merchandise, at right time and price
Problem:
Cannot utilize social data and sentiment analysis
with their inventory and purchase management
system
Solution:
Leverage JBoss Data Virtualization to mashup
Sentiment analysis data with inventory and
purchasing system data. Leveraged BRMS to
optimize pricing and stocking decisions.
Consume
Compose
Connect
Analytical Apps
JBoss Data Virtualization
Hive
Inventory
Databases
Purchase Mgmt
Application
Sentiment
Analysis
JBoss
BRMS
Data Driven
Decision
Management
DEMOS
LUCIDWORKS, JBOSS DATA
VIRTUALIZATION AND RED HAT
STORAGE
ABOUT LUCIDWORKS
Employs 40% of the “committers” for Lucene/Solr
Makes 50% - 70% of the enhancements to each release of
Lucene/Solr
Only company to offer Open Source and Open Core Search Solutions
LUCENE/SOLR: ENABLING BETTER, DATA-DRIVEN DECISIONS
LUCIDWORKS DEMONSTRATION
• LucidWorks/Solr to provide full
text search and statistics
• Data Virtualization provides
the data through Teiid JDBC
driver and pulls the data from
Hive/Hadoop, CSV File, XML
File
• Red Hat Storage provides the
Enterprise Data Repository
DEMONSTRATION ARCHITECTURE
DEMOS
HORTONWORKS AND JBOSS DATA
VIRTUALIZATION
ABOUT HORTONWORKS
Founded in 2011 by 24 engineers from the original Yahoo! Hadoop
development and operations team
Hortonworks drive innovation in the open exclusively via the Apache
Software Foundation process
Hortonworks is responsible for around 50% of core code base
advances to Apache Hadoop
HORTONWORKS DATA PLATFORM 2 SANDBOX
Enterprise Ready YARN, the Hadoop Operating System
Stinger Phase 2; Interactive SQL Queries at Petabyte Scale
Reliable NoSQL IN Hadoop with Hbase
Technical Specs Component Version
Apache Hadoop 2.2.0
Apache Hive 0.12.0
Apache HCatalog 0.12.0
Apache HBase 0.96.0
Apache ZooKeeper 3.4.5
Apache Pig 0.12.0
Apache Sqoop 1.4.4
Apache Flume 1.4.0
Apache Oozie 4.0.0
Apache Ambari 1.4.1
Apache Mahout 0.8.0
Hue 2.3.0
HORTONWORKS
DEMONSTRATION
Objective:
Secure data according to Role for row
level security and Column Masking
Problem:
Cannot hide region data such as patient
data from region specific users
Solution:
Leverage JBoss Data Virtualization to
provide Row Level Security and Masking
of columns
Consume
Compose
Connect
DV Dashboard to analyze the aggregated data by User
Role
JBoss Data Virtualization
Hive
SOURCE 1: Hive/Hadoop in the HDP
contains US Region Data
SOURCE 2: Hive/Hadoop in the HDP
contains EU Region Data
Hive
HORTONWORKS
DEMONSTRATION
Objective:
Determine if sentiment data from the first
week of the Iron Man 3 movie is a
predictor of sales
Problem:
Cannot utilize social data and sentiment
analysis with sales management system
Solution:
Leverage JBoss Data Virtualization to
mashup Sentiment analysis data with
ticket and merchandise sales data on
MySQL into a single view of the data.
Consume
Compose
Connect
Excel Powerview and DV Dashboard to
analyze the aggregated data
JBoss Data Virtualization
Hive
SOURCE 1: Hive/Hadoop contains twitter
data including sentiment
SOURCE 2: MySQL data that includes
ticket and merchandise sales
DEMONSTRATION SYSTEM REQUIREMENTS
• JDK
– Oracle JDK 1.6, 1.7 or OpenJDK 1.6 or 1.7
• JBoss Data Virtualization v6 Beta
– http://jboss.org/products/datavirt.html
• JBoss Developer Studio
– http://jboss.org/products
• JBoss Integration Stack Tools (Teiid)
– https://devstudio.jboss.com/updates/7.0-development/integration-stack/
• Slides, Code and References for demo
– https://github.com/DataVirtualizationByExample/Mashup-with-Hive-and-MySQL
• Hortonworks Data Platform (A VM for testing Hive/Hadoop)
– http://hortonworks.com/products/hdp-2/#install
• Red Hat Storage
– http://www.redhat.com/products/storage-server/
JBOSS DATA VIRTUALIZATION
PRODUCT ROADMAP AND BIG DATA
WHAT COMING: JBOSS DATA VIRTUALIZATION 6.1
BENEFITS OF DATA VIRTUALIZATION ON BIG
DATA
Enterprise democratization of big data
Any reporting or analytical tool can be used
Easy access to big data
Seamless integration of big data and existing data assets
Sharing of integration specifications
Collaborative development on big data
Fine-grained of security big data
Increased time-to-market of reports on big data
WHY RED HAT FOR BIG DATA?
Transform ALL data into actionable information
Cost Effective, Comprehensive Platform
Community based Innovation
Enterprise Class Software and Support
Data
Analytics
Data to Actionable Information Cycle
THANK YOU
Q & A

Big data insights with Red Hat JBoss Data Virtualization

  • 1.
    GAIN BETTER INSIGHTFROM BIG DATA USING JBOSS DATA VIRTUALIZATION Syed Rasheed Solution Manager Red Hat Corp. Kenny Peeples Technical Manager Red Hat Corp. Kimberly Palko Product Manager Red Hat Corp.
  • 2.
    AGENDA Demystifying Big Data DataVirtualization: Making Big Data Available to Everyone Red Hat Big Data Strategy and Platform Real World Customer Example using Red Hat Big Data Platform Demo Roadmap Q&A
  • 3.
    DO WE AGREEON WHAT BIG DATA IS?
  • 4.
  • 5.
    IT’S ALL ABOUTGAINING BUSINESS INSIGHTS Improve product development Optimize business processes Improve customer care Improve customer lifetime value Personalize products Competitive intelligence …
  • 6.
    INFORMATION AND AGILITYGAP OverOver 70%70%BI project efforts lies in Data Integration – finding and identifying source data OnlyOnly 28%28%Users have any meaningful data access
  • 7.
    DATA CHALLENGES GETTINGBIGGER FOR USERS NoSQL Hive MapReduce HDFS Pig Jaql Flume Storm HBase
  • 8.
    RED HAT’S BIGDATA STRATEGY Reduce Information Gap thru cost effectively making ALL data easily consumable for analytics Data Analytics Data to Actionable Information Cycle
  • 9.
    BIG DATA FOREVERYONE
  • 10.
    EASY ACCESS TOBIG DATA BI Reports & Analytics Hive MapReduce HDFS Analytical Reporting Tool Data Virtualization Server Hadoop Big Data 1. Reporting tool accesses the data virtualization server via rich SQL dialect 2. The data virtualization server translates rich SQL dialect to HiveQL 3. Hive translates SQL to MapReduce 4. MapReduce runs MR job on big data
  • 11.
    TURN FRAGMENTED DATAINTO ACTIONABLE INFORMATION ConnectConnect ComposeCompose ConsumeConsume BI Reports & Analytics Mobile Applications SOA Applications & PortalsESB, ETL Native Data ConnectivityNative Data Connectivity Standard based Data Provisioning JDBC, ODBC, REST, SOAP, OData Standard based Data Provisioning JDBC, ODBC, REST, SOAP, OData Design ToolsDesign Tools DashboardDashboard OptimizationOptimization CachingCaching SecuritySecurity MetadataMetadata Hadoop NoSQL Cloud Apps Data Warehouse & Databases Mainframe XML, CSV & Excel Files Enterprise Apps Siloed & Complex Virtualize Transform Federate Easy, Real-time Information Access Unified Virtual Database / Common Data Model Data Transformations Unified Virtual Database / Common Data Model Data Transformations
  • 12.
    BENEFITS OF DATAVIRTUALIZATION ON BIG DATA Enterprise democratization of big data Any reporting or analytical tool can be used Easy access to big data Seamless integration of big data and existing data assets Sharing of integration specifications Collaborative development on big data Fine-grained of security big data Increased time-to-market of reports on big data
  • 13.
  • 14.
    COMPREHENSIVE MIDDLEWARE PLATFORM CAPTURE,PROCESS AND INTEGRATE BIG DATA VOLUME, VELOCITY, VARIETY Hadoop Data Integration JBoss Data Virtualization Data Integration JBoss Data Virtualization In-memory Cache JBoss Data Grid In-memory Cache JBoss Data Grid BI Analytics (historical, operational, predictive) BI Analytics (historical, operational, predictive) SOA Composite ApplicationsSOA Composite Applications Messaging and Event Processing JBoss A-MQ and JBoss BRMS J Messaging and Event Processing JBoss A-MQ and JBoss BRMS J Structured DataStructured Data Streaming DataStreaming Data Semi-Structured DataSemi-Structured Data RedHatStorage RedHatEnterpriseLinux&Virtualization Capture&ProcessIntegrate&Analyze
  • 15.
    RED HAT BIGDATA PLATFORM
  • 16.
    EXAMPLES: RED HAT BIGDATA PLATFORM IN THE REAL WORLD
  • 17.
    BIG DATA INTHE UTILITIES Objective: Combine data from smart meters on homes with data from electricity generation and transmission and make it available to power providers Problem: The original smart grid project looked only at reading information from the meters on houses and now this data needs to be combined with generation and transmission data in a cost-effective way The data points are all over the place: sensors on the lines, in the field, homes, etc. The information must be accessible to multiple power providers through a common interface Solution: Use Messaging to collect data from a variety of sources and route it to a CEP for initial filtering. Process with Hadoop map/reduce and BRMS and distribute data to Data Virtualization to be combined with other sources and consumed with BI tools, and/or to JDG for in-memory data caching and/or send to archive.
  • 18.
    SMART GRID TransmissionTransmission GenerationGenerationConsumerConsumer RegulatoryRegulatory UsersUsers Collector Sensors Collector Sensors Local Data Store Local Data Store Collector Scada Collector Scada Local Data Store Local Data Store Collector Meter Collector Meter Local Data Store Local Data Store Adaptor Rules Adaptor Rules Sensor Adaptor Sensor Adaptor Routing Function Routing Function Normalization / MapReduce Normalization / MapReduce PM Regional Translator / Scheduler PM Regional Translator / Scheduler Offline Storage Offline Storage Data Virtualization Data Virtualization CacheCache AuthenticationAuthentication PresentationPresentation REST ExposureREST Exposure Element Connection Tier Data Adaptation & Routing Tier Normalized Data Tier Data Tier API Exposure &Portal Tier ComposeCompose PM Data Schedule PM Data Reports Rules Creation / Updates PM Admin NoSQL-Cassandra
  • 19.
    RETAIL CUSTOMER USECASE GAIN BETTER INSIGHT FOR INTELLIGENT INVENTORY MANAGEMENT Objective: Right merchandise, at right time and price Problem: Cannot utilize social data and sentiment analysis with their inventory and purchase management system Solution: Leverage JBoss Data Virtualization to mashup Sentiment analysis data with inventory and purchasing system data. Leveraged BRMS to optimize pricing and stocking decisions. Consume Compose Connect Analytical Apps JBoss Data Virtualization Hive Inventory Databases Purchase Mgmt Application Sentiment Analysis JBoss BRMS Data Driven Decision Management
  • 20.
  • 21.
    ABOUT LUCIDWORKS Employs 40%of the “committers” for Lucene/Solr Makes 50% - 70% of the enhancements to each release of Lucene/Solr Only company to offer Open Source and Open Core Search Solutions
  • 22.
    LUCENE/SOLR: ENABLING BETTER,DATA-DRIVEN DECISIONS
  • 23.
    LUCIDWORKS DEMONSTRATION • LucidWorks/Solrto provide full text search and statistics • Data Virtualization provides the data through Teiid JDBC driver and pulls the data from Hive/Hadoop, CSV File, XML File • Red Hat Storage provides the Enterprise Data Repository
  • 24.
  • 25.
    DEMOS HORTONWORKS AND JBOSSDATA VIRTUALIZATION
  • 26.
    ABOUT HORTONWORKS Founded in2011 by 24 engineers from the original Yahoo! Hadoop development and operations team Hortonworks drive innovation in the open exclusively via the Apache Software Foundation process Hortonworks is responsible for around 50% of core code base advances to Apache Hadoop
  • 27.
    HORTONWORKS DATA PLATFORM2 SANDBOX Enterprise Ready YARN, the Hadoop Operating System Stinger Phase 2; Interactive SQL Queries at Petabyte Scale Reliable NoSQL IN Hadoop with Hbase Technical Specs Component Version Apache Hadoop 2.2.0 Apache Hive 0.12.0 Apache HCatalog 0.12.0 Apache HBase 0.96.0 Apache ZooKeeper 3.4.5 Apache Pig 0.12.0 Apache Sqoop 1.4.4 Apache Flume 1.4.0 Apache Oozie 4.0.0 Apache Ambari 1.4.1 Apache Mahout 0.8.0 Hue 2.3.0
  • 28.
    HORTONWORKS DEMONSTRATION Objective: Secure data accordingto Role for row level security and Column Masking Problem: Cannot hide region data such as patient data from region specific users Solution: Leverage JBoss Data Virtualization to provide Row Level Security and Masking of columns Consume Compose Connect DV Dashboard to analyze the aggregated data by User Role JBoss Data Virtualization Hive SOURCE 1: Hive/Hadoop in the HDP contains US Region Data SOURCE 2: Hive/Hadoop in the HDP contains EU Region Data Hive
  • 29.
    HORTONWORKS DEMONSTRATION Objective: Determine if sentimentdata from the first week of the Iron Man 3 movie is a predictor of sales Problem: Cannot utilize social data and sentiment analysis with sales management system Solution: Leverage JBoss Data Virtualization to mashup Sentiment analysis data with ticket and merchandise sales data on MySQL into a single view of the data. Consume Compose Connect Excel Powerview and DV Dashboard to analyze the aggregated data JBoss Data Virtualization Hive SOURCE 1: Hive/Hadoop contains twitter data including sentiment SOURCE 2: MySQL data that includes ticket and merchandise sales
  • 30.
    DEMONSTRATION SYSTEM REQUIREMENTS •JDK – Oracle JDK 1.6, 1.7 or OpenJDK 1.6 or 1.7 • JBoss Data Virtualization v6 Beta – http://jboss.org/products/datavirt.html • JBoss Developer Studio – http://jboss.org/products • JBoss Integration Stack Tools (Teiid) – https://devstudio.jboss.com/updates/7.0-development/integration-stack/ • Slides, Code and References for demo – https://github.com/DataVirtualizationByExample/Mashup-with-Hive-and-MySQL • Hortonworks Data Platform (A VM for testing Hive/Hadoop) – http://hortonworks.com/products/hdp-2/#install • Red Hat Storage – http://www.redhat.com/products/storage-server/
  • 37.
  • 38.
    WHAT COMING: JBOSSDATA VIRTUALIZATION 6.1
  • 39.
    BENEFITS OF DATAVIRTUALIZATION ON BIG DATA Enterprise democratization of big data Any reporting or analytical tool can be used Easy access to big data Seamless integration of big data and existing data assets Sharing of integration specifications Collaborative development on big data Fine-grained of security big data Increased time-to-market of reports on big data
  • 40.
    WHY RED HATFOR BIG DATA? Transform ALL data into actionable information Cost Effective, Comprehensive Platform Community based Innovation Enterprise Class Software and Support Data Analytics Data to Actionable Information Cycle
  • 41.

Editor's Notes

  • #7 Reduce costs for finding and accessing highly fragmented data Improve time to market for new products and services by simplifying data access and integration Deliver IT solution agility necessary to capitalize on constantly changing market conditions Transform fragmented data into actionable information that delivers competitive advantage
  • #9 To remember the pragmatic definition of big data, think SPA — the three questions of big data: Store. Can you capture and store the data? Process. Can you cleanse, enrich, and analyze the data?  Access. Can you retrieve, search, integrate, and visualize the data?
  • #12 The data virtualization software provides 3 step process to connect data sources and data consumers: Connect: Fast Access to data from disparate systems (databases, files, services, applications, etc.) with disparate access method and storage models. Compose: Easily create reusable, unified common data model and virtual data views by combining and transforming data from multiple sources. Consume: Seamlessly exposing unified, virtual data model and views available in real-time through a variety of open standards data access methods to support different tools and applications. JBoss Data Virtualization software implements all three steps internally while isolating/hiding complexity of data access methods, transformation and data merge logic details from information consumers. This enables organization to acquire actionable, unified information when they want it and the way they want it; i.e. at the business speed.
  • #41 To remember the pragmatic definition of big data, think SPA — the three questions of big data: Store. Can you capture and store the data? Process. Can you cleanse, enrich, and analyze the data?  Access. Can you retrieve, search, integrate, and visualize the data?