Big Data Integration
1
Marcelo Litovsky
National Solutions Architect – Information Builders
Why are people buying Apache Hadoop?
•Load, Transform, Syndicate – Use the power of Apache Hadoop to pre-process large
amounts of data at a low cost, then transform it into what is needed on the warehouse
•Archive/Offload - Do not discard any data. Use Apache Hadoop to archive/offload useful
data. Whether driven by government regulations or to add business value, the
information is readily available in Apache Hadoop.
Data Warehousing – Paradigm Shift from ETL to ELT
•Load data from external sources (Social Media, Machine data…)
•Conform datasets to enterprise standards
•Integrate the disparate data sources to extract value from the incoming data
•Relate streaming and unstructured data, social data with transactional and traditional
operational data sources
External Data Integration
The Evolution of Integration
3
Hand Coded
Integration
ETL Messaging
Bus
ESBEAI Apache Hadoop-Based
Integration
Traditional in Transition to Modern
4
Fewer
use cases
More
use cases
ModernTraditional
OLTP
OLAP
Data warehouses
Data marts
Point-to-point
Integration
EII
We Have Some Pretty Simple Problems…
5
According to a May 2015 Gartner Survey…
• 26% are deploying Apache Hadoop, 11% in 12 months, 7% in 24
months
• 49% cite trying to find value as their biggest problem
• 57% cite the Apache Hadoop skills gap as their biggest problem
To summarize…
• Companies are investing in Apache Hadoop, but not sure why
• Companies are investing in Apache Hadoop, but don’t know how to
use it
Information Builders Big Data Architecture
6
Use Case for Apache Hadoop
Sqoop,Flume…
Avro,JSON…
Traditional
applications
and data stores
iWay Big Data Integrator
Simplified, modern,
native Apache Hadoop
integration
Big Data Apache Hadoop
Any distribution, Any data
BI &
Analytics WebFOCUS BI and analytics platform
Self-service for Everyone
WebFOCUS access,
ETL, metadata
WebFOCUS access,
ETL, metadata
Data Ingestion – Enterprise Data Hub
ETL / ELT
Predictive Analytics - RStat
Business Intelligence - WebFocus
Low-cost storage of large data volumes
iWay Big Data Integrator
7
100% Run “in” Apache Hadoop architecture
Simplified
interface
Native Apache Hadoop
script generation
Process mgmt.
& governance  Simplified easy-to-use interface
to integrate in Apache Hadoop
 Marshals Apache Hadoop
resources and standards
 Takes advantage of performance
and resource negotiation
 Includes sophisticated process
management and governance
Sqoop,Flume…
Avro,JSON…
Traditional
applications
and data stores
iWay Big Data Integrator
Simplified, modern,
native Apache Hadoop
integration
Big Data Apache Hadoop
Any distribution, Any data
iWay Big Data Integrator
8
Key Features
Eclipse-based User Friendly Interface
Data ingestion using abstraction above Sqoop®, Flume®, Spark®, and proprietary
streaming channel content
Transformation & Mapping
Publish to non-Apache Hadoop data sources
Auto-generated scripts/jobs based on configuration
iWay Big Data Integrator
9
Notable Features in 2016
• Data Profiling, Data Preparation, Master Data Management
• Analyze patterns, data types, sparsity, cardinality of Apache Hadoop
datasets
• Generation of data cleansing rules based on pattern analysis
• Auto generation of remediation tickets for non-cleansable records
• Ability to transpose (wide to deep, deep to wide) data in parallel
• Missing value imputation, data scaling, data categorization
• Streaming and in-process predictive model scoring (PMML and
native code)
• “Natively” Match and Merge
Data Governance
iWay Big Data Integrator
10
Notable Features in 2016
• Full capture of data lineage for BDI ingestion,
transform, data prep, cleansing
• Integration with Cloudera Navigator, to give
holistic data lineage view for non-BDI sources
• User interface to interactively display
information
Data Lineage
iWay Big Data Integrator
11
Data Ingestion
Graphical Sqoop and Flume configuration
•Replace
•Change Data Capture
•Native “Roll your own”
Sqoop
•Flume editor with validation
•Graphical wizard in the works
•Templates
Flume
•Legacy formats (Streaming channel, Mumps, etc)
Proprietary “channel” ingestion – iWay Service Manager
Structured data standardized on Avro format
Late-binding data “wrangler” for unstructured content
iWay Big Data Integrator
12
Data Ingestion
Graphical Sqoop and Flume configuration
iWay Big Data Integrator
13
Transformation
• Join (inner, left, right, full, outer)
• Group by
• Aggregate functions as defined by cluster
Drag and drop data transformation designer
Any data on cluster can be transformed, provided it is described in Hive metastore
Logic preview
Transformations performed 100% in Apache Hadoop
Kerberos compliant
iWay Big Data Integrator
14
• Relational targets on remote RDBMS
• XML definitions
• Custom-defined on design canvas
Mapping
• Publish to any JDBC-compliant MPP or
RDBMS
• Staging table or direct-to-target load
Publish
iWay Big Data Integrator
15
Transformation
Drag and drop data transformation designer
iWay Big Data Integrator
16
Transformation – underlying script
Underlying script generation view
iWay Big Data Integrator
17
Job Execution
Multiple job executions in a defined order
Real-World Strategies for Deploying Big Data
18
Data Quality and MDM – iWay Big Data Integrator
Edge Node Deployment of DQ Services
Real-World Strategies for Deploying Big Data
19
Data Quality and MDM – iWay Big Data Integrator
Native Spark Interface to DQ
Real-World Strategies for Deploying Big Data
20
Spark Integration – iWay Big Data Integrator
• Spark Streaming
• SparkSQL
• SparkR
• MLLib
Full Integration of Apache® Spark Stack
Fully Automated project setup, dependency management, Scala version
detection
Code, build, test, deploy – all from within Big Data Integrator
Real-World Strategies for Deploying Big Data
 Predictive Model Development and Deployment
21
Spark Integration – iWay Big Data Integrator
Predictive Model Development and Deployment
iWay Big Data Integrator
22
Cloudera Certified
• Easy to use interface for deploying and integrating
data on Apache Hadoop distributions of all flavors,
ensuring portability.
• Ingests, transforms, and cleanses traditional RDBMS,
mobile, social media, sensor, and other data in
batch or streams, using native Apache Hadoop
facilities.
• 100% YARN compliant, taking advantage of native
Apache Hadoop performance and resource
negotiation.
• Simplifies the use of Apache Hadoop ecosystem
technologies such as: MapReduce, Sqoop, Flume,
Hive®, and Spark®.
iWay Big Data Integrator is
CLOUDERA CERTIFIED!!

Summer Shorts: Big Data Integration

  • 1.
    Big Data Integration 1 MarceloLitovsky National Solutions Architect – Information Builders
  • 2.
    Why are peoplebuying Apache Hadoop? •Load, Transform, Syndicate – Use the power of Apache Hadoop to pre-process large amounts of data at a low cost, then transform it into what is needed on the warehouse •Archive/Offload - Do not discard any data. Use Apache Hadoop to archive/offload useful data. Whether driven by government regulations or to add business value, the information is readily available in Apache Hadoop. Data Warehousing – Paradigm Shift from ETL to ELT •Load data from external sources (Social Media, Machine data…) •Conform datasets to enterprise standards •Integrate the disparate data sources to extract value from the incoming data •Relate streaming and unstructured data, social data with transactional and traditional operational data sources External Data Integration
  • 3.
    The Evolution ofIntegration 3 Hand Coded Integration ETL Messaging Bus ESBEAI Apache Hadoop-Based Integration
  • 4.
    Traditional in Transitionto Modern 4 Fewer use cases More use cases ModernTraditional OLTP OLAP Data warehouses Data marts Point-to-point Integration EII
  • 5.
    We Have SomePretty Simple Problems… 5 According to a May 2015 Gartner Survey… • 26% are deploying Apache Hadoop, 11% in 12 months, 7% in 24 months • 49% cite trying to find value as their biggest problem • 57% cite the Apache Hadoop skills gap as their biggest problem To summarize… • Companies are investing in Apache Hadoop, but not sure why • Companies are investing in Apache Hadoop, but don’t know how to use it
  • 6.
    Information Builders BigData Architecture 6 Use Case for Apache Hadoop Sqoop,Flume… Avro,JSON… Traditional applications and data stores iWay Big Data Integrator Simplified, modern, native Apache Hadoop integration Big Data Apache Hadoop Any distribution, Any data BI & Analytics WebFOCUS BI and analytics platform Self-service for Everyone WebFOCUS access, ETL, metadata WebFOCUS access, ETL, metadata Data Ingestion – Enterprise Data Hub ETL / ELT Predictive Analytics - RStat Business Intelligence - WebFocus Low-cost storage of large data volumes
  • 7.
    iWay Big DataIntegrator 7 100% Run “in” Apache Hadoop architecture Simplified interface Native Apache Hadoop script generation Process mgmt. & governance  Simplified easy-to-use interface to integrate in Apache Hadoop  Marshals Apache Hadoop resources and standards  Takes advantage of performance and resource negotiation  Includes sophisticated process management and governance Sqoop,Flume… Avro,JSON… Traditional applications and data stores iWay Big Data Integrator Simplified, modern, native Apache Hadoop integration Big Data Apache Hadoop Any distribution, Any data
  • 8.
    iWay Big DataIntegrator 8 Key Features Eclipse-based User Friendly Interface Data ingestion using abstraction above Sqoop®, Flume®, Spark®, and proprietary streaming channel content Transformation & Mapping Publish to non-Apache Hadoop data sources Auto-generated scripts/jobs based on configuration
  • 9.
    iWay Big DataIntegrator 9 Notable Features in 2016 • Data Profiling, Data Preparation, Master Data Management • Analyze patterns, data types, sparsity, cardinality of Apache Hadoop datasets • Generation of data cleansing rules based on pattern analysis • Auto generation of remediation tickets for non-cleansable records • Ability to transpose (wide to deep, deep to wide) data in parallel • Missing value imputation, data scaling, data categorization • Streaming and in-process predictive model scoring (PMML and native code) • “Natively” Match and Merge Data Governance
  • 10.
    iWay Big DataIntegrator 10 Notable Features in 2016 • Full capture of data lineage for BDI ingestion, transform, data prep, cleansing • Integration with Cloudera Navigator, to give holistic data lineage view for non-BDI sources • User interface to interactively display information Data Lineage
  • 11.
    iWay Big DataIntegrator 11 Data Ingestion Graphical Sqoop and Flume configuration •Replace •Change Data Capture •Native “Roll your own” Sqoop •Flume editor with validation •Graphical wizard in the works •Templates Flume •Legacy formats (Streaming channel, Mumps, etc) Proprietary “channel” ingestion – iWay Service Manager Structured data standardized on Avro format Late-binding data “wrangler” for unstructured content
  • 12.
    iWay Big DataIntegrator 12 Data Ingestion Graphical Sqoop and Flume configuration
  • 13.
    iWay Big DataIntegrator 13 Transformation • Join (inner, left, right, full, outer) • Group by • Aggregate functions as defined by cluster Drag and drop data transformation designer Any data on cluster can be transformed, provided it is described in Hive metastore Logic preview Transformations performed 100% in Apache Hadoop Kerberos compliant
  • 14.
    iWay Big DataIntegrator 14 • Relational targets on remote RDBMS • XML definitions • Custom-defined on design canvas Mapping • Publish to any JDBC-compliant MPP or RDBMS • Staging table or direct-to-target load Publish
  • 15.
    iWay Big DataIntegrator 15 Transformation Drag and drop data transformation designer
  • 16.
    iWay Big DataIntegrator 16 Transformation – underlying script Underlying script generation view
  • 17.
    iWay Big DataIntegrator 17 Job Execution Multiple job executions in a defined order
  • 18.
    Real-World Strategies forDeploying Big Data 18 Data Quality and MDM – iWay Big Data Integrator Edge Node Deployment of DQ Services
  • 19.
    Real-World Strategies forDeploying Big Data 19 Data Quality and MDM – iWay Big Data Integrator Native Spark Interface to DQ
  • 20.
    Real-World Strategies forDeploying Big Data 20 Spark Integration – iWay Big Data Integrator • Spark Streaming • SparkSQL • SparkR • MLLib Full Integration of Apache® Spark Stack Fully Automated project setup, dependency management, Scala version detection Code, build, test, deploy – all from within Big Data Integrator
  • 21.
    Real-World Strategies forDeploying Big Data  Predictive Model Development and Deployment 21 Spark Integration – iWay Big Data Integrator Predictive Model Development and Deployment
  • 22.
    iWay Big DataIntegrator 22 Cloudera Certified • Easy to use interface for deploying and integrating data on Apache Hadoop distributions of all flavors, ensuring portability. • Ingests, transforms, and cleanses traditional RDBMS, mobile, social media, sensor, and other data in batch or streams, using native Apache Hadoop facilities. • 100% YARN compliant, taking advantage of native Apache Hadoop performance and resource negotiation. • Simplifies the use of Apache Hadoop ecosystem technologies such as: MapReduce, Sqoop, Flume, Hive®, and Spark®. iWay Big Data Integrator is CLOUDERA CERTIFIED!!

Editor's Notes

  • #6 http://www.informationweek.com/big-data/software-platforms/hadoop-adoption-remains-steady-but-slow-gartner-finds