Summer Shorts: Big Data Integration

Big Data Integration
1
Marcelo Litovsky
National Solutions Architect – Information Builders

Why are people buying Apache Hadoop?
•Load, Transform, Syndicate – Use the power of Apache Hadoop to pre-process large
amounts of data at a low cost, then transform it into what is needed on the warehouse
•Archive/Offload - Do not discard any data. Use Apache Hadoop to archive/offload useful
data. Whether driven by government regulations or to add business value, the
information is readily available in Apache Hadoop.
Data Warehousing – Paradigm Shift from ETL to ELT
•Load data from external sources (Social Media, Machine data…)
•Conform datasets to enterprise standards
•Integrate the disparate data sources to extract value from the incoming data
•Relate streaming and unstructured data, social data with transactional and traditional
operational data sources
External Data Integration

The Evolution of Integration
3
Hand Coded
Integration
ETL Messaging
Bus
ESBEAI Apache Hadoop-Based
Integration

Traditional in Transition to Modern
4
Fewer
use cases
More
use cases
ModernTraditional
OLTP
OLAP
Data warehouses
Data marts
Point-to-point
Integration
EII

We Have Some Pretty Simple Problems…
5
According to a May 2015 Gartner Survey…
• 26% are deploying Apache Hadoop, 11% in 12 months, 7% in 24
months
• 49% cite trying to find value as their biggest problem
• 57% cite the Apache Hadoop skills gap as their biggest problem
To summarize…
• Companies are investing in Apache Hadoop, but not sure why
• Companies are investing in Apache Hadoop, but don’t know how to
use it

Information Builders Big Data Architecture
6
Use Case for Apache Hadoop
Sqoop,Flume…
Avro,JSON…
Traditional
applications
and data stores
iWay Big Data Integrator
Simplified, modern,
native Apache Hadoop
integration
Big Data Apache Hadoop
Any distribution, Any data
BI &
Analytics WebFOCUS BI and analytics platform
Self-service for Everyone
WebFOCUS access,
ETL, metadata
WebFOCUS access,
ETL, metadata
Data Ingestion – Enterprise Data Hub
ETL / ELT
Predictive Analytics - RStat
Business Intelligence - WebFocus
Low-cost storage of large data volumes

7
100% Run “in” Apache Hadoop architecture
Simplified
interface
Native Apache Hadoop
script generation
Process mgmt.
& governance  Simplified easy-to-use interface
to integrate in Apache Hadoop
 Marshals Apache Hadoop
resources and standards
 Takes advantage of performance
and resource negotiation
 Includes sophisticated process
management and governance
Sqoop,Flume…
Avro,JSON…
Traditional
applications
and data stores
Simplified, modern,
native Apache Hadoop
integration
Big Data Apache Hadoop
Any distribution, Any data

8
Key Features
Eclipse-based User Friendly Interface
Data ingestion using abstraction above Sqoop®, Flume®, Spark®, and proprietary
streaming channel content
Transformation & Mapping
Publish to non-Apache Hadoop data sources
Auto-generated scripts/jobs based on configuration

9
Notable Features in 2016
• Data Profiling, Data Preparation, Master Data Management
• Analyze patterns, data types, sparsity, cardinality of Apache Hadoop
datasets
• Generation of data cleansing rules based on pattern analysis
• Auto generation of remediation tickets for non-cleansable records
• Ability to transpose (wide to deep, deep to wide) data in parallel
• Missing value imputation, data scaling, data categorization
• Streaming and in-process predictive model scoring (PMML and
native code)
• “Natively” Match and Merge
Data Governance

10
Notable Features in 2016
• Full capture of data lineage for BDI ingestion,
transform, data prep, cleansing
• Integration with Cloudera Navigator, to give
holistic data lineage view for non-BDI sources
• User interface to interactively display
information
Data Lineage

11
Data Ingestion
Graphical Sqoop and Flume configuration
•Replace
•Change Data Capture
•Native “Roll your own”
Sqoop
•Flume editor with validation
•Graphical wizard in the works
•Templates
Flume
•Legacy formats (Streaming channel, Mumps, etc)
Proprietary “channel” ingestion – iWay Service Manager
Structured data standardized on Avro format
Late-binding data “wrangler” for unstructured content

12
Data Ingestion
Graphical Sqoop and Flume configuration

13
Transformation
• Join (inner, left, right, full, outer)
• Group by
• Aggregate functions as defined by cluster
Drag and drop data transformation designer
Any data on cluster can be transformed, provided it is described in Hive metastore
Logic preview
Transformations performed 100% in Apache Hadoop
Kerberos compliant

14
• Relational targets on remote RDBMS
• XML definitions
• Custom-defined on design canvas
Mapping
• Publish to any JDBC-compliant MPP or
RDBMS
• Staging table or direct-to-target load
Publish

15
Transformation
Drag and drop data transformation designer

16
Transformation – underlying script
Underlying script generation view

17
Job Execution
Multiple job executions in a defined order

Real-World Strategies for Deploying Big Data
18
Data Quality and MDM – iWay Big Data Integrator
Edge Node Deployment of DQ Services

19
Data Quality and MDM – iWay Big Data Integrator
Native Spark Interface to DQ

20
Spark Integration – iWay Big Data Integrator
• Spark Streaming
• SparkSQL
• SparkR
• MLLib
Full Integration of Apache® Spark Stack
Fully Automated project setup, dependency management, Scala version
detection
Code, build, test, deploy – all from within Big Data Integrator

 Predictive Model Development and Deployment
21
Spark Integration – iWay Big Data Integrator
Predictive Model Development and Deployment

22
Cloudera Certified
• Easy to use interface for deploying and integrating
data on Apache Hadoop distributions of all flavors,
ensuring portability.
• Ingests, transforms, and cleanses traditional RDBMS,
mobile, social media, sensor, and other data in
batch or streams, using native Apache Hadoop
facilities.
• 100% YARN compliant, taking advantage of native
Apache Hadoop performance and resource
negotiation.
• Simplifies the use of Apache Hadoop ecosystem
technologies such as: MapReduce, Sqoop, Flume,
Hive®, and Spark®.
iWay Big Data Integrator is
CLOUDERA CERTIFIED!!

Summer Shorts: Big Data Integration

More Related Content

What's hot

Similar to Summer Shorts: Big Data Integration

More from ibi

Recently uploaded

Summer Shorts: Big Data Integration

Editor's Notes