Integrating Big Data Technologies

©2012 Sixth Sense Advisors, Inc. All Rights Reserved 1

INTEGRATING BIG
DATA
Dataversity Webinar
Feb 7 2012


State of Data Today


A Growing Trend
Expectations for BI are changing w/o anyone telling us

Requirement Expectations Reality
Speed Speed of the Internet Speed = Infra + Arch +
Design
Accessibility Accessibility of a BI Tool licenses &
Smartphone security
Usability IPAD - Mobility Web Enabled BI Tool
Availability Google Search Data & Report Metadata
Delivery Speed of questions Methodology & Signoff
Data Access to everything Structured Data
Scalability Cloud (Amazon) Existing Infrastructure
Cost Cell phone or Free WIFI Millions


The
Wisdom
of
Crowds


Data
Deluge
=
Business
Insights


BIG
Data

Structured Current New

ERP
CRM
SCM

Content
Management
Systems

Email
Call Center

Documents
Contracts

UnStructured


What’s so Big about Big Data

Velocity
Volume
Variety
Complexity
Ambiguity


So you are about to start the Big
Data Project

Tools Output

Data

instructions


The
Normal
Way
Results
In
……..

Image Source: Web


Why
Big
Data
can
Fail
on
the
RDBMS?

New Data Types
Current
New volume
Data •  POOR
Management New analytics Performance
Platform •  Failed
(RDBMS + ETL New workload Programs
+BI) New metadata

Scalability; Sharding; ACID;


BIG Data
•  Workload Demands •  Infrastructure
•  Process dynamic data Requirements
content •  Scalable platform
•  Process unstructured •  Database independence
data •  Fault tolerant
•  Systems that can scale architectures
up and scale out with •  Low cost of acquisition
high volume data and store
•  Perform complex
•  Supported by standard
operations within toolsets
reasonable response
time


Hadoop

Design Goals
ü  System Shall Manage and
Heal Itself
ü  Performance Shall Scale
Linearly
ü  Compute Shall Move to
Data
ü  Simple Core, Modular and
Extensible


Hadoop Differentiators

Schema-on-Write: RDBMS Schema-on-Read: Hadoop
•  Schema must be created •  Data is simply copied to the file
before data is loaded. store, no special transformation
is needed.
•  An explicit load operation has
to take place which transforms •  A SerDe (Serializer/Deserlizer)
the data to the internal is applied during read time to
structure of the database. extract the required columns.
•  New columns must be added •  New data can start flowing
explicitly before data for such anytime and will appear
columns can be loaded into retroactively once the SerDe is
the database. updated to parse them.
•  Read is Fast. •  Load is Fast
•  Standards/Governance. •  Evolving Schemas/Agility


Hadoop Known Limitations
•  Write-once model
•  A namespace with an extremely large number of files exceeds
Namenode’s capacity to maintain
•  Cannot be mounted by exisiting OS
•  Getting data in and out is tedious
•  Virtual File System can solve problem
•  HDFS does not implement / support
•  User quotas
•  Access permissions
•  Hard or soft links
•  Data balancing schemes
•  No periodic checkpoints
•  Namenode is single point of failure
•  Automatic restart and failover to another machine not yet supported


Hadoop Tips
•  Hadoop is useful •  Implementation
•  When you must process lots of •  Think big, start small
unstructured data •  Build on agile cycles
•  When running batch jobs is •  Focus on the data, as you will
acceptable always develop schema on
•  When you have access to lots of write.
cheap hardware

•  Available Optimizations
•  Hadoop is not useful
•  Input to Maps
•  For intense calculations with little or •  Map only jobs
no data •  Combiner
•  When your data is not self-contained •  Compression
•  Speculation
•  When you need interactive results
•  Fault Tolerance
•  Buffer Size
•  Parallelism (threads)
•  Partitioner
•  Reporter
•  DistributedCache
•  Task child environment settings


Hadoop Tips
•  Troubleshooting •  Performance Tuning
•  Are your partitions uniform? •  Increase the memory/buffer allocated
•  Can you combine records at the map to the tasks
side? •  Increase the number of tasks that can
•  Are maps reading off a DFS block be run in parallel
worth of data? •  Increase the number of threads that
•  Are you running a single reduce wave serve the map outputs
(unless the data size per reducers is •  Disable unnecessary logging
too big) ? •  Turn on speculation
•  Have you tried compressing •  Run reducers in one wave as they
intermediate data & final data? tend to get expensive
•  Are there buffer size issues •  Tune the usage of DistributedCache,
•  Do you see unexplained “long tails” it can increase efficiency
•  Are your CPU cores busy?
•  Is at least one system resource being
loaded?


NoSQL
•  Stands for Not Only SQL
•  Based on CAP Theorem
•  Usually do not require a fixed table schema nor do they
use the concept of joins
•  All NoSQL offerings relax one or more of the ACID
properties
•  NoSQL databases come in a variety of flavors
•  XML (myXMLDB, Tamino, Sedna)
•  Wide Column (Cassandra, Hbase, Big Table)
•  Key/Value (Redis, Memcached with BerkleyDB)
•  Graph (neo4j, InfoGrid)
•  Document store (CouchDB, MongoDB)


NoSQL Footprint

Key Amazon Dynamo
Value

Voldermort Big Google Big Table
Table
Size
HBase Lotus Notes
Doc
Database
Cassandra Graph
Graph
Theory

Complexity


NoSQL
•  Access and Query •  Best Practices
•  RESTful interfaces (HTTP as an •  Design for data collection
accessAPI) •  Plan the data store
•  Query languages other than SQL •  Organize by type and semantics
•  SPARQL - Query language for •  Partition for performance
the SemanticWeb •  Access and Query is run time
•  Gremlin - the graph traversal dependent
language •  Horizontal scaling
•  Sones Graph Query Language •  Memory Caching
•  Data Manipulation / Query API
•  The Google BigTable
DataStoreAPI
•  The Neo4jTraversalAPI
•  Serialization Formats
•  JSON
•  Thrift
•  ProtoBuffers
•  RDF


Textual ETL Engine
Forest Rim Technology – Textual ETL Engine (TETLE) – is an integration tool for turning text into a structure of
data that can be analyzed by standard analytical tools

•  Textual ETL Engine provides a robust user
interface to define rules (or patterns / keywords)
to process unstructured or semi-structured data.
•  The rules engine encapsulates all the complexity
and lets the user define simple phrases and
keywords
•  Easy to implement and easy to realize ROI

•  Advantages •  Disadvantages
•  Simple to use •  Not integrated with Hadoop as a rules
•  No MR or Coding required for text analysis interface
and mining •  Currently uses Sqoop for metadata
•  Extensible by Taxonomy integration interchange with Hadoop or NoSQL
•  Works on standard and new databases interfaces
•  Produces a highly columnar key-value •  Current GA does not handle distributed
store, ready for metadata integration processing outside Windows platform


Integration
•  All RDBMS vendors today are supporting Hadoop or NoSQL as
an integration or extension
•  Oracle Exalytics / Big Data Appliance
•  Teradata Aster Appliance
•  EMC Greenplum Appliance
•  IBM BigInsights
•  Microsoft Windows Azure Integration
•  There are multiple providers of Hadoop distribution
•  CloudEra
•  HortonWorks
•  Zettaset
•  Adapters from vendors to interface with CloudEra or
HortonWorks distributions of Hadoop are available today. There
are integration efforts to release Hadoop as an integral engine
across the RDBMS vendor platforms


Conceptual
SoluEon
Architecture

Metadata MDM

ETL
Data
OLTP ELT
Warehouse Reporting
CDC
Analytics
DataMart’s Search
OLAP
Text Mining
Big Data Content Analytics
BIG Data Textual DW Knowledge Analytics
Content ETL
Email Taxonomy
Docs
And / Or

MR / Ruby / Java
(Hadoop)


Integration Tips
•  The key to the castle in integrating Big Data is metadata
•  Whatever the tool, technology and technique, if you do not
know your metadata, your integration will fail
•  Semantic technologies and architectures will be the way to
process and integrate the Big Data, much akin to Web 2.0
models
•  Data quality for Big Data is a very questionable goal. To get
some semblance of quality, taxonomies and ontologies can be
of help
•  3rd part data providers also provide keywords, trending tags
and scores, these can provide a lot of integration support
•  Writing business rules for Big Data can be very cumbersome
and not all programs can be written in MapReduce


Which Tool

Application Hadoop NoSQL Textual ETL
Machine Learning x x
Sentiments x x x
Text Processing x x x
Image Processing x x
Video Analytics x x
Log Parsing x x x
Collaborative x x x
Filtering
Context Search x
Email & Content x


Success
Stories

•  Machine learning & Recommendation Engines – Amazon,
Orbitz
•  CRM - Consumer Analytics, Metrics, Social Network
Analytics, Churn, Sentiment, Influencer, Proximity
•  Finance – Fraud, Compliance
•  Telco – CDR, Fraud
•  Healthcare – Provider / Patient analytics, fraud, proactive
care
•  Lifesciences – clinical analytics, physician outreach
•  Pharma – Pharmacovigilance, clinical trials
•  Insurance – fraud, geo-spatial
•  Manufacturing – warranty analytics, supplier quality
metrics


Data Science

Data Analytics Art & Science APPLIED SCIENCE

Content User Interest Prediction
Customer inventory prediction
Product Machine learning
Behaviors Pattern Mining
Optimization Advanced Regression
Big Data Processing & ETL Analysis

Business Intelligence
Advanced Analytics


Challenges

•  Resources
Availability

•  MR
is
hard
to
implement

•  Speech
to
text

•  ConversaEon
context
is
oJen
missing

•  Quality
of
recording

•  Accent
issues

•  Visual
data
tagging

•  Images

•  Text
embedded
within
images

•  Metadata
is
not
available

•  Data
is
not
trusted

•  Content
management
plaMorm
capabiliEes

•  Ontologies
Ambiguity

•  Taxonomy
IntegraEon


Contact
•  Krish Krishnan
rkrish1124@yahoo.com
Twitter: @datagenius

Integrating Big Data Technologies

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (9)

Similar to Integrating Big Data Technologies

Similar to Integrating Big Data Technologies (20)

More from DATAVERSITY

More from DATAVERSITY (20)

Recently uploaded

Recently uploaded (20)

Integrating Big Data Technologies