Big Data Introduction

Agenda
• Current Scenario/Trends in IT
• Big Data
– Batch eco system
– NoSQL eco system
– Visualization
• Case Studies for Big Data
– Enterprise Data Warehouse
– Customer Analytics

Current Scenario
Enterprise
applications
Operational
Decision
Support
Enterprise applications can be broadly categorized into
Operational and Decision support systems.

Current Scenario – Architecture
(Typical Enterprise Application)
Client
(Browser)
Client
(Browser)
Client
(Browser)
App Server
App Server
Database

Current Scenario - Architecture
• Recent trends
– Standardization and consolidation of hardware
(servers, storage, network) etc., to cut down the
costs
– Storage is physically separated from servers and
connected with high speed fiber optics

Database
Server
Database
Server
Database
Server
Network Switch Network Switch Storage Cluster
*Typical database architecture in an enterprise

• Databases
– Databases are clustered (Oracle – RAC)
• High availability
• Fault tolerance
• Load balancing
• Scalable (not linear)
– Common network storage
• File abstraction – file can be of any size
• Fault tolerance (using RAID)

• Almost all these applications follow similar n-tier architecture
– Core applications (operational)
– EAI (Integrating Enterprise Applications)
– CRM
– ERP
– DW/BI Tools like Informatica, Cognos, Business Objects etc
• However there are exceptions – legacy (Mainframes based)
applications which uses closed architecture

Application
Servers
Database Servers
Storage
Servers
*Birds eye view – after standardization and
consolidation using cloud architecture

Current Scenario - Challenges
• Almost all operational systems are using relational databases (RDBMS like
Oracle).
– RDBMS are originally designed for Operational and transactional.
• Not linearly scalable.
– Transactions
– Data integrity
• Expensive
• Predefined Schema
• Data processing do not happen where data is stored (storage layer)
– Some processing happens at database server level (SQL)
– Some processing happens at application server level (Java/.net)
– Some processing happens at client/browser level (Java Script)

Current Scenario – Use case
(E-Mail Campaigning)
App
Server(s)
Mail
Server(s)
Database
Client
Client
Client

• Customer (E-Mail recipient) data needs to be
stored in real time
• Customer data can be in hundreds of millions
(if not billions)
• For every campaign e-mail have to be pushed
to all the customers (batch and ad-hoc)
• Customers have to be uniquely identified to
avoid sending multiple coupons to same
recipient (batch and periodic)

• Challenges
– Small client vs. Big client
• Scalability issues can be significant
– Standard client vs. Premium client
– Infrastructure
• Either databases or application servers or email severs can be
bottleneck
– Code development and deployment
– Standardization
*Keep these in mind and I will explain how this can be
resolved using Big Data eco system

Big Data
• Evolution of Big Data
• Understanding characteristics of Big Data
• Batch, operational and analytics in Big Data
eco system
• Types, Technologies or tools, Techniques and
Talent

Evolution of Big Data
• GFS (Google File System)
• Google Map Reduce
• Google Big Table

Understanding characteristics of Big
Data
• Volume
• Variety
• Velocity

Batch, operational and analytics in Big
Data eco system
• Batch – Hadoop eco system
– Map reduce
– Hive/Pig
– Sqoop
• Operational (but not transactional) – NoSQL eco system
– Cassandra
– Hbase
– Mongo DB
• Analytics and visualization
– Sentiment analysis
– Statistical analysis
– Machine Learning and Natural Language Processing

Big Data eco system – Advantages
• Distributed storage
– Fault tolerance (RAID is replaced by replication)
• Distributed computing/processing
– Data locality (code goes to data)
• Scalability (almost linear)
• Low cost hardware (commodity)
• Low licensing costs

Hadoop eco system
• Evolution of Hadoop eco system
• Use cases that can be addressed using Hadoop
eco system
• Hadoop eco system tools/landscape

Evolution of Hadoop eco system
• GFS to HDFS
• Google Map Reduce to Hadoop Map Reduce
• Big Table to HBase

Use cases that can be addressed using
Hadoop eco system
• ETL
• Real time reporting
• Batch reporting
• Operational but not transactional

Hadoop eco system tools/landscape
• Operational and real time data integration
– HBase
• ETL
– Map reduce, Hive/Pig, Sqoop etc
• Reporting
– Hive (Batch)
– Impala/Presto (Real time)
• Analytics API
– Map reduce
– Other frameworks
• Miscellaneous/complementary tools
– Zoo Keeper (co-ordination service for masters)
– Oozie (Workflow/Scheduler)
– Chef/Puppet (automation for administrators)
– Vendor specific management tools (Cloudera Manager, Hortonworks Ambari etc)

NoSQL eco system
• Evolution of NoSQL eco system
• Use cases that can be addressed using NoSQL
eco system
• NoSQL eco system tools/landscape

Evolution of NoSQL eco system
• Google Big Table
• Amazon DynamoDB
• Apache HBase
• Apache Cassandra
• MongoDB

Use cases that can be addressed using
NoSQL eco system
• Operational but not transactional
• Complements conventional RDBMS systems
• NoSQL is generally not a substitute for
transactional systems.
• Facebook messenger is implemented using
HBase

NoSQL eco system tools/landscape
• NoSQL Tools
– Apache HBase
– Apache Cassandra
– MongoDB
• Miscellaneous/complementary tools
– Zoo Keeper (Co-ordination service for high
availability of masters)
– Vendor specific DevOps tools

Analytics and Visualization
• Evolution of analytics and visualization tools
• Use cases that can be addressed
– Statistical analysis
– Machine learning and Natural language processing
– Conventional Reporting
• Eco system tools/landscape
– Datameer
– Tableau or any BI tool
– R (In memory statistical analysis tool)

Use Case – E-Mail Campaigning
• Role of NoSQL
– Operational
• Role of Hadoop
– Decision support
*Both NoSQL and Hadoop can be installed on same
servers.

Use Case (E-Mail Campaigning)
Big Data eco system
Client
Client
Client
StorageProcessing
StorageProcessing
Node 1
Node 2

Use Case (E-Mail Campaigning)
Big Data eco system
• Storage
– Distributed Storage (example HDFS, CFS, GFS etc)
• Processing
– Operational (HBase, Cassandra)
• Data storage is operational – for example customers might have to
stored in real time
– Batch (Map Reduce, Hive/Pig)
• E-Mail campaigning is batch
• Map Reduce can be integrated with E-Mail notification to push the
campaigning.
• Customer validation can be done in batch

Use Case (LinkedIn)
• Most of the frames in linkedin.com are
implemented using Big Data eco system tools
• Advantages
– Low cost to implement an idea (endorsements)
– No impact on existing applications
– Both operational (actual endorsement) and batch
(consolidated e-mail) are done on same servers
– Distributed and scalable

Use Case – EDW
(Current Architecture)
OLTP
Closed
Main Frames
XML
External apps
Data
Warehouse
Data Integration
(ETL/Real Time)
ODS
Source(s)
EDW/ODS
Visualization/
Reporting
Reporting
Decision
Support

Use Case – EDW
(Current Architecture)
• Enterprise Data Warehouse is built for Enterprise reporting for selected
audience in Executive Management, hence user base who view the reports
will be typically in tens or hundreds
• Data Integration
– ODS (Operational Data Store)
• Sources – Disparate
• Real time – Tools/custom (Goldengate, Shareplex etc)
• Batch – Tools/custom
• Uses – Compliance, data lineage, reports etc
– Enterprise Datawarehouse
• Sources – ODS or other sources
• ETL – Tools/custom (Informatica, Ab Initio, Talend)
• Reporting/Visualization
– ODS (Compliance related reporting)
– Enterprise Datawarehouse
– Tools (Cognos, Business Objects, Microstrategy, Tableau etc)

Use Case – EDW
(Big Data eco system)
OLTP
Closed
Main Frames
XML
External apps
Source(s)
Visualization/
Reporting
Reporting
Decision
Support
Node
Node
Node
Hadoop Cluster
(EDW/ODS)
ETL
Real
Time/Batch
(No ETL)
Reporting
Database

Hadoop eco system
Distributed File System (HDFS)
Map Reduce
Hadoop Core Components
Hive
Pig
Flume
Non Map Reduce
Impala
PrestoSqoop
Oozie
Mahout
Hadoop eco system
Hadoop Components
HBase

Use Case – EDW
• ODS and EDW can be shared on the same hadoop cluster
• Real time/batch data integration
– Flume (to get data from web logs)
– Use HBase layer
• ETL
– Should leverage Hadoop Map Reduce capabilities
– Sqoop – to get data from relational databases
– Hive/Pig – To process/transform data as per reporting requirements
• Reporting/Visualization
– Reporting can be done either directly from Hadoop or separate
reporting database

Use Case – EDW
• Pros over traditional EDW
– Low cost and consolidated hardware
– Low licensing costs
– Open source tools
– Facilitate advanced analytics
• Cons over traditional EDW
– Still evolving
– Learning curve

Use Case – Customer Analytics
• A company can often have thousands to millions
of customer (eg: eBay, Amazon, YouTube,
LinkedIn etc.)
• Analytics at customer level can add significant
value to both customer as well as Enterprise
• Traditional EDW appliances will not be able to
support customer analytics/reporting for large
enterprises
• Big Data eco system of tools can handle customer
analytics for an enterprise of any size

Use Case – Customer Analytics
• Capture data from web logs and load into
Hadoop – Flume/custom solution
• Load customer profile data from traditional MDM
or EDW or other source to Hadoop –
Sqoop/Hive/HBase
• Perform ETL to compute analytics at customer
level – Hive/Pig
• Database to store the pre-computed analytics for
all customers – Hbase
• Visualization – is often custom as per company's
requirements

Jobs in Big Data
• Generalized
– Data Scientists
– Solutions Architects
– Infrastructure Architects
– And many more
• Specialized
– ETL developers/architects
– Advanced analytics developers/architects
– Data Analysts/Business Analysts
– Hadoop Admins
– NoSQL Admins/DBAs
– Devops Engineers
– And many more

Industry reaction
• Oracle – Big Data appliance
• IBM – Big Insights
• EMC created PivotalHD
• ETL tools – Informatica, Syncsort etc are
adding or rearchitecting big data capabilities

Hadoop eco system - Setup
Environment
• https://www.youtube.com/watch?v=p7uCyFf
WL-
c&index=13&list=PLf0swTFhTI8o6LURHy7u3YI
BC3FnuZIdP

HDFS
• Hadoop Distributed File System
– It is a distributed storage
– HDFS files are logical (physically they are stored as
blocks)
– Replication factor is used for fault tolerance.
https://www.youtube.com/watch?v=-Rc-
jisdyKI&index=14&list=PLf0swTFhTI8o6LURHy7u
3YIBC3FnuZIdP

Map Reduce
• https://www.youtube.com/watch?v=IRxgew6y
tq8&index=16&list=PLf0swTFhTI8o6LURHy7u3
YIBC3FnuZIdP
• https://www.youtube.com/watch?v=he8vt835
cf8&index=17&list=PLf0swTFhTI8o6LURHy7u3
YIBC3FnuZIdP

Map Reduce based tools
• Hive
• Pig
• Sqoop
• Oozie
• Flume
• Many more

Non Map reduce based tools
• Impala
• HBase
• Many more

Introduction to Hive
• Define logical structure on top of data in HDFS
• Provides commands to load/insert data into HDFS
• Provides SQL interface to process data in HDFS
• It typically uses map reduce to process the data
• Stores metadata/logical structure in traditional
RDBMS such as MySQL
• A small demo on Hive

Introduction to Pig
• Another map reduce based interface to process
data in HDFS
• Provides commands to load and read data from
HDFS
• No need to have pre-defined structure on data
• No need of rigid schemas
• Handy to process unstructured or semi-
structured data
• A small demo on Pig

Introduction to Sqoop
• Map reduce based data copying utility to and
from HDFS
• It can understand HCatalog/Hive structure
• It can copy data from almost all traditional
RDBMS, EDW appliances as well as NoSQL
data stores to HDFS and vice-versa
• A small demo on Sqoop

Big Data Introduction

More Related Content

What's hot

Similar to Big Data Introduction

More from Durga Gadiraju

Recently uploaded

Big Data Introduction