Big data hadoop ecosystem and nosql

Overview
of
Big Data
Hadoop Ecosystem and
NoSQL Databases
Khanderao Kand
CTO GloMantra Inc.
Entrepreneur and Technologist
Twitter @khanderao

Big Data

The Dominant trend for 2013 will, once again, be Big Data

Gartner reports must have technology for “Competetive
advantage by 2015”

IDC forecasts that the market for Big Data is expected to
grow from $3.2 billion in 2010 to $16.9 billion in 2015 in its
report, Worldwide Big Data Technology and Services 2012-2015.

By 2016, revenue from the big data sector will approach $24
billion, reaching $48.3 billion by 2018.

The image was taken from the Atacama desert in western South America by Yuri
Beletsky (Las Campanas Observatory, Carnegie Institution for Science) on July 11, 2012.
Copyright Yuri Beletsky

Alignment…

Explosion of data from site logs, search engines, social
media…

Google published paper on Map Reduce and Google File
System, inspired Doug Cutting working on Apache Lucene-
Nutch, Hadoop born

Yahoo took further with 1000 nodes in 2008

Possible to process very very large data on commodity
hardware

Apache Open source

Big Data Stack

Patents

Speed

Matlab
SAS SPSS
R
SciPy
Mahout
Scale

Speed kdb
Esper, S4
MySQL
MongoDB
Hbase
Hadoop Scale

Big Data Architecture
Analytics Products Apps

BI
BI Tools - Dev Visualization

Unstructured
Data
Lucene Hadoop No-SQL RDBMS
Nutch Map Reduce Hadoop No-SQL
Based
SOLR

Structured System
Data ETL Workflow
Admin
Data &
Monitoring
RDBMS Integration Scheduler
Datalogs
Streams

HDFS
Large Data Set
Client 1 Client2
Write Once – Read Many
Fault Tolerant NameNode
Distributed File System Read
Write

Name Node – Data Node
Fixed Size Data Blocks
Checksum
Rack1 Rack N
Files – Sequence of blocks Replication

Replicated over Balanced Cluster
Heartbeat Report from Nodes

Map Reduce

• Two Step, Map and Reduce, approach of solving problem
• Move the code to the data
• Map step process data on nodes
• Reduce step aggregates results from all Map nodes with reduce algorithm
• JobTracker distributes and tracks tasks
• TaskTracker on processing nodes communicated task status to JobTrackers
• Inspired by Functional Programming

Hadoop Ecosystem

BI Analytics Apps RDBMS

Workflow
Chukwa Oozie Flume
Orchestration

Data Avro Pig Hive Sqoop

Security, Recovery, Infra
Access HBase

zookeeper
Network

Nagios, Ganglia
Processing Map Reduce

HCatalog
Storage HDFS

Apache Hive

SQL-like HiveQL

Warehousing Apps

Compiles to MapReduce Tasks

Facebook, Netflix, etc.

Apache Pig Latin
Higher Level scripting above Map Reduce

Procedureal (unlike SQL) by easy like SQL

Constructs like FOREACH, GROUP

Supports User Defined Functions

From Yahoo

Good for Integrating and writing Hadoop JObs

Sqoop
Data Bulk Load

Data Import Export

RDBMS and NoSQL

HDFS, Hbase

Data Sliced

Sliced Transferred via MaP only Jobs

Chukwa & Flume

Hadoop Subproject

Large scale log processing

On Map R

Collection and analysis

Batch Oriented

Components:
Agents
Collectors
MR Jobs for Parsing & Archiving
HICC : Hadoop Infra Care Center Web App

Big „Fast‟ Data
Real time adhoc querry:

Once again Google Percolater and Dremel inspired

Cloudera : Impala
SQL like querry on HDFS
Lower latency
By pass Map Reduce

Apache Drill

NoSQL DataBases
Document Databases : MongoDB, CouchDB

Column Databases: Cassandra, Hbase

KV Pair:

Graph DB: Neo4J

MongoDB
Document Oriented

Flexible - No Fix Schema

Distributed – Sharding based on diff policies

Fault Tolerant via Replication

Easy to install use

JSON – BSON format storage

Javascript based Querry

Java, Python, other languages

Opensource, Supported by 10Gen

Fast Read

CouchDB
Document Oriented
JSON format
HTTP/REST interface
MapReduce, Javascript
Replication support
Multi version CC
Written in Erlang
Fast Write – Read
Good Availability

Apache Cassandra
Based on Amazon Dynamo Db

Column oriented

Theoretically infinite columns

Columns as tupple N,V, timestamp

Organized as column family

(unlike Hbase)Not Hadoop based

Equal Nodes, easier to config and manage

Parallel write

Netflix,,etc.

Apache HBase
Modeled as Google Big Table

Column Oriented

Column Family stored together as against all columns in row

Predefine table schema with columns

However columns can be added in runtime

Fault Tolerant

Runs on HDFS

MapReduce based

Interface via REST, AVRO, Thrift

Facebook‟s messaging platform

Big data hadoop ecosystem and nosql

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (19)

Similar to Big data hadoop ecosystem and nosql

Similar to Big data hadoop ecosystem and nosql (20)

Big data hadoop ecosystem and nosql