Harnessing hadoop for big data analytics v0.1

Transforming Mobile Marketing & Advertising™

Harnessing s for Big Data
Analytics

Jobin Wilson
jobin.wilson@flytxt.com

Confidential
Copyright © 2010 Flytxt B.V. All rights reserved.

Who am I ?

• Architect @ Flytxt (Big Data Analytics & Automation)

• Passionate about data, distributed computing , machine learning

• Previously

•Virtualization & Cloud Lifecycle Management(BMC)

• Designed and Implemented Cloud Life Cycle Management Interface for BMC

• Large Scale Data Centre Automation(AOL)

• Implemented Centralized Data Center Management Framework for AOL

•Workflow Systems & Automation (Accenture)

• Implemented Service Management Suit for various customers

Confidential

Session Agenda!

• Data – What's the big deal?

• What is Hadoop( & What it is not  )

• Map-Reduce Model & HDFS

• Hadoop Ecosystem & Tools

• Lets get started!

• Q&A

3 Confidential

Five computers & a 640k ;-)

"I think there is a world market
for about five computers"
Moore’s Law
Thomas Watson 1943,
Chairman of the board of IBM

"640k ought to be enough for
anybody"

Attributed to
Bill Gates in 1981.

Confidential

Data Explosion !

Confidential

Do I also know what you might do next summer?

• Does your travel company know you visited Goa &
Cochin twice in the last two years?

• Collaborative Filtering

• Lots of Data + Statistics = WOW!!!

• BTW, don’t worry about the eqn 

Confidential

Don‟t throw away data just because it doesn't „fit‟

• relational tuples, log files, semi structured textual data (e.g., e-mail),pictures
, videos

• User generated data & System generated data

• Applications need more than structured data

• My application is not “Dumb” any more!!

• “I keep saying that the sexy job in the next 10 years will be
statisticians, and I’m not kidding.” - Hal Varian (Google’s chief economist)

Confidential

Lets get to business!!

What is Apache Hadoop ?

• Apache Hadoop is an open-source system to
reliably store and process extremely large data sets
across many commodity computers.

• originally developed to support Nutch search engine
project.

• scales linearly with data size or analysis complexity

• Scale-out ,shared nothing architecture

• inspired by Google's MapReduce and Google File
System (GFS) papers

Confidential

Basics of Hadoop

• Two Core Components – HDFS & Map-Reduce

• Machines are un-reliable

• Separates distributed fault-tolerant computing code from application
logic.

• No need to worry about identity of a machine

• lets you interact with a cluster, not a bunch of machines.

• Analysis workloads span across multiple machines

• runs as a cloud(cluster) & possibly on a cloud (EC2)

Confidential

Lead Actors

• Name Node – Book keeping metadata server

• Secondary Name Node – Assistant to Name Node

• Job Tracker – Scheduler

• Task Tracker - Task execution

• Data Node - Block storage

Confidential

HDFS Write Model

Confidential

Map-Reduce Model

Confidential

Map-Reduce Execution Flow

Confidential

Hadoop Ecosystem
• Oozie – Open-source workflow/coordination
service to manage data processing jobs for Apache
Hadoop™ - Developed at Yahoo!

• HBase – Column-store database based on
Google’s BigTable. Holds extremely large data sets
(Petabytes)

• Hive – SQL based data warehousing app with
features for analyzing very large data sets -
Developed at Facebook

• Zoo Keeper – Distributed consensus engine
providing Leader election, service
discovery, distributed locking / mutual exclusion

• Pig - platform for analyzing large data sets that
consists of a high-level language for expressing
data analysis steps

• Ganglia - a scalable distributed monitoring system
for high-performance computing systems such as
clusters and Grids
Confidential

Hadoop is not a “Holy Grail”

• Not a substitute for a database

• MapReduce is not always the best algorithm

• HDFS is not a substitute for a
High Availability SAN-hosted FS

• HDFS is not a Posix file system

• Not a place to learn Java programming

• Not a place to learn Unix/Linux system administration

• Not a place to learn basics of networking

Confidential

Notable Users of Hadoop
(Source: http://en.wikipedia.org/wiki/Hadoop)

• A9.com • Meebo
• AOL • Metaweb
• EHarmony • The New York Times
• eBay • Rackspace
• Facebook • StumbleUpon
• Fox Interactive Media • Twitter
• IBM • Yahoo
• Last.fm • Amazon
• LinkedIn

Confidential

Q&A

www.flytxt.com
Confidential

THANK YOU
contact us : dev2dev@flytxt.com/ jobin.wilson@flytxt.com

www.flytxt.com
Confidential 18

Harnessing hadoop for big data analytics v0.1

More Related Content

Viewers also liked

Similar to Harnessing hadoop for big data analytics v0.1

Recently uploaded

Harnessing hadoop for big data analytics v0.1