Leveraging open source for big data stack

Leveraging Open Source Big Data Stack
Prasanth M Sasidharan

Copyright © 2011 Flytxt B.V. All rights reserved 1/16/2012

What is data?
 Data is Information in raw or unorganized form such as alphabets,
numbers, or symbols

What is Big data?
 Big Data refers to large datasets which are difficult to store, manage and
analyze

 Everyday, we create 2.5 trillion bytes of data–so much that 90% of the
data in the world today has been created in the last two years alone.

Copyright © 2011 Flytxt B.V. All rights reserved 1/16/2012 2

Global Data Trends


Big Data & Distributed Computing
 Multiple servers, each working on part of job, each doing same task .
 Key Challenges:
• Work distribution and orchestration
• Error recovery
• Scalability and management


FOSS in Aadhar
 Aadhaar is a 12-digit unique number which the Unique Identification Authority
of India (UIDAI) will issue for all residents in India

 The number will be stored in a centralized database and linked to the basic
demographics and biometric information – photograph, ten fingerprints and iris
– of each individual.

 It is unique and robust enough to eliminate the large number of duplicate and
fake identities in government and private databases


Lets Meet a Stack!

Application Layer

Infrastructure
Layer


Infrastructure for Big Data Analysis
 What’s Virtualization?
Virtualization allows multiple operating system instances to
run concurrently on a single computer; it is a means of separating
hardware from a single operating system.


What’s Hypervisor?
◦ Also called virtual machine manager (VMM), is one of many hardware
virtualization techniques allowing multiple operating systems, termed guests, to
run concurrently on a host computer

◦ Originally developed in the 1970s as part of the IBM S/360

Xen® hypervisor


Advantages of FOSS

 Flexibility and Freedom

 Reliability

 Auditability

 Fast Deployment

 Cost


Cost For Reproducing YouTube

Capital Expenditures Ann Expenses,ex HW Support
($M) ($M)

System Hardware Software Total Staff Support Total

Oracle Exadata $147.4 $442.0 $589.4 $1.6 $97.4 $99.0
Alternative
openSource,
commodity
hardware $104.2 $0.0 $104.2 $2.2 $12.9 $15.1


Get Involved!
 Find out about Apache projects (http://projects.apache.org/
 Join mailing lists
 Pick up a Bug
 Suggest ideas or Fixes
 Checkout the latest code / Download releases
 Change the sourcefiles to incorporate your change or addition
 Provide appropriate source code documentation and follow project's
coding conventions.
 Check Whether the software still compiles and runs correctly
 Run any unit or regression tests the software may have
 Send the patch for Review & committing


Notable Users of Hadoop
(Source: http://en.wikipedia.org/wiki/Hadoop)

• Adobe • Meebo
• Amazon • The New York Times
• AOL • Rackspace
• eBay • StumbleUpon
• Facebook • Twitter
• Fox Interactive Media • Yahoo
• IBM
• Last.fm
• LinkedIn

References
• Hadoop: The Definitive Guide-MapReduce for the Cloud

• HBase: The Definitive Guide

• Hive Wiki (http://wiki.apache.org/hadoop/Hive)

• Pig Wiki (http://wiki.apache.org/pig/)


Open Source Initiatives @ FlyTXT
 Customization Specific to our business lines

 Mahout Enhancements for additional Machine Learning Algorithms

 Hive Customization

 Oozie Enhancements

 Hadoop Enhancements

 We won the IEEE cloud computing challenge


THANK YOU


Extra Slides


Major Contributors to Hadoop….


Quantity of Global Data
Exabyte

130 2,720
7,910
2005

2012

2015*


Numbers behind the News!!

Twitter produces over 230 million tweets per day

Wal-Mart is logging one million transactions per hour

Facebook creates over 30 billion pieces of content ranging
from web links, news, blogs, photo

India's mobile subscription base at 873.61 mn users

India has a population of 1.21 billion

Lets meet the Big data Stack
• Oozie – Open-source workflow/coordination service to
manage data processing jobs for Apache Hadoop™ -
Developed at Yahoo!

• HBase – Column-store database based on Google’s
BigTable. Holds extremely large data sets (Petabytes)

• Hive – SQL based data warehousing app with features for
analyzing very large data sets - Developed at Facebook

• Zoo Keeper – Distributed consensus engine providing
Leader election, service discovery, distributed locking /
mutual exclusion

• Pig - platform for analyzing large data sets that consists of a
high-level language for expressing data analysis steps

• Ganglia - a scalable distributed monitoring system for high-
performance computing systems such as clusters and Grids

• Apache Mahout - Free implementations of distributed or
otherwise scalable machine learning algorithms on
the Hadoop platform


Leveraging open source for big data stack

More Related Content

What's hot

Viewers also liked

Similar to Leveraging open source for big data stack

More from Flytxt

Recently uploaded

Leveraging open source for big data stack

Editor's Notes