Your SlideShare is downloading. ×
0
Leveraging open source for big data stack
Leveraging open source for big data stack
Leveraging open source for big data stack
Leveraging open source for big data stack
Leveraging open source for big data stack
Leveraging open source for big data stack
Leveraging open source for big data stack
Leveraging open source for big data stack
Leveraging open source for big data stack
Leveraging open source for big data stack
Leveraging open source for big data stack
Leveraging open source for big data stack
Leveraging open source for big data stack
Leveraging open source for big data stack
Leveraging open source for big data stack
Leveraging open source for big data stack
Leveraging open source for big data stack
Leveraging open source for big data stack
Leveraging open source for big data stack
Leveraging open source for big data stack
Leveraging open source for big data stack
Leveraging open source for big data stack
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Leveraging open source for big data stack

2,199

Published on

Harnessing Hadoop for Big Data - Series III - Leveraging Open Source for Big Data Stack

Harnessing Hadoop for Big Data - Series III - Leveraging Open Source for Big Data Stack

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
2,199
On Slideshare
0
From Embeds
0
Number of Embeds
7
Actions
Shares
0
Downloads
26
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide
  • Exabyte is 1 billion gigabytes, 7910 is 3 times more bits of information in digital universe than stars in physical universe
  • Indian telecom added 7.9 million new subscribers in September. The indian population can be related to Aadhar project
  • Mahout is a person who drives an elephant – catching a taxi from airport algorithm
  • Transcript

    • 1. Leveraging Open Source Big Data Stack Prasanth M Sasidharan Copyright © 2011 Flytxt B.V. All rights reserved 1/16/2012
    • 2. What is data? Data is Information in raw or unorganized form such as alphabets, numbers, or symbolsWhat is Big data?  Big Data refers to large datasets which are difficult to store, manage and analyze  Everyday, we create 2.5 trillion bytes of data–so much that 90% of the data in the world today has been created in the last two years alone. Copyright © 2011 Flytxt B.V. All rights reserved 1/16/2012 2
    • 3. Data Explosion !
    • 4. Global Data Trends Copyright © 2011 Flytxt B.V. All rights reserved 1/16/2012 4
    • 5. Big Data & Distributed Computing  Multiple servers, each working on part of job, each doing same task .  Key Challenges: • Work distribution and orchestration • Error recovery • Scalability and management Copyright © 2011 Flytxt B.V. All rights reserved 1/16/2012 5
    • 6. FOSS in Aadhar  Aadhaar is a 12-digit unique number which the Unique Identification Authority of India (UIDAI) will issue for all residents in India  The number will be stored in a centralized database and linked to the basic demographics and biometric information – photograph, ten fingerprints and iris – of each individual.  It is unique and robust enough to eliminate the large number of duplicate and fake identities in government and private databases Copyright © 2011 Flytxt B.V. All rights reserved 1/16/2012 6
    • 7. Lets Meet a Stack! Application Layer Infrastructure Layer Copyright © 2011 Flytxt B.V. All rights reserved 1/16/2012 7
    • 8. Infrastructure for Big Data Analysis  What’s Virtualization? Virtualization allows multiple operating system instances to run concurrently on a single computer; it is a means of separating hardware from a single operating system. Copyright © 2011 Flytxt B.V. All rights reserved 1/16/2012 8
    • 9. What’s Hypervisor? ◦ Also called virtual machine manager (VMM), is one of many hardware virtualization techniques allowing multiple operating systems, termed guests, to run concurrently on a host computer ◦ Originally developed in the 1970s as part of the IBM S/360 Xen® hypervisor Copyright © 2011 Flytxt B.V. All rights reserved 1/16/2012 9
    • 10. Advantages of FOSS  Flexibility and Freedom  Reliability  Auditability  Fast Deployment  Cost Copyright © 2011 Flytxt B.V. All rights reserved 1/16/2012 10
    • 11. Cost For Reproducing YouTube Capital Expenditures Ann Expenses,ex HW Support ($M) ($M) System Hardware Software Total Staff Support TotalOracle Exadata $147.4 $442.0 $589.4 $1.6 $97.4 $99.0 Alternative openSource, commodity hardware $104.2 $0.0 $104.2 $2.2 $12.9 $15.1 Copyright © 2011 Flytxt B.V. All rights reserved 1/16/2012 11
    • 12. Get Involved!  Find out about Apache projects (http://projects.apache.org/  Join mailing lists  Pick up a Bug  Suggest ideas or Fixes  Checkout the latest code / Download releases  Change the sourcefiles to incorporate your change or addition  Provide appropriate source code documentation and follow projects coding conventions.  Check Whether the software still compiles and runs correctly  Run any unit or regression tests the software may have  Send the patch for Review & committing Copyright © 2011 Flytxt B.V. All rights reserved 1/16/2012 12
    • 13. Notable Users of Hadoop(Source: http://en.wikipedia.org/wiki/Hadoop) • Adobe • Meebo • Amazon • The New York Times • AOL • Rackspace • eBay • StumbleUpon • Facebook • Twitter • Fox Interactive Media • Yahoo • IBM • Last.fm • LinkedInReferences • Hadoop: The Definitive Guide-MapReduce for the Cloud • HBase: The Definitive Guide • Hive Wiki (http://wiki.apache.org/hadoop/Hive) • Pig Wiki (http://wiki.apache.org/pig/) Copyright © 2011 Flytxt B.V. All rights reserved 1/16/2012 13
    • 14. Open Source Initiatives @ FlyTXT  Customization Specific to our business lines  Mahout Enhancements for additional Machine Learning Algorithms  Hive Customization  Oozie Enhancements  Hadoop Enhancements  We won the IEEE cloud computing challenge Copyright © 2011 Flytxt B.V. All rights reserved 1/16/2012 14
    • 15. THANK YOU Copyright © 2011 Flytxt B.V. All rights reserved 1/16/2012 15
    • 16. Extra Slides Copyright © 2011 Flytxt B.V. All rights reserved 1/16/2012 16
    • 17. Major Contributors to Hadoop…. Copyright © 2011 Flytxt B.V. All rights reserved 1/16/2012 17
    • 18. Copyright © 2011 Flytxt B.V. All rights reserved 1/16/2012 18
    • 19. Quantity of Global Data Exabyte 130 2,720 7,910 2005 2012 2015* Copyright © 2011 Flytxt B.V. All rights reserved 1/16/2012 19
    • 20. Numbers behind the News!! Twitter produces over 230 million tweets per day Wal-Mart is logging one million transactions per hour Facebook creates over 30 billion pieces of content ranging from web links, news, blogs, photo Indias mobile subscription base at 873.61 mn users India has a population of 1.21 billion
    • 21. Lets meet the Big data Stack • Oozie – Open-source workflow/coordination service to manage data processing jobs for Apache Hadoop™ - Developed at Yahoo! • HBase – Column-store database based on Google’s BigTable. Holds extremely large data sets (Petabytes) • Hive – SQL based data warehousing app with features for analyzing very large data sets - Developed at Facebook • Zoo Keeper – Distributed consensus engine providing Leader election, service discovery, distributed locking / mutual exclusion • Pig - platform for analyzing large data sets that consists of a high-level language for expressing data analysis steps • Ganglia - a scalable distributed monitoring system for high- performance computing systems such as clusters and Grids • Apache Mahout - Free implementations of distributed or otherwise scalable machine learning algorithms on the Hadoop platform Copyright © 2011 Flytxt B.V. All rights reserved 1/16/2012 21
    • 22. Copyright © 2011 Flytxt B.V. All rights reserved 1/16/2012 22

    ×