Leveraging open source for big data stack

2,680 views

Published on

Harnessing Hadoop for Big Data - Series III - Leveraging Open Source for Big Data Stack

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
2,680
On SlideShare
0
From Embeds
0
Number of Embeds
1,113
Actions
Shares
0
Downloads
39
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide
  • Exabyte is 1 billion gigabytes, 7910 is 3 times more bits of information in digital universe than stars in physical universe
  • Indian telecom added 7.9 million new subscribers in September. The indian population can be related to Aadhar project
  • Mahout is a person who drives an elephant – catching a taxi from airport algorithm
  • Leveraging open source for big data stack

    1. 1. Leveraging Open Source Big Data Stack Prasanth M Sasidharan Copyright © 2011 Flytxt B.V. All rights reserved 1/16/2012
    2. 2. What is data? Data is Information in raw or unorganized form such as alphabets, numbers, or symbolsWhat is Big data?  Big Data refers to large datasets which are difficult to store, manage and analyze  Everyday, we create 2.5 trillion bytes of data–so much that 90% of the data in the world today has been created in the last two years alone. Copyright © 2011 Flytxt B.V. All rights reserved 1/16/2012 2
    3. 3. Data Explosion !
    4. 4. Global Data Trends Copyright © 2011 Flytxt B.V. All rights reserved 1/16/2012 4
    5. 5. Big Data & Distributed Computing  Multiple servers, each working on part of job, each doing same task .  Key Challenges: • Work distribution and orchestration • Error recovery • Scalability and management Copyright © 2011 Flytxt B.V. All rights reserved 1/16/2012 5
    6. 6. FOSS in Aadhar  Aadhaar is a 12-digit unique number which the Unique Identification Authority of India (UIDAI) will issue for all residents in India  The number will be stored in a centralized database and linked to the basic demographics and biometric information – photograph, ten fingerprints and iris – of each individual.  It is unique and robust enough to eliminate the large number of duplicate and fake identities in government and private databases Copyright © 2011 Flytxt B.V. All rights reserved 1/16/2012 6
    7. 7. Lets Meet a Stack! Application Layer Infrastructure Layer Copyright © 2011 Flytxt B.V. All rights reserved 1/16/2012 7
    8. 8. Infrastructure for Big Data Analysis  What’s Virtualization? Virtualization allows multiple operating system instances to run concurrently on a single computer; it is a means of separating hardware from a single operating system. Copyright © 2011 Flytxt B.V. All rights reserved 1/16/2012 8
    9. 9. What’s Hypervisor? ◦ Also called virtual machine manager (VMM), is one of many hardware virtualization techniques allowing multiple operating systems, termed guests, to run concurrently on a host computer ◦ Originally developed in the 1970s as part of the IBM S/360 Xen® hypervisor Copyright © 2011 Flytxt B.V. All rights reserved 1/16/2012 9
    10. 10. Advantages of FOSS  Flexibility and Freedom  Reliability  Auditability  Fast Deployment  Cost Copyright © 2011 Flytxt B.V. All rights reserved 1/16/2012 10
    11. 11. Cost For Reproducing YouTube Capital Expenditures Ann Expenses,ex HW Support ($M) ($M) System Hardware Software Total Staff Support TotalOracle Exadata $147.4 $442.0 $589.4 $1.6 $97.4 $99.0 Alternative openSource, commodity hardware $104.2 $0.0 $104.2 $2.2 $12.9 $15.1 Copyright © 2011 Flytxt B.V. All rights reserved 1/16/2012 11
    12. 12. Get Involved!  Find out about Apache projects (http://projects.apache.org/  Join mailing lists  Pick up a Bug  Suggest ideas or Fixes  Checkout the latest code / Download releases  Change the sourcefiles to incorporate your change or addition  Provide appropriate source code documentation and follow projects coding conventions.  Check Whether the software still compiles and runs correctly  Run any unit or regression tests the software may have  Send the patch for Review & committing Copyright © 2011 Flytxt B.V. All rights reserved 1/16/2012 12
    13. 13. Notable Users of Hadoop(Source: http://en.wikipedia.org/wiki/Hadoop) • Adobe • Meebo • Amazon • The New York Times • AOL • Rackspace • eBay • StumbleUpon • Facebook • Twitter • Fox Interactive Media • Yahoo • IBM • Last.fm • LinkedInReferences • Hadoop: The Definitive Guide-MapReduce for the Cloud • HBase: The Definitive Guide • Hive Wiki (http://wiki.apache.org/hadoop/Hive) • Pig Wiki (http://wiki.apache.org/pig/) Copyright © 2011 Flytxt B.V. All rights reserved 1/16/2012 13
    14. 14. Open Source Initiatives @ FlyTXT  Customization Specific to our business lines  Mahout Enhancements for additional Machine Learning Algorithms  Hive Customization  Oozie Enhancements  Hadoop Enhancements  We won the IEEE cloud computing challenge Copyright © 2011 Flytxt B.V. All rights reserved 1/16/2012 14
    15. 15. THANK YOU Copyright © 2011 Flytxt B.V. All rights reserved 1/16/2012 15
    16. 16. Extra Slides Copyright © 2011 Flytxt B.V. All rights reserved 1/16/2012 16
    17. 17. Major Contributors to Hadoop…. Copyright © 2011 Flytxt B.V. All rights reserved 1/16/2012 17
    18. 18. Copyright © 2011 Flytxt B.V. All rights reserved 1/16/2012 18
    19. 19. Quantity of Global Data Exabyte 130 2,720 7,910 2005 2012 2015* Copyright © 2011 Flytxt B.V. All rights reserved 1/16/2012 19
    20. 20. Numbers behind the News!! Twitter produces over 230 million tweets per day Wal-Mart is logging one million transactions per hour Facebook creates over 30 billion pieces of content ranging from web links, news, blogs, photo Indias mobile subscription base at 873.61 mn users India has a population of 1.21 billion
    21. 21. Lets meet the Big data Stack • Oozie – Open-source workflow/coordination service to manage data processing jobs for Apache Hadoop™ - Developed at Yahoo! • HBase – Column-store database based on Google’s BigTable. Holds extremely large data sets (Petabytes) • Hive – SQL based data warehousing app with features for analyzing very large data sets - Developed at Facebook • Zoo Keeper – Distributed consensus engine providing Leader election, service discovery, distributed locking / mutual exclusion • Pig - platform for analyzing large data sets that consists of a high-level language for expressing data analysis steps • Ganglia - a scalable distributed monitoring system for high- performance computing systems such as clusters and Grids • Apache Mahout - Free implementations of distributed or otherwise scalable machine learning algorithms on the Hadoop platform Copyright © 2011 Flytxt B.V. All rights reserved 1/16/2012 21
    22. 22. Copyright © 2011 Flytxt B.V. All rights reserved 1/16/2012 22

    ×