• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Apache hadoop
 

Apache hadoop

on

  • 593 views

Handling D Big Data---

Handling D Big Data---

Statistics

Views

Total Views
593
Views on SlideShare
593
Embed Views
0

Actions

Likes
0
Downloads
24
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Apache hadoop Apache hadoop Presentation Transcript

    • Apache Hadoop- Large Scale Data ProcessingSharath Bandaru & Sai Dinesh KoppuravuriAdvanced Topics PresentationISYE 582 :Engineering Information Systems
    • Overview Understanding Big Data Structured/Unstructured Data Limitations Of Existing Data Analytics Structure Apache Hadoop Hadoop Architecture HDFS Map Reduce Conclusions References
    • Understanding Big DataBig DataIs creatingLarge AndGrowing FilesMeasured in:Petabytes (10^12)Terabytes (10^15)Which is largelyunstructured
    • Structured/Unstructured Data
    • Why now ?DataGrowthSTRUCTURED DATA – 20%1980 2013UNSTRUCTUREDDATA–80%Source : Cloudera, 2013
    • Challenges posed by Big DataVelocityVolumeVariety400 million tweets in a day on Twitter1 million transactions by Wal-Mart every hour2.5 peta bytes created by Wal-Marttransactions in an hourVideos, Photos, Text messages, Images,Audios, Documents, Emails, etc.,
    • Limitations Of Existing Data Analytics ArchitectureBI Reports + Interactive AppsRDBMS (aggregated data)ETL Compute GridStorage Only Grid ( original raw data )CollectionInstrumentationMoving Data ToCompute Doesn’t ScaleCan’t Explore OriginalHigh Fidelity Raw DataArchiving=Premature DataDeath
    • So What is Apache ?• A set of tools that supports running of applications on big data.• Core Hadoop has two main systems:- HDFS : self-healing high-bandwidth clustered storage.- Map Reduce : distributed fault-tolerant resource managementand scheduling coupled with a scalable data programmingabstraction.
    • HistorySource : Cloudera, 2013
    • The Key Benefit: Agility/FlexibilitySchema-on-Write (RDBMS): Schema-on-Read (Hadoop):• Schema must be created before any datacan be loaded.• An explicit load operation has to takeplace which transforms data to DBinternal structure.• New columns must be added explicitlybefore new data for such columns can beloaded into the database• Data is simply copied to the file store,no transformation is needed.• A SerDe (Serializer/Deserlizer) is appliedduring read time to extract the requiredcolumns (late binding).• New data can start flowing anytime andwill appear retroactively once the SerDe isupdated to parse it.• Read is Fast• Standards/Governance• Load is Fast• Flexibility/AgilityPros
    • Use The Right Tool For The Right JobRelational Databases: Hadoop:Use when:• Interactive OLAP Analytics (< 1 sec)• Multistep ACID transactions• 100 % SQL complianceUse when:• Structured or Not (Flexibility)• Scalability of Storage/Compute• Complex Data Processing
    • Traditional ApproachBig DataPowerful ComputerProcessing limitEnterprise Approach:
    • Hadoop ArchitectureTaskTrackerJobTrackerNameNodeDataNodeMasterTaskTrackerDataNodeTaskTrackerDataNodeTaskTrackerDataNodeSlavesMapReduceHDFS
    • Hadoop ArchitectureTaskTrackerJobTrackerNameNodeDataNodeMasterTaskTrackerDataNodeTaskTrackerDataNodeTaskTrackerDataNodeSlavesApplication
    • Job TrackerTaskTrackerJobTrackerNameNodeDataNodeMasterTaskTrackerDataNodeTaskTrackerDataNodeTaskTrackerDataNodeSlavesApplication
    • Job TrackerTaskTrackerJobTrackerNameNodeDataNodeMasterTaskTrackerDataNodeTaskTrackerDataNodeTaskTrackerDataNodeSlavesApplication
    • HDFS: Hadoop Distributed File System• A given file is broken into blocks (default=64MB), then blocks are replicated acrosscluster(default=3).12345HDFS345125134245123Optimized for :• Throughput• Put/Get/Delete• AppendsBlock Replication for :• Durability• Availability• ThroughputBlock Replicas are distributed across serversand racks
    • Fault Tolerance for DataTaskTrackerJobTrackerNameNodeDataNodeMasterTaskTrackerDataNodeTaskTrackerDataNodeTaskTrackerDataNodeSlavesHDFS
    • Fault Tolerance for ProcessingTaskTrackerJobTrackerNameNodeDataNodeMasterTaskTrackerDataNodeTaskTrackerDataNodeTaskTrackerDataNodeSlavesMap Reduce
    • Fault Tolerance for ProcessingTaskTrackerJobTrackerNameNodeDataNodeMasterTaskTrackerDataNodeTaskTrackerDataNodeTaskTrackerDataNodeSlavesTables are backed up
    • Map ReduceInput DataMap Map Map Map MapShuffleReduce ReduceResults
    • Understanding the concept of Map ReduceMotherSamAn Apple• Believed “an apple a day keeps a doctor away”The Story Of Sam
    • Understanding the concept of Map Reduce• Sam thought of “drinking” the apple He used a to cut theand a to make juice.
    • Understanding the concept of Map ReduceNext day• Sam applied his invention to all the fruits he could find inthe fruit basket (map ‘( )’) (reduce ‘( )’) Classical Notion of Map Reducein Functional ProgrammingA list of values mapped intoanother list of values, whichgets reduced into a single value
    • Understanding the concept of Map Reduce18 Years Later• Sam got his first job in “Tropicana” for his expertise inmaking juices. Now, it’s not just one basketbut a whole container of fruits Also, they produce a list ofjuice types separatelyNOT ENOUGH !! But, Sam had just ONEand ONELarge data and list ofvalues for outputWait!
    • Understanding the concept of Map ReduceBrave Sam(<a, > , <o, > , <p, > , …)Each input to a map is a list of <key, value> pairsEach output of a map is a list of <key, value> pairs(<a’, > , <o’, > , <p’, > , …)Grouped by keyEach input to a reduce is a <key, value-list> (possibly a list of these, dependingon the grouping/hashing mechanism)e.g. <a’, ( …)>Reduced into a list of valuesImplemented parallel version of his innovation
    • Understanding the concept of Map Reduce• Sam realized,– To create his favorite mix fruit juice he can use a combiner after the reducers– If several <key, value-list> fall into the same group (based on thegrouping/hashing algorithm) then use the blender (reducer) separately oneach of them– The knife (mapper) and blender (reducer) should not contain residue after use– Side Effect FreeSource: (Map Reduce, 2010).
    • Conclusions• The key benefits of Apache Hadoop:1) Agility/ Flexibility (Quickest Time to Insight)2) Complex Data Processing (Any Language, Any Problem)3) Scalability of Storage/Compute (Freedom to Grow)4) Economical Storage (Keep All Your Data Alive Forever)• The key systems for Apache Hadoop are:1) Hadoop Distributed File System : self-healing high-bandwidthclustered storage.2) Map Reduce : distributed fault-tolerant resource managementcoupled with scalable data processing.
    • References• Ekanayake, S. (2010, March). Map Reduce : The Story Of Sam. Retrieved April 13, 2013,from http://esaliya.blogspot.com/2010/03/mapreduce-explained-simply-as-story- of.html.• Jeffrey Dean and Sanjay Ghemawat. (2004, December). Map Reduce : Simplified DataProcessing on Large Clusters.• The Apache Software Foundation. (2013, April). Hadoop. Retrieved April 19, 2013, fromhttp://hadoop.apache.org/.• Isabel Drost. (2010, February). Apache Hadoop : Large Scale Data Analysis made Easy.retrieved April 13, 2013, from http://www.youtube.com/watch?v=VFHqquABHB8.• Dr. Amr Awadallah. (2011, November). Introducing Apache Hadoop : The Modern DataOperating System. Retrieved April 15, 2013, fromhttp://www.youtube.com/watch?v=d2xeNpfzsYI