Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Intro to Big Data using Hadoop

7,310 views

Published on

Introduction to Big Data and Apache Hadoop project. MapReduce vizualization

  • Be the first to comment

Intro to Big Data using Hadoop

  1. 1. Intro to Big Data using Hadoop Sergejus Barinovas sergejus.blogas.lt fb.com/ITishnikai @sergejusb
  2. 2. Information is powerful…but it is how we use it that will define us
  3. 3. Data Explosion text audio video images relational picture from Big Data Integration
  4. 4. Big Data (globally)– creates over 30 billion pieces of content per day– stores 30 petabytes of data– produces over 90 million tweets per day
  5. 5. Big Data (our example)– logs over 300 gigabytes of transactions per day– stores more than 1,5 terabyte of aggregated data
  6. 6. 4 Vs of Big Data volume volume velocity velocity variety variety variability variability
  7. 7. Big Data ChallengesSort 10TB on 1 node = 2,5 days 100-node cluster = 35 mins
  8. 8. Big Data Challenges“Fat” servers implies high cost – use cheap commodity nodes insteadLarge # of cheap nodes implies often failures – leverage automatic fault-tolerance fault-tolerance
  9. 9. Big Data ChallengesWe need new data-parallel programmingmodel for clusters of commodity machines
  10. 10. MapReduceto the rescue!
  11. 11. MapReducePublished in 2004 by Google – MapReduce: Simplified Data Processing on Large ClustersPopularized by Apache Hadoop project – used by Yahoo!, Facebook, Twitter, Amazon, …
  12. 12. MapReduce
  13. 13. Word Count Example Input Map Shuffle & Sort Reduce Outputthe quick the, 3 brown Map brown, 2 fox Reduce fox, 2 how, 1 now, 1 the fox ate the Map mouse quick, 1 ate, 1 Reduce mouse, 1how now brown Map cow, 1 cow
  14. 14. Word Count Example Input Map Shuffle & Sort Reduce Output the, 1 the, 1the quick quick, 1 brown, 1 brown Map brown, 1 fox, 1 fox fox, 1 the, 1 Reduce fox, 1 the, 1 the, 1 how, 1 the fox fox, 1 now, 1 ate the Map ate, 1 brown, 1 mouse the, 1 mouse, 1 quick, 1 ate, 1 how, 1 mouse, 1 Reducehow now now, 1 cow, 1 brown Map brown, 1 cow cow, 1
  15. 15. Word Count Example Input Map Shuffle & Sort Reduce Output the, [1,1,1]the quick brown, [1,1] the, 3 brown Map fox, [1,1] brown, 2 fox how, [1] Reduce fox, 2 now, [1] how, 1 now, 1 the fox ate the Map mouse quick, [1] quick, 1 ate, [1] ate, 1 mouse, [1] Reduce mouse, 1how now cow, [1] cow, 1 brown Map cow
  16. 16. MapReduce philosophy – hide complexity – make it scalable – make it cheap
  17. 17. MapReduce popularized by Apache Hadoop project
  18. 18. Hadoop OverviewOpen source implementation of – Google MapReduce paper – Google File System (GFS) paperFirst release in 2008 by Yahoo! – wide adoption by Facebook, Twitter, Amazon, etc.
  19. 19. Hadoop CoreMapReduce (Job Scheduling / Execution System) Hadoop Distributed File System (HDFS)
  20. 20. Hadoop Core (HDFS) MapReduce (Job Scheduling / Execution System) Hadoop Distributed File System (HDFS)• Name Node stores file metadata• files split into 64 MB blocks• blocks replicated across 3 Data Nodes
  21. 21. Hadoop Core (HDFS)MapReduce (Job Scheduling / Execution System) Hadoop Distributed File System (HDFS) Name Node Data Node
  22. 22. Hadoop Core (MapReduce)• Job Tracker distributes tasks and handles failures• tasks are assigned based on data locality• Task Trackers can execute multiple tasks MapReduce (Job Scheduling / Execution System) Hadoop Distributed File System (HDFS) Name Node Data Node
  23. 23. Hadoop Core (MapReduce) Job Tracker Task TrackerMapReduce (Job Scheduling / Execution System) Hadoop Distributed File System (HDFS) Name Node Data Node
  24. 24. Hadoop Core (Job submission) Task TrackerClient Job Tracker Name Node Data Node
  25. 25. Hadoop Ecosystem Pig (ETL) Hive (BI) Sqoop (RDBMS) MapReduce (Job Scheduling / Execution System)Zookeeper Avro HBase Hadoop Distributed File System (HDFS)
  26. 26. JavaScript MapReducevar map = function (key, value, context) { var words = value.split(/[^a-zA-Z]/); for (var i = 0; i < words.length; i++) { if (words[i] !== "") { context.write(words[i].toLowerCase(), 1); } }};var reduce = function (key, values, context) { var sum = 0; while (values.hasNext()) { sum += parseInt(values.next()); } context.write(key, sum);};
  27. 27. Pigwords = LOAD /example/count AS ( word: chararray, count: int);popular_words = ORDER words BY count DESC;top_popular_words = LIMIT popular_words 10;DUMP top_popular_words;
  28. 28. HiveCREATE EXTERNAL TABLE WordCount ( word string, count int)ROW FORMAT DELIMITED FIELDS TERMINATED BY t LINES TERMINATED BY nSTORED AS TEXTFILELOCATION "/example/count";SELECT * FROM WordCount ORDER BY count DESC LIMIT 10;
  29. 29. Über DemoDemoHadoop in the Cloud
  30. 30. Thanks!Questions?

×