Your SlideShare is downloading. ×
0
Intro to Big Data using Hadoop
Intro to Big Data using Hadoop
Intro to Big Data using Hadoop
Intro to Big Data using Hadoop
Intro to Big Data using Hadoop
Intro to Big Data using Hadoop
Intro to Big Data using Hadoop
Intro to Big Data using Hadoop
Intro to Big Data using Hadoop
Intro to Big Data using Hadoop
Intro to Big Data using Hadoop
Intro to Big Data using Hadoop
Intro to Big Data using Hadoop
Intro to Big Data using Hadoop
Intro to Big Data using Hadoop
Intro to Big Data using Hadoop
Intro to Big Data using Hadoop
Intro to Big Data using Hadoop
Intro to Big Data using Hadoop
Intro to Big Data using Hadoop
Intro to Big Data using Hadoop
Intro to Big Data using Hadoop
Intro to Big Data using Hadoop
Intro to Big Data using Hadoop
Intro to Big Data using Hadoop
Intro to Big Data using Hadoop
Intro to Big Data using Hadoop
Intro to Big Data using Hadoop
Intro to Big Data using Hadoop
Intro to Big Data using Hadoop
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Intro to Big Data using Hadoop

6,083

Published on

Introduction to Big Data and Apache Hadoop project. MapReduce vizualization

Introduction to Big Data and Apache Hadoop project. MapReduce vizualization

0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
6,083
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
266
Comments
0
Likes
2
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide
  • So, which is really the “enterprise” now?
  • Volume – exceeds physical limits of vertical scalabilityVelocity – decision window small compared to data change rateVariety – many different formats makes integration expensiveVariability – many options or variable interpretations confound analysis
  • --run MapReducerunJs("/example/mr/WordCount.js", "/example/data/davinci.txt", "/example/count");--create Hive table for the existing dataCREATE EXTERNAL TABLE WordCount ( word string, count int)ROW FORMAT DELIMITED FIELDS TERMINATED BY '\\t' LINES TERMINATED BY '\\n' STORED AS TEXTFILELOCATION "/example/count";--select top wordsSELECT * FROM WordCountORDER BY count DESC LIMIT 10;--execute Hive selecthive.exec("select * from WordCount order by count desc limit 10;");--execute LINQ style Hive queryhive.from("WordCount").orderBy("count DESC").take(10).run();--execute Pig scriptwords = LOAD '/example/count' AS ( word: chararray, count: int);popular_words = ORDER words by count DESC; top_popular_words = LIMIT popular_words 10;DUMP top_popular_words;--execute LINQ style Pig scriptpig.from("/example/count", "word: chararray, count: int").orderBy("count DESC").take(10).run();
  • --run MapReducerunJs("/example/mr/WordCount.js", "/example/data/davinci.txt", "/example/count");--create Hive table for the existing dataCREATE EXTERNAL TABLE WordCount ( word string, count int)ROW FORMAT DELIMITED FIELDS TERMINATED BY '\\t' LINES TERMINATED BY '\\n' STORED AS TEXTFILELOCATION "/example/count";--select top wordsSELECT * FROM WordCountORDER BY count DESC LIMIT 10;--execute Hive selecthive.exec("select * from WordCount order by count desc limit 10;");--execute LINQ style Hive queryhive.from("WordCount").orderBy("count DESC").take(10).run();--execute Pig scriptwords = LOAD '/example/count' AS ( word: chararray, count: int);popular_words = ORDER words by count DESC; top_popular_words = LIMIT popular_words 10;DUMP top_popular_words;--execute LINQ style Pig scriptpig.from("/example/count", "word: chararray, count: int").orderBy("count DESC").take(10).run();
  • Transcript

    • 1. Intro to Big Data using Hadoop Sergejus Barinovas sergejus.blogas.lt fb.com/ITishnikai @sergejusb
    • 2. Information is powerful…but it is how we use it that will define us
    • 3. Data Explosion text audio video images relational picture from Big Data Integration
    • 4. Big Data (globally)– creates over 30 billion pieces of content per day– stores 30 petabytes of data– produces over 90 million tweets per day
    • 5. Big Data (our example)– logs over 300 gigabytes of transactions per day– stores more than 1,5 terabyte of aggregated data
    • 6. 4 Vs of Big Data volume volume velocity velocity variety variety variability variability
    • 7. Big Data ChallengesSort 10TB on 1 node = 2,5 days 100-node cluster = 35 mins
    • 8. Big Data Challenges“Fat” servers implies high cost – use cheap commodity nodes insteadLarge # of cheap nodes implies often failures – leverage automatic fault-tolerance fault-tolerance
    • 9. Big Data ChallengesWe need new data-parallel programmingmodel for clusters of commodity machines
    • 10. MapReduceto the rescue!
    • 11. MapReducePublished in 2004 by Google – MapReduce: Simplified Data Processing on Large ClustersPopularized by Apache Hadoop project – used by Yahoo!, Facebook, Twitter, Amazon, …
    • 12. MapReduce
    • 13. Word Count Example Input Map Shuffle & Sort Reduce Outputthe quick the, 3 brown Map brown, 2 fox Reduce fox, 2 how, 1 now, 1 the fox ate the Map mouse quick, 1 ate, 1 Reduce mouse, 1how now brown Map cow, 1 cow
    • 14. Word Count Example Input Map Shuffle & Sort Reduce Output the, 1 the, 1the quick quick, 1 brown, 1 brown Map brown, 1 fox, 1 fox fox, 1 the, 1 Reduce fox, 1 the, 1 the, 1 how, 1 the fox fox, 1 now, 1 ate the Map ate, 1 brown, 1 mouse the, 1 mouse, 1 quick, 1 ate, 1 how, 1 mouse, 1 Reducehow now now, 1 cow, 1 brown Map brown, 1 cow cow, 1
    • 15. Word Count Example Input Map Shuffle & Sort Reduce Output the, [1,1,1]the quick brown, [1,1] the, 3 brown Map fox, [1,1] brown, 2 fox how, [1] Reduce fox, 2 now, [1] how, 1 now, 1 the fox ate the Map mouse quick, [1] quick, 1 ate, [1] ate, 1 mouse, [1] Reduce mouse, 1how now cow, [1] cow, 1 brown Map cow
    • 16. MapReduce philosophy – hide complexity – make it scalable – make it cheap
    • 17. MapReduce popularized by Apache Hadoop project
    • 18. Hadoop OverviewOpen source implementation of – Google MapReduce paper – Google File System (GFS) paperFirst release in 2008 by Yahoo! – wide adoption by Facebook, Twitter, Amazon, etc.
    • 19. Hadoop CoreMapReduce (Job Scheduling / Execution System) Hadoop Distributed File System (HDFS)
    • 20. Hadoop Core (HDFS) MapReduce (Job Scheduling / Execution System) Hadoop Distributed File System (HDFS)• Name Node stores file metadata• files split into 64 MB blocks• blocks replicated across 3 Data Nodes
    • 21. Hadoop Core (HDFS)MapReduce (Job Scheduling / Execution System) Hadoop Distributed File System (HDFS) Name Node Data Node
    • 22. Hadoop Core (MapReduce)• Job Tracker distributes tasks and handles failures• tasks are assigned based on data locality• Task Trackers can execute multiple tasks MapReduce (Job Scheduling / Execution System) Hadoop Distributed File System (HDFS) Name Node Data Node
    • 23. Hadoop Core (MapReduce) Job Tracker Task TrackerMapReduce (Job Scheduling / Execution System) Hadoop Distributed File System (HDFS) Name Node Data Node
    • 24. Hadoop Core (Job submission) Task TrackerClient Job Tracker Name Node Data Node
    • 25. Hadoop Ecosystem Pig (ETL) Hive (BI) Sqoop (RDBMS) MapReduce (Job Scheduling / Execution System)Zookeeper Avro HBase Hadoop Distributed File System (HDFS)
    • 26. JavaScript MapReducevar map = function (key, value, context) { var words = value.split(/[^a-zA-Z]/); for (var i = 0; i < words.length; i++) { if (words[i] !== "") { context.write(words[i].toLowerCase(), 1); } }};var reduce = function (key, values, context) { var sum = 0; while (values.hasNext()) { sum += parseInt(values.next()); } context.write(key, sum);};
    • 27. Pigwords = LOAD /example/count AS ( word: chararray, count: int);popular_words = ORDER words BY count DESC;top_popular_words = LIMIT popular_words 10;DUMP top_popular_words;
    • 28. HiveCREATE EXTERNAL TABLE WordCount ( word string, count int)ROW FORMAT DELIMITED FIELDS TERMINATED BY t LINES TERMINATED BY nSTORED AS TEXTFILELOCATION "/example/count";SELECT * FROM WordCount ORDER BY count DESC LIMIT 10;
    • 29. Über DemoDemoHadoop in the Cloud
    • 30. Thanks!Questions?

    ×