Your SlideShare is downloading. ×
0
Hands on Hadoop and pig
Hands on Hadoop and pig
Hands on Hadoop and pig
Hands on Hadoop and pig
Hands on Hadoop and pig
Hands on Hadoop and pig
Hands on Hadoop and pig
Hands on Hadoop and pig
Hands on Hadoop and pig
Hands on Hadoop and pig
Hands on Hadoop and pig
Hands on Hadoop and pig
Hands on Hadoop and pig
Hands on Hadoop and pig
Hands on Hadoop and pig
Hands on Hadoop and pig
Hands on Hadoop and pig
Hands on Hadoop and pig
Hands on Hadoop and pig
Hands on Hadoop and pig
Hands on Hadoop and pig
Hands on Hadoop and pig
Hands on Hadoop and pig
Hands on Hadoop and pig
Hands on Hadoop and pig
Hands on Hadoop and pig
Hands on Hadoop and pig
Hands on Hadoop and pig
Hands on Hadoop and pig
Hands on Hadoop and pig
Hands on Hadoop and pig
Hands on Hadoop and pig
Hands on Hadoop and pig
Hands on Hadoop and pig
Hands on Hadoop and pig
Hands on Hadoop and pig
Hands on Hadoop and pig
Hands on Hadoop and pig
Hands on Hadoop and pig
Hands on Hadoop and pig
Hands on Hadoop and pig
Hands on Hadoop and pig
Hands on Hadoop and pig
Hands on Hadoop and pig
Hands on Hadoop and pig
Hands on Hadoop and pig
Hands on Hadoop and pig
Hands on Hadoop and pig
Hands on Hadoop and pig
Hands on Hadoop and pig
Hands on Hadoop and pig
Hands on Hadoop and pig
Hands on Hadoop and pig
Hands on Hadoop and pig
Hands on Hadoop and pig
Hands on Hadoop and pig
Hands on Hadoop and pig
Hands on Hadoop and pig
Hands on Hadoop and pig
Hands on Hadoop and pig
Hands on Hadoop and pig
Hands on Hadoop and pig
Hands on Hadoop and pig
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Hands on Hadoop and pig

1,777

Published on

More details at http://sudarmuthu.com/blog/getting-started-with-hadoop-and-pig

More details at http://sudarmuthu.com/blog/getting-started-with-hadoop-and-pig

Published in: Technology
0 Comments
3 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
1,777
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
84
Comments
0
Likes
3
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. BigData using Hadoop and Pig Sudar Muthu Research Engineer Yahoo Labs http://sudarmuthu.com http://twitter.com/sudarmuthu
  • 2. Who am I? Research Engineer at Yahoo Labs Mines useful information from huge datasets Worked on both structured and unstructured data. Builds robots as hobby ;)
  • 3. What we will see today? What is BigData? Get our hands dirty with Hadoop See some code Try out Pig Glimpse of Hbase and Hive
  • 4. What is BigData?
  • 5. “ Big data is a collection of data sets so large ” and complex that it becomes difficult to process using on-hand database management tools http://en.wikipedia.org/wiki/Big_data
  • 6. How big is BigData?
  • 7. 1GB today is not the sameas 1GB just 10 years before
  • 8. Anything that doesn’t fitinto the RAM of a single machine
  • 9. Types of Big Data
  • 10. Data in Movement (streams) Twitter/Facebook comments Stock market data Access logs of a busy web server Sensors: Vital signs of a newly born
  • 11. Data at rest (Oceans) Collection of what has streamed Emails or IM messages Social Media Unstructured documents: forms, claims
  • 12. We have all this data and need to find a way to process them
  • 13. Traditional way of scaling (Scaling up) Make the machine more powerful  Add more RAM  Add more cores to CPU It is going to be very expensive Will be limited by disk seek and read time Single point of failure
  • 14. New way to scale up (Scale out) Add more instances of the same machine Cost is less compared to scaling up Immune to failure of a single or a set of nodes Disk seek and write time is not going to be bottleneck Future safe (to some extend)
  • 15. Is it fit for ALL types of problems?
  • 16. Divide and conquer
  • 17. Hadoop
  • 18. A scalable, fault-tolerant grid operating system fordata storage and processing
  • 19. What is Hadoop? Runs on Commodity hardware HDFS: Fault-tolerant high-bandwidth clustered storage MapReduce: Distributed data processing Works with structured and unstructured data Open source, Apache license Master (named-node) – Slave architecture
  • 20. Design Principles System shall manage and heal itself Performance shall scale linearly Algorithm should move to data  Lower latency, lower bandwidth Simple core, modular and extensible
  • 21. Components of Hadoop HDFS Map Reduce PIG HBase Hive
  • 22. Getting started with Hadoop
  • 23. What I am not going to cover? Installation or setting up Hadoop  Will be running all the code in a single node instance Monitoring of the clusters Performance tuning User authentication or quota
  • 24. Before we get into code, let’s understand some concepts
  • 25. Map Reduce
  • 26. Framework for distributedprocessing of large datasets
  • 27. MapReduceConsists of two functions Map  Filter and transform the input, which the reducer can understand Reduce  Aggregate over the input provided by the Map function
  • 28. Formal definitionMap<k1, v1> -> list(<k2,v2>)Reduce<k2, list(v2)> -> list <k3, v3>
  • 29. Let’s see some examples
  • 30. Count number of words in filesMap<file_name, file_contents> => list<word, count>Reduce<word, list(count)> => <word, sum_of_counts>
  • 31. Count number of words in filesMap<“file1”, “to be or not to be”> =>{<“to”,1>,<“be”,1>,<“or”,1>,<“not”,1>,<“to,1>,<“be”,1>}
  • 32. Count number of words in filesReduce{<“to”,<1,1>>, <“be”,<1,1>>, <“or”,<1>>,<“not”,<1>>}=>{<“to”,2>, <“be”,2>, <“or”,1>, <“not”,1>}
  • 33. Max temperature in a yearMap<file_name, file_contents> => <year, temp>Reduce<year, list(temp)> => <year, max_temp>
  • 34. HDFS
  • 35. HDFS Distributed file system Data is distributed over different nodes Will be replicated for fail over Is abstracted out for the algorithms
  • 36. HDFS Commands
  • 37. HDFS Commands hadoop fs –mkdir <dir_name> hadoop fs –ls <dir_name> hadoop fs –rmr <dir_name> hadoop fs –put <local_file> <remote_dir> hadoop fs –get <remote_file> <local_dir> hadoop fs –cat <remote_file> hadoop fs –help
  • 38. Let’s write some code
  • 39. Count Words Demo Create a mapper class  Override map() method Create a reducer class  Override reduce() method Create a main method Create JAR Run it on Hadoop
  • 40. Map Methodpublic void map(LongWritable key, Text value, Contextcontext) throws IOException, InterruptedException { String line = value.toString(); StringTokenizer itr = new StringTokenizer(line); while (itr.hasMoreTokens()) { context.write(new Text(itr.nextToken()), newIntWritable(1)); }}
  • 41. Reduce Methodpublic void reduce(Text key, Iterable<IntWritable>values, Context context) throws IOException,InterruptedException { int sum = 0; for (IntWritable value : values) { sum += value.get(); } context.write(key, new IntWritable(sum));}
  • 42. Main MethodJob job = new Job();job.setJarByClass(CountWords.class);job.setJobName("Count Words");FileInputFormat.addInputPath(job, new Path(args[0]));FileOutputFormat.setOutputPath(job, new Path(args[1]));job.setMapperClass(CountWordsMapper.class);job.setReducerClass(CountWordsReducer.class);job.setOutputKeyClass(Text.class);job.setOutputValueClass(IntWritable.class);
  • 43. Run it on Hadoophadoop jar dist/countwords.jarcom.sudarmuthu.hadoop.countwords.CountWords input/ output/
  • 44. Outputat 1be 3can 7cant 1code 2command 1connect 1consider 1continued 1control 4could 1couple 1courtesy 1desktop, 1detailed 1details 1…..…..
  • 45. Pig
  • 46. What is Pig?Pig provides an abstraction for processing largedatasetsConsists of Pig Latin – Language to express data flows Execution environment
  • 47. Why we need Pig? MapReduce can get complex if your data needs lot of processing/transformations MapReduce provides primitive data structures Pig provides rich data structures Supports complex operations like joins
  • 48. Running Pig programs In an interactive shell called Grunt As a Pig Script Embedded into Java programs (like JDBC)
  • 49. Grunt – Interactive Shell
  • 50. Grunt shell fs commands – like hadoop fs  fs –ls  Fs –mkdir fs copyToLocal <file> fs copyFromLocal <local_file> <dest> exec – execute Pig scripts sh – execute shell scripts
  • 51. Let’s see them in action
  • 52. Pig Latin LOAD – Read files DUMP – Dump data in the console JOIN – Do a join on data sets FILTER – Filter data sets SORT – Sort data STORE – Store data back in files
  • 53. Let’s see some code
  • 54. Sort words based on count
  • 55. Filter words present in a list
  • 56. HBase
  • 57. What is Hbase? Distributed, column-oriented database built on top of HDFS Useful when real-time read/write random-access to very large datasets is needed. Can handle billions of rows with millions of columns
  • 58. Hive
  • 59. What is Hive? Useful for managing and querying structured data Provides SQL like syntax Meta data is stored in a RDBMS Extensible with types, functions , scripts etc
  • 60. Hadoop Relational Databases Affordable  Interactive response times Storage/Compute  ACID Structured or Unstructured  Structured data Resilient Auto Scalability  Cost/Scale prohibitive
  • 61. Thank You

×