Hadoop & Ecosystem - An Overview


Published on

Data volumes continue to rise rapidly: structured data, unstructured data, social data, mobile data, log data - just about everything! Hadoop and its ecosystem has provided us with a viable approach for data storage, management, analytics and insights. By now any techie knows about Hadoop and its basic concepts. In this session, we will provide a quick overview of Hadoop, HDFS and MapReduce. Further, we will explore the entire Hadoop ecosystem covering data ingestion, query languages, cluster management, machine learning, job schedulers, monitoring and integrations with existing tools and cloud platforms.

This session will delve into the larger Hadoop ecosystem. We will take a closer look at the tools and techniques available for unlocking the value in your data!

The audience for the talk was uninitiated with Hadoop. The idea of this talk was to move beyond "Hadoop = MapReduce + HDFS" and talk about the larger ecosystem and how it can help you solve your Big Data problems.

Published in: Technology

Hadoop & Ecosystem - An Overview

  1. 1. Innovation → Execution → Solution → Delivered Hadoop & Ecosystem US: +1 408 556 9645 India: +91 20 661 43 400 Web: http://www.clogeny.com Email: contact@clogeny.com
  2. 2. Need for Hadoop? Store terabytes and petabytes of data • Reliable, Inexpensive, Accessible, Fault-tolerant • Distributed File System (HDFS) Distributed applications are hard to develop and manage • MapReduce to make distributed computing tractable Datasets too large for RDBMS Scale, Scale, Scale – with cheap(er) hardware SoCoMo Hadoop = MapReduce + HDFS© 2012 Clogeny Technologies http://www.clogeny.com
  3. 3. What types of problems does Hadoop Solve? • Discover fraud patterns Risk Modeling • Collateral optimization, Risk modeling • Mine web click logs to detect patterns E-Commerce • Build and test various customer behavioral models • Reporting and post-analysis Recommendation • Collaborative filtering: User, Item, Content Engines • Product, show, movie, job recommendations • Clickstream analysis, forecasting, optimization Ad Targeting • Aggregate various data sources - social graph for eg. • Report generation and post-analysis application • Improve search quality using pattern recognition Text Analysis • Spam filtering© 2012 Clogeny Technologies http://www.clogeny.com
  4. 4. MapReduce Basic functional operations: • Map • Reduce Do not modify data, they generate new data Original data remains unmodified© 2012 Clogeny Technologies http://www.clogeny.com
  5. 5. HDFS Based on Google File System Write Once Read Many Access Model Splits input data into blocks – 64MB/128MB blocks (small file IO is bad, yes) Blocks are replicated across nodes to handle node failures NameNode: filesystem metadata, block replication, RW Access DataNode: Store actual data (client reads from nearest datanode Data pipelining: Clients writes to first datanode, datanode replicates to next datanode in pipeline© 2012 Clogeny Technologies http://www.clogeny.com
  6. 6. HDFS Architecture© 2012 Clogeny Technologies http://www.clogeny.com
  7. 7. Hadoop Architecture Job Tracker: Farms out MR tasks to nodes in the cluster – ideally those having data or atleast in the same rack. Talks to NameNode for location of the data. Task Tracker: A node in the cluster which accepts tasks from a JobTracker– Map, Reduce, Shuffle© 2012 Clogeny Technologies http://www.clogeny.com
  8. 8. Example WORDCOUNT Read text files and count how often words occur. • The input is text files • The output is a text file  each line: word, tab, count Map: Produce pairs of (word, count) Reduce: For each word, sum up the counts. GREP Search input files for a given pattern Map: emits a line if pattern is matched Reduce: Copies results to output© 2012 Clogeny Technologies http://www.clogeny.com
  9. 9. Wordcount Mapper 14 public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { 15 private final static IntWritable one = new IntWritable(1); 16 private Text word = new Text(); 17 18 public void map( LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { 19 String line = value.toString(); 20 StringTokenizer tokenizer = new StringTokenizer(line); 21 while (tokenizer.hasMoreTokens()) { 22 word.set(tokenizer.nextToken()); 23 output.collect(word, one); 24 } 25 } 26 }© 2012 Clogeny Technologies http://www.clogeny.com
  10. 10. Wordcount Reducer 28 public static class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { 29 30 public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { 31 int sum = 0; 32 while (values.hasNext()) { 33 sum += values.next().get(); 34 } 35 output.collect(key, new IntWritable(sum)); 36 } 37 }© 2012 Clogeny Technologies http://www.clogeny.com
  11. 11. Apache Hive Data Warehousing System for Hadoop Facilitates data summarization, ad-hoc queries and analysis of large datasets Query using an SQL like language called HiveQL. Can also use custom Mappers and reducers Hive tables can be defined directly on HDFS files via SerDe, customized formats Tables can be partitioned and data loaded separately in each partition for scale. Tables can be clustered based on certain columns for query performance. The schema is stored in an RDBMS. Has complex column types like map, array, struct in addition to atomic types.© 2012 Clogeny Technologies http://www.clogeny.com
  12. 12. Apache Hive Storage Format Example CREATE TABLE mylog ( uhash BIGINT, page_url string, unix_time INT) STORED AS TEXTFILE; LOAD DATA INPATH ‘/user/log.txt’ INTO TABLE mylog; Supports text file, sequence files, extensible for other formats. Serialized Formats: Delimited, Thrift protocol Deserialized: Java Integer/String/Array/HashMap, Hadoop Writable Classes, Thrift – user defined classes CREATE TABLE mylog ( uhash BIGINT, page_url STRING, unix_time INT) ROW FORMAT SERDE ‘org.apache.hadoop.hive.serde2.columnar.ColumnarSerDe’ STORED as RCFILE;© 2012 Clogeny Technologies http://www.clogeny.com
  13. 13. Apache Hive HiveQL – JOIN INSERT OVERWRITE TABLE pv_users SELECT pv.pageid, u,age_bkt FROM pageview pv JOIN user u ON (pv.uhash = u.uhash) HiveQL – GROUP BY SELECT pageid, age_bkt, count(1) FROM pv_users GROUP BY pageid, age_bkt© 2012 Clogeny Technologies http://www.clogeny.com
  14. 14. Apache Hive Group By in Map Reduce© 2012 Clogeny Technologies http://www.clogeny.com
  15. 15. Apache Pig High level interface for Hadoop – Pig Latin is an declarative SQL like language for analyzing large data sets Pig compiler produces sequences of Map-Reduce programs Makes it easy to develop parallel processing programs – much easier than developing MR code directly Interactive Shell - Grunt User Defined Functions (UDFs) – specify custom processing in Java, Python, Ruby Implement EVAL, AGGREGATE, FILTER functions© 2012 Clogeny Technologies http://www.clogeny.com
  16. 16. Apache Pig – Word Count input_lines = LOAD /tmp/my-copy-of-all-pages-on-internet AS (line:chararray); -- Extract words from each line and put them into a pig bag -- datatype, then flatten the bag to get one word on each row words = FOREACH input_lines GENERATE FLATTEN(TOKENIZE(line)) AS word; -- filter out any words that are just white spaces filtered_words = FILTER words BY word MATCHES w+; -- create a group for each word word_groups = GROUP filtered_words BY word; -- count the entries in each group word_count = FOREACH word_groups GENERATE COUNT(filtered_words) AS count, group AS word; -- order the records by count ordered_word_count = ORDER word_count BY count DESC; STORE ordered_word_count INTO /tmp/number-of-words-on-internet;© 2012 Clogeny Technologies http://www.clogeny.com
  17. 17. Sqoop Tool to transfer data between Hadoop and relational databases like MySQL or Oracle. Sqoop automates the process and takes the schema from the RDBMS. Use Map-Reduce to transfer the data – making it a parallel and fault-tolerant process. $ sqoop import --connect jdbc:mysql://database.example.com/employees $ sqoop import --connnect <connect-str> --table foo --warehouse-dir /shared $ sqoop import-all-tables --connect jdbc:mysql://db.foo.com/corp Supports incremental import of data – you can schedule imports Can also import data into Hive, HBase© 2012 Clogeny Technologies http://www.clogeny.com
  18. 18. Flume Collecting data from various sources? When? With ugly unmaintainable scripts? Flume - Distributed Data Collection Service Scalable, Configurable, Robust Collect data from various formats (generally to HDFS) Flows: each flow corresponds to a type of data source Different flows have different compression, reliability config Flows are comprised of nodes chained together Each node receives data at its source and sends it to a sink© 2012 Clogeny Technologies http://www.clogeny.com
  19. 19. Flume Example flows within a Agents & Collectors Flume service© 2012 Clogeny Technologies http://www.clogeny.com
  20. 20. HCatalog So much data – that managing the metadata is difficult Multiple tools & formats – MR, Pig, Hive How to share data easily? Ops – Needs to manage data storage, cluster, data expiry, data replication, import, export Register data through Hcatalog Uses Hive’s metastore A = load raw using HCatLoader(); B = filter A by ds=20110225 and region=us and property=news; PIG no longer cares where “raw” is. So its location can be changed. No data model needed.© 2012 Clogeny Technologies http://www.clogeny.com
  21. 21. Oozie – Workflow Scheduler Workflow scheduler system to manage Hadoop jobs Create a Directed Acyclic Graphs of actions Supports various types of Hadoop jobs like Java MR, Streaming MR, Pig, Hive, Sqoop as well as shell scripts and Java programs. Create a flow-chart of your data analysis process & allow Oozie to co-ordinate Time & data dependencies XML based config Simple GUI to track jobs and workflows© 2012 Clogeny Technologies http://www.clogeny.com
  22. 22. Oozie – Workflow Scheduler $ oozie job -oozie http://localhost:8080/oozie -config examples/apps/map- reduce/job.properties -run . job: 14-20090525161321-oozie-tucu Check the workflow job status: $ oozie job -oozie http://localhost:8080/oozie -info 14-20090525161321-oozie-tucu Workflow Name : map-reduce-wf App Path : hdfs://localhost:9000/user/tucu/examples/apps/map-reduce Status : SUCCEEDED Run : 0 User : tucu Group : users Created : 2009-05-26 05:01 +0000 Started : 2009-05-26 05:01 +0000 Ended : 2009-05-26 05:01 +0000© 2012 Clogeny Technologies http://www.clogeny.com
  23. 23. Schedulers All teams want access to the Hadoop clusters: Analytics team starts a query and takes down the prod cluster How to allocate resources within a cluster across teams and users How to assign priority to jobs? Fair Scheduler: Jobs are grouped into pools Each pool has a guaranteed minimum share Excess capacity is split between jobs Capacity Scheduler: Queues with allocation of part of available capacity (M & R slots) Priorities for jobs, share available capacity across queues For eg: queue per team No preemption© 2012 Clogeny Technologies http://www.clogeny.com
  24. 24. Ambari Deployment, monitoring and management of Hadoop clusters is hard! Ambari provides configuration management and deployment using Puppet Cluster configuration is a data object Reliable repeatable deployment Central point to manage services Uses Ganglia and Nagios for monitoring and alerting AD & LDAP Integration Visualization of the cluster state and Hadoop jobs over time© 2012 Clogeny Technologies http://www.clogeny.com
  25. 25. Ambari© 2012 Clogeny Technologies http://www.clogeny.com
  26. 26. Ambari© 2012 Clogeny Technologies http://www.clogeny.com
  27. 27. Ambari© 2012 Clogeny Technologies http://www.clogeny.com
  28. 28. Apache Mahout Machine Learning Library Leverage MapReduce for clustering, classification, collaborative filtering algorithms. Recommendation: Takes user behavior as input and from that suggest items that user might like. • Netflix Clustering: Group objects into clusters • Create groups of users on social media Classification: Learn from categorized documents and is able to assign unlabeled documents to the correct categories. • Recommend ads to users, classify text into categories© 2012 Clogeny Technologies http://www.clogeny.com
  29. 29. Serengeti I just implemented virtualization in my datacenter. Can I use Hadoop on virtualized resources? Yes – Serengeti allows you to do that! Get much more agility – at cost of performance Improve availability of your Hadoop Cluster Share capacity across Hadoop clusters and other applications Hadoop Virtual Extensions to make Hadoop virtualization-aware Hadoop as a Service: Integrate with VMWare Chargeback and bill accordingly.© 2012 Clogeny Technologies http://www.clogeny.com
  30. 30. Platforms - HortonWorks© 2012 Clogeny Technologies http://www.clogeny.com
  31. 31. Dell – Cloudera Stack© 2012 Clogeny Technologies http://www.clogeny.com
  32. 32. Amazon Elastic MapReduce Cloud platform (EC2 & S3) providing Hadoop as a Service Provision as much capacity as you like for as much time as you like Integration with S3 (Cloud Storage) – store your persistent data in S3 App or Service on Cloud, Hadoop cluster in same place Multiple locations – you have an on-demand Hadoop cluster across 5 continents MapR, KarmaSphere, etc. tools are available on the AWS platform itself. Use expensive BI tools on the cloud pay-as-you-go© 2012 Clogeny Technologies http://www.clogeny.com