SlideShare a Scribd company logo
Hadoop Overview &
   Architecture	


            Milind Bhandarkar	

  Chief Scientist, Machine Learning Platforms,	

        Greenplum, A Division of EMC	

            (Twitter: @techmilind)
About Me	

•    http://www.linkedin.com/in/milindb	


•    Founding member of Hadoop team at Yahoo! [2005-2010]	


•    Contributor to Apache Hadoop since v0.1	


•    Built and led Grid Solutions Team at Yahoo! [2007-2010]	


•    Parallel Programming Paradigms [1989-today] (PhD cs.illinois.edu)	


•    Center for Development of Advanced Computing (C-DAC), National Center
     for Supercomputing Applications (NCSA), Center for Simulation of Advanced
     Rockets, Siebel Systems, Pathscale Inc. (acquired by QLogic), Yahoo!, LinkedIn,
     and EMC-Greenplum
Agenda	

• Motivation	

• Hadoop	

 • Map-Reduce	

 • Distributed File System	

 • Hadoop Architecture	

 • Next Generation MapReduce	

• Q & A	

                    2
Hadoop At Scale
      (Some Statistics)	

• 40,000 + machines in 20+ clusters	

• Largest cluster is 4,000 machines	

• 170 Petabytes of storage	

• 1000+ users	

• 1,000,000+ jobs/month	

                      3
BEHIND
EVERY CLICK
Hadoop Workflow
Who Uses Hadoop ?
Why Hadoop ?	



       7
Big Datasets
(Data-Rich Computing theme proposal. J. Campbell, et al, 2007)
Cost Per Gigabyte                             
(http://www.mkomo.com/cost-per-gigabyte)
Storage Trends
(Graph by Adam Leventhal, ACM Queue, Dec 2009)
Motivating Examples	



          11
Yahoo! Search Assist
Search Assist	

• Insight: Related concepts appear close
  together in text corpus	

• Input: Web pages	

 • 1 Billion Pages, 10K bytes each	

 • 10 TB of input data	

• Output: List(word, List(related words))	

                       13
Search Assist	

// Input: List(URL, Text)	
	
foreach URL in Input :	
	
    Words = Tokenize(Text(URL));	
	
    foreach word in Tokens :	
	
        Insert (word, Next(word, Tokens)) in Pairs;	
	
        Insert (word, Previous(word, Tokens)) in Pairs;	
	
// Result: Pairs = List (word, RelatedWord)	
	
Group Pairs by word;	
// Code Samples	
// Result: List (word, List(RelatedWords)	
	
foreach word in Pairs :	
	
    Count RelatedWords in GroupedPairs;	
	
// Result: List (word, List(RelatedWords, count))	
	
foreach word in CountedPairs :	
	
    Sort Pairs(word, *) descending by count;	
	
    choose Top 5 Pairs;	
	
// Result: List (word, Top5(RelatedWords))	

                            14
People You May Know
People You May Know	

• Insight: You might also know Joe Smith if a
  lot of folks you know, know Joe Smith	

 • if you don t know Joe Smith already	

• Numbers:	

 • 100 MM users	

 • Average connections per user is 100	

                      16
People You May Know	

	
	
	 Input: List(UserName, List(Connections))	
//
	
	

	
foreach u in UserList : // 100 MM	
	
	   foreach x in Connections(u) : // 100	
// Code foreach y in Connections(x) : // 100	
        Samples	
	           if (y not in Connections(u)) :	
	                Count(u, y)++; // 1 Trillion Iterations	
	   Sort (u,y) in descending order of Count(u,y);	
	   Choose Top 3 y;	
	   Store (u, {y0, y1, y2}) for serving;	
	
	

                            17
Performance	

• 101 Random accesses for each user	

 • Assume 1 ms per random access	

 • 100 ms per user	

• 100 MM users	

 • 100 days on a single machine	

                     18
MapReduce Paradigm	



         19
Map  Reduce	


• Primitives in Lisp ( Other functional
  languages) 1970s	

• Google Paper 2004	

 • http://labs.google.com/papers/
    mapreduce.html	



                        20
Map	


Output_List = Map (Input_List)	




Square (1, 2, 3, 4, 5, 6, 7, 8, 9, 10) =	
	
(1, 4, 9, 16, 25, 36,49, 64, 81, 100)	




                            21
Reduce	


Output_Element = Reduce (Input_List)	




Sum (1, 4, 9, 16, 25, 36,49, 64, 81, 100) = 385	




                            22
Parallelism	

• Map is inherently parallel	

 • Each list element processed independently	

• Reduce is inherently sequential	

 • Unless processing multiple lists	

• Grouping to produce multiple lists	

                      23
Search Assist Map	

// Input: http://hadoop.apache.org  	

	
Pairs = Tokenize_And_Pair ( Text ( Input ) )	




	
	
// Example	
	
	



                            24
Search Assist Reduce	

// Input: GroupedList (word, GroupedList(words))	
	
CountedPairs = CountOccurrences (word, RelatedWords)	




Output = {	
(hadoop, apache, 7) (hadoop, DFS, 3) (hadoop, streaming,
4) (hadoop, mapreduce, 9) ...	
}	



                            25
Issues with Large Data	


• Map Parallelism: Chunking input data	

• Reduce Parallelism: Grouping related data	

• Dealing with failures  load imbalance	



                       26
Apache Hadoop	


• January 2006: Subproject of Lucene	

• January 2008: Top-level Apache project	

• Stable Version: 1.0.3	

• Latest Version: 2.0.0 (Alpha)	


                      28
Apache Hadoop	

• Reliable, Performant Distributed file system	

• MapReduce Programming framework	

• Ecosystem: HBase, Hive, Pig, Howl, Oozie,
  Zookeeper, Chukwa, Mahout, Cascading,
  Scribe, Cassandra, Hypertable, Voldemort,
  Azkaban, Sqoop, Flume, Avro ...	



                       29
Problem: Bandwidth to
        Data	

• Scan 100TB Datasets on 1000 node cluster	

 • Remote storage @ 10MB/s = 165 mins	

 • Local storage @ 50-200MB/s = 33-8 mins	

• Moving computation is more efficient than
  moving data	

  • Need visibility into data placement	

                       30
Problem: Scaling Reliably	

• Failure is not an option, it s a rule !	

 • 1000 nodes, MTBF  1 day	

 • 4000 disks, 8000 cores, 25 switches, 1000
    NICs, 2000 DIMMS (16TB RAM)	

• Need fault tolerant store with reasonable
  availability guarantees	

  • Handle hardware faults transparently	

                         31
Hadoop Goals	

• Scalable: Petabytes (10      15   Bytes) of data on
  thousands on nodes	

• Economical: Commodity components only	

• Reliable	

 • Engineering reliability into every application
    is expensive	



                       32
Hadoop MapReduce	



        33
Think MapReduce	


• Record = (Key, Value)	

• Key : Comparable, Serializable	

• Value: Serializable	

• Input, Map, Shuffle, Reduce, Output	


                      34
Seems Familiar ?	

	
	
	
	
	
cat /var/log/auth.log* |  	
	
grep “session opened” | cut -d’ ‘ -f10 | 	
	
sort | Samples	
// Code	
uniq -c  	
	
~/userlist 	
	
	
	
	
	
	

                            35
Map	


• Input: (Key , Value )	

              1         1

• Output: List(Key , Value )	

                    2               2

• Projections, Filtering, Transformation	



                            36
Shuffle	


• Input: List(Key , Value )	

                  2             2

• Output	

 • Sort(Partition(List(Key , List(Value ))))	

                                    2    2

• Provided by Hadoop	


                        37
Reduce	


• Input: List(Key , List(Value ))	

                   2                   2

• Output: List(Key , Value )	

                       3           3

• Aggregation	



                           38
Hadoop Streaming	

• Hadoop is written in Java	

 • Java MapReduce code is native 	

• What about Non-Java Programmers ?	

 • Perl, Python, Shell, R	

 • grep, sed, awk, uniq as Mappers/Reducers	

• Text Input and Output	

                      39
Hadoop Streaming	

• Thin Java wrapper for Map  Reduce Tasks	

• Forks actual Mapper  Reducer	

• IPC via stdin, stdout, stderr	

• Key.toString() t Value.toString() n	

• Slower than Java programs	

 • Allows for quick prototyping / debugging	

                      40
Hadoop Streaming	

	
	
	 bin/hadoop jar hadoop-streaming.jar 	
$
	     -input in-files -output out-dir 	
	     -mapper mapper.sh -reducer reducer.sh	
	
	 mapper.sh	
#
// Code Samples	
	
sed -e 's/ /n/g' | grep .	
	
	
#
	 reducer.sh	
	
	
uniq -c | awk '{print $2 t $1}'	
	
	

                            41
Hadoop Distributed File
   System (HDFS)	



           42
HDFS	

• Data is organized into files and directories	

• Files are divided into uniform sized blocks
  (default 128MB) and distributed across
  cluster nodes	

• HDFS exposes block placement so that
  computation can be migrated to data	



                        43
HDFS	

• Blocks are replicated (default 3) to handle
  hardware failure	

• Replication for performance and fault
  tolerance (Rack-Aware placement)	

• HDFS keeps checksums of data for
  corruption detection and recovery	



                        44
HDFS	


• Master-Worker Architecture	

• Single NameNode	

• Many (Thousands) DataNodes	



                    45
HDFS Master
         (NameNode)	

• Manages filesystem namespace	

• File metadata (i.e. inode )	

• Mapping inode to list of blocks + locations	

• Authorization  Authentication	

• Checkpoint  journal namespace changes	

                       46
Namenode	

• Mapping of datanode to list of blocks	

• Monitor datanode health	

• Replicate missing blocks	

• Keeps ALL namespace in memory	

• 60M objects (File/Block) in 16GB	

                       47
Datanodes	

• Handle block storage on multiple volumes 
  block integrity	

• Clients access the blocks directly from data
  nodes	

• Periodically send heartbeats and block
  reports to Namenode	

• Blocks are stored as underlying OS s files	

                       48
HDFS Architecture
Example: Unigrams	


• Input: Huge text corpus	

 • Wikipedia Articles (40GB uncompressed)	

• Output: List of words sorted in descending
  order of frequency
Unigrams	


$ cat ~/wikipedia.txt | 	
sed -e 's/ /n/g' | grep . | 	
sort | 	
uniq -c  	
~/frequencies.txt	
	
$ cat ~/frequencies.txt | 	
# cat | 	
sort -n -k1,1 -r |	
# cat  	
~/unigrams.txt
MR for Unigrams	


mapper (filename, file-contents):	
    	for each word in file-contents:	
    	    	emit (word, 1)	
	
reducer (word, values):	
    	sum = 0	
    	for each value in values:	
    	    	sum = sum + value	
    	emit (word, sum)
MR for Unigrams	



mapper (word, frequency):	
    	emit (frequency, word)	
	
reducer (frequency, words):	
    	for each word in words:	
    	    	emit (word, frequency)
Unigrams: Java Mapper	

public static class MapClass extends MapReduceBase	
     	implements MapperLongWritable, Text, Text, IntWritable {	
	
         public void map(LongWritable key, Text value,      	    	
     	      	OutputCollectorText, IntWritable output,	
     	      	Reporter reporter) throws IOException {	
	
     	      	String line = value.toString();	
     	      	StringTokenizer itr = new StringTokenizer(line);	
     	      	while (itr.hasMoreTokens()) {	
     	      	     	Text word = new Text(itr.nextToken());	
     	      	     	output.collect(word, new IntWritable(1));	
     	      	}	
     	}	
}
Unigrams: Java Reducer	


public static class Reduce extends MapReduceBase	
     	implements ReducerText, IntWritable, Text, IntWritable {	
	
     	public void reduce(Text key,IteratorIntWritable values,	
     	     	OutputCollectorText,IntWritable output,	
     	     	Reporter reporter) throws IOException {	
     	     		
     	     	int sum = 0;	
     	     	while (values.hasNext()) {	
     	     	     	sum += values.next().get();	
     	     	}	
     	     	output.collect(key, new IntWritable(sum));	
     	}	
}
Unigrams: Driver	


public void run(String inputPath, String outputPath) throws
Exception	
{	
     	JobConf conf = new JobConf(WordCount.class);	
     	conf.setJobName(wordcount);	
     	conf.setMapperClass(MapClass.class);	
     	conf.setReducerClass(Reduce.class);	
     	FileInputFormat.addInputPath(conf, new Path(inputPath)); 	
     	FileOutputFormat.setOutputPath(conf, new Path(outputPath));	
     	JobClient.runJob(conf);	
}
Configuration	

•  Unified Mechanism for	

  •  Configuring Daemons	

  •  Runtime environment for Jobs/Tasks	

•  Defaults: *-default.xml	

•  Site-Specific: *-site.xml	

•  final parameters
Example	

configuration	
    	property	
    	    	namemapred.job.tracker/name	
    	    	valuehead.server.node.com:9001/value	
    	/property	
    	property	
    	    	namefs.default.name/name	
    	    	valuehdfs://head.server.node.com:9000/value	
    	/property	
    	property	
    	namemapred.child.java.opts/name	
    	value-Xmx512m/value	
    	finaltrue/final	
    	/property	
....	
/configuration
Running Hadoop Jobs
Running a Job	

[milindb@gateway ~]$ hadoop jar 	
$HADOOP_HOME/hadoop-examples.jar wordcount 	
/data/newsarchive/20080923 /tmp/
newsoutinput.FileInputFormat: Total input paths to
process : 4	
mapred.JobClient: Running job: job_200904270516_5709	
mapred.JobClient: map 0% reduce 0%	
mapred.JobClient: map 3% reduce 0%	
mapred.JobClient: map 7% reduce 0%	
....	
mapred.JobClient: map 100% reduce 21%	
mapred.JobClient: map 100% reduce 31%	
mapred.JobClient: map 100% reduce 33%	
mapred.JobClient: map 100% reduce 66%	
mapred.JobClient: map 100% reduce 100%	
mapred.JobClient: Job complete: job_200904270516_5709
Running a Job	


mapred.JobClient: Counters: 18	
mapred.JobClient:   Job Counters	
mapred.JobClient:     Launched reduce tasks=1	
mapred.JobClient:     Rack-local map tasks=10	
mapred.JobClient:     Launched map tasks=25	
mapred.JobClient:     Data-local map tasks=1	
mapred.JobClient:   FileSystemCounters	
mapred.JobClient:     FILE_BYTES_READ=491145085	
mapred.JobClient:     HDFS_BYTES_READ=3068106537	
mapred.JobClient:     FILE_BYTES_WRITTEN=724733409	
mapred.JobClient:     HDFS_BYTES_WRITTEN=377464307
Running a Job	


mapred.JobClient:   Map-Reduce Framework	
mapred.JobClient:     Combine output records=73828180	
mapred.JobClient:     Map input records=36079096	
mapred.JobClient:     Reduce shuffle bytes=233587524	
mapred.JobClient:     Spilled Records=78177976	
mapred.JobClient:     Map output bytes=4278663275	
mapred.JobClient:     Combine input records=371084796	
mapred.JobClient:     Map output records=313041519	
mapred.JobClient:     Reduce input records=15784903
JobTracker WebUI	

	
	
	
	
	
	
	
// Code Samples
JobTracker Status
Jobs Status
Job Details
Job Counters
Job Progress
All Tasks
Task Details
Task Counters
Task Logs
MapReduce Dataflow
MapReduce
Job Submission
Initialization
Scheduling
Execution
Map Task
Sort Buffer
Reduce Task
Next Generation
  MapReduce	



       82
MapReduce Today	

    (Courtesy: Arun Murthy, Hortonworks)
Why ?	

• Scalability Limitations today	

 • Maximum cluster size: 4000 nodes	

 • Maximum Concurrent tasks: 40,000	

• Job Tracker SPOF	

• Fixed map and reduce containers (slots)	

 • Punishes pleasantly parallel apps	

                      84
Why ? (contd)	

• MapReduce is not suitable for every
  application	

• Fine-Grained Iterative applications	

 • HaLoop: Hadoop in a Loop	

• Message passing applications	

 • Graph Processing	

                        85
Requirements	


• Need scalable cluster resources manager	

• Separate scheduling from resource
  management	

• Multi-Lingual Communication Protocols	


                      86
Bottom Line	

• @techmilind #mrng (MapReduce, Next
  Gen) is in reality, #rmng (Resource Manager,
  Next Gen)	

• Expect different programming paradigms to
  be implemented	

 • Including MPI (soon)	

                      87
Architecture	

  (Courtesy: Arun Murthy, Hortonworks)
The New World	

•  Resource Manager	

  •  Allocates resources (containers) to applications	

•  Node Manager	

  •  Manages containers on nodes	

•  Application Master	

  •  Specific to paradigm e.g. MapReduce application master,
      MPI application master etc	



                               89
Container	


• In current terminology: A Task Slot	

• Slice of the node s hardware resources	

• #of cores, virtual memory, disk size, disk and
  network bandwidth etc	

  • Currently, only memory usage is sliced	


                       90

More Related Content

What's hot

Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation HadoopVarun Narang
 
Hadoop YARN
Hadoop YARNHadoop YARN
Hadoop YARN
Vigen Sahakyan
 
PPT on Hadoop
PPT on HadoopPPT on Hadoop
PPT on Hadoop
Shubham Parmar
 
Apache hive introduction
Apache hive introductionApache hive introduction
Apache hive introduction
Mahmood Reza Esmaili Zand
 
Apache Spark overview
Apache Spark overviewApache Spark overview
Apache Spark overview
DataArt
 
Hadoop technology
Hadoop technologyHadoop technology
Hadoop technology
tipanagiriharika
 
Apache Spark Introduction
Apache Spark IntroductionApache Spark Introduction
Apache Spark Introduction
sudhakara st
 
Hadoop and Big Data
Hadoop and Big DataHadoop and Big Data
Hadoop and Big Data
Harshdeep Kaur
 
Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...
Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...
Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...
Simplilearn
 
Sqoop
SqoopSqoop
Hadoop File system (HDFS)
Hadoop File system (HDFS)Hadoop File system (HDFS)
Hadoop File system (HDFS)
Prashant Gupta
 
Hive
HiveHive
Unit-3_BDA.ppt
Unit-3_BDA.pptUnit-3_BDA.ppt
Unit-3_BDA.ppt
PoojaShah174393
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with Hadoop
Philippe Julio
 
Apache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsApache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & Internals
Anton Kirillov
 
Hadoop Tutorial For Beginners
Hadoop Tutorial For BeginnersHadoop Tutorial For Beginners
Hadoop Tutorial For Beginners
Dataflair Web Services Pvt Ltd
 
MPP vs Hadoop
MPP vs HadoopMPP vs Hadoop
MPP vs Hadoop
Alexey Grishchenko
 
Hadoop Ecosystem | Hadoop Ecosystem Tutorial | Hadoop Tutorial For Beginners ...
Hadoop Ecosystem | Hadoop Ecosystem Tutorial | Hadoop Tutorial For Beginners ...Hadoop Ecosystem | Hadoop Ecosystem Tutorial | Hadoop Tutorial For Beginners ...
Hadoop Ecosystem | Hadoop Ecosystem Tutorial | Hadoop Tutorial For Beginners ...
Simplilearn
 

What's hot (20)

Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation Hadoop
 
Hadoop YARN
Hadoop YARNHadoop YARN
Hadoop YARN
 
PPT on Hadoop
PPT on HadoopPPT on Hadoop
PPT on Hadoop
 
Apache hive introduction
Apache hive introductionApache hive introduction
Apache hive introduction
 
Apache Spark overview
Apache Spark overviewApache Spark overview
Apache Spark overview
 
Hadoop
HadoopHadoop
Hadoop
 
Hadoop technology
Hadoop technologyHadoop technology
Hadoop technology
 
Apache Spark Introduction
Apache Spark IntroductionApache Spark Introduction
Apache Spark Introduction
 
Hadoop and Big Data
Hadoop and Big DataHadoop and Big Data
Hadoop and Big Data
 
Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...
Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...
Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...
 
Sqoop
SqoopSqoop
Sqoop
 
Hadoop File system (HDFS)
Hadoop File system (HDFS)Hadoop File system (HDFS)
Hadoop File system (HDFS)
 
Hive
HiveHive
Hive
 
Unit-3_BDA.ppt
Unit-3_BDA.pptUnit-3_BDA.ppt
Unit-3_BDA.ppt
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with Hadoop
 
Apache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsApache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & Internals
 
Hadoop Tutorial For Beginners
Hadoop Tutorial For BeginnersHadoop Tutorial For Beginners
Hadoop Tutorial For Beginners
 
MPP vs Hadoop
MPP vs HadoopMPP vs Hadoop
MPP vs Hadoop
 
Hadoop Ecosystem | Hadoop Ecosystem Tutorial | Hadoop Tutorial For Beginners ...
Hadoop Ecosystem | Hadoop Ecosystem Tutorial | Hadoop Tutorial For Beginners ...Hadoop Ecosystem | Hadoop Ecosystem Tutorial | Hadoop Tutorial For Beginners ...
Hadoop Ecosystem | Hadoop Ecosystem Tutorial | Hadoop Tutorial For Beginners ...
 
HDFS Architecture
HDFS ArchitectureHDFS Architecture
HDFS Architecture
 

Viewers also liked

HADOOP TECHNOLOGY ppt
HADOOP  TECHNOLOGY pptHADOOP  TECHNOLOGY ppt
HADOOP TECHNOLOGY ppt
sravya raju
 
Practical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & PigPractical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & PigMilind Bhandarkar
 
Hadoop Technology
Hadoop TechnologyHadoop Technology
Hadoop Technology
Atul Kushwaha
 
Hadoop Technologies
Hadoop TechnologiesHadoop Technologies
Hadoop Technologies
Kannappan Sirchabesan
 
Hadoop - Overview
Hadoop - OverviewHadoop - Overview
Hadoop - Overview
Jay
 
Hadoop demo ppt
Hadoop demo pptHadoop demo ppt
Hadoop demo ppt
Phil Young
 
introduction to data processing using Hadoop and Pig
introduction to data processing using Hadoop and Pigintroduction to data processing using Hadoop and Pig
introduction to data processing using Hadoop and Pig
Ricardo Varela
 
Pig, Making Hadoop Easy
Pig, Making Hadoop EasyPig, Making Hadoop Easy
Pig, Making Hadoop Easy
Nick Dimiduk
 
HIVE: Data Warehousing & Analytics on Hadoop
HIVE: Data Warehousing & Analytics on HadoopHIVE: Data Warehousing & Analytics on Hadoop
HIVE: Data Warehousing & Analytics on HadoopZheng Shao
 
Hadoop, Pig, and Twitter (NoSQL East 2009)
Hadoop, Pig, and Twitter (NoSQL East 2009)Hadoop, Pig, and Twitter (NoSQL East 2009)
Hadoop, Pig, and Twitter (NoSQL East 2009)
Kevin Weil
 
Facebooks Petabyte Scale Data Warehouse using Hive and Hadoop
Facebooks Petabyte Scale Data Warehouse using Hive and HadoopFacebooks Petabyte Scale Data Warehouse using Hive and Hadoop
Facebooks Petabyte Scale Data Warehouse using Hive and Hadoop
royans
 
Integration of Hive and HBase
Integration of Hive and HBaseIntegration of Hive and HBase
Integration of Hive and HBase
Hortonworks
 
Treat your enterprise data lake indigestion: Enterprise ready security and go...
Treat your enterprise data lake indigestion: Enterprise ready security and go...Treat your enterprise data lake indigestion: Enterprise ready security and go...
Treat your enterprise data lake indigestion: Enterprise ready security and go...
DataWorks Summit
 
Hdp security overview
Hdp security overview Hdp security overview
Hdp security overview
Hortonworks
 
Information security in big data -privacy and data mining
Information security in big data -privacy and data miningInformation security in big data -privacy and data mining
Information security in big data -privacy and data mining
harithavijay94
 
Improvements in Hadoop Security
Improvements in Hadoop SecurityImprovements in Hadoop Security
Improvements in Hadoop SecurityDataWorks Summit
 
Big Data and Security - Where are we now? (2015)
Big Data and Security - Where are we now? (2015)Big Data and Security - Where are we now? (2015)
Big Data and Security - Where are we now? (2015)
Peter Wood
 
Apache Knox Gateway "Single Sign On" expands the reach of the Enterprise Users
Apache Knox Gateway "Single Sign On" expands the reach of the Enterprise UsersApache Knox Gateway "Single Sign On" expands the reach of the Enterprise Users
Apache Knox Gateway "Single Sign On" expands the reach of the Enterprise Users
DataWorks Summit
 
Hadoop & Security - Past, Present, Future
Hadoop & Security - Past, Present, FutureHadoop & Security - Past, Present, Future
Hadoop & Security - Past, Present, Future
Uwe Printz
 
Apache Knox setup and hive and hdfs Access using KNOX
Apache Knox setup and hive and hdfs Access using KNOXApache Knox setup and hive and hdfs Access using KNOX
Apache Knox setup and hive and hdfs Access using KNOX
Abhishek Mallick
 

Viewers also liked (20)

HADOOP TECHNOLOGY ppt
HADOOP  TECHNOLOGY pptHADOOP  TECHNOLOGY ppt
HADOOP TECHNOLOGY ppt
 
Practical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & PigPractical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & Pig
 
Hadoop Technology
Hadoop TechnologyHadoop Technology
Hadoop Technology
 
Hadoop Technologies
Hadoop TechnologiesHadoop Technologies
Hadoop Technologies
 
Hadoop - Overview
Hadoop - OverviewHadoop - Overview
Hadoop - Overview
 
Hadoop demo ppt
Hadoop demo pptHadoop demo ppt
Hadoop demo ppt
 
introduction to data processing using Hadoop and Pig
introduction to data processing using Hadoop and Pigintroduction to data processing using Hadoop and Pig
introduction to data processing using Hadoop and Pig
 
Pig, Making Hadoop Easy
Pig, Making Hadoop EasyPig, Making Hadoop Easy
Pig, Making Hadoop Easy
 
HIVE: Data Warehousing & Analytics on Hadoop
HIVE: Data Warehousing & Analytics on HadoopHIVE: Data Warehousing & Analytics on Hadoop
HIVE: Data Warehousing & Analytics on Hadoop
 
Hadoop, Pig, and Twitter (NoSQL East 2009)
Hadoop, Pig, and Twitter (NoSQL East 2009)Hadoop, Pig, and Twitter (NoSQL East 2009)
Hadoop, Pig, and Twitter (NoSQL East 2009)
 
Facebooks Petabyte Scale Data Warehouse using Hive and Hadoop
Facebooks Petabyte Scale Data Warehouse using Hive and HadoopFacebooks Petabyte Scale Data Warehouse using Hive and Hadoop
Facebooks Petabyte Scale Data Warehouse using Hive and Hadoop
 
Integration of Hive and HBase
Integration of Hive and HBaseIntegration of Hive and HBase
Integration of Hive and HBase
 
Treat your enterprise data lake indigestion: Enterprise ready security and go...
Treat your enterprise data lake indigestion: Enterprise ready security and go...Treat your enterprise data lake indigestion: Enterprise ready security and go...
Treat your enterprise data lake indigestion: Enterprise ready security and go...
 
Hdp security overview
Hdp security overview Hdp security overview
Hdp security overview
 
Information security in big data -privacy and data mining
Information security in big data -privacy and data miningInformation security in big data -privacy and data mining
Information security in big data -privacy and data mining
 
Improvements in Hadoop Security
Improvements in Hadoop SecurityImprovements in Hadoop Security
Improvements in Hadoop Security
 
Big Data and Security - Where are we now? (2015)
Big Data and Security - Where are we now? (2015)Big Data and Security - Where are we now? (2015)
Big Data and Security - Where are we now? (2015)
 
Apache Knox Gateway "Single Sign On" expands the reach of the Enterprise Users
Apache Knox Gateway "Single Sign On" expands the reach of the Enterprise UsersApache Knox Gateway "Single Sign On" expands the reach of the Enterprise Users
Apache Knox Gateway "Single Sign On" expands the reach of the Enterprise Users
 
Hadoop & Security - Past, Present, Future
Hadoop & Security - Past, Present, FutureHadoop & Security - Past, Present, Future
Hadoop & Security - Past, Present, Future
 
Apache Knox setup and hive and hdfs Access using KNOX
Apache Knox setup and hive and hdfs Access using KNOXApache Knox setup and hive and hdfs Access using KNOX
Apache Knox setup and hive and hdfs Access using KNOX
 

Similar to Hadoop Overview & Architecture

Hadoop Overview kdd2011
Hadoop Overview kdd2011Hadoop Overview kdd2011
Hadoop Overview kdd2011
Milind Bhandarkar
 
Hadoop london
Hadoop londonHadoop london
Hadoop Tutorial with @techmilind
Hadoop Tutorial with @techmilindHadoop Tutorial with @techmilind
Hadoop Tutorial with @techmilind
EMC
 
Hadoop: A distributed framework for Big Data
Hadoop: A distributed framework for Big DataHadoop: A distributed framework for Big Data
Hadoop: A distributed framework for Big Data
Dhanashri Yadav
 
Lecture 2 part 3
Lecture 2 part 3Lecture 2 part 3
Lecture 2 part 3
Jazan University
 
Fundamental of Big Data with Hadoop and Hive
Fundamental of Big Data with Hadoop and HiveFundamental of Big Data with Hadoop and Hive
Fundamental of Big Data with Hadoop and Hive
Sharjeel Imtiaz
 
Hadoop with Python
Hadoop with PythonHadoop with Python
Hadoop with Python
Donald Miner
 
Osd ctw spark
Osd ctw sparkOsd ctw spark
Osd ctw spark
Wisely chen
 
HadoopThe Hadoop Java Software Framework
HadoopThe Hadoop Java Software FrameworkHadoopThe Hadoop Java Software Framework
HadoopThe Hadoop Java Software Framework
ThoughtWorks
 
Etu L2 Training - Hadoop 企業應用實作
Etu L2 Training - Hadoop 企業應用實作Etu L2 Training - Hadoop 企業應用實作
Etu L2 Training - Hadoop 企業應用實作
James Chen
 
Large scale computing with mapreduce
Large scale computing with mapreduceLarge scale computing with mapreduce
Large scale computing with mapreduce
hansen3032
 
Map reduce and hadoop at mylife
Map reduce and hadoop at mylifeMap reduce and hadoop at mylife
Map reduce and hadoop at mylife
responseteam
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
York University
 
Scala and spark
Scala and sparkScala and spark
Scala and spark
Fabio Fumarola
 
L19CloudMapReduce introduction for cloud computing .ppt
L19CloudMapReduce introduction for cloud computing .pptL19CloudMapReduce introduction for cloud computing .ppt
L19CloudMapReduce introduction for cloud computing .ppt
MaruthiPrasad96
 
11. From Hadoop to Spark 1:2
11. From Hadoop to Spark 1:211. From Hadoop to Spark 1:2
11. From Hadoop to Spark 1:2
Fabio Fumarola
 
Hadoop
HadoopHadoop
Apache Spark
Apache SparkApache Spark
Apache Spark
SugumarSarDurai
 
SQL on Hadoop for the Oracle Professional
SQL on Hadoop for the Oracle ProfessionalSQL on Hadoop for the Oracle Professional
SQL on Hadoop for the Oracle Professional
Michael Rainey
 

Similar to Hadoop Overview & Architecture (20)

Hadoop Overview kdd2011
Hadoop Overview kdd2011Hadoop Overview kdd2011
Hadoop Overview kdd2011
 
Hadoop london
Hadoop londonHadoop london
Hadoop london
 
Hadoop Tutorial with @techmilind
Hadoop Tutorial with @techmilindHadoop Tutorial with @techmilind
Hadoop Tutorial with @techmilind
 
Hadoop
HadoopHadoop
Hadoop
 
Hadoop: A distributed framework for Big Data
Hadoop: A distributed framework for Big DataHadoop: A distributed framework for Big Data
Hadoop: A distributed framework for Big Data
 
Lecture 2 part 3
Lecture 2 part 3Lecture 2 part 3
Lecture 2 part 3
 
Fundamental of Big Data with Hadoop and Hive
Fundamental of Big Data with Hadoop and HiveFundamental of Big Data with Hadoop and Hive
Fundamental of Big Data with Hadoop and Hive
 
Hadoop with Python
Hadoop with PythonHadoop with Python
Hadoop with Python
 
Osd ctw spark
Osd ctw sparkOsd ctw spark
Osd ctw spark
 
HadoopThe Hadoop Java Software Framework
HadoopThe Hadoop Java Software FrameworkHadoopThe Hadoop Java Software Framework
HadoopThe Hadoop Java Software Framework
 
Etu L2 Training - Hadoop 企業應用實作
Etu L2 Training - Hadoop 企業應用實作Etu L2 Training - Hadoop 企業應用實作
Etu L2 Training - Hadoop 企業應用實作
 
Large scale computing with mapreduce
Large scale computing with mapreduceLarge scale computing with mapreduce
Large scale computing with mapreduce
 
Map reduce and hadoop at mylife
Map reduce and hadoop at mylifeMap reduce and hadoop at mylife
Map reduce and hadoop at mylife
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Scala and spark
Scala and sparkScala and spark
Scala and spark
 
L19CloudMapReduce introduction for cloud computing .ppt
L19CloudMapReduce introduction for cloud computing .pptL19CloudMapReduce introduction for cloud computing .ppt
L19CloudMapReduce introduction for cloud computing .ppt
 
11. From Hadoop to Spark 1:2
11. From Hadoop to Spark 1:211. From Hadoop to Spark 1:2
11. From Hadoop to Spark 1:2
 
Hadoop
HadoopHadoop
Hadoop
 
Apache Spark
Apache SparkApache Spark
Apache Spark
 
SQL on Hadoop for the Oracle Professional
SQL on Hadoop for the Oracle ProfessionalSQL on Hadoop for the Oracle Professional
SQL on Hadoop for the Oracle Professional
 

More from EMC

INDUSTRY-LEADING TECHNOLOGY FOR LONG TERM RETENTION OF BACKUPS IN THE CLOUD
INDUSTRY-LEADING  TECHNOLOGY FOR LONG TERM RETENTION OF BACKUPS IN THE CLOUDINDUSTRY-LEADING  TECHNOLOGY FOR LONG TERM RETENTION OF BACKUPS IN THE CLOUD
INDUSTRY-LEADING TECHNOLOGY FOR LONG TERM RETENTION OF BACKUPS IN THE CLOUD
EMC
 
Cloud Foundry Summit Berlin Keynote
Cloud Foundry Summit Berlin Keynote Cloud Foundry Summit Berlin Keynote
Cloud Foundry Summit Berlin Keynote
EMC
 
EMC GLOBAL DATA PROTECTION INDEX
EMC GLOBAL DATA PROTECTION INDEX EMC GLOBAL DATA PROTECTION INDEX
EMC GLOBAL DATA PROTECTION INDEX
EMC
 
Transforming Desktop Virtualization with Citrix XenDesktop and EMC XtremIO
Transforming Desktop Virtualization with Citrix XenDesktop and EMC XtremIOTransforming Desktop Virtualization with Citrix XenDesktop and EMC XtremIO
Transforming Desktop Virtualization with Citrix XenDesktop and EMC XtremIO
EMC
 
Citrix ready-webinar-xtremio
Citrix ready-webinar-xtremioCitrix ready-webinar-xtremio
Citrix ready-webinar-xtremio
EMC
 
EMC FORUM RESEARCH GLOBAL RESULTS - 10,451 RESPONSES ACROSS 33 COUNTRIES
EMC FORUM RESEARCH GLOBAL RESULTS - 10,451 RESPONSES ACROSS 33 COUNTRIES EMC FORUM RESEARCH GLOBAL RESULTS - 10,451 RESPONSES ACROSS 33 COUNTRIES
EMC FORUM RESEARCH GLOBAL RESULTS - 10,451 RESPONSES ACROSS 33 COUNTRIES
EMC
 
EMC with Mirantis Openstack
EMC with Mirantis OpenstackEMC with Mirantis Openstack
EMC with Mirantis Openstack
EMC
 
Modern infrastructure for business data lake
Modern infrastructure for business data lakeModern infrastructure for business data lake
Modern infrastructure for business data lakeEMC
 
Force Cyber Criminals to Shop Elsewhere
Force Cyber Criminals to Shop ElsewhereForce Cyber Criminals to Shop Elsewhere
Force Cyber Criminals to Shop Elsewhere
EMC
 
Pivotal : Moments in Container History
Pivotal : Moments in Container History Pivotal : Moments in Container History
Pivotal : Moments in Container History
EMC
 
Data Lake Protection - A Technical Review
Data Lake Protection - A Technical ReviewData Lake Protection - A Technical Review
Data Lake Protection - A Technical Review
EMC
 
Mobile E-commerce: Friend or Foe
Mobile E-commerce: Friend or FoeMobile E-commerce: Friend or Foe
Mobile E-commerce: Friend or Foe
EMC
 
Virtualization Myths Infographic
Virtualization Myths Infographic Virtualization Myths Infographic
Virtualization Myths Infographic EMC
 
Intelligence-Driven GRC for Security
Intelligence-Driven GRC for SecurityIntelligence-Driven GRC for Security
Intelligence-Driven GRC for Security
EMC
 
The Trust Paradox: Access Management and Trust in an Insecure Age
The Trust Paradox: Access Management and Trust in an Insecure AgeThe Trust Paradox: Access Management and Trust in an Insecure Age
The Trust Paradox: Access Management and Trust in an Insecure Age
EMC
 
EMC Technology Day - SRM University 2015
EMC Technology Day - SRM University 2015EMC Technology Day - SRM University 2015
EMC Technology Day - SRM University 2015
EMC
 
EMC Academic Summit 2015
EMC Academic Summit 2015EMC Academic Summit 2015
EMC Academic Summit 2015
EMC
 
Data Science and Big Data Analytics Book from EMC Education Services
Data Science and Big Data Analytics Book from EMC Education ServicesData Science and Big Data Analytics Book from EMC Education Services
Data Science and Big Data Analytics Book from EMC Education ServicesEMC
 
Using EMC Symmetrix Storage in VMware vSphere Environments
Using EMC Symmetrix Storage in VMware vSphere EnvironmentsUsing EMC Symmetrix Storage in VMware vSphere Environments
Using EMC Symmetrix Storage in VMware vSphere Environments
EMC
 
Using EMC VNX storage with VMware vSphereTechBook
Using EMC VNX storage with VMware vSphereTechBookUsing EMC VNX storage with VMware vSphereTechBook
Using EMC VNX storage with VMware vSphereTechBook
EMC
 

More from EMC (20)

INDUSTRY-LEADING TECHNOLOGY FOR LONG TERM RETENTION OF BACKUPS IN THE CLOUD
INDUSTRY-LEADING  TECHNOLOGY FOR LONG TERM RETENTION OF BACKUPS IN THE CLOUDINDUSTRY-LEADING  TECHNOLOGY FOR LONG TERM RETENTION OF BACKUPS IN THE CLOUD
INDUSTRY-LEADING TECHNOLOGY FOR LONG TERM RETENTION OF BACKUPS IN THE CLOUD
 
Cloud Foundry Summit Berlin Keynote
Cloud Foundry Summit Berlin Keynote Cloud Foundry Summit Berlin Keynote
Cloud Foundry Summit Berlin Keynote
 
EMC GLOBAL DATA PROTECTION INDEX
EMC GLOBAL DATA PROTECTION INDEX EMC GLOBAL DATA PROTECTION INDEX
EMC GLOBAL DATA PROTECTION INDEX
 
Transforming Desktop Virtualization with Citrix XenDesktop and EMC XtremIO
Transforming Desktop Virtualization with Citrix XenDesktop and EMC XtremIOTransforming Desktop Virtualization with Citrix XenDesktop and EMC XtremIO
Transforming Desktop Virtualization with Citrix XenDesktop and EMC XtremIO
 
Citrix ready-webinar-xtremio
Citrix ready-webinar-xtremioCitrix ready-webinar-xtremio
Citrix ready-webinar-xtremio
 
EMC FORUM RESEARCH GLOBAL RESULTS - 10,451 RESPONSES ACROSS 33 COUNTRIES
EMC FORUM RESEARCH GLOBAL RESULTS - 10,451 RESPONSES ACROSS 33 COUNTRIES EMC FORUM RESEARCH GLOBAL RESULTS - 10,451 RESPONSES ACROSS 33 COUNTRIES
EMC FORUM RESEARCH GLOBAL RESULTS - 10,451 RESPONSES ACROSS 33 COUNTRIES
 
EMC with Mirantis Openstack
EMC with Mirantis OpenstackEMC with Mirantis Openstack
EMC with Mirantis Openstack
 
Modern infrastructure for business data lake
Modern infrastructure for business data lakeModern infrastructure for business data lake
Modern infrastructure for business data lake
 
Force Cyber Criminals to Shop Elsewhere
Force Cyber Criminals to Shop ElsewhereForce Cyber Criminals to Shop Elsewhere
Force Cyber Criminals to Shop Elsewhere
 
Pivotal : Moments in Container History
Pivotal : Moments in Container History Pivotal : Moments in Container History
Pivotal : Moments in Container History
 
Data Lake Protection - A Technical Review
Data Lake Protection - A Technical ReviewData Lake Protection - A Technical Review
Data Lake Protection - A Technical Review
 
Mobile E-commerce: Friend or Foe
Mobile E-commerce: Friend or FoeMobile E-commerce: Friend or Foe
Mobile E-commerce: Friend or Foe
 
Virtualization Myths Infographic
Virtualization Myths Infographic Virtualization Myths Infographic
Virtualization Myths Infographic
 
Intelligence-Driven GRC for Security
Intelligence-Driven GRC for SecurityIntelligence-Driven GRC for Security
Intelligence-Driven GRC for Security
 
The Trust Paradox: Access Management and Trust in an Insecure Age
The Trust Paradox: Access Management and Trust in an Insecure AgeThe Trust Paradox: Access Management and Trust in an Insecure Age
The Trust Paradox: Access Management and Trust in an Insecure Age
 
EMC Technology Day - SRM University 2015
EMC Technology Day - SRM University 2015EMC Technology Day - SRM University 2015
EMC Technology Day - SRM University 2015
 
EMC Academic Summit 2015
EMC Academic Summit 2015EMC Academic Summit 2015
EMC Academic Summit 2015
 
Data Science and Big Data Analytics Book from EMC Education Services
Data Science and Big Data Analytics Book from EMC Education ServicesData Science and Big Data Analytics Book from EMC Education Services
Data Science and Big Data Analytics Book from EMC Education Services
 
Using EMC Symmetrix Storage in VMware vSphere Environments
Using EMC Symmetrix Storage in VMware vSphere EnvironmentsUsing EMC Symmetrix Storage in VMware vSphere Environments
Using EMC Symmetrix Storage in VMware vSphere Environments
 
Using EMC VNX storage with VMware vSphereTechBook
Using EMC VNX storage with VMware vSphereTechBookUsing EMC VNX storage with VMware vSphereTechBook
Using EMC VNX storage with VMware vSphereTechBook
 

Recently uploaded

GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
Neo4j
 
How to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptxHow to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptx
danishmna97
 
Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1
DianaGray10
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
Alan Dix
 
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
Kari Kakkonen
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
Ana-Maria Mihalceanu
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
Safe Software
 
UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6
DianaGray10
 
Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?
Nexer Digital
 
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
Neo4j
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance
 
Free Complete Python - A step towards Data Science
Free Complete Python - A step towards Data ScienceFree Complete Python - A step towards Data Science
Free Complete Python - A step towards Data Science
RinaMondal9
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
mikeeftimakis1
 
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
James Anderson
 
GridMate - End to end testing is a critical piece to ensure quality and avoid...
GridMate - End to end testing is a critical piece to ensure quality and avoid...GridMate - End to end testing is a critical piece to ensure quality and avoid...
GridMate - End to end testing is a critical piece to ensure quality and avoid...
ThomasParaiso2
 
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AIEnchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Vladimir Iglovikov, Ph.D.
 
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
Neo4j
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
DanBrown980551
 
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptxSecstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
nkrafacyberclub
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
Alpen-Adria-Universität
 

Recently uploaded (20)

GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
 
How to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptxHow to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptx
 
Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
 
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
 
UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6
 
Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?
 
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
 
Free Complete Python - A step towards Data Science
Free Complete Python - A step towards Data ScienceFree Complete Python - A step towards Data Science
Free Complete Python - A step towards Data Science
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
 
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
 
GridMate - End to end testing is a critical piece to ensure quality and avoid...
GridMate - End to end testing is a critical piece to ensure quality and avoid...GridMate - End to end testing is a critical piece to ensure quality and avoid...
GridMate - End to end testing is a critical piece to ensure quality and avoid...
 
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AIEnchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AI
 
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
 
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptxSecstrike : Reverse Engineering & Pwnable tools for CTF.pptx
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
 

Hadoop Overview & Architecture

  • 1. Hadoop Overview & Architecture Milind Bhandarkar Chief Scientist, Machine Learning Platforms, Greenplum, A Division of EMC (Twitter: @techmilind)
  • 2. About Me •  http://www.linkedin.com/in/milindb •  Founding member of Hadoop team at Yahoo! [2005-2010] •  Contributor to Apache Hadoop since v0.1 •  Built and led Grid Solutions Team at Yahoo! [2007-2010] •  Parallel Programming Paradigms [1989-today] (PhD cs.illinois.edu) •  Center for Development of Advanced Computing (C-DAC), National Center for Supercomputing Applications (NCSA), Center for Simulation of Advanced Rockets, Siebel Systems, Pathscale Inc. (acquired by QLogic), Yahoo!, LinkedIn, and EMC-Greenplum
  • 3. Agenda • Motivation • Hadoop • Map-Reduce • Distributed File System • Hadoop Architecture • Next Generation MapReduce • Q & A 2
  • 4. Hadoop At Scale (Some Statistics) • 40,000 + machines in 20+ clusters • Largest cluster is 4,000 machines • 170 Petabytes of storage • 1000+ users • 1,000,000+ jobs/month 3
  • 9. Big Datasets (Data-Rich Computing theme proposal. J. Campbell, et al, 2007)
  • 10. Cost Per Gigabyte (http://www.mkomo.com/cost-per-gigabyte)
  • 11. Storage Trends (Graph by Adam Leventhal, ACM Queue, Dec 2009)
  • 14. Search Assist • Insight: Related concepts appear close together in text corpus • Input: Web pages • 1 Billion Pages, 10K bytes each • 10 TB of input data • Output: List(word, List(related words)) 13
  • 15. Search Assist // Input: List(URL, Text) foreach URL in Input : Words = Tokenize(Text(URL)); foreach word in Tokens : Insert (word, Next(word, Tokens)) in Pairs; Insert (word, Previous(word, Tokens)) in Pairs; // Result: Pairs = List (word, RelatedWord) Group Pairs by word; // Code Samples // Result: List (word, List(RelatedWords) foreach word in Pairs : Count RelatedWords in GroupedPairs; // Result: List (word, List(RelatedWords, count)) foreach word in CountedPairs : Sort Pairs(word, *) descending by count; choose Top 5 Pairs; // Result: List (word, Top5(RelatedWords)) 14
  • 17. People You May Know • Insight: You might also know Joe Smith if a lot of folks you know, know Joe Smith • if you don t know Joe Smith already • Numbers: • 100 MM users • Average connections per user is 100 16
  • 18. People You May Know Input: List(UserName, List(Connections)) // foreach u in UserList : // 100 MM foreach x in Connections(u) : // 100 // Code foreach y in Connections(x) : // 100 Samples if (y not in Connections(u)) : Count(u, y)++; // 1 Trillion Iterations Sort (u,y) in descending order of Count(u,y); Choose Top 3 y; Store (u, {y0, y1, y2}) for serving; 17
  • 19. Performance • 101 Random accesses for each user • Assume 1 ms per random access • 100 ms per user • 100 MM users • 100 days on a single machine 18
  • 21. Map Reduce • Primitives in Lisp ( Other functional languages) 1970s • Google Paper 2004 • http://labs.google.com/papers/ mapreduce.html 20
  • 22. Map Output_List = Map (Input_List) Square (1, 2, 3, 4, 5, 6, 7, 8, 9, 10) = (1, 4, 9, 16, 25, 36,49, 64, 81, 100) 21
  • 23. Reduce Output_Element = Reduce (Input_List) Sum (1, 4, 9, 16, 25, 36,49, 64, 81, 100) = 385 22
  • 24. Parallelism • Map is inherently parallel • Each list element processed independently • Reduce is inherently sequential • Unless processing multiple lists • Grouping to produce multiple lists 23
  • 25. Search Assist Map // Input: http://hadoop.apache.org Pairs = Tokenize_And_Pair ( Text ( Input ) ) // Example 24
  • 26. Search Assist Reduce // Input: GroupedList (word, GroupedList(words)) CountedPairs = CountOccurrences (word, RelatedWords) Output = { (hadoop, apache, 7) (hadoop, DFS, 3) (hadoop, streaming, 4) (hadoop, mapreduce, 9) ... } 25
  • 27. Issues with Large Data • Map Parallelism: Chunking input data • Reduce Parallelism: Grouping related data • Dealing with failures load imbalance 26
  • 28.
  • 29. Apache Hadoop • January 2006: Subproject of Lucene • January 2008: Top-level Apache project • Stable Version: 1.0.3 • Latest Version: 2.0.0 (Alpha) 28
  • 30. Apache Hadoop • Reliable, Performant Distributed file system • MapReduce Programming framework • Ecosystem: HBase, Hive, Pig, Howl, Oozie, Zookeeper, Chukwa, Mahout, Cascading, Scribe, Cassandra, Hypertable, Voldemort, Azkaban, Sqoop, Flume, Avro ... 29
  • 31. Problem: Bandwidth to Data • Scan 100TB Datasets on 1000 node cluster • Remote storage @ 10MB/s = 165 mins • Local storage @ 50-200MB/s = 33-8 mins • Moving computation is more efficient than moving data • Need visibility into data placement 30
  • 32. Problem: Scaling Reliably • Failure is not an option, it s a rule ! • 1000 nodes, MTBF 1 day • 4000 disks, 8000 cores, 25 switches, 1000 NICs, 2000 DIMMS (16TB RAM) • Need fault tolerant store with reasonable availability guarantees • Handle hardware faults transparently 31
  • 33. Hadoop Goals • Scalable: Petabytes (10 15 Bytes) of data on thousands on nodes • Economical: Commodity components only • Reliable • Engineering reliability into every application is expensive 32
  • 35. Think MapReduce • Record = (Key, Value) • Key : Comparable, Serializable • Value: Serializable • Input, Map, Shuffle, Reduce, Output 34
  • 36. Seems Familiar ? cat /var/log/auth.log* | grep “session opened” | cut -d’ ‘ -f10 | sort | Samples // Code uniq -c ~/userlist 35
  • 37. Map • Input: (Key , Value ) 1 1 • Output: List(Key , Value ) 2 2 • Projections, Filtering, Transformation 36
  • 38. Shuffle • Input: List(Key , Value ) 2 2 • Output • Sort(Partition(List(Key , List(Value )))) 2 2 • Provided by Hadoop 37
  • 39. Reduce • Input: List(Key , List(Value )) 2 2 • Output: List(Key , Value ) 3 3 • Aggregation 38
  • 40. Hadoop Streaming • Hadoop is written in Java • Java MapReduce code is native • What about Non-Java Programmers ? • Perl, Python, Shell, R • grep, sed, awk, uniq as Mappers/Reducers • Text Input and Output 39
  • 41. Hadoop Streaming • Thin Java wrapper for Map Reduce Tasks • Forks actual Mapper Reducer • IPC via stdin, stdout, stderr • Key.toString() t Value.toString() n • Slower than Java programs • Allows for quick prototyping / debugging 40
  • 42. Hadoop Streaming bin/hadoop jar hadoop-streaming.jar $ -input in-files -output out-dir -mapper mapper.sh -reducer reducer.sh mapper.sh # // Code Samples sed -e 's/ /n/g' | grep . # reducer.sh uniq -c | awk '{print $2 t $1}' 41
  • 43. Hadoop Distributed File System (HDFS) 42
  • 44. HDFS • Data is organized into files and directories • Files are divided into uniform sized blocks (default 128MB) and distributed across cluster nodes • HDFS exposes block placement so that computation can be migrated to data 43
  • 45. HDFS • Blocks are replicated (default 3) to handle hardware failure • Replication for performance and fault tolerance (Rack-Aware placement) • HDFS keeps checksums of data for corruption detection and recovery 44
  • 47. HDFS Master (NameNode) • Manages filesystem namespace • File metadata (i.e. inode ) • Mapping inode to list of blocks + locations • Authorization Authentication • Checkpoint journal namespace changes 46
  • 48. Namenode • Mapping of datanode to list of blocks • Monitor datanode health • Replicate missing blocks • Keeps ALL namespace in memory • 60M objects (File/Block) in 16GB 47
  • 49. Datanodes • Handle block storage on multiple volumes block integrity • Clients access the blocks directly from data nodes • Periodically send heartbeats and block reports to Namenode • Blocks are stored as underlying OS s files 48
  • 51. Example: Unigrams • Input: Huge text corpus • Wikipedia Articles (40GB uncompressed) • Output: List of words sorted in descending order of frequency
  • 52. Unigrams $ cat ~/wikipedia.txt | sed -e 's/ /n/g' | grep . | sort | uniq -c ~/frequencies.txt $ cat ~/frequencies.txt | # cat | sort -n -k1,1 -r | # cat ~/unigrams.txt
  • 53. MR for Unigrams mapper (filename, file-contents): for each word in file-contents: emit (word, 1) reducer (word, values): sum = 0 for each value in values: sum = sum + value emit (word, sum)
  • 54. MR for Unigrams mapper (word, frequency): emit (frequency, word) reducer (frequency, words): for each word in words: emit (word, frequency)
  • 55. Unigrams: Java Mapper public static class MapClass extends MapReduceBase implements MapperLongWritable, Text, Text, IntWritable { public void map(LongWritable key, Text value, OutputCollectorText, IntWritable output, Reporter reporter) throws IOException { String line = value.toString(); StringTokenizer itr = new StringTokenizer(line); while (itr.hasMoreTokens()) { Text word = new Text(itr.nextToken()); output.collect(word, new IntWritable(1)); } } }
  • 56. Unigrams: Java Reducer public static class Reduce extends MapReduceBase implements ReducerText, IntWritable, Text, IntWritable { public void reduce(Text key,IteratorIntWritable values, OutputCollectorText,IntWritable output, Reporter reporter) throws IOException { int sum = 0; while (values.hasNext()) { sum += values.next().get(); } output.collect(key, new IntWritable(sum)); } }
  • 57. Unigrams: Driver public void run(String inputPath, String outputPath) throws Exception { JobConf conf = new JobConf(WordCount.class); conf.setJobName(wordcount); conf.setMapperClass(MapClass.class); conf.setReducerClass(Reduce.class); FileInputFormat.addInputPath(conf, new Path(inputPath)); FileOutputFormat.setOutputPath(conf, new Path(outputPath)); JobClient.runJob(conf); }
  • 58. Configuration •  Unified Mechanism for •  Configuring Daemons •  Runtime environment for Jobs/Tasks •  Defaults: *-default.xml •  Site-Specific: *-site.xml •  final parameters
  • 59. Example configuration property namemapred.job.tracker/name valuehead.server.node.com:9001/value /property property namefs.default.name/name valuehdfs://head.server.node.com:9000/value /property property namemapred.child.java.opts/name value-Xmx512m/value finaltrue/final /property .... /configuration
  • 61. Running a Job [milindb@gateway ~]$ hadoop jar $HADOOP_HOME/hadoop-examples.jar wordcount /data/newsarchive/20080923 /tmp/ newsoutinput.FileInputFormat: Total input paths to process : 4 mapred.JobClient: Running job: job_200904270516_5709 mapred.JobClient: map 0% reduce 0% mapred.JobClient: map 3% reduce 0% mapred.JobClient: map 7% reduce 0% .... mapred.JobClient: map 100% reduce 21% mapred.JobClient: map 100% reduce 31% mapred.JobClient: map 100% reduce 33% mapred.JobClient: map 100% reduce 66% mapred.JobClient: map 100% reduce 100% mapred.JobClient: Job complete: job_200904270516_5709
  • 62. Running a Job mapred.JobClient: Counters: 18 mapred.JobClient: Job Counters mapred.JobClient: Launched reduce tasks=1 mapred.JobClient: Rack-local map tasks=10 mapred.JobClient: Launched map tasks=25 mapred.JobClient: Data-local map tasks=1 mapred.JobClient: FileSystemCounters mapred.JobClient: FILE_BYTES_READ=491145085 mapred.JobClient: HDFS_BYTES_READ=3068106537 mapred.JobClient: FILE_BYTES_WRITTEN=724733409 mapred.JobClient: HDFS_BYTES_WRITTEN=377464307
  • 63. Running a Job mapred.JobClient: Map-Reduce Framework mapred.JobClient: Combine output records=73828180 mapred.JobClient: Map input records=36079096 mapred.JobClient: Reduce shuffle bytes=233587524 mapred.JobClient: Spilled Records=78177976 mapred.JobClient: Map output bytes=4278663275 mapred.JobClient: Combine input records=371084796 mapred.JobClient: Map output records=313041519 mapred.JobClient: Reduce input records=15784903
  • 83. Next Generation MapReduce 82
  • 84. MapReduce Today (Courtesy: Arun Murthy, Hortonworks)
  • 85. Why ? • Scalability Limitations today • Maximum cluster size: 4000 nodes • Maximum Concurrent tasks: 40,000 • Job Tracker SPOF • Fixed map and reduce containers (slots) • Punishes pleasantly parallel apps 84
  • 86. Why ? (contd) • MapReduce is not suitable for every application • Fine-Grained Iterative applications • HaLoop: Hadoop in a Loop • Message passing applications • Graph Processing 85
  • 87. Requirements • Need scalable cluster resources manager • Separate scheduling from resource management • Multi-Lingual Communication Protocols 86
  • 88. Bottom Line • @techmilind #mrng (MapReduce, Next Gen) is in reality, #rmng (Resource Manager, Next Gen) • Expect different programming paradigms to be implemented • Including MPI (soon) 87
  • 89. Architecture (Courtesy: Arun Murthy, Hortonworks)
  • 90. The New World •  Resource Manager •  Allocates resources (containers) to applications •  Node Manager •  Manages containers on nodes •  Application Master •  Specific to paradigm e.g. MapReduce application master, MPI application master etc 89
  • 91. Container • In current terminology: A Task Slot • Slice of the node s hardware resources • #of cores, virtual memory, disk size, disk and network bandwidth etc • Currently, only memory usage is sliced 90