SlideShare a Scribd company logo
1 of 87
Hadoop Product Family
and Ecosystem
PC Liao
Agenda

• What is BigData?
• What is the problem?
• Hadoop
  – Introduction to Hadoop
  – Hadoop components
  – What sort of problems can be solved with Hadoop?
• Hadoop ecosystem
• Conclusion
What is BigData?




  A set of files   A database   A single file
The Data-Driven World

• Modern systems have to deal with far more data than
  was the case in the past
  – Organizations are generating huge amounts of data
  – That data has inherent value, and cannot be discarded
• Examples:
  – Yahoo – over 170PB of data
  – Facebook – over 30PB of data
  – eBay – over 5PB of data




• Many organizations are generating data at a rate of
  terabytes per day
What is the problem

• Traditionally, computation has been processor-bound
• For decades, the primary push was to increase the
  computing power of a single machine
  – Faster processor, more RAM
• Distributed systems evolved to allow developers to use
  multiple machines for a single job
  – At compute time, data is copied to the compute nodes
What is the problem

• Getting the data to the processors
  becomes the bottleneck


• Quick calculation
   – Typical disk data transfer rate:
      • 75MB/sec
   – Time taken to transfer 100GB of data
     to the processor:
      • approx. 22   minutes!
What is the problem

• Failure of a component may cost a lot
• What we need when job fail?
  – May result in a graceful degradation of application performance,
    but entire system does not completely fail
  – Should not result in the loss of any data
  – Would not affect the outcome of the job
Big Data Solutions by Industries
The most common problems Hadoop can solve
Threat Analysis/Trade Surveillance

• Challenge:
  – Detecting threats in the form of fraudulent activity or attacks
     • Large data volumes involved
     • Like looking for a needle in a haystack

• Solution with Hadoop:
  – Parallel processing over huge datasets
  – Pattern recognition to identify anomalies
     • – i.e., threats

• Typical Industry:
  – Security, Financial Services
Big Data Use Case
Smart Protection Network
• Challenge
   – Information accessibility and transparency problems
     for threat researcher due to the size and source of
     data (volume, variety and velocity)

• Size of Data
   – Overall Data
        •   Data sources: 20+
        •   Data fields: 1000+
        •   Daily new records: 23 Billion+
        •   Daily new data size: 4TB+


   SPN Smart Feedback
   •   Feedback components: 26
   •   Data fields : 300+
   •   Daily new file counts: 6 Million+
   •   Daily new records: 90 Million+
   •   Daily new data size: 261GB+
Index=“vsapi” zbot
Hadoop Family and Ecosystem
Recommendation Engine

• Challenge:
  – Using user data to predict which products to recommend
• Solution with Hadoop:
  – Batch processing framework
     • Allow execution in in parallel over large datasets
  – Collaborative filtering
     • Collecting „taste‟ information from many users
     • Utilizing information to predict what similar users like

• Typical Industry
  – ISP, Advertising
Walmart Case


                        Diapers
               Beer


                      Friday




               Revenue         ?
Hadoop!
– inspired by
• Apache Hadoop project
  – inspired by Google's MapReduce and Google File System
    papers.
• Open sourced, flexible and available architecture for
  large scale computation and data processing on a
  network of commodity hardware
• Open Source Software + Hardware Commodity
  – IT Costs Reduction
Hadoop Concepts

• Distribute the data as it is initially stored in the system
• Individual nodes can work on data local to those nodes
• Users can focus on developing applications.
Hadoop Components

• Hadoop consists of two core components
  – The Hadoop Distributed File System (HDFS)
  – MapReduce Software Framework
• There are many other projects based around core
  Hadoop
  – Often referred to as the „Hadoop Ecosystem‟
  – Pig, Hive, HBase, Flume, Oozie, Sqoop, etc

                                                                                  Hue                         Mahout
                                                                             (Web Console)                   (Data Mining)

                                                                                                 Oozie
                                                                                      (Job Workflow & Scheduling)




                                          (Coordination)
                                                           Zookeeper
                                                                             Sqoop/Flume                 Pig/Hive (Analytical
                                                                            (Data integration)                Language)


                                                                       MapReduce Runtime
                                                                       (Dist. Programming Framework)            Hbase
                                                                                                          (Column NoSQL DB)


                                                                             Hadoop Distributed File System (HDFS)
Hadoop Components: HDFS

• HDFS, the Hadoop Distributed File System, is
  responsible for storing data on the cluster
• Two roles in HDFS
  – Namenode: Record metadata
  – Datanode: Store data



                                                                            Hue                         Mahout
                                                                       (Web Console)                   (Data Mining)

                                                                                           Oozie
                                                                                (Job Workflow & Scheduling)




                                    (Coordination)
                                                     Zookeeper
                                                                       Sqoop/Flume                 Pig/Hive (Analytical
                                                                      (Data integration)                Language)


                                                                 MapReduce Runtime
                                                                 (Dist. Programming Framework)            Hbase
                                                                                                    (Column NoSQL DB)


                                                                       Hadoop Distributed File System (HDFS)
How Files Are Stored: Example
                       • NameNode holds metadata for the
                         data files
                       • DataNodes hold the actual blocks
                          • Each block is replicated three
                             times on the cluster
HDFS: Points To Note
                       • When a client application
                         wants to read a file:
                          • It communicates with
                             the NameNode to
                             determine which
                             blocks make up the
                             file, and which
                             DataNodes those
                             blocks reside on
                          • It then
                             communicates
                             directly with the
                             DataNodes to read
                             the data
Hadoop Components: MapReduce

• MapReduce is a method for distributing a task across
  multiple nodes
• It works like a Unix pipeline:
  – cat input | grep | sort     | uniq -c | cat > output
  – Input | Map | Shuffle & Sort | Reduce | Output



                                                                                Hue                         Mahout
                                                                           (Web Console)                   (Data Mining)

                                                                                               Oozie
                                                                                    (Job Workflow & Scheduling)




                                        (Coordination)
                                                         Zookeeper
                                                                           Sqoop/Flume                 Pig/Hive (Analytical
                                                                          (Data integration)                Language)


                                                                     MapReduce Runtime
                                                                     (Dist. Programming Framework)            Hbase
                                                                                                        (Column NoSQL DB)


                                                                           Hadoop Distributed File System (HDFS)
Features of MapReduce

• Automatic parallelization and distribution
• Automatic re-execution on failure
• Locality optimizations
• MapReduce abstracts all the „housekeeping‟ away from
  the developer
  – Developer can concentrate simply on writing the Map and
    Reduce functions
                                                                                 Hue                         Mahout
                                                                            (Web Console)                   (Data Mining)

                                                                                                Oozie
                                                                                     (Job Workflow & Scheduling)




                                         (Coordination)
                                                          Zookeeper
                                                                            Sqoop/Flume                 Pig/Hive (Analytical
                                                                           (Data integration)                Language)


                                                                      MapReduce Runtime
                                                                      (Dist. Programming Framework)            Hbase
                                                                                                         (Column NoSQL DB)


                                                                            Hadoop Distributed File System (HDFS)
Example : word count

• Word count is challenging over massive amounts of
  data
  – Using a single compute node would be too time-consuming
  – Number of unique words can easily exceed the RAM
• MapReduce breaks complex tasks down into smaller
  elements which can be executed in parallel
• More nodes, more faster
Word Count Example




       Key: offset
       Value: line

                            Key: word      Key: word
                            Value: count   Value: sum of count



0:The cat sat on the mat
22:The aardvark sat on the sofa
The Hadoop Ecosystems
Growing Hadoop Ecosystem

• The term „Hadoop‟ is taken to be the combination of
  HDFS and MapReduce
• There are numerous other projects surrounding Hadoop
  – Typically referred to as the „Hadoop Ecosystem‟
     •   Zookeeper
     •   Hive and Pig
     •   HBase
     •   Flume
     •   Other Ecosystem Projects
          – Sqoop
          – Oozie
          – Hue
          – Mahout
The Ecosystem is the System

• Hadoop has become the kernel of the distributed
  operating system for Big Data
• No one uses the kernel alone
• A collection of projects at Apache
Relation Map

                                                Hue                                   Mahout
                                            (Web Console)                         (Data Mining)


                                                                  Oozie
                                                        (Job Workflow & Scheduling)
    (Coordination)
                     Zookeeper




                                           Sqoop/Flume
                                                                          Pig/Hive (Analytical Language)
                                          (Data integration)



                                 MapReduce Runtime
                                 (Dist. Programming Framework)                         Hbase
                                                                                (Column NoSQL DB)


                                              Hadoop Distributed File System (HDFS)
Zookeeper – Coordination Framework

                                                Hue                                   Mahout
                                            (Web Console)                         (Data Mining)


                                                                  Oozie
                                                        (Job Workflow & Scheduling)
    (Coordination)
                     Zookeeper




                                           Sqoop/Flume
                                                                          Pig/Hive (Analytical Language)
                                          (Data integration)



                                 MapReduce Runtime
                                 (Dist. Programming Framework)                         Hbase
                                                                                (Column NoSQL DB)


                                              Hadoop Distributed File System (HDFS)
What is ZooKeeper

• A centralized service for maintaining
  – Configuration information
  – Providing distributed synchronization
• A set of tools to build distributed applications that can
  safely handle partial failures
• ZooKeeper was designed to store coordination data
  – Status information
  – Configuration
  – Location information
Why use ZooKeeper?

• Manage configuration across nodes
• Implement reliable messaging
• Implement redundant services
• Synchronize process execution
ZooKeeper Architecture




   – All servers store a copy of the data (in memory)
   – A leader is elected at startup
   – 2 roles – leader and follower
     • Followers service clients, all updates go through leader
     • Update responses are sent when a majority of servers have persisted the
        change
   – HA support
Hbase – Column NoSQL DB

                                               Hue                                   Mahout
                                           (Web Console)                         (Data Mining)


                                                                 Oozie
                                                       (Job Workflow & Scheduling)
   (Coordination)
                    Zookeeper




                                          Sqoop/Flume
                                                                         Pig/Hive (Analytical Language)
                                         (Data integration)



                                MapReduce Runtime
                                (Dist. Programming Framework)                         Hbase
                                                                               (Column NoSQL DB)


                                             Hadoop Distributed File System (HDFS)
Structured-data vs Raw-data
I – Inspired by

• Apache open source project
• Inspired from Google Big Table
• Non-relational, distributed database written in Java
• Coordinated by Zookeeper
Row & Column Oriented
Hbase – Data Model

• Cells are “versioned”
• Table rows are sorted by row key
• Region – a row range [start-key:end-key]
Architecture

• Master Server (HMaster)
  – Assigns regions to regionservers
  – Monitors the health of regionservers
• RegionServers
  – Contain regions and handle client read/write request
Hbase – workflow
When to use HBase

• Need random, low latency access to the data
• Application has a variable schema where each row is
  slightly different
• Add columns
• Most of columns are NULL in each row
Flume / Sqoop – Data Integration Framework

                                                Hue                                   Mahout
                                            (Web Console)                         (Data Mining)


                                                                  Oozie
                                                        (Job Workflow & Scheduling)
    (Coordination)
                     Zookeeper




                                           Sqoop/Flume
                                                                          Pig/Hive (Analytical Language)
                                          (Data integration)



                                 MapReduce Runtime
                                 (Dist. Programming Framework)                         Hbase
                                                                                (Column NoSQL DB)


                                              Hadoop Distributed File System (HDFS)
What‟s the problem for data collection

• Data collection is currently a priori and ad hoc
• A priori – decide what you want to collect ahead of time
• Ad hoc – each kind of data source goes through its own
  collection path
(and how can it help?)



• A distributed data collection service
• It efficiently collecting, aggregating, and moving large
  amounts of data
• Fault tolerant, many failover and recovery mechanism
• One-stop solution for data collection of all formats
Flume: High-Level Overview
• Logical Node
• Source
• Sink
Architecture

• basic diagram
  – one master control multiple node
Architecture

• multiple master control multiple node
An example flow
Flume / Sqoop – Data Integration Framework

                                                Hue                                   Mahout
                                            (Web Console)                         (Data Mining)


                                                                  Oozie
                                                        (Job Workflow & Scheduling)
    (Coordination)
                     Zookeeper




                                           Sqoop/Flume
                                                                          Pig/Hive (Analytical Language)
                                          (Data integration)



                                 MapReduce Runtime
                                 (Dist. Programming Framework)                         Hbase
                                                                                (Column NoSQL DB)


                                              Hadoop Distributed File System (HDFS)
Sqoop

• Easy, parallel database import/export
• What you want do?
  – Insert data from RDBMS to HDFS
  – Export data from HDFS back into RDBMS
What is Sqoop

• A suite of tools that connect Hadoop and database
  systems
• Import tables from databases into HDFS for deep
  analysis
• Export MapReduce results back to a database for
  presentation to end-users
• Provides the ability to import from SQL databases
  straight into your Hive data warehouse
How Sqoop helps

• The Problem
  – Structured data in traditional databases cannot be easily
    combined with complex data stored in HDFS
• Sqoop (SQL-to-Hadoop)
  – Easy import of data from many databases to HDFS
  – Generate code for use in MapReduce applications
Sqoop - import process
Sqoop - export process

• Exports are performed in parallel using MapReduce
Why Sqoop

• JDBC-based implementation
  – Works with many popular database vendors
• Auto-generation of tedious user-side code
  – Write MapReduce applications to work with your data, faster
• Integration with Hive
  – Allows you to stay in a SQL-based environment
Sqoop - JOB
• Job management options




• E.g sqoop job –create myjob –import –connect xxxxxxx
  --table mytable
Pig / Hive – Analytical Language

                                                Hue                                   Mahout
                                            (Web Console)                         (Data Mining)


                                                                  Oozie
                                                        (Job Workflow & Scheduling)
    (Coordination)
                     Zookeeper




                                           Sqoop/Flume
                                                                          Pig/Hive (Analytical Language)
                                          (Data integration)



                                 MapReduce Runtime
                                 (Dist. Programming Framework)                         Hbase
                                                                                (Column NoSQL DB)


                                              Hadoop Distributed File System (HDFS)
Why Hive and Pig?

• Although MapReduce is very powerful, it can also be
  complex to master
• Many organizations have business or data analysts who
  are skilled at writing SQL queries, but not at writing Java
  code
• Many organizations have programmers who are skilled
  at writing code in scripting languages
• Hive and Pig are two projects which evolved separately
  to help such people analyze huge amounts of data via
  MapReduce
  – Hive was initially developed at Facebook, Pig at Yahoo!
Hive     – Developed by

• What is Hive?
  – An SQL-like interface to Hadoop
• Data Warehouse infrastructure that provides data
  summarization and ad hoc querying on top of Hadoop
  – MapRuduce for execution
  – HDFS for storage
• Hive Query Language
  – Basic-SQL : Select, From, Join, Group-By
  – Equi-Join, Muti-Table Insert, Multi-Group-By
  – Batch query
 SELECT * FROM purchases WHERE price > 100 GROUP BY storeid
Pig                 – Initiated by




• A high-level scripting language (Pig Latin)
• Process data one step at a time
• Simple to write MapReduce program
• Easy understand
• Easy debug         A = load ‘a.txt’ as (id, name, age, ...)
                     B = load ‘b.txt’ as (id, address, ...)
                     C = JOIN A BY id, B BY id;STORE C into ‘c.txt’
Hive vs. Pig

                  Hive                   Pig
Language          HiveQL (SQL-like)      Pig Latin, a scripting language
Schema            Table definitions      A schema is optionally defined
                  that are stored in a   at runtime
                  metastore
Programmait Access JDBC, ODBC            PigServer
WordCount Example

• Input
  Hello World Bye World
  Hello Hadoop Goodbye Hadoop
• For the given sample input the map emits
  < Hello, 1>
  < World, 1>
  < Bye, 1>
  < World, 1>
  < Hello, 1>
  < Hadoop, 1>
  < Goodbye, 1>
  < Hadoop, 1>

   < Bye, 1>
• the reduce just sums up the values
  < Goodbye, 1>
  < Hadoop, 2>
  < Hello, 2>
  < World, 2>
WordCount Example In MapReduce
public class WordCount {
public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> {
  private final static IntWritable one = new IntWritable(1);
  private Text word = new Text();

    public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
      String line = value.toString();
      StringTokenizer tokenizer = new StringTokenizer(line);
      while (tokenizer.hasMoreTokens()) {
         word.set(tokenizer.nextToken());
         context.write(word, one);
      }
    }
}

public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> {
  public void reduce(Text key, Iterable<IntWritable> values, Context context)
    throws IOException, InterruptedException {
     int sum = 0;
     for (IntWritable val : values) {
        sum += val.get();
     }
     context.write(key, new IntWritable(sum));
  }
}

public static void main(String[] args) throws Exception {
  Configuration conf = new Configuration();

    Job job = new Job(conf, "wordcount");
    job.setOutputKeyClass(Text.class);
    job.setOutputValueClass(IntWritable.class);

    job.setMapperClass(Map.class);
    job.setReducerClass(Reduce.class);

    job.setInputFormatClass(TextInputFormat.class);
    job.setOutputFormatClass(TextOutputFormat.class);

    FileInputFormat.addInputPath(job, new Path(args[0]));
    FileOutputFormat.setOutputPath(job, new Path(args[1]));

    job.waitForCompletion(true);
}
WordCount Example By Pig


A = LOAD 'wordcount/input' USING PigStorage as (token:chararray);

B = GROUP A BY token;

C = FOREACH B GENERATE group, COUNT(A) as count;

DUMP C;
WordCount Example By Hive

CREATE TABLE wordcount (token STRING);

LOAD DATA LOCAL INPATH ‟wordcount/input'
OVERWRITE INTO TABLE wordcount;


SELECT count(*) FROM wordcount GROUP BY token;
Oozie – Job Workflow & Scheduling

                                                Hue                                   Mahout
                                            (Web Console)                         (Data Mining)


                                                                  Oozie
                                                        (Job Workflow & Scheduling)
    (Coordination)
                     Zookeeper




                                           Sqoop/Flume
                                                                          Pig/Hive (Analytical Language)
                                          (Data integration)



                                 MapReduce Runtime
                                 (Dist. Programming Framework)                         Hbase
                                                                                (Column NoSQL DB)


                                              Hadoop Distributed File System (HDFS)
What is                       ?

• A Java Web Application
• Oozie is a workflow scheduler for Hadoop
• Crond for Hadoop
                       Job 1 Job 2

                            Job 3

                       Job 4 Job 5
Why

• Why use Oozie instead of just cascading a jobs one
  after another
• Major flexibility
  – Start, Stop, Suspend, and re-run jobs
• Oozie allows you to restart from a failure
  – You can tell Oozie to restart a job from a specific node in the
    graph or to skip specific failed nodes
High Level Architecture

• Web Service API
• database store :
  – Workflow definitions
  – Currently running workflow instances, including instance states
    and variables



                 Oozie

        WS       Tomcat
                                      Hadoop/Pig/HDFS
        API      web-app



                   DB
How it triggered

• Time
   – Execute your workflow every 15 minutes


     00:15           00:30     00:45      01:00


 • Time and Data
    – Materialize your workflow every hour, but only run them when
      the input data is ready.
                               Hadoop
Input Data Exists?


               01:00         02:00      03:00     04:00
Exeample Workflow
Oozie use criteria

• Need Launch, control, and monitor jobs from your Java
  Apps
  – Java Client API/Command Line Interface
• Need control jobs from anywhere
  – Web Service API
• Have jobs that you need to run every hour, day, week
• Need receive notification when a job done
  – Email when a job is complete
Hue – Web Console

                                               Hue                                   Mahout
                                           (Web Console)                         (Data Mining)


                                                                 Oozie
                                                       (Job Workflow & Scheduling)
   (Coordination)
                    Zookeeper




                                          Sqoop/Flume
                                                                         Pig/Hive (Analytical Language)
                                         (Data integration)



                                MapReduce Runtime
                                (Dist. Programming Framework)                         Hbase
                                                                               (Column NoSQL DB)


                                             Hadoop Distributed File System (HDFS)
Hue – developed by

• Hadoop User Experience
• Apache Open source project
• HUE is a web UI for Hadoop
• Platform for building custom applications with a nice UI
  library
Hue

• HUE comes with a suite of applications
  – File Browser: Browse HDFS; change permissions and
    ownership; upload, download, view and edit files.
  – Job Browser: View jobs, tasks, counters, logs, etc.
  – Beeswax: Wizards to help create Hive tables, load data, run and
    manage Hive queries, and download results in Excel format.
Hue: File Browser UI
Hue: Beewax UI
Mahout – Data Mining

                                                Hue                                   Mahout
                                            (Web Console)                         (Data Mining)


                                                                  Oozie
                                                        (Job Workflow & Scheduling)
    (Coordination)
                     Zookeeper




                                           Sqoop/Flume
                                                                          Pig/Hive (Analytical Language)
                                          (Data integration)



                                 MapReduce Runtime
                                 (Dist. Programming Framework)                         Hbase
                                                                                (Column NoSQL DB)


                                              Hadoop Distributed File System (HDFS)
What is

• Machine-learning tool
• Distributed and scalable machine learning algorithms on
  the Hadoop platform
• Building intelligent applications easier and faster
Why

• Current state of ML libraries
  –   Lack Community
  –   Lack Documentation and Examples
  –   Lack Scalability
  –   Are Research oriented
Mahout – scale

• Scale to large datasets
  – Hadoop MapReduce implementations that scales linearly with
    data
• Scalable to support your business case
  – Mahout is distributed under a commercially friendly Apache
    Software license
• Scalable community
  – Vibrant, responsive and diverse
Mahout – four use cases

• Mahout machine learning algorithms
  – Recommendation mining : takes users‟ behavior and find items
    said specified user might like
  – Clustering : takes e.g. text documents and groups them based
    on related document topics
  – Classification : learns from existing categorized documents what
    specific category documents look like and is able to assign
    unlabeled documents to appropriate category
  – Frequent item set mining : takes a set of item groups (e.g. terms
    in query session, shopping cart content) and identifies, which
    individual items typically appear together
Use case Example

• Predict what the user likes based on
  – His/Her historical behavior
  – Aggregate behavior of people similar to him
Conclusion

Today, we introduced:
• Why Hadoop is needed
• The basic concepts of HDFS and MapReduce
• What sort of problems can be solved with Hadoop
• What other projects are included in the Hadoop
  ecosystem
Recap – Hadoop Ecosystem

                                               Hue                                   Mahout
                                           (Web Console)                         (Data Mining)


                                                                 Oozie
                                                       (Job Workflow & Scheduling)
   (Coordination)
                    Zookeeper




                                          Sqoop/Flume
                                                                         Pig/Hive (Analytical Language)
                                         (Data integration)



                                MapReduce Runtime
                                (Dist. Programming Framework)                         Hbase
                                                                               (Column NoSQL DB)


                                             Hadoop Distributed File System (HDFS)
Questions?
Thank you!

More Related Content

What's hot

Big Data and Hadoop Introduction
 Big Data and Hadoop Introduction Big Data and Hadoop Introduction
Big Data and Hadoop IntroductionDzung Nguyen
 
Introduction To Hadoop Ecosystem
Introduction To Hadoop EcosystemIntroduction To Hadoop Ecosystem
Introduction To Hadoop EcosystemInSemble
 
Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1Thanh Nguyen
 
Hadoop And Their Ecosystem
 Hadoop And Their Ecosystem Hadoop And Their Ecosystem
Hadoop And Their Ecosystemsunera pathan
 
AWS Summit Sydney 2014 | Secure Hadoop as a Service - Session Sponsored by Intel
AWS Summit Sydney 2014 | Secure Hadoop as a Service - Session Sponsored by IntelAWS Summit Sydney 2014 | Secure Hadoop as a Service - Session Sponsored by Intel
AWS Summit Sydney 2014 | Secure Hadoop as a Service - Session Sponsored by IntelAmazon Web Services
 
Hadoop Overview & Architecture
Hadoop Overview & Architecture  Hadoop Overview & Architecture
Hadoop Overview & Architecture EMC
 
Keynote: Getting Serious about MySQL and Hadoop at Continuent
Keynote: Getting Serious about MySQL and Hadoop at ContinuentKeynote: Getting Serious about MySQL and Hadoop at Continuent
Keynote: Getting Serious about MySQL and Hadoop at ContinuentContinuent
 
Hadoop Seminar Report
Hadoop Seminar ReportHadoop Seminar Report
Hadoop Seminar ReportAtul Kushwaha
 
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)Uwe Printz
 
Integration of HIve and HBase
Integration of HIve and HBaseIntegration of HIve and HBase
Integration of HIve and HBaseHortonworks
 
Hadoop demo ppt
Hadoop demo pptHadoop demo ppt
Hadoop demo pptPhil Young
 
Hadoop Overview
Hadoop Overview Hadoop Overview
Hadoop Overview EMC
 
Introduction to Big Data & Hadoop
Introduction to Big Data & HadoopIntroduction to Big Data & Hadoop
Introduction to Big Data & HadoopEdureka!
 
Introduction to Hadoop Technology
Introduction to Hadoop TechnologyIntroduction to Hadoop Technology
Introduction to Hadoop TechnologyManish Borkar
 
What are Hadoop Components? Hadoop Ecosystem and Architecture | Edureka
What are Hadoop Components? Hadoop Ecosystem and Architecture | EdurekaWhat are Hadoop Components? Hadoop Ecosystem and Architecture | Edureka
What are Hadoop Components? Hadoop Ecosystem and Architecture | EdurekaEdureka!
 

What's hot (20)

Hadoop
HadoopHadoop
Hadoop
 
Big Data and Hadoop Introduction
 Big Data and Hadoop Introduction Big Data and Hadoop Introduction
Big Data and Hadoop Introduction
 
Introduction To Hadoop Ecosystem
Introduction To Hadoop EcosystemIntroduction To Hadoop Ecosystem
Introduction To Hadoop Ecosystem
 
Apache Hadoop at 10
Apache Hadoop at 10Apache Hadoop at 10
Apache Hadoop at 10
 
Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1
 
Hadoop And Their Ecosystem
 Hadoop And Their Ecosystem Hadoop And Their Ecosystem
Hadoop And Their Ecosystem
 
AWS Summit Sydney 2014 | Secure Hadoop as a Service - Session Sponsored by Intel
AWS Summit Sydney 2014 | Secure Hadoop as a Service - Session Sponsored by IntelAWS Summit Sydney 2014 | Secure Hadoop as a Service - Session Sponsored by Intel
AWS Summit Sydney 2014 | Secure Hadoop as a Service - Session Sponsored by Intel
 
Hadoop Overview & Architecture
Hadoop Overview & Architecture  Hadoop Overview & Architecture
Hadoop Overview & Architecture
 
Keynote: Getting Serious about MySQL and Hadoop at Continuent
Keynote: Getting Serious about MySQL and Hadoop at ContinuentKeynote: Getting Serious about MySQL and Hadoop at Continuent
Keynote: Getting Serious about MySQL and Hadoop at Continuent
 
An Introduction to the World of Hadoop
An Introduction to the World of HadoopAn Introduction to the World of Hadoop
An Introduction to the World of Hadoop
 
Hadoop Seminar Report
Hadoop Seminar ReportHadoop Seminar Report
Hadoop Seminar Report
 
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)
 
Hadoop overview
Hadoop overviewHadoop overview
Hadoop overview
 
Integration of HIve and HBase
Integration of HIve and HBaseIntegration of HIve and HBase
Integration of HIve and HBase
 
Hadoop demo ppt
Hadoop demo pptHadoop demo ppt
Hadoop demo ppt
 
Hadoop Overview
Hadoop Overview Hadoop Overview
Hadoop Overview
 
Introduction to Big Data & Hadoop
Introduction to Big Data & HadoopIntroduction to Big Data & Hadoop
Introduction to Big Data & Hadoop
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
 
Introduction to Hadoop Technology
Introduction to Hadoop TechnologyIntroduction to Hadoop Technology
Introduction to Hadoop Technology
 
What are Hadoop Components? Hadoop Ecosystem and Architecture | Edureka
What are Hadoop Components? Hadoop Ecosystem and Architecture | EdurekaWhat are Hadoop Components? Hadoop Ecosystem and Architecture | Edureka
What are Hadoop Components? Hadoop Ecosystem and Architecture | Edureka
 

Viewers also liked

Druid at SF Big Analytics 2015-12-01
Druid at SF Big Analytics 2015-12-01Druid at SF Big Analytics 2015-12-01
Druid at SF Big Analytics 2015-12-01gianmerlino
 
Large scale ETL with Hadoop
Large scale ETL with HadoopLarge scale ETL with Hadoop
Large scale ETL with HadoopOReillyStrata
 
Druid: Sub-Second OLAP queries over Petabytes of Streaming Data
Druid: Sub-Second OLAP queries over Petabytes of Streaming DataDruid: Sub-Second OLAP queries over Petabytes of Streaming Data
Druid: Sub-Second OLAP queries over Petabytes of Streaming DataDataWorks Summit
 
Big Data Pipeline and Analytics Platform Using NetflixOSS and Other Open Sour...
Big Data Pipeline and Analytics Platform Using NetflixOSS and Other Open Sour...Big Data Pipeline and Analytics Platform Using NetflixOSS and Other Open Sour...
Big Data Pipeline and Analytics Platform Using NetflixOSS and Other Open Sour...Sudhir Tonse
 
OLAP options on Hadoop
OLAP options on HadoopOLAP options on Hadoop
OLAP options on HadoopYuta Imai
 

Viewers also liked (6)

Druid at SF Big Analytics 2015-12-01
Druid at SF Big Analytics 2015-12-01Druid at SF Big Analytics 2015-12-01
Druid at SF Big Analytics 2015-12-01
 
Large scale ETL with Hadoop
Large scale ETL with HadoopLarge scale ETL with Hadoop
Large scale ETL with Hadoop
 
Druid: Sub-Second OLAP queries over Petabytes of Streaming Data
Druid: Sub-Second OLAP queries over Petabytes of Streaming DataDruid: Sub-Second OLAP queries over Petabytes of Streaming Data
Druid: Sub-Second OLAP queries over Petabytes of Streaming Data
 
Big Data Pipeline and Analytics Platform Using NetflixOSS and Other Open Sour...
Big Data Pipeline and Analytics Platform Using NetflixOSS and Other Open Sour...Big Data Pipeline and Analytics Platform Using NetflixOSS and Other Open Sour...
Big Data Pipeline and Analytics Platform Using NetflixOSS and Other Open Sour...
 
OLAP options on Hadoop
OLAP options on HadoopOLAP options on Hadoop
OLAP options on Hadoop
 
Scalable Real-time analytics using Druid
Scalable Real-time analytics using DruidScalable Real-time analytics using Druid
Scalable Real-time analytics using Druid
 

Similar to Hadoop Family and Ecosystem

Apachecon Euro 2012: Elastic, Multi-tenant Hadoop on Demand
Apachecon Euro 2012: Elastic, Multi-tenant Hadoop on DemandApachecon Euro 2012: Elastic, Multi-tenant Hadoop on Demand
Apachecon Euro 2012: Elastic, Multi-tenant Hadoop on DemandRichard McDougall
 
Keynote from ApacheCon NA 2011
Keynote from ApacheCon NA 2011Keynote from ApacheCon NA 2011
Keynote from ApacheCon NA 2011Hortonworks
 
NYC-Meetup- Introduction to Hadoop Echosystem
NYC-Meetup- Introduction to Hadoop EchosystemNYC-Meetup- Introduction to Hadoop Echosystem
NYC-Meetup- Introduction to Hadoop EchosystemAL500745425
 
Microsoft's Hadoop Story
Microsoft's Hadoop StoryMicrosoft's Hadoop Story
Microsoft's Hadoop StoryMichael Rys
 
HDP-1 introduction for HUG France
HDP-1 introduction for HUG FranceHDP-1 introduction for HUG France
HDP-1 introduction for HUG FranceSteve Loughran
 
Introduction to Apache Hadoop
Introduction to Apache HadoopIntroduction to Apache Hadoop
Introduction to Apache HadoopSteve Watt
 
Hadoop distributed computing framework for big data
Hadoop distributed computing framework for big dataHadoop distributed computing framework for big data
Hadoop distributed computing framework for big dataCyanny LIANG
 
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive DemosHadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive DemosLester Martin
 
Introduction to apache hadoop
Introduction to apache hadoopIntroduction to apache hadoop
Introduction to apache hadoopShashwat Shriparv
 
Big Data in the Microsoft Platform
Big Data in the Microsoft PlatformBig Data in the Microsoft Platform
Big Data in the Microsoft PlatformJesus Rodriguez
 
Big data & hadoop
Big data & hadoopBig data & hadoop
Big data & hadoopAbhi Goyan
 
BW Tech Meetup: Hadoop and The rise of Big Data
BW Tech Meetup: Hadoop and The rise of Big Data BW Tech Meetup: Hadoop and The rise of Big Data
BW Tech Meetup: Hadoop and The rise of Big Data Mindgrub Technologies
 
4. hadoop גיא לבנברג
4. hadoop  גיא לבנברג4. hadoop  גיא לבנברג
4. hadoop גיא לבנברגTaldor Group
 
Architecting the Future of Big Data & Search - Eric Baldeschwieler
Architecting the Future of Big Data & Search - Eric BaldeschwielerArchitecting the Future of Big Data & Search - Eric Baldeschwieler
Architecting the Future of Big Data & Search - Eric Baldeschwielerlucenerevolution
 
Hadoop on Azure, Blue elephants
Hadoop on Azure,  Blue elephantsHadoop on Azure,  Blue elephants
Hadoop on Azure, Blue elephantsOvidiu Dimulescu
 

Similar to Hadoop Family and Ecosystem (20)

Zh tw cloud computing era
Zh tw cloud computing eraZh tw cloud computing era
Zh tw cloud computing era
 
Cloud computing era
Cloud computing eraCloud computing era
Cloud computing era
 
Big data ppt
Big data pptBig data ppt
Big data ppt
 
Apachecon Euro 2012: Elastic, Multi-tenant Hadoop on Demand
Apachecon Euro 2012: Elastic, Multi-tenant Hadoop on DemandApachecon Euro 2012: Elastic, Multi-tenant Hadoop on Demand
Apachecon Euro 2012: Elastic, Multi-tenant Hadoop on Demand
 
Keynote from ApacheCon NA 2011
Keynote from ApacheCon NA 2011Keynote from ApacheCon NA 2011
Keynote from ApacheCon NA 2011
 
NYC-Meetup- Introduction to Hadoop Echosystem
NYC-Meetup- Introduction to Hadoop EchosystemNYC-Meetup- Introduction to Hadoop Echosystem
NYC-Meetup- Introduction to Hadoop Echosystem
 
Big data
Big dataBig data
Big data
 
Microsoft's Hadoop Story
Microsoft's Hadoop StoryMicrosoft's Hadoop Story
Microsoft's Hadoop Story
 
HDP-1 introduction for HUG France
HDP-1 introduction for HUG FranceHDP-1 introduction for HUG France
HDP-1 introduction for HUG France
 
Introduction to Apache Hadoop
Introduction to Apache HadoopIntroduction to Apache Hadoop
Introduction to Apache Hadoop
 
Hadoop distributed computing framework for big data
Hadoop distributed computing framework for big dataHadoop distributed computing framework for big data
Hadoop distributed computing framework for big data
 
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive DemosHadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
 
Introduction to apache hadoop
Introduction to apache hadoopIntroduction to apache hadoop
Introduction to apache hadoop
 
Big Data in the Microsoft Platform
Big Data in the Microsoft PlatformBig Data in the Microsoft Platform
Big Data in the Microsoft Platform
 
Big data & hadoop
Big data & hadoopBig data & hadoop
Big data & hadoop
 
Bw tech hadoop
Bw tech hadoopBw tech hadoop
Bw tech hadoop
 
BW Tech Meetup: Hadoop and The rise of Big Data
BW Tech Meetup: Hadoop and The rise of Big Data BW Tech Meetup: Hadoop and The rise of Big Data
BW Tech Meetup: Hadoop and The rise of Big Data
 
4. hadoop גיא לבנברג
4. hadoop  גיא לבנברג4. hadoop  גיא לבנברג
4. hadoop גיא לבנברג
 
Architecting the Future of Big Data & Search - Eric Baldeschwieler
Architecting the Future of Big Data & Search - Eric BaldeschwielerArchitecting the Future of Big Data & Search - Eric Baldeschwieler
Architecting the Future of Big Data & Search - Eric Baldeschwieler
 
Hadoop on Azure, Blue elephants
Hadoop on Azure,  Blue elephantsHadoop on Azure,  Blue elephants
Hadoop on Azure, Blue elephants
 

More from tcloudcomputing-tw

Hadoop Security Now and Future
Hadoop Security Now and FutureHadoop Security Now and Future
Hadoop Security Now and Futuretcloudcomputing-tw
 
Session 4 - News from ACS Community
Session 4 - News from ACS CommunitySession 4 - News from ACS Community
Session 4 - News from ACS Communitytcloudcomputing-tw
 
Session 3 - CloudStack Test Automation and CI
Session 3 - CloudStack Test Automation and CISession 3 - CloudStack Test Automation and CI
Session 3 - CloudStack Test Automation and CItcloudcomputing-tw
 
Session 2 - CloudStack Usage and Application (2013.Q3)
Session 2 - CloudStack Usage and Application (2013.Q3)Session 2 - CloudStack Usage and Application (2013.Q3)
Session 2 - CloudStack Usage and Application (2013.Q3)tcloudcomputing-tw
 
Session 1 - CloudStack Plugin Structure and Implementation (2013.Q3)
Session 1 - CloudStack Plugin Structure and Implementation (2013.Q3)Session 1 - CloudStack Plugin Structure and Implementation (2013.Q3)
Session 1 - CloudStack Plugin Structure and Implementation (2013.Q3)tcloudcomputing-tw
 
2012 CloudStack Design Camp in Taiwan--- CloudStack Overview-2
2012 CloudStack Design Camp in Taiwan--- CloudStack Overview-22012 CloudStack Design Camp in Taiwan--- CloudStack Overview-2
2012 CloudStack Design Camp in Taiwan--- CloudStack Overview-2tcloudcomputing-tw
 
2012 CloudStack Design Camp in Taiwan--- CloudStack Overview-1
2012 CloudStack Design Camp in Taiwan--- CloudStack Overview-12012 CloudStack Design Camp in Taiwan--- CloudStack Overview-1
2012 CloudStack Design Camp in Taiwan--- CloudStack Overview-1tcloudcomputing-tw
 

More from tcloudcomputing-tw (7)

Hadoop Security Now and Future
Hadoop Security Now and FutureHadoop Security Now and Future
Hadoop Security Now and Future
 
Session 4 - News from ACS Community
Session 4 - News from ACS CommunitySession 4 - News from ACS Community
Session 4 - News from ACS Community
 
Session 3 - CloudStack Test Automation and CI
Session 3 - CloudStack Test Automation and CISession 3 - CloudStack Test Automation and CI
Session 3 - CloudStack Test Automation and CI
 
Session 2 - CloudStack Usage and Application (2013.Q3)
Session 2 - CloudStack Usage and Application (2013.Q3)Session 2 - CloudStack Usage and Application (2013.Q3)
Session 2 - CloudStack Usage and Application (2013.Q3)
 
Session 1 - CloudStack Plugin Structure and Implementation (2013.Q3)
Session 1 - CloudStack Plugin Structure and Implementation (2013.Q3)Session 1 - CloudStack Plugin Structure and Implementation (2013.Q3)
Session 1 - CloudStack Plugin Structure and Implementation (2013.Q3)
 
2012 CloudStack Design Camp in Taiwan--- CloudStack Overview-2
2012 CloudStack Design Camp in Taiwan--- CloudStack Overview-22012 CloudStack Design Camp in Taiwan--- CloudStack Overview-2
2012 CloudStack Design Camp in Taiwan--- CloudStack Overview-2
 
2012 CloudStack Design Camp in Taiwan--- CloudStack Overview-1
2012 CloudStack Design Camp in Taiwan--- CloudStack Overview-12012 CloudStack Design Camp in Taiwan--- CloudStack Overview-1
2012 CloudStack Design Camp in Taiwan--- CloudStack Overview-1
 

Recently uploaded

UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7DianaGray10
 
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdf
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdf
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdfJamie (Taka) Wang
 
OpenShift Commons Paris - Choose Your Own Observability Adventure
OpenShift Commons Paris - Choose Your Own Observability AdventureOpenShift Commons Paris - Choose Your Own Observability Adventure
OpenShift Commons Paris - Choose Your Own Observability AdventureEric D. Schabell
 
Basic Building Blocks of Internet of Things.
Basic Building Blocks of Internet of Things.Basic Building Blocks of Internet of Things.
Basic Building Blocks of Internet of Things.YounusS2
 
Designing A Time bound resource download URL
Designing A Time bound resource download URLDesigning A Time bound resource download URL
Designing A Time bound resource download URLRuncy Oommen
 
Empowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership BlueprintEmpowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership BlueprintMahmoud Rabie
 
Videogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdfVideogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdfinfogdgmi
 
AI You Can Trust - Ensuring Success with Data Integrity Webinar
AI You Can Trust - Ensuring Success with Data Integrity WebinarAI You Can Trust - Ensuring Success with Data Integrity Webinar
AI You Can Trust - Ensuring Success with Data Integrity WebinarPrecisely
 
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdfIaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdfDaniel Santiago Silva Capera
 
Building Your Own AI Instance (TBLC AI )
Building Your Own AI Instance (TBLC AI )Building Your Own AI Instance (TBLC AI )
Building Your Own AI Instance (TBLC AI )Brian Pichman
 
UiPath Studio Web workshop series - Day 8
UiPath Studio Web workshop series - Day 8UiPath Studio Web workshop series - Day 8
UiPath Studio Web workshop series - Day 8DianaGray10
 
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPA
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPAAnypoint Code Builder , Google Pub sub connector and MuleSoft RPA
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPAshyamraj55
 
Crea il tuo assistente AI con lo Stregatto (open source python framework)
Crea il tuo assistente AI con lo Stregatto (open source python framework)Crea il tuo assistente AI con lo Stregatto (open source python framework)
Crea il tuo assistente AI con lo Stregatto (open source python framework)Commit University
 
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve DecarbonizationUsing IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve DecarbonizationIES VE
 
9 Steps For Building Winning Founding Team
9 Steps For Building Winning Founding Team9 Steps For Building Winning Founding Team
9 Steps For Building Winning Founding TeamAdam Moalla
 
Machine Learning Model Validation (Aijun Zhang 2024).pdf
Machine Learning Model Validation (Aijun Zhang 2024).pdfMachine Learning Model Validation (Aijun Zhang 2024).pdf
Machine Learning Model Validation (Aijun Zhang 2024).pdfAijun Zhang
 
Introduction to Matsuo Laboratory (ENG).pptx
Introduction to Matsuo Laboratory (ENG).pptxIntroduction to Matsuo Laboratory (ENG).pptx
Introduction to Matsuo Laboratory (ENG).pptxMatsuo Lab
 
Building AI-Driven Apps Using Semantic Kernel.pptx
Building AI-Driven Apps Using Semantic Kernel.pptxBuilding AI-Driven Apps Using Semantic Kernel.pptx
Building AI-Driven Apps Using Semantic Kernel.pptxUdaiappa Ramachandran
 
UiPath Platform: The Backend Engine Powering Your Automation - Session 1
UiPath Platform: The Backend Engine Powering Your Automation - Session 1UiPath Platform: The Backend Engine Powering Your Automation - Session 1
UiPath Platform: The Backend Engine Powering Your Automation - Session 1DianaGray10
 

Recently uploaded (20)

UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7
 
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdf
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdf
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdf
 
OpenShift Commons Paris - Choose Your Own Observability Adventure
OpenShift Commons Paris - Choose Your Own Observability AdventureOpenShift Commons Paris - Choose Your Own Observability Adventure
OpenShift Commons Paris - Choose Your Own Observability Adventure
 
Basic Building Blocks of Internet of Things.
Basic Building Blocks of Internet of Things.Basic Building Blocks of Internet of Things.
Basic Building Blocks of Internet of Things.
 
Designing A Time bound resource download URL
Designing A Time bound resource download URLDesigning A Time bound resource download URL
Designing A Time bound resource download URL
 
Empowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership BlueprintEmpowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership Blueprint
 
Videogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdfVideogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdf
 
AI You Can Trust - Ensuring Success with Data Integrity Webinar
AI You Can Trust - Ensuring Success with Data Integrity WebinarAI You Can Trust - Ensuring Success with Data Integrity Webinar
AI You Can Trust - Ensuring Success with Data Integrity Webinar
 
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdfIaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
 
20150722 - AGV
20150722 - AGV20150722 - AGV
20150722 - AGV
 
Building Your Own AI Instance (TBLC AI )
Building Your Own AI Instance (TBLC AI )Building Your Own AI Instance (TBLC AI )
Building Your Own AI Instance (TBLC AI )
 
UiPath Studio Web workshop series - Day 8
UiPath Studio Web workshop series - Day 8UiPath Studio Web workshop series - Day 8
UiPath Studio Web workshop series - Day 8
 
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPA
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPAAnypoint Code Builder , Google Pub sub connector and MuleSoft RPA
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPA
 
Crea il tuo assistente AI con lo Stregatto (open source python framework)
Crea il tuo assistente AI con lo Stregatto (open source python framework)Crea il tuo assistente AI con lo Stregatto (open source python framework)
Crea il tuo assistente AI con lo Stregatto (open source python framework)
 
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve DecarbonizationUsing IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
 
9 Steps For Building Winning Founding Team
9 Steps For Building Winning Founding Team9 Steps For Building Winning Founding Team
9 Steps For Building Winning Founding Team
 
Machine Learning Model Validation (Aijun Zhang 2024).pdf
Machine Learning Model Validation (Aijun Zhang 2024).pdfMachine Learning Model Validation (Aijun Zhang 2024).pdf
Machine Learning Model Validation (Aijun Zhang 2024).pdf
 
Introduction to Matsuo Laboratory (ENG).pptx
Introduction to Matsuo Laboratory (ENG).pptxIntroduction to Matsuo Laboratory (ENG).pptx
Introduction to Matsuo Laboratory (ENG).pptx
 
Building AI-Driven Apps Using Semantic Kernel.pptx
Building AI-Driven Apps Using Semantic Kernel.pptxBuilding AI-Driven Apps Using Semantic Kernel.pptx
Building AI-Driven Apps Using Semantic Kernel.pptx
 
UiPath Platform: The Backend Engine Powering Your Automation - Session 1
UiPath Platform: The Backend Engine Powering Your Automation - Session 1UiPath Platform: The Backend Engine Powering Your Automation - Session 1
UiPath Platform: The Backend Engine Powering Your Automation - Session 1
 

Hadoop Family and Ecosystem

  • 1. Hadoop Product Family and Ecosystem PC Liao
  • 2. Agenda • What is BigData? • What is the problem? • Hadoop – Introduction to Hadoop – Hadoop components – What sort of problems can be solved with Hadoop? • Hadoop ecosystem • Conclusion
  • 3. What is BigData? A set of files A database A single file
  • 4. The Data-Driven World • Modern systems have to deal with far more data than was the case in the past – Organizations are generating huge amounts of data – That data has inherent value, and cannot be discarded • Examples: – Yahoo – over 170PB of data – Facebook – over 30PB of data – eBay – over 5PB of data • Many organizations are generating data at a rate of terabytes per day
  • 5. What is the problem • Traditionally, computation has been processor-bound • For decades, the primary push was to increase the computing power of a single machine – Faster processor, more RAM • Distributed systems evolved to allow developers to use multiple machines for a single job – At compute time, data is copied to the compute nodes
  • 6. What is the problem • Getting the data to the processors becomes the bottleneck • Quick calculation – Typical disk data transfer rate: • 75MB/sec – Time taken to transfer 100GB of data to the processor: • approx. 22 minutes!
  • 7. What is the problem • Failure of a component may cost a lot • What we need when job fail? – May result in a graceful degradation of application performance, but entire system does not completely fail – Should not result in the loss of any data – Would not affect the outcome of the job
  • 8. Big Data Solutions by Industries The most common problems Hadoop can solve
  • 9. Threat Analysis/Trade Surveillance • Challenge: – Detecting threats in the form of fraudulent activity or attacks • Large data volumes involved • Like looking for a needle in a haystack • Solution with Hadoop: – Parallel processing over huge datasets – Pattern recognition to identify anomalies • – i.e., threats • Typical Industry: – Security, Financial Services
  • 10. Big Data Use Case Smart Protection Network • Challenge – Information accessibility and transparency problems for threat researcher due to the size and source of data (volume, variety and velocity) • Size of Data – Overall Data • Data sources: 20+ • Data fields: 1000+ • Daily new records: 23 Billion+ • Daily new data size: 4TB+ SPN Smart Feedback • Feedback components: 26 • Data fields : 300+ • Daily new file counts: 6 Million+ • Daily new records: 90 Million+ • Daily new data size: 261GB+
  • 13. Recommendation Engine • Challenge: – Using user data to predict which products to recommend • Solution with Hadoop: – Batch processing framework • Allow execution in in parallel over large datasets – Collaborative filtering • Collecting „taste‟ information from many users • Utilizing information to predict what similar users like • Typical Industry – ISP, Advertising
  • 14. Walmart Case Diapers Beer Friday Revenue ?
  • 16. – inspired by • Apache Hadoop project – inspired by Google's MapReduce and Google File System papers. • Open sourced, flexible and available architecture for large scale computation and data processing on a network of commodity hardware • Open Source Software + Hardware Commodity – IT Costs Reduction
  • 17. Hadoop Concepts • Distribute the data as it is initially stored in the system • Individual nodes can work on data local to those nodes • Users can focus on developing applications.
  • 18. Hadoop Components • Hadoop consists of two core components – The Hadoop Distributed File System (HDFS) – MapReduce Software Framework • There are many other projects based around core Hadoop – Often referred to as the „Hadoop Ecosystem‟ – Pig, Hive, HBase, Flume, Oozie, Sqoop, etc Hue Mahout (Web Console) (Data Mining) Oozie (Job Workflow & Scheduling) (Coordination) Zookeeper Sqoop/Flume Pig/Hive (Analytical (Data integration) Language) MapReduce Runtime (Dist. Programming Framework) Hbase (Column NoSQL DB) Hadoop Distributed File System (HDFS)
  • 19. Hadoop Components: HDFS • HDFS, the Hadoop Distributed File System, is responsible for storing data on the cluster • Two roles in HDFS – Namenode: Record metadata – Datanode: Store data Hue Mahout (Web Console) (Data Mining) Oozie (Job Workflow & Scheduling) (Coordination) Zookeeper Sqoop/Flume Pig/Hive (Analytical (Data integration) Language) MapReduce Runtime (Dist. Programming Framework) Hbase (Column NoSQL DB) Hadoop Distributed File System (HDFS)
  • 20. How Files Are Stored: Example • NameNode holds metadata for the data files • DataNodes hold the actual blocks • Each block is replicated three times on the cluster
  • 21. HDFS: Points To Note • When a client application wants to read a file: • It communicates with the NameNode to determine which blocks make up the file, and which DataNodes those blocks reside on • It then communicates directly with the DataNodes to read the data
  • 22. Hadoop Components: MapReduce • MapReduce is a method for distributing a task across multiple nodes • It works like a Unix pipeline: – cat input | grep | sort | uniq -c | cat > output – Input | Map | Shuffle & Sort | Reduce | Output Hue Mahout (Web Console) (Data Mining) Oozie (Job Workflow & Scheduling) (Coordination) Zookeeper Sqoop/Flume Pig/Hive (Analytical (Data integration) Language) MapReduce Runtime (Dist. Programming Framework) Hbase (Column NoSQL DB) Hadoop Distributed File System (HDFS)
  • 23. Features of MapReduce • Automatic parallelization and distribution • Automatic re-execution on failure • Locality optimizations • MapReduce abstracts all the „housekeeping‟ away from the developer – Developer can concentrate simply on writing the Map and Reduce functions Hue Mahout (Web Console) (Data Mining) Oozie (Job Workflow & Scheduling) (Coordination) Zookeeper Sqoop/Flume Pig/Hive (Analytical (Data integration) Language) MapReduce Runtime (Dist. Programming Framework) Hbase (Column NoSQL DB) Hadoop Distributed File System (HDFS)
  • 24. Example : word count • Word count is challenging over massive amounts of data – Using a single compute node would be too time-consuming – Number of unique words can easily exceed the RAM • MapReduce breaks complex tasks down into smaller elements which can be executed in parallel • More nodes, more faster
  • 25. Word Count Example Key: offset Value: line Key: word Key: word Value: count Value: sum of count 0:The cat sat on the mat 22:The aardvark sat on the sofa
  • 27. Growing Hadoop Ecosystem • The term „Hadoop‟ is taken to be the combination of HDFS and MapReduce • There are numerous other projects surrounding Hadoop – Typically referred to as the „Hadoop Ecosystem‟ • Zookeeper • Hive and Pig • HBase • Flume • Other Ecosystem Projects – Sqoop – Oozie – Hue – Mahout
  • 28. The Ecosystem is the System • Hadoop has become the kernel of the distributed operating system for Big Data • No one uses the kernel alone • A collection of projects at Apache
  • 29. Relation Map Hue Mahout (Web Console) (Data Mining) Oozie (Job Workflow & Scheduling) (Coordination) Zookeeper Sqoop/Flume Pig/Hive (Analytical Language) (Data integration) MapReduce Runtime (Dist. Programming Framework) Hbase (Column NoSQL DB) Hadoop Distributed File System (HDFS)
  • 30. Zookeeper – Coordination Framework Hue Mahout (Web Console) (Data Mining) Oozie (Job Workflow & Scheduling) (Coordination) Zookeeper Sqoop/Flume Pig/Hive (Analytical Language) (Data integration) MapReduce Runtime (Dist. Programming Framework) Hbase (Column NoSQL DB) Hadoop Distributed File System (HDFS)
  • 31. What is ZooKeeper • A centralized service for maintaining – Configuration information – Providing distributed synchronization • A set of tools to build distributed applications that can safely handle partial failures • ZooKeeper was designed to store coordination data – Status information – Configuration – Location information
  • 32. Why use ZooKeeper? • Manage configuration across nodes • Implement reliable messaging • Implement redundant services • Synchronize process execution
  • 33. ZooKeeper Architecture – All servers store a copy of the data (in memory) – A leader is elected at startup – 2 roles – leader and follower • Followers service clients, all updates go through leader • Update responses are sent when a majority of servers have persisted the change – HA support
  • 34. Hbase – Column NoSQL DB Hue Mahout (Web Console) (Data Mining) Oozie (Job Workflow & Scheduling) (Coordination) Zookeeper Sqoop/Flume Pig/Hive (Analytical Language) (Data integration) MapReduce Runtime (Dist. Programming Framework) Hbase (Column NoSQL DB) Hadoop Distributed File System (HDFS)
  • 36. I – Inspired by • Apache open source project • Inspired from Google Big Table • Non-relational, distributed database written in Java • Coordinated by Zookeeper
  • 37. Row & Column Oriented
  • 38. Hbase – Data Model • Cells are “versioned” • Table rows are sorted by row key • Region – a row range [start-key:end-key]
  • 39. Architecture • Master Server (HMaster) – Assigns regions to regionservers – Monitors the health of regionservers • RegionServers – Contain regions and handle client read/write request
  • 41. When to use HBase • Need random, low latency access to the data • Application has a variable schema where each row is slightly different • Add columns • Most of columns are NULL in each row
  • 42. Flume / Sqoop – Data Integration Framework Hue Mahout (Web Console) (Data Mining) Oozie (Job Workflow & Scheduling) (Coordination) Zookeeper Sqoop/Flume Pig/Hive (Analytical Language) (Data integration) MapReduce Runtime (Dist. Programming Framework) Hbase (Column NoSQL DB) Hadoop Distributed File System (HDFS)
  • 43. What‟s the problem for data collection • Data collection is currently a priori and ad hoc • A priori – decide what you want to collect ahead of time • Ad hoc – each kind of data source goes through its own collection path
  • 44. (and how can it help?) • A distributed data collection service • It efficiently collecting, aggregating, and moving large amounts of data • Fault tolerant, many failover and recovery mechanism • One-stop solution for data collection of all formats
  • 45. Flume: High-Level Overview • Logical Node • Source • Sink
  • 46. Architecture • basic diagram – one master control multiple node
  • 47. Architecture • multiple master control multiple node
  • 49. Flume / Sqoop – Data Integration Framework Hue Mahout (Web Console) (Data Mining) Oozie (Job Workflow & Scheduling) (Coordination) Zookeeper Sqoop/Flume Pig/Hive (Analytical Language) (Data integration) MapReduce Runtime (Dist. Programming Framework) Hbase (Column NoSQL DB) Hadoop Distributed File System (HDFS)
  • 50. Sqoop • Easy, parallel database import/export • What you want do? – Insert data from RDBMS to HDFS – Export data from HDFS back into RDBMS
  • 51. What is Sqoop • A suite of tools that connect Hadoop and database systems • Import tables from databases into HDFS for deep analysis • Export MapReduce results back to a database for presentation to end-users • Provides the ability to import from SQL databases straight into your Hive data warehouse
  • 52. How Sqoop helps • The Problem – Structured data in traditional databases cannot be easily combined with complex data stored in HDFS • Sqoop (SQL-to-Hadoop) – Easy import of data from many databases to HDFS – Generate code for use in MapReduce applications
  • 53. Sqoop - import process
  • 54. Sqoop - export process • Exports are performed in parallel using MapReduce
  • 55. Why Sqoop • JDBC-based implementation – Works with many popular database vendors • Auto-generation of tedious user-side code – Write MapReduce applications to work with your data, faster • Integration with Hive – Allows you to stay in a SQL-based environment
  • 56. Sqoop - JOB • Job management options • E.g sqoop job –create myjob –import –connect xxxxxxx --table mytable
  • 57. Pig / Hive – Analytical Language Hue Mahout (Web Console) (Data Mining) Oozie (Job Workflow & Scheduling) (Coordination) Zookeeper Sqoop/Flume Pig/Hive (Analytical Language) (Data integration) MapReduce Runtime (Dist. Programming Framework) Hbase (Column NoSQL DB) Hadoop Distributed File System (HDFS)
  • 58. Why Hive and Pig? • Although MapReduce is very powerful, it can also be complex to master • Many organizations have business or data analysts who are skilled at writing SQL queries, but not at writing Java code • Many organizations have programmers who are skilled at writing code in scripting languages • Hive and Pig are two projects which evolved separately to help such people analyze huge amounts of data via MapReduce – Hive was initially developed at Facebook, Pig at Yahoo!
  • 59. Hive – Developed by • What is Hive? – An SQL-like interface to Hadoop • Data Warehouse infrastructure that provides data summarization and ad hoc querying on top of Hadoop – MapRuduce for execution – HDFS for storage • Hive Query Language – Basic-SQL : Select, From, Join, Group-By – Equi-Join, Muti-Table Insert, Multi-Group-By – Batch query SELECT * FROM purchases WHERE price > 100 GROUP BY storeid
  • 60. Pig – Initiated by • A high-level scripting language (Pig Latin) • Process data one step at a time • Simple to write MapReduce program • Easy understand • Easy debug A = load ‘a.txt’ as (id, name, age, ...) B = load ‘b.txt’ as (id, address, ...) C = JOIN A BY id, B BY id;STORE C into ‘c.txt’
  • 61. Hive vs. Pig Hive Pig Language HiveQL (SQL-like) Pig Latin, a scripting language Schema Table definitions A schema is optionally defined that are stored in a at runtime metastore Programmait Access JDBC, ODBC PigServer
  • 62. WordCount Example • Input Hello World Bye World Hello Hadoop Goodbye Hadoop • For the given sample input the map emits < Hello, 1> < World, 1> < Bye, 1> < World, 1> < Hello, 1> < Hadoop, 1> < Goodbye, 1> < Hadoop, 1> < Bye, 1> • the reduce just sums up the values < Goodbye, 1> < Hadoop, 2> < Hello, 2> < World, 2>
  • 63. WordCount Example In MapReduce public class WordCount { public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); context.write(word, one); } } } public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } context.write(key, new IntWritable(sum)); } } public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); Job job = new Job(conf, "wordcount"); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); job.setMapperClass(Map.class); job.setReducerClass(Reduce.class); job.setInputFormatClass(TextInputFormat.class); job.setOutputFormatClass(TextOutputFormat.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.waitForCompletion(true); }
  • 64. WordCount Example By Pig A = LOAD 'wordcount/input' USING PigStorage as (token:chararray); B = GROUP A BY token; C = FOREACH B GENERATE group, COUNT(A) as count; DUMP C;
  • 65. WordCount Example By Hive CREATE TABLE wordcount (token STRING); LOAD DATA LOCAL INPATH ‟wordcount/input' OVERWRITE INTO TABLE wordcount; SELECT count(*) FROM wordcount GROUP BY token;
  • 66. Oozie – Job Workflow & Scheduling Hue Mahout (Web Console) (Data Mining) Oozie (Job Workflow & Scheduling) (Coordination) Zookeeper Sqoop/Flume Pig/Hive (Analytical Language) (Data integration) MapReduce Runtime (Dist. Programming Framework) Hbase (Column NoSQL DB) Hadoop Distributed File System (HDFS)
  • 67. What is ? • A Java Web Application • Oozie is a workflow scheduler for Hadoop • Crond for Hadoop Job 1 Job 2 Job 3 Job 4 Job 5
  • 68. Why • Why use Oozie instead of just cascading a jobs one after another • Major flexibility – Start, Stop, Suspend, and re-run jobs • Oozie allows you to restart from a failure – You can tell Oozie to restart a job from a specific node in the graph or to skip specific failed nodes
  • 69. High Level Architecture • Web Service API • database store : – Workflow definitions – Currently running workflow instances, including instance states and variables Oozie WS Tomcat Hadoop/Pig/HDFS API web-app DB
  • 70. How it triggered • Time – Execute your workflow every 15 minutes 00:15 00:30 00:45 01:00 • Time and Data – Materialize your workflow every hour, but only run them when the input data is ready. Hadoop Input Data Exists? 01:00 02:00 03:00 04:00
  • 72. Oozie use criteria • Need Launch, control, and monitor jobs from your Java Apps – Java Client API/Command Line Interface • Need control jobs from anywhere – Web Service API • Have jobs that you need to run every hour, day, week • Need receive notification when a job done – Email when a job is complete
  • 73. Hue – Web Console Hue Mahout (Web Console) (Data Mining) Oozie (Job Workflow & Scheduling) (Coordination) Zookeeper Sqoop/Flume Pig/Hive (Analytical Language) (Data integration) MapReduce Runtime (Dist. Programming Framework) Hbase (Column NoSQL DB) Hadoop Distributed File System (HDFS)
  • 74. Hue – developed by • Hadoop User Experience • Apache Open source project • HUE is a web UI for Hadoop • Platform for building custom applications with a nice UI library
  • 75. Hue • HUE comes with a suite of applications – File Browser: Browse HDFS; change permissions and ownership; upload, download, view and edit files. – Job Browser: View jobs, tasks, counters, logs, etc. – Beeswax: Wizards to help create Hive tables, load data, run and manage Hive queries, and download results in Excel format.
  • 78. Mahout – Data Mining Hue Mahout (Web Console) (Data Mining) Oozie (Job Workflow & Scheduling) (Coordination) Zookeeper Sqoop/Flume Pig/Hive (Analytical Language) (Data integration) MapReduce Runtime (Dist. Programming Framework) Hbase (Column NoSQL DB) Hadoop Distributed File System (HDFS)
  • 79. What is • Machine-learning tool • Distributed and scalable machine learning algorithms on the Hadoop platform • Building intelligent applications easier and faster
  • 80. Why • Current state of ML libraries – Lack Community – Lack Documentation and Examples – Lack Scalability – Are Research oriented
  • 81. Mahout – scale • Scale to large datasets – Hadoop MapReduce implementations that scales linearly with data • Scalable to support your business case – Mahout is distributed under a commercially friendly Apache Software license • Scalable community – Vibrant, responsive and diverse
  • 82. Mahout – four use cases • Mahout machine learning algorithms – Recommendation mining : takes users‟ behavior and find items said specified user might like – Clustering : takes e.g. text documents and groups them based on related document topics – Classification : learns from existing categorized documents what specific category documents look like and is able to assign unlabeled documents to appropriate category – Frequent item set mining : takes a set of item groups (e.g. terms in query session, shopping cart content) and identifies, which individual items typically appear together
  • 83. Use case Example • Predict what the user likes based on – His/Her historical behavior – Aggregate behavior of people similar to him
  • 84. Conclusion Today, we introduced: • Why Hadoop is needed • The basic concepts of HDFS and MapReduce • What sort of problems can be solved with Hadoop • What other projects are included in the Hadoop ecosystem
  • 85. Recap – Hadoop Ecosystem Hue Mahout (Web Console) (Data Mining) Oozie (Job Workflow & Scheduling) (Coordination) Zookeeper Sqoop/Flume Pig/Hive (Analytical Language) (Data integration) MapReduce Runtime (Dist. Programming Framework) Hbase (Column NoSQL DB) Hadoop Distributed File System (HDFS)