Zahid Mian
Part of the Brown-bag Series
 CoreTechnologies
 HDFS
 MapReduce
 YARN
 Spark
 Data Processing
 Pig
 Mahout
 Hadoop Streaming
 MLLib
 Security
 Sentry
 Kerberos
 Knox
 ETL
 Sqoop
 Flume
 DistCp
 Storm
 Monitoring
 Ambari
 HCatalog
 Nagios
 Puppet
 Chef
 ZooKeeper
 Oozie
 Ganglia
 Databases
 Cassandra
 HBase
 Accumulo
 Memcached
 Blur
 Solr
 MongoDB
 Hive
 SparkSQL
 Giraph
 Hadoop Distributed File System (HDFS)
 Runs on clusters of inexpensive disks
 Write-once data
 Stores data in blocks across multiple disks
 NameNode responsible for managing
metadata about the actual data
 Linux-likeCLI for management of files
 Since it’s Open Source, customization is
possible
 Solving computations by breaking everything into Map or Reduce
jobs
 Input and output of jobs is always in Key/Value pairs
 Map Input might be a line from a file <LineNumber, LineText>:
 <224, “HelloWorld. HelloWorld”>
 Map Output might be instance of each word:
 <“Hello”, 1>, <“World”, 1>, <“Hello”, 1>, <“World”, 1>
 Reduce input would be the output from the Mapper
 Reduce output might be the count of occurrence of each word:
 <“Hello”, 2>, <“World”, 2>
 Generally MapReduce jobs are written in Java
 Internally Hadoop does a lot of processing to make this seemless
 All data stored in HDFS (except log files)
 Yet Another Resource Negotiator
 By itself not much
 Allows a variety of tools to conveniently run
within the Hadoop cluster (MapReduce,
Hbase, Spark, Storm, Solr, etc.)
 Think ofYARN as the operating system for
Hadoop
 Users generally interact with individual tools
withinYARN rather than directly withYARN
 MapReduce doesn’t perform well with iterative
algorithms (e.g., graph analysis)
 Spark overcomes that flaw …
 Supports multipass/iterative algorithms by
reducing/eliminating reads/writes to disk
 A replacement for MapReduce
 Three principles of Spark operations:
 Resilient Distributed Dataset (RDD):The Data
 Transformation: Modifies RDD or creates a new RDD
 Action: analyzes an RDD and returns a single result
 Scala is the preferred language for Spark
 Part of Apache HadoopYARN
 Performance gains
 Optimal resource management
 Plan reconfiguration at runtime
 Dynamic physical data flow decisions
 An abstraction build on top of Hadoop
 Essentially an ETL tool
 Use “simple” PigLatin script to create ETL jobs
 Pig will convert jobs to Hadoop M/R jobs
 Takes away the “pain” of writing Java M/R jobs
 Can perform joins, summaries, etc.
 Input/Output all within HDFS
 Can also write external functions (UDF) and call
them from PigLatin
 Allows the use of stdin and stdout (linux) as
input and outputs for your M/R jobs
 What this means is that you can use C,
Python, and other languages
 All the internal work (e.g., shuffling) still
happens within the Hadoop cluster
 Only useful if Java skills are weak
 Collection of machine-learning algorithms
that run on Hadoop
 Possible to write your own algorithms in
traditional Java M/R jobs …
 … why bother when they exist in Mahout?
 Algorithms include: k-means clustering,
latent dirichlet allocation, logistic-regression-
based classifier, random forest decision tree
classifer, etc.
 Machine Learning Library (MLLib) for Spark
 Similar to Mahout, but specifically for Spark
 (Remember Spark is not MapReduce)
 Algorithms include: Linear SVM and logistic
regression, k-means clustering, multinomial
naïve Bayes, Dimensionality reduction, etc.
 Still not fully developed
 Provides basic authorization in Hadoop
 Provides role-based authorization
 Works at the application level (the application
needs to call theAPIs)
 Works with Hive, Solr and Impala
 Drawback: possible to write M/R job to access
non-authorized data)
 Provides Secure Authentication
 Tedious to setup and maintain
 Security Gateway to manage access
 History of Hadoop suggests that security was
an afterthought
 Each tool had own security implementation
 Knox overcomes that complexity
 Provides gateway between external (to Hadoop)
apps and internal apps
 Authorization, authentication, and auditing
 Works with AD and LDAP
 Transfers data between HDFS and relation
DBs
 A very simple command line tool
 export data from HDFS to RDBMS
 Import data from RDBMS to HDFS
 transfers executed as M/R jobs in Hadoop
 Filtering possible
 Additional options for file formats, delimiters, etc.
 Data collection and aggregation
 Works well with log data
 Moves large data files from various servers
into Hadoop cluster
 Supports “complex” multihop flows
 Key implementation features: source,
channel, sink
 Job configuration done via a .config file
 Data movement between Hadoop clusters
 Basically it can copy entire cluster
 Primary Usage:
 Moving data from test to dev environments
 “Dual Ingestion” using two clusters in case one
fails
 Stream Ingestion (instead of
batch processing)
 Quickly perform
transformations of very large
number of small records
 Workflow, called topology,
includes spouts as inputs and
bolts as transformations.
 Usage:
 transform a stream of tweets
into a stream of trending
topics
 Bolts can do a lot of work:
aggregate, communicate with
Databases, joins, etc.
 A Distributed Messaging framework
 Fast, scalable, and durable
 Single cluster can serve as central data
backbone
 Messages are persisted on disk and replicated
across clusters
 Uses include: traditional messaging, website
activity tracking, centralized feeds of
operational data
 Provision, monitoring, and management of a
Hadoop cluster
 GUI based tool
 Features
 Step by step wizard for installing services
 Start, stop, configure services
 Dashboard for monitoring health and status
 Ganglia for metrics collection
 Nagios for system alerts
 Another data abstraction layer
 Use HDFS files as tables
 Almost SQL-like, but more Hive-like
 Add partitions
 Users don’t have to worry about location or
format of data
 IT Infrastructure monitoring
 Web based interface
 Detection of outages and problems
 Send alerts via email or SMS
 Automatic restart provisioning
PUPPET
 Node management tool
 Puppet uses declarative
syntax
 Configuration file identifies
programs; Puppet
determines their
availability
 Broken down as:
Resources, manifests, and
modules
CHEF
 Node management tool
 Chef uses imperative
syntax
 Resource might specify a
certain requirement (a
specific directory is
needed)
 Broken down as:
Resources, recipes and
cookbooks
 Allows coordination between nodes
 Sharing “small” amounts of state and config
data
 For example, share connection string
 Highly scalable and reliable
 Some built-in protection from using it as a
datastore
 Use API to extend use to other areas like
implementing security
 A workflow scheduler
 Like typical schedulers, you can create
relatively complex rules around jobs
 Start, stop, suspend, restart jobs
 Control both jobs and tasks
 Another monitoring tool
 Provides a high-level overview of cluster
 Computing capability, data transfers, storage
usage
 Has support for add-ins for additional
features
 Used withinAmbari
 Feed management and data processing
platform
 Feed retention, replications, archival
 Supports workflows
 Integration with Hive/Hcatalog
 Feeds can be any type of data (e.g., Emails)
 Key-value store
 Scales well and efficient storage
 Distributed database
 Peer-to-peer system
 NoSQL database with random access
 Excellent for sparse data
 Behaves like a key-value store
 Key + number of bins/columns
 Only one datatype: byte string
 Concept of column families for similar data
 Has CLI, but can be access from Java and Pig
 Not meant for transactional system
 Limited built-in functionality
 Key functions must be added at application level
 Name-value db with cell-level security
 Developed by NSA, but now withApache
 Excellent for multitenant storage
 Set column visibility rules for user “labels”
 Scales well, at petabytes of data
 Retrieval operations in seconds
 In-memory cache
 Fast access of large data for short time
 Traditional approach to sharing data in HDFS
is to use replicated join (send data to each
node)
 Memcached provides a “pool” of memory
across the nodes and stores data in that pool
 Effectively a distributed memory pool
 Much more efficient than replicating data
 DocumentWarehouse
 Allows searching of text documents
 Blur uses HDFS stack; Solr doesn’t
 Uses can query data based on indexing
 JSON document-oriented database
 Most popular NoSQL db
 Supports secondary indexes
 Does not run on Hadoop Stack
 Concept of documents (rows) and collections
(tables)
 Very scalable … extends simple key-value
storage
 Interact directly with HDFS data using HQL
 HQL similar to SQL (syntax and commands)
 HQL queries converted to M/R jobs
 HQL does not support:
 Updates/Deletes
 Transactions
 Non-equality joins
 SQL Access to Hadoop Data
 In-memory model for execution (like Spark)
 No MapReduce functionality
 Much faster than traditional HDFS access
 Supports HQL; also support for Java, Scala
APIs
 Can also run MLLib algorithms
 A Graph database (think extended relationships)
 Facebook, LinkedIn,Twitter, etc. use graphs to
determine your friends and likely friends
 The science of graph theory is a bit complicated
 If John is a friend of Mary; Mary is a friend of
Tom;Tom is a friend of Alice …
 Find friends who are two paths (degrees) from
John; nightmare to do with SQL
 Finding relationships from email exchanges
 Relational database layer over HBASE
 Provides JDBC driver to access data
 SQL query converted into HBase scans
 Produces regular JDBC resultsets
 Versioning support to ensure correct schema
is used
 Good performance
Hadoop Technologies

Hadoop Technologies

  • 1.
    Zahid Mian Part ofthe Brown-bag Series
  • 2.
     CoreTechnologies  HDFS MapReduce  YARN  Spark  Data Processing  Pig  Mahout  Hadoop Streaming  MLLib  Security  Sentry  Kerberos  Knox  ETL  Sqoop  Flume  DistCp  Storm
  • 3.
     Monitoring  Ambari HCatalog  Nagios  Puppet  Chef  ZooKeeper  Oozie  Ganglia  Databases  Cassandra  HBase  Accumulo  Memcached  Blur  Solr  MongoDB  Hive  SparkSQL  Giraph
  • 4.
     Hadoop DistributedFile System (HDFS)  Runs on clusters of inexpensive disks  Write-once data  Stores data in blocks across multiple disks  NameNode responsible for managing metadata about the actual data  Linux-likeCLI for management of files  Since it’s Open Source, customization is possible
  • 5.
     Solving computationsby breaking everything into Map or Reduce jobs  Input and output of jobs is always in Key/Value pairs  Map Input might be a line from a file <LineNumber, LineText>:  <224, “HelloWorld. HelloWorld”>  Map Output might be instance of each word:  <“Hello”, 1>, <“World”, 1>, <“Hello”, 1>, <“World”, 1>  Reduce input would be the output from the Mapper  Reduce output might be the count of occurrence of each word:  <“Hello”, 2>, <“World”, 2>  Generally MapReduce jobs are written in Java  Internally Hadoop does a lot of processing to make this seemless  All data stored in HDFS (except log files)
  • 6.
     Yet AnotherResource Negotiator  By itself not much  Allows a variety of tools to conveniently run within the Hadoop cluster (MapReduce, Hbase, Spark, Storm, Solr, etc.)  Think ofYARN as the operating system for Hadoop  Users generally interact with individual tools withinYARN rather than directly withYARN
  • 7.
     MapReduce doesn’tperform well with iterative algorithms (e.g., graph analysis)  Spark overcomes that flaw …  Supports multipass/iterative algorithms by reducing/eliminating reads/writes to disk  A replacement for MapReduce  Three principles of Spark operations:  Resilient Distributed Dataset (RDD):The Data  Transformation: Modifies RDD or creates a new RDD  Action: analyzes an RDD and returns a single result  Scala is the preferred language for Spark
  • 8.
     Part ofApache HadoopYARN  Performance gains  Optimal resource management  Plan reconfiguration at runtime  Dynamic physical data flow decisions
  • 9.
     An abstractionbuild on top of Hadoop  Essentially an ETL tool  Use “simple” PigLatin script to create ETL jobs  Pig will convert jobs to Hadoop M/R jobs  Takes away the “pain” of writing Java M/R jobs  Can perform joins, summaries, etc.  Input/Output all within HDFS  Can also write external functions (UDF) and call them from PigLatin
  • 10.
     Allows theuse of stdin and stdout (linux) as input and outputs for your M/R jobs  What this means is that you can use C, Python, and other languages  All the internal work (e.g., shuffling) still happens within the Hadoop cluster  Only useful if Java skills are weak
  • 11.
     Collection ofmachine-learning algorithms that run on Hadoop  Possible to write your own algorithms in traditional Java M/R jobs …  … why bother when they exist in Mahout?  Algorithms include: k-means clustering, latent dirichlet allocation, logistic-regression- based classifier, random forest decision tree classifer, etc.
  • 12.
     Machine LearningLibrary (MLLib) for Spark  Similar to Mahout, but specifically for Spark  (Remember Spark is not MapReduce)  Algorithms include: Linear SVM and logistic regression, k-means clustering, multinomial naïve Bayes, Dimensionality reduction, etc.
  • 13.
     Still notfully developed  Provides basic authorization in Hadoop  Provides role-based authorization  Works at the application level (the application needs to call theAPIs)  Works with Hive, Solr and Impala  Drawback: possible to write M/R job to access non-authorized data)
  • 14.
     Provides SecureAuthentication  Tedious to setup and maintain
  • 15.
     Security Gatewayto manage access  History of Hadoop suggests that security was an afterthought  Each tool had own security implementation  Knox overcomes that complexity  Provides gateway between external (to Hadoop) apps and internal apps  Authorization, authentication, and auditing  Works with AD and LDAP
  • 16.
     Transfers databetween HDFS and relation DBs  A very simple command line tool  export data from HDFS to RDBMS  Import data from RDBMS to HDFS  transfers executed as M/R jobs in Hadoop  Filtering possible  Additional options for file formats, delimiters, etc.
  • 17.
     Data collectionand aggregation  Works well with log data  Moves large data files from various servers into Hadoop cluster  Supports “complex” multihop flows  Key implementation features: source, channel, sink  Job configuration done via a .config file
  • 18.
     Data movementbetween Hadoop clusters  Basically it can copy entire cluster  Primary Usage:  Moving data from test to dev environments  “Dual Ingestion” using two clusters in case one fails
  • 19.
     Stream Ingestion(instead of batch processing)  Quickly perform transformations of very large number of small records  Workflow, called topology, includes spouts as inputs and bolts as transformations.  Usage:  transform a stream of tweets into a stream of trending topics  Bolts can do a lot of work: aggregate, communicate with Databases, joins, etc.
  • 20.
     A DistributedMessaging framework  Fast, scalable, and durable  Single cluster can serve as central data backbone  Messages are persisted on disk and replicated across clusters  Uses include: traditional messaging, website activity tracking, centralized feeds of operational data
  • 21.
     Provision, monitoring,and management of a Hadoop cluster  GUI based tool  Features  Step by step wizard for installing services  Start, stop, configure services  Dashboard for monitoring health and status  Ganglia for metrics collection  Nagios for system alerts
  • 22.
     Another dataabstraction layer  Use HDFS files as tables  Almost SQL-like, but more Hive-like  Add partitions  Users don’t have to worry about location or format of data
  • 23.
     IT Infrastructuremonitoring  Web based interface  Detection of outages and problems  Send alerts via email or SMS  Automatic restart provisioning
  • 24.
    PUPPET  Node managementtool  Puppet uses declarative syntax  Configuration file identifies programs; Puppet determines their availability  Broken down as: Resources, manifests, and modules CHEF  Node management tool  Chef uses imperative syntax  Resource might specify a certain requirement (a specific directory is needed)  Broken down as: Resources, recipes and cookbooks
  • 25.
     Allows coordinationbetween nodes  Sharing “small” amounts of state and config data  For example, share connection string  Highly scalable and reliable  Some built-in protection from using it as a datastore  Use API to extend use to other areas like implementing security
  • 26.
     A workflowscheduler  Like typical schedulers, you can create relatively complex rules around jobs  Start, stop, suspend, restart jobs  Control both jobs and tasks
  • 27.
     Another monitoringtool  Provides a high-level overview of cluster  Computing capability, data transfers, storage usage  Has support for add-ins for additional features  Used withinAmbari
  • 28.
     Feed managementand data processing platform  Feed retention, replications, archival  Supports workflows  Integration with Hive/Hcatalog  Feeds can be any type of data (e.g., Emails)
  • 29.
     Key-value store Scales well and efficient storage  Distributed database  Peer-to-peer system
  • 30.
     NoSQL databasewith random access  Excellent for sparse data  Behaves like a key-value store  Key + number of bins/columns  Only one datatype: byte string  Concept of column families for similar data  Has CLI, but can be access from Java and Pig  Not meant for transactional system  Limited built-in functionality  Key functions must be added at application level
  • 31.
     Name-value dbwith cell-level security  Developed by NSA, but now withApache  Excellent for multitenant storage  Set column visibility rules for user “labels”  Scales well, at petabytes of data  Retrieval operations in seconds
  • 32.
     In-memory cache Fast access of large data for short time  Traditional approach to sharing data in HDFS is to use replicated join (send data to each node)  Memcached provides a “pool” of memory across the nodes and stores data in that pool  Effectively a distributed memory pool  Much more efficient than replicating data
  • 33.
     DocumentWarehouse  Allowssearching of text documents  Blur uses HDFS stack; Solr doesn’t  Uses can query data based on indexing
  • 34.
     JSON document-orienteddatabase  Most popular NoSQL db  Supports secondary indexes  Does not run on Hadoop Stack  Concept of documents (rows) and collections (tables)  Very scalable … extends simple key-value storage
  • 35.
     Interact directlywith HDFS data using HQL  HQL similar to SQL (syntax and commands)  HQL queries converted to M/R jobs  HQL does not support:  Updates/Deletes  Transactions  Non-equality joins
  • 36.
     SQL Accessto Hadoop Data  In-memory model for execution (like Spark)  No MapReduce functionality  Much faster than traditional HDFS access  Supports HQL; also support for Java, Scala APIs  Can also run MLLib algorithms
  • 37.
     A Graphdatabase (think extended relationships)  Facebook, LinkedIn,Twitter, etc. use graphs to determine your friends and likely friends  The science of graph theory is a bit complicated  If John is a friend of Mary; Mary is a friend of Tom;Tom is a friend of Alice …  Find friends who are two paths (degrees) from John; nightmare to do with SQL  Finding relationships from email exchanges
  • 38.
     Relational databaselayer over HBASE  Provides JDBC driver to access data  SQL query converted into HBase scans  Produces regular JDBC resultsets  Versioning support to ensure correct schema is used  Good performance