Hadoop Solutions

   By Zenyk Matchyshyn
  Staff Engineer @ Lohika
Agenda
   •    Why?
   •    Data in / Data out
   •    Data Formats
   •    Tools
   •    Providers
   •    Future
   •    Q/A




1/14/2013                    2
Why?
   •    Smart meter analysis
   •    Genome processing
   •    Sentiment & social media analysis
   •    Network capacity trending & management
   •    Ad targeting
   •    Fraud detection




1/14/2013                                        3
DATA IN / DATA OUT


1/14/2013               4
Flume

   •    Apache Flume is a distributed system for
        collecting streaming data.
   •    Developed by Cloudera, now Apache project
   •    Popular & supported
   •    Features:
            •   Centralized config
            •   Failover
            •   Reliability

1/14/2013                                           5
Flume - Responsibilities
•   Node – path from source to sink
•   Agent – collect data from local host and forwards
    to Collector
•   Collector – collects the data and writes into
    HDFS
•   Master – manages configuration and supports
    data flow




1/14/2013                                           6
Data in / Data out - other solutions


   •    Scribe https://github.com/facebook/scribe –
        similar to Flume
   •    Chukwa http://incubator.apache.org/chukwa/
        – similar to Flume
   •    Oozie http://oozie.apache.org/ - workflow
        scheduler




1/14/2013                                             7
Sqoop

   •    Apache project, originally from Cloudera
        http://sqoop.apache.org/
   •    Uses metadata to describe structure in HDFS
   •    Transport bulk data in & out from relational
        database
   •    Directly reading & writing from Map/Reduce
        as an alternative



1/14/2013                                              8
DATA FORMATS


1/14/2013         9
Formats

   •    Input and Output matter
   •    Data in files is splitted
   •    XML and JSON are supported
   •    Do document per-line or suffer the
        consequences ;)




1/14/2013                                    10
Serialization frameworks
   •    Binary in nature, makes things a bit more
        complicated
   •    Thrift & Protobuf vs SequenceFile & Avro
   •    Native formats support splitability and
        compression
   •    Avro supports code generation and
        versioning, just like Thrift & Protobuf
   •    Out-of-the-box support in Hadoop


1/14/2013                                           11
Compression

   •    Deflate (zlib)
   •    Gzip
   •    Bzip2 – splittable with additional work, slow
   •    LZO – block based
   •    LZOP – splittable with additional work
   •    Snappy – from Google, fast, but no splittability



1/14/2013                                               12
Testing
   •    MRUnit – unit testing for Map/Reduce jobs
        http://mrunit.apache.org/
   •    Data sampling for testing
   •    Data spikes detection




1/14/2013                                           13
Small files

   •    Small files are problematic because of big
        block size
   •    Can pack them into bigger Avro files
   •    Can move to Hbase
   •    Hadoop Archives (HAR) files




1/14/2013                                            14
TOOLS


1/14/2013   15
Pig
    •    High level language for data analysis
    •    Uses PigLatin to describe data flows
         (translates into MapReduce)
    •    Filters, Joins, Projections, Groupings, Counts,
         etc.
    •    Example:
A = LOAD 'student' USING PigStorage() AS (name:chararray, age:int, gpa:float);
B = FOREACH A GENERATE name;
DUMP B;
(John)
(Mary)


 1/14/2013                                                               16
Hive


   •    SQL-like interface - HiveQL
   •    Has its own structure
   •    Not a pipeline like Pig
   •    Basically a distributed data warehouse
   •    Has execution optimization




1/14/2013                                        17
HBase


•    Distributed, column oriented store
•    Independent of Hadoop
•    No translation into Map/Reduce
•    Stores data in MapFiles (indexed SequenceFiles)




1/14/2013                                        18
PROVIDERS


1/14/2013      19
Apache


   •    Umbrella for Hadoop projects
   •    No commercial support
   •    Active community
   •    Most recent builds




1/14/2013                              20
Cloudera

   •    Has its own tuned build – CDH
   •    Commercial support
   •    Certification & Training
   •    Has products on top of Hadoop (like Cloudera
        Manager etc.)
   •    Very high visibility




1/14/2013                                          21
Amazon Elastic MapReduce (EMR)
   •    Custom build tailored for AWS environment
   •    Very easy
   •    Uses S3 as a storage
   •    Uses SimpleDB for job flow state information
   •    Supports HBase




1/14/2013                                              22
HortonWorks


   •    Own platform on top of Hadoop
   •    Big backers like Microsoft and Yahoo
   •    Has trainings & certification




1/14/2013                                      23
FUTURE


1/14/2013   24
Future

 •    Percolator for incremental indexing and
      analysis of frequently changing datasets
 •    Dremel for ad hoc analytics
 •    Pregel for analyzing graph data
 •    ZooKeeper & Hadoop de-coupling with new
      execution engines to the rescue!




1/14/2013                                        25
Q/A


            ?
1/14/2013       26

Hadoop Solutions

  • 1.
    Hadoop Solutions By Zenyk Matchyshyn Staff Engineer @ Lohika
  • 2.
    Agenda • Why? • Data in / Data out • Data Formats • Tools • Providers • Future • Q/A 1/14/2013 2
  • 3.
    Why? • Smart meter analysis • Genome processing • Sentiment & social media analysis • Network capacity trending & management • Ad targeting • Fraud detection 1/14/2013 3
  • 4.
    DATA IN /DATA OUT 1/14/2013 4
  • 5.
    Flume • Apache Flume is a distributed system for collecting streaming data. • Developed by Cloudera, now Apache project • Popular & supported • Features: • Centralized config • Failover • Reliability 1/14/2013 5
  • 6.
    Flume - Responsibilities • Node – path from source to sink • Agent – collect data from local host and forwards to Collector • Collector – collects the data and writes into HDFS • Master – manages configuration and supports data flow 1/14/2013 6
  • 7.
    Data in /Data out - other solutions • Scribe https://github.com/facebook/scribe – similar to Flume • Chukwa http://incubator.apache.org/chukwa/ – similar to Flume • Oozie http://oozie.apache.org/ - workflow scheduler 1/14/2013 7
  • 8.
    Sqoop • Apache project, originally from Cloudera http://sqoop.apache.org/ • Uses metadata to describe structure in HDFS • Transport bulk data in & out from relational database • Directly reading & writing from Map/Reduce as an alternative 1/14/2013 8
  • 9.
  • 10.
    Formats • Input and Output matter • Data in files is splitted • XML and JSON are supported • Do document per-line or suffer the consequences ;) 1/14/2013 10
  • 11.
    Serialization frameworks • Binary in nature, makes things a bit more complicated • Thrift & Protobuf vs SequenceFile & Avro • Native formats support splitability and compression • Avro supports code generation and versioning, just like Thrift & Protobuf • Out-of-the-box support in Hadoop 1/14/2013 11
  • 12.
    Compression • Deflate (zlib) • Gzip • Bzip2 – splittable with additional work, slow • LZO – block based • LZOP – splittable with additional work • Snappy – from Google, fast, but no splittability 1/14/2013 12
  • 13.
    Testing • MRUnit – unit testing for Map/Reduce jobs http://mrunit.apache.org/ • Data sampling for testing • Data spikes detection 1/14/2013 13
  • 14.
    Small files • Small files are problematic because of big block size • Can pack them into bigger Avro files • Can move to Hbase • Hadoop Archives (HAR) files 1/14/2013 14
  • 15.
  • 16.
    Pig • High level language for data analysis • Uses PigLatin to describe data flows (translates into MapReduce) • Filters, Joins, Projections, Groupings, Counts, etc. • Example: A = LOAD 'student' USING PigStorage() AS (name:chararray, age:int, gpa:float); B = FOREACH A GENERATE name; DUMP B; (John) (Mary) 1/14/2013 16
  • 17.
    Hive • SQL-like interface - HiveQL • Has its own structure • Not a pipeline like Pig • Basically a distributed data warehouse • Has execution optimization 1/14/2013 17
  • 18.
    HBase • Distributed, column oriented store • Independent of Hadoop • No translation into Map/Reduce • Stores data in MapFiles (indexed SequenceFiles) 1/14/2013 18
  • 19.
  • 20.
    Apache • Umbrella for Hadoop projects • No commercial support • Active community • Most recent builds 1/14/2013 20
  • 21.
    Cloudera • Has its own tuned build – CDH • Commercial support • Certification & Training • Has products on top of Hadoop (like Cloudera Manager etc.) • Very high visibility 1/14/2013 21
  • 22.
    Amazon Elastic MapReduce(EMR) • Custom build tailored for AWS environment • Very easy • Uses S3 as a storage • Uses SimpleDB for job flow state information • Supports HBase 1/14/2013 22
  • 23.
    HortonWorks • Own platform on top of Hadoop • Big backers like Microsoft and Yahoo • Has trainings & certification 1/14/2013 23
  • 24.
  • 25.
    Future • Percolator for incremental indexing and analysis of frequently changing datasets • Dremel for ad hoc analytics • Pregel for analyzing graph data • ZooKeeper & Hadoop de-coupling with new execution engines to the rescue! 1/14/2013 25
  • 26.
    Q/A ? 1/14/2013 26