Hadoop Solutions

Hadoop Solutions

By Zenyk Matchyshyn
Staff Engineer @ Lohika

Agenda
• Why?
• Data in / Data out
• Data Formats
• Tools
• Providers
• Future
• Q/A

1/14/2013 2

Why?
• Smart meter analysis
• Genome processing
• Sentiment & social media analysis
• Network capacity trending & management
• Ad targeting
• Fraud detection

1/14/2013 3

DATA IN / DATA OUT

1/14/2013 4

Flume

• Apache Flume is a distributed system for
collecting streaming data.
• Developed by Cloudera, now Apache project
• Popular & supported
• Features:
• Centralized config
• Failover
• Reliability

1/14/2013 5

Flume - Responsibilities
• Node – path from source to sink
• Agent – collect data from local host and forwards
to Collector
• Collector – collects the data and writes into
HDFS
• Master – manages configuration and supports
data flow

1/14/2013 6

Data in / Data out - other solutions

• Scribe https://github.com/facebook/scribe –
similar to Flume
• Chukwa http://incubator.apache.org/chukwa/
– similar to Flume
• Oozie http://oozie.apache.org/ - workflow
scheduler

1/14/2013 7

Sqoop

• Apache project, originally from Cloudera
http://sqoop.apache.org/
• Uses metadata to describe structure in HDFS
• Transport bulk data in & out from relational
database
• Directly reading & writing from Map/Reduce
as an alternative

1/14/2013 8

DATA FORMATS

1/14/2013 9

Formats

• Input and Output matter
• Data in files is splitted
• XML and JSON are supported
• Do document per-line or suffer the
consequences ;)

1/14/2013 10

Serialization frameworks
• Binary in nature, makes things a bit more
complicated
• Thrift & Protobuf vs SequenceFile & Avro
• Native formats support splitability and
compression
• Avro supports code generation and
versioning, just like Thrift & Protobuf
• Out-of-the-box support in Hadoop

1/14/2013 11

Compression

• Deflate (zlib)
• Gzip
• Bzip2 – splittable with additional work, slow
• LZO – block based
• LZOP – splittable with additional work
• Snappy – from Google, fast, but no splittability

1/14/2013 12

Testing
• MRUnit – unit testing for Map/Reduce jobs
http://mrunit.apache.org/
• Data sampling for testing
• Data spikes detection

1/14/2013 13

Small files

• Small files are problematic because of big
block size
• Can pack them into bigger Avro files
• Can move to Hbase
• Hadoop Archives (HAR) files

1/14/2013 14

Pig
• High level language for data analysis
• Uses PigLatin to describe data flows
(translates into MapReduce)
• Filters, Joins, Projections, Groupings, Counts,
etc.
• Example:
A = LOAD 'student' USING PigStorage() AS (name:chararray, age:int, gpa:float);
B = FOREACH A GENERATE name;
DUMP B;
(John)
(Mary)

1/14/2013 16

Hive

• SQL-like interface - HiveQL
• Has its own structure
• Not a pipeline like Pig
• Basically a distributed data warehouse
• Has execution optimization

1/14/2013 17

HBase

• Distributed, column oriented store
• Independent of Hadoop
• No translation into Map/Reduce
• Stores data in MapFiles (indexed SequenceFiles)

1/14/2013 18

Apache

• Umbrella for Hadoop projects
• No commercial support
• Active community
• Most recent builds

1/14/2013 20

Cloudera

• Has its own tuned build – CDH
• Commercial support
• Certification & Training
• Has products on top of Hadoop (like Cloudera
Manager etc.)
• Very high visibility

1/14/2013 21

Amazon Elastic MapReduce (EMR)
• Custom build tailored for AWS environment
• Very easy
• Uses S3 as a storage
• Uses SimpleDB for job flow state information
• Supports HBase

1/14/2013 22

HortonWorks

• Own platform on top of Hadoop
• Big backers like Microsoft and Yahoo
• Has trainings & certification

1/14/2013 23

Future

• Percolator for incremental indexing and
analysis of frequently changing datasets
• Dremel for ad hoc analytics
• Pregel for analyzing graph data
• ZooKeeper & Hadoop de-coupling with new
execution engines to the rescue!

1/14/2013 25

Hadoop Solutions

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (17)

Similar to Hadoop Solutions

Similar to Hadoop Solutions (20)

More from zenyk

More from zenyk (8)

Recently uploaded

Recently uploaded (20)

Hadoop Solutions