Hadoop Solutions

  • 2,610 views
Uploaded on

Overview of common Hadoop based solutions for typical tasks and problems,

Overview of common Hadoop based solutions for typical tasks and problems,

More in: Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
2,610
On Slideshare
0
From Embeds
0
Number of Embeds
3

Actions

Shares
Downloads
0
Comments
0
Likes
4

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Hadoop Solutions By Zenyk Matchyshyn Staff Engineer @ Lohika
  • 2. Agenda • Why? • Data in / Data out • Data Formats • Tools • Providers • Future • Q/A1/14/2013 2
  • 3. Why? • Smart meter analysis • Genome processing • Sentiment & social media analysis • Network capacity trending & management • Ad targeting • Fraud detection1/14/2013 3
  • 4. DATA IN / DATA OUT1/14/2013 4
  • 5. Flume • Apache Flume is a distributed system for collecting streaming data. • Developed by Cloudera, now Apache project • Popular & supported • Features: • Centralized config • Failover • Reliability1/14/2013 5
  • 6. Flume - Responsibilities• Node – path from source to sink• Agent – collect data from local host and forwards to Collector• Collector – collects the data and writes into HDFS• Master – manages configuration and supports data flow1/14/2013 6
  • 7. Data in / Data out - other solutions • Scribe https://github.com/facebook/scribe – similar to Flume • Chukwa http://incubator.apache.org/chukwa/ – similar to Flume • Oozie http://oozie.apache.org/ - workflow scheduler1/14/2013 7
  • 8. Sqoop • Apache project, originally from Cloudera http://sqoop.apache.org/ • Uses metadata to describe structure in HDFS • Transport bulk data in & out from relational database • Directly reading & writing from Map/Reduce as an alternative1/14/2013 8
  • 9. DATA FORMATS1/14/2013 9
  • 10. Formats • Input and Output matter • Data in files is splitted • XML and JSON are supported • Do document per-line or suffer the consequences ;)1/14/2013 10
  • 11. Serialization frameworks • Binary in nature, makes things a bit more complicated • Thrift & Protobuf vs SequenceFile & Avro • Native formats support splitability and compression • Avro supports code generation and versioning, just like Thrift & Protobuf • Out-of-the-box support in Hadoop1/14/2013 11
  • 12. Compression • Deflate (zlib) • Gzip • Bzip2 – splittable with additional work, slow • LZO – block based • LZOP – splittable with additional work • Snappy – from Google, fast, but no splittability1/14/2013 12
  • 13. Testing • MRUnit – unit testing for Map/Reduce jobs http://mrunit.apache.org/ • Data sampling for testing • Data spikes detection1/14/2013 13
  • 14. Small files • Small files are problematic because of big block size • Can pack them into bigger Avro files • Can move to Hbase • Hadoop Archives (HAR) files1/14/2013 14
  • 15. TOOLS1/14/2013 15
  • 16. Pig • High level language for data analysis • Uses PigLatin to describe data flows (translates into MapReduce) • Filters, Joins, Projections, Groupings, Counts, etc. • Example:A = LOAD student USING PigStorage() AS (name:chararray, age:int, gpa:float);B = FOREACH A GENERATE name;DUMP B;(John)(Mary) 1/14/2013 16
  • 17. Hive • SQL-like interface - HiveQL • Has its own structure • Not a pipeline like Pig • Basically a distributed data warehouse • Has execution optimization1/14/2013 17
  • 18. HBase• Distributed, column oriented store• Independent of Hadoop• No translation into Map/Reduce• Stores data in MapFiles (indexed SequenceFiles)1/14/2013 18
  • 19. PROVIDERS1/14/2013 19
  • 20. Apache • Umbrella for Hadoop projects • No commercial support • Active community • Most recent builds1/14/2013 20
  • 21. Cloudera • Has its own tuned build – CDH • Commercial support • Certification & Training • Has products on top of Hadoop (like Cloudera Manager etc.) • Very high visibility1/14/2013 21
  • 22. Amazon Elastic MapReduce (EMR) • Custom build tailored for AWS environment • Very easy • Uses S3 as a storage • Uses SimpleDB for job flow state information • Supports HBase1/14/2013 22
  • 23. HortonWorks • Own platform on top of Hadoop • Big backers like Microsoft and Yahoo • Has trainings & certification1/14/2013 23
  • 24. FUTURE1/14/2013 24
  • 25. Future • Percolator for incremental indexing and analysis of frequently changing datasets • Dremel for ad hoc analytics • Pregel for analyzing graph data • ZooKeeper & Hadoop de-coupling with new execution engines to the rescue!1/14/2013 25
  • 26. Q/A ?1/14/2013 26