Hadoop Solutions

3,245 views
3,097 views

Published on

Overview of common Hadoop based solutions for typical tasks and problems,

Published in: Technology
0 Comments
4 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
3,245
On SlideShare
0
From Embeds
0
Number of Embeds
1,991
Actions
Shares
0
Downloads
0
Comments
0
Likes
4
Embeds 0
No embeds

No notes for slide

Hadoop Solutions

  1. 1. Hadoop Solutions By Zenyk Matchyshyn Staff Engineer @ Lohika
  2. 2. Agenda • Why? • Data in / Data out • Data Formats • Tools • Providers • Future • Q/A1/14/2013 2
  3. 3. Why? • Smart meter analysis • Genome processing • Sentiment & social media analysis • Network capacity trending & management • Ad targeting • Fraud detection1/14/2013 3
  4. 4. DATA IN / DATA OUT1/14/2013 4
  5. 5. Flume • Apache Flume is a distributed system for collecting streaming data. • Developed by Cloudera, now Apache project • Popular & supported • Features: • Centralized config • Failover • Reliability1/14/2013 5
  6. 6. Flume - Responsibilities• Node – path from source to sink• Agent – collect data from local host and forwards to Collector• Collector – collects the data and writes into HDFS• Master – manages configuration and supports data flow1/14/2013 6
  7. 7. Data in / Data out - other solutions • Scribe https://github.com/facebook/scribe – similar to Flume • Chukwa http://incubator.apache.org/chukwa/ – similar to Flume • Oozie http://oozie.apache.org/ - workflow scheduler1/14/2013 7
  8. 8. Sqoop • Apache project, originally from Cloudera http://sqoop.apache.org/ • Uses metadata to describe structure in HDFS • Transport bulk data in & out from relational database • Directly reading & writing from Map/Reduce as an alternative1/14/2013 8
  9. 9. DATA FORMATS1/14/2013 9
  10. 10. Formats • Input and Output matter • Data in files is splitted • XML and JSON are supported • Do document per-line or suffer the consequences ;)1/14/2013 10
  11. 11. Serialization frameworks • Binary in nature, makes things a bit more complicated • Thrift & Protobuf vs SequenceFile & Avro • Native formats support splitability and compression • Avro supports code generation and versioning, just like Thrift & Protobuf • Out-of-the-box support in Hadoop1/14/2013 11
  12. 12. Compression • Deflate (zlib) • Gzip • Bzip2 – splittable with additional work, slow • LZO – block based • LZOP – splittable with additional work • Snappy – from Google, fast, but no splittability1/14/2013 12
  13. 13. Testing • MRUnit – unit testing for Map/Reduce jobs http://mrunit.apache.org/ • Data sampling for testing • Data spikes detection1/14/2013 13
  14. 14. Small files • Small files are problematic because of big block size • Can pack them into bigger Avro files • Can move to Hbase • Hadoop Archives (HAR) files1/14/2013 14
  15. 15. TOOLS1/14/2013 15
  16. 16. Pig • High level language for data analysis • Uses PigLatin to describe data flows (translates into MapReduce) • Filters, Joins, Projections, Groupings, Counts, etc. • Example:A = LOAD student USING PigStorage() AS (name:chararray, age:int, gpa:float);B = FOREACH A GENERATE name;DUMP B;(John)(Mary) 1/14/2013 16
  17. 17. Hive • SQL-like interface - HiveQL • Has its own structure • Not a pipeline like Pig • Basically a distributed data warehouse • Has execution optimization1/14/2013 17
  18. 18. HBase• Distributed, column oriented store• Independent of Hadoop• No translation into Map/Reduce• Stores data in MapFiles (indexed SequenceFiles)1/14/2013 18
  19. 19. PROVIDERS1/14/2013 19
  20. 20. Apache • Umbrella for Hadoop projects • No commercial support • Active community • Most recent builds1/14/2013 20
  21. 21. Cloudera • Has its own tuned build – CDH • Commercial support • Certification & Training • Has products on top of Hadoop (like Cloudera Manager etc.) • Very high visibility1/14/2013 21
  22. 22. Amazon Elastic MapReduce (EMR) • Custom build tailored for AWS environment • Very easy • Uses S3 as a storage • Uses SimpleDB for job flow state information • Supports HBase1/14/2013 22
  23. 23. HortonWorks • Own platform on top of Hadoop • Big backers like Microsoft and Yahoo • Has trainings & certification1/14/2013 23
  24. 24. FUTURE1/14/2013 24
  25. 25. Future • Percolator for incremental indexing and analysis of frequently changing datasets • Dremel for ad hoc analytics • Pregel for analyzing graph data • ZooKeeper & Hadoop de-coupling with new execution engines to the rescue!1/14/2013 25
  26. 26. Q/A ?1/14/2013 26

×