Hadoop Ecosystem Overview
Upcoming SlideShare
Loading in...5
×

Like this? Share it with your network

Share
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
785
On Slideshare
785
From Embeds
0
Number of Embeds
0

Actions

Shares
Downloads
28
Comments
0
Likes
1

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. BigData Eco System overview ●Log Formats ●Compression ●Collecting Data ●Distributed Storage ●Distributed Processing ●Workflow Management ●Realtime Read Write Storage ●Others
  • 2. Log Formats ● Google Protocol Buffers ● Thrift Typed Data (and int is an int etc) Named Flexible Schema Backwards compatibility
  • 3. Compression Default (slowish) ● GZIP (splittable) ● Bzip2 (slowest) ● 7zip (only for long archival) Fastest ● Snappy (not splittable) ● LZO (not splittable) ●
  • 4. Compression LZO/Snappy ● Saves data ● CPU/RAM impact is negligible ● Less bytes in disk makes for faster reads and writes
  • 5. Collecting Data File collection ● BigStreams Pub Subscribe collection ● Kafka + Kafka-Collector ● Scribe ● Flume
  • 6. Data Pipeline ● Reliable ● Auto healing ● Cache data for N days ● Automatic ● Fast ● Distributed
  • 7. Distributed Storage ● HDFS (master slaves) ● S3 (bucket, key → blob, no master) ● GridFS ? ● NFS (not for high writes or big data)
  • 8. Hadoop Distributed File System Advantages ● Simple (more or less) ● Works with every day hardware (cheap to scale) ● Proven scalability to petabytes ● Lends itself to efficient distributed batch processing Disadvantages ● Single Point of Failure (HA is a work in progress) ● All meta-data must fit in master's RAM ● No Random Read/Writes
  • 9. S3 Advantages ● Cheap ● distributed ● Good for data archival Disadvantages ● Data is stored externally ● Does not lend itself to batch processing of large volumes of data
  • 10. Distributed Processing ● Hadoop MapReduce ● GridGain ● Storm ● Akka Actors
  • 11. Hadoop MapReduce ● Used for distributed serial batch processing ● Works with HDFS ● Simple concept but complex APIs ● Lots of higher level APIs for querying (Pig/Hive) ● Not for random indexed reads ● Not for small data i.e. < 10 gigs
  • 12. GridGain ● Fast In-Memory queries ● Not attached to any specific datastorage ● API is java/script based
  • 13. Storm ● In-Memory ● Distributed ● Stream based aggregation/processing ● Supports sending partially aggregated data to backends like Hbase/Cassandra
  • 14. Akka Actors ● Concurrent processing constructs based on the erlang actor model ● Latest versions support distributed RPC communication via Netty or ZeroMQ. ● Used for building distributed fast processing systems.
  • 15. M/R High level languages SQL • Hive Imperative • Pig Lisp • Cascalog R • Hive JDBC Connection
  • 16. Apache Pig Advantages ● Simple and programmable ● UDFS and Loader/Store APIs are simple ● Spill to disk to avoid OOM Disadvantages ● Low level ● Schema-less ●
  • 17. Hive Advantages ● SQL interface ● Server mode Fast Disadvantages ● Complex UDF Load/Store (SERDE) API ● Does not spill to disk like pig to avoid OOM
  • 18. Workflow Management ● Glue ● Oozie ● Azkaban ● Bash
  • 19. Glue ● Workflows for devops. ● No XML. ● Polygot language approach supports Groovy, Scala,Ruby(JRuby), Python(Jython), Clojure, JavaScript. ● Data driven and cronbased worfklows ● Separate configuration from workflows
  • 20. Oozie ● XML ● UI for build workflows using blocks, (still have to program the components) ● Buy another pair of glasses
  • 21. Azkaban ● Based on Flows – There consist of binaries described by a job text file ● Concentrates on generic scheduling and retries in a traditional sense. ● Flow UI
  • 22. Bash ● Don't do workflows in bash ● Know your bash for simple adhoc searches and processing ● Again do not do workflows in bash
  • 23. Realtime Read Write Storage ● Hbase and Accumulo ● Cassandra
  • 24. Hbase and Accumulo ● Both are based on the BigTable paper from Google. ● Column based storage ● Integrates with HDFS ● Tables act as distributed indexes ● Region Servers are single points of failure ● Aimed at faster reads than writes
  • 25. Cassandra ● Based on the Dynamo papers from amazon ● No single point of failure ● Aimed at faster writes than reads ● Default eventual consistency with configurable durability options (at the cost of writing speed) ● Column Counters
  • 26. Others ● Lucene (api for building fast indexes) ● Solr and Elastic search – Built on top of lucene – Distributed indexes – Fast query times ● Mongo DB (document db) ● Redis (fast in-memory db) – Lots of basic constructs, easy to build bloom filters – Great for realtime
  • 27. References http://cassandra.apache.org/ http://hbase.apache.org/ http://hbase.apache.org/ http://hbase.apache.org/ http://hadoop.apache.org/ http://hive.apache.org/ http://pig.apache.org/ http://storm-project.net/ http://lucene.apache.org/core/ http://lucene.apache.org/solr/ http://www.elasticsearch.org/ https://github.com/nathanmarz/cascalog http://akka.io/ http://www.datastax.com/dev/blog/big-analytics-with-r-cassandra-and-hive http://gerritjvv.github.io/glue/ https://code.google.com/p/bigstreams/ https://github.com/facebook/scribe http://flume.apache.org/ http://redis.io/ http://www.mongodb.org/ http://netty.io/ http://zeromq.org/ http://kafka.apache.org/ http://www.gridgain.com/ http://docs.mongodb.org/manual/core/gridfs/ http://aws.amazon.com/s3/