Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
BigData Eco System overview
●Log Formats
●Compression
●Collecting Data
●Distributed Storage
●Distributed Processing
●Workf...
Log Formats
● Google Protocol Buffers
● Thrift
Typed Data (and int is an int etc)
Named Flexible Schema
Backwards compatib...
Compression
Default (slowish)
● GZIP (splittable)
● Bzip2 (slowest)
● 7zip (only for long archival)
Fastest
● Snappy (not ...
Compression LZO/Snappy
● Saves data
● CPU/RAM impact is negligible
● Less bytes in disk makes for faster reads and
writes
Collecting Data
File collection
● BigStreams
Pub Subscribe collection
● Kafka + Kafka-Collector
● Scribe
● Flume
Data Pipeline
● Reliable
● Auto healing
● Cache data for N days
● Automatic
● Fast
● Distributed
Distributed Storage
● HDFS (master slaves)
● S3 (bucket, key → blob, no master)
● GridFS ?
● NFS (not for high writes or b...
Hadoop Distributed File System
Advantages
● Simple (more or less)
● Works with every day hardware (cheap to scale)
● Prove...
S3
Advantages
● Cheap
● distributed
● Good for data archival
Disadvantages
● Data is stored externally
● Does not lend its...
Distributed Processing
● Hadoop MapReduce
● GridGain
● Storm
● Akka Actors
Hadoop MapReduce
● Used for distributed serial batch processing
● Works with HDFS
● Simple concept but complex APIs
● Lots...
GridGain
● Fast In-Memory queries
● Not attached to any specific datastorage
● API is java/script based
Storm
● In-Memory
● Distributed
● Stream based aggregation/processing
● Supports sending partially aggregated data to
back...
Akka Actors
● Concurrent processing constructs based on the
erlang actor model
● Latest versions support distributed RPC
c...
M/R High level languages
SQL
• Hive
Imperative
• Pig
Lisp
• Cascalog
R
• Hive JDBC Connection
Apache Pig
Advantages
● Simple and programmable
● UDFS and Loader/Store APIs are simple
● Spill to disk to avoid OOM
Disad...
Hive
Advantages
● SQL interface
● Server mode
Fast
Disadvantages
● Complex UDF Load/Store (SERDE) API
● Does not spill to ...
Workflow Management
● Glue
● Oozie
● Azkaban
● Bash
Glue
● Workflows for devops.
● No XML.
● Polygot language approach supports Groovy,
Scala,Ruby(JRuby), Python(Jython), Clo...
Oozie
● XML
● UI for build workflows using blocks, (still have to
program the components)
● Buy another pair of glasses
Azkaban
● Based on Flows
– There consist of binaries described by a job text file
● Concentrates on generic scheduling and...
Bash
● Don't do workflows in bash
● Know your bash for simple adhoc searches and
processing
● Again do not do workflows in...
Realtime Read Write Storage
● Hbase and Accumulo
● Cassandra
Hbase and Accumulo
● Both are based on the BigTable paper from
Google.
● Column based storage
● Integrates with HDFS
● Tab...
Cassandra
● Based on the Dynamo papers from amazon
● No single point of failure
● Aimed at faster writes than reads
● Defa...
Others
● Lucene (api for building fast indexes)
● Solr and Elastic search
– Built on top of lucene
– Distributed indexes
–...
References
http://cassandra.apache.org/
http://hbase.apache.org/
http://hbase.apache.org/
http://hbase.apache.org/
http://...
Upcoming SlideShare
Loading in …5
×

Hadoop Ecosystem Overview

957 views

Published on

Published in: Technology
  • Be the first to comment

Hadoop Ecosystem Overview

  1. 1. BigData Eco System overview ●Log Formats ●Compression ●Collecting Data ●Distributed Storage ●Distributed Processing ●Workflow Management ●Realtime Read Write Storage ●Others
  2. 2. Log Formats ● Google Protocol Buffers ● Thrift Typed Data (and int is an int etc) Named Flexible Schema Backwards compatibility
  3. 3. Compression Default (slowish) ● GZIP (splittable) ● Bzip2 (slowest) ● 7zip (only for long archival) Fastest ● Snappy (not splittable) ● LZO (not splittable) ●
  4. 4. Compression LZO/Snappy ● Saves data ● CPU/RAM impact is negligible ● Less bytes in disk makes for faster reads and writes
  5. 5. Collecting Data File collection ● BigStreams Pub Subscribe collection ● Kafka + Kafka-Collector ● Scribe ● Flume
  6. 6. Data Pipeline ● Reliable ● Auto healing ● Cache data for N days ● Automatic ● Fast ● Distributed
  7. 7. Distributed Storage ● HDFS (master slaves) ● S3 (bucket, key → blob, no master) ● GridFS ? ● NFS (not for high writes or big data)
  8. 8. Hadoop Distributed File System Advantages ● Simple (more or less) ● Works with every day hardware (cheap to scale) ● Proven scalability to petabytes ● Lends itself to efficient distributed batch processing Disadvantages ● Single Point of Failure (HA is a work in progress) ● All meta-data must fit in master's RAM ● No Random Read/Writes
  9. 9. S3 Advantages ● Cheap ● distributed ● Good for data archival Disadvantages ● Data is stored externally ● Does not lend itself to batch processing of large volumes of data
  10. 10. Distributed Processing ● Hadoop MapReduce ● GridGain ● Storm ● Akka Actors
  11. 11. Hadoop MapReduce ● Used for distributed serial batch processing ● Works with HDFS ● Simple concept but complex APIs ● Lots of higher level APIs for querying (Pig/Hive) ● Not for random indexed reads ● Not for small data i.e. < 10 gigs
  12. 12. GridGain ● Fast In-Memory queries ● Not attached to any specific datastorage ● API is java/script based
  13. 13. Storm ● In-Memory ● Distributed ● Stream based aggregation/processing ● Supports sending partially aggregated data to backends like Hbase/Cassandra
  14. 14. Akka Actors ● Concurrent processing constructs based on the erlang actor model ● Latest versions support distributed RPC communication via Netty or ZeroMQ. ● Used for building distributed fast processing systems.
  15. 15. M/R High level languages SQL • Hive Imperative • Pig Lisp • Cascalog R • Hive JDBC Connection
  16. 16. Apache Pig Advantages ● Simple and programmable ● UDFS and Loader/Store APIs are simple ● Spill to disk to avoid OOM Disadvantages ● Low level ● Schema-less ●
  17. 17. Hive Advantages ● SQL interface ● Server mode Fast Disadvantages ● Complex UDF Load/Store (SERDE) API ● Does not spill to disk like pig to avoid OOM
  18. 18. Workflow Management ● Glue ● Oozie ● Azkaban ● Bash
  19. 19. Glue ● Workflows for devops. ● No XML. ● Polygot language approach supports Groovy, Scala,Ruby(JRuby), Python(Jython), Clojure, JavaScript. ● Data driven and cronbased worfklows ● Separate configuration from workflows
  20. 20. Oozie ● XML ● UI for build workflows using blocks, (still have to program the components) ● Buy another pair of glasses
  21. 21. Azkaban ● Based on Flows – There consist of binaries described by a job text file ● Concentrates on generic scheduling and retries in a traditional sense. ● Flow UI
  22. 22. Bash ● Don't do workflows in bash ● Know your bash for simple adhoc searches and processing ● Again do not do workflows in bash
  23. 23. Realtime Read Write Storage ● Hbase and Accumulo ● Cassandra
  24. 24. Hbase and Accumulo ● Both are based on the BigTable paper from Google. ● Column based storage ● Integrates with HDFS ● Tables act as distributed indexes ● Region Servers are single points of failure ● Aimed at faster reads than writes
  25. 25. Cassandra ● Based on the Dynamo papers from amazon ● No single point of failure ● Aimed at faster writes than reads ● Default eventual consistency with configurable durability options (at the cost of writing speed) ● Column Counters
  26. 26. Others ● Lucene (api for building fast indexes) ● Solr and Elastic search – Built on top of lucene – Distributed indexes – Fast query times ● Mongo DB (document db) ● Redis (fast in-memory db) – Lots of basic constructs, easy to build bloom filters – Great for realtime
  27. 27. References http://cassandra.apache.org/ http://hbase.apache.org/ http://hbase.apache.org/ http://hbase.apache.org/ http://hadoop.apache.org/ http://hive.apache.org/ http://pig.apache.org/ http://storm-project.net/ http://lucene.apache.org/core/ http://lucene.apache.org/solr/ http://www.elasticsearch.org/ https://github.com/nathanmarz/cascalog http://akka.io/ http://www.datastax.com/dev/blog/big-analytics-with-r-cassandra-and-hive http://gerritjvv.github.io/glue/ https://code.google.com/p/bigstreams/ https://github.com/facebook/scribe http://flume.apache.org/ http://redis.io/ http://www.mongodb.org/ http://netty.io/ http://zeromq.org/ http://kafka.apache.org/ http://www.gridgain.com/ http://docs.mongodb.org/manual/core/gridfs/ http://aws.amazon.com/s3/

×