Overview of Big data zoo

375 views

Published on

Explains different open source big data tools and where they fit

Published in: Technology, Education
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
375
On SlideShare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
14
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Overview of Big data zoo

  1. 1. Data Analysis as a Service Iou Fag(halv)dag, 2014 Gurvinder Singh, Uninett
  2. 2. Data is the King
  3. 3. Big-Data is ...... ?
  4. 4. Big-Data is relative
  5. 5. What the hype is .. Cheap commodity hardware with amazing computing and storage capacity ... but this time software has also catching up with hardware
  6. 6. Hype Ingredient list is .. Cheap commodity hardware Good network capacity Software based on principal of "Divide and Conquer" ..thus scale out horizontally
  7. 7. Storage
  8. 8. Unstructure Storage Store data reliably, cheaply and scalably Hadoop Distributed File System (HDFS) Divide data into smaller chunks Hetrogenous storage medium support Similar DFS e.g. Lustre, IBM GPFS, Ceph, MooseFS
  9. 9. Structured Storage Store structured data reliably, scalably and indexed NoSQL databases to store structured data HBase, Accumulo stores underlying data in HDFS Many more in big data zoo: Cassandra, Voltdb, NuoDB... BlinkDB offers tradeoff between accuracy & response time Full text search offers by Elasticsearch, Solr
  10. 10. Processing Mapreduce methodology to process data in the distributed fashion Data locality with Hadoop Mapreduce and HDFS Spark supports mapreduce and utilize system & cluster's RAM Support machine learning algorithms Support python,scala,java Support R, framework for data scientists Hive, Shark, Pig to process structure data in distributed way
  11. 11. Some performance numbers to guide.. L1 cache reference 0.5 ns L2 cache reference 7 ns RAM reference 100 ns (Queen) Flash IO card reference 75,000 ns (Princess) RTT within same datacenter 500,000 ns Disk reference 10,000,000 ns
  12. 12. THE END By Gurvinder Singh

×