More Related Content


Recently uploaded(20)


The ABC of Big Data

  1. Big Data ABC @andrefaria
  2. Concepts
  3. Relational
  4. Key Value Stores Riak, Mecached, Berkley DB, HamsterDB, Couchbase, Voldemort, DynamoDB
  5. Document Stores think about document databases as key-value stores where the value is examinable rich query language + indexes MongoDB, CouchDB , Terrastore, OrientDB, RavenDB
  6. Column Family Stores Each column family can be compared to a container of rows in an RDBMS table where the key identifies the row and the row consists of multiple columns. The difference is that various rows do not have to have the same columns, and columns can be added to any row at any time without having to add it to other rows. Cassandra, HBase, Hypertable
  7. Graph Databases Neo4J, Infinite Graph, OrientDB, FlockDB
  8. Map Reduce
  9. Sharding
  10. Document-oriented system Large Data Sets Records similar to JSON Automatic sharding and MapReduce Queries are written in JavaScript
  11. Document-oriented system JavaScript Interface Multi-version concurrency control approach Client side needs to handle clashes on writes No good built-in method for horizontal scalability (but there are external solutions like BigCouch, Lounge, and Pillow)
  12. Originally an internal Facebook project Keyspaces and column families Similar the data model used by BigTable Data is sharded and balanced automatically
  13. Keeps the entire Database in RAM Its values can be complex data structures
  14. BIG TABLE Structure: tables, row keys, column families, column names, timestamps, and cell values Designed to handle very large data loads by running on big clusters of commodity hardware Uses GFS as its underlying storage
  15. Open source clone of Big Table Same structure of Big Table Uses HDFS instead of GFS
  16. Another open source BigTable clone written in C++ Focus in High Performance
  17. DynamoDB Key Value System Large Distributed Clusters Versioning
  18. AWS S3 HTTP Blobs
  19. Inspired by AWS Dynamo DB OpenSource and Commercial Versions Key Value System Large Distributed Clusters Queries in ErLang or JavaScript Consistent hashing and a gossip protocol to avoid centralized index server
  20. Takes care of running your code across a cluster of machines. - chunking up the input data - sending it to each machine - running your code on each chunk - checking that the code ran - passing any results next stage - sorting between stages - sending each chunk of that sorted data to the right machine - writing debugging information on each job’s progress
  21. With Hive, you can program Hadoop jobs using a SQL-like language HiveQL.
  22. Apache Pig A procedural data processing language designed for Hadoop Provides a set of functions that help with common data processing problems
  23. PigPen is map-reduce for Clojure, or distributed Clojure. It compiles to Apache Pig, but you don't need to know much about Pig to use it.
  24. Hadoop for Logs The Flume project is designed to make the data gathering process easy and scalable, by running agents on the source machines that pass the data updates to collectors, which then aggregate them into large chunks that can be efficiently written as HDFS files.
  25. The R project is both a specialized language and a toolkit of modules aimed at anyone working with statistics.
  26. Lucene is a Java library that handles indexing and searching large collections of documents, and Solr is an application that uses the library to build a search engine server.
  27. Mahout is an open source framework that can run common machine learning algorithms on massive datasets. The framework makes it easy to use analysis techniques to implement features such as “People who bought this also bought” recommendation engine on your own site.
  28. ZooKeeper Coordinates work and configuration of different Clusters
  29. Serialization As you pass data between systems and you need to store it in files at some points
  30. JSON BSON (Binary JSON) Apache Thrift (predefine structure) Apache Avro (predefine structure)
  31. @andrefaria