Systems for Big Data Processing


Published on

Published in: Technology
1 Like
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • A distributed design is usually a second or third generation architectural choice. Most systems begin their journey as a simple application running on a web server talking to an RDBMS.\n\nExamples - \nAmazon \nstarted out as an application more than a 15 years ago based on this simple web-app-talking-to-db architecture.\nThis C++ application, called Obidos, evolved to hold all the business logic, all the display logic, and all the functionality that Amazon is now famous for: similarities, recommendations, reviews, etc. \nFor years the scaling efforts at Amazon were focused on making the back-end databases scale to hold more items, more customers, more orders, and to support multiple international sites. This went on until 2001 when it became clear that this monolithic application couldn’t scale anymore.\n\nTwitter\nLaunched in 2006 and hit mainstream in 2008 \nstarted out as a simple Ruby on rails application talking to a MySQL database.\n
  • The watershed moment in the history of most large scale systems is when their teams stopped thinking of the system as just a simple webapp and instead shifted to a view of the system that was a fully distributed, decentralized platform.\nThe benefits of this approach are many - \nFirstly service orientation offers a whole new level of isolation. This isolation by clearly articulating the ownership boundaries thus enabling teams take greater control and ownership of the services they develop and operate. \nIt also allows for dependencies to be clearly specified, makes testing easier and allows small teams to work independently.\n\nThe other aspect facilitated by the move to SOA is that by preventing direct access to the underly data store used by a service allows us the freedom to change the underlying implementation of the services without modifying the rest of the systems. This comes really handy when you want to make reliability and scalability improvements without having to involve your clients. As long as the information contracts and SLAs are adhered to the clients of a service do not have to be involved the in the process.\n\n
  • The best way to learn/understand more about distributed systems is by looking at the real, production grade systems that have been built in recent times. In the reminder of this talk I am going to talk through what the landscape looks like of Big data systems. Along the way I’ll also describe some of the key systems in each of the categories and hopefully one of them will arouse your curiosity and spark off further experimentation in the area of your choice.\n\n
  • \n
  • \n
  • A DFS is any file system that allows access to files from multiple hosts sharing via a computer network. This makes it possible for multiple users on multiple machines to share files and storage resources.\nClients don't have direct access to the underlying block storage and instead interact over the network using a protocol.\nSome of the more modern DFSs have started to include facilities for replication and fault tolerance. In fact the replication and fault tolerance aspects contribute to the bulk to features they have.\n\nTrends in this space :\nSeveral research-ware DFSs have been built and continue to redesigned in light of feedback and experience. \nBut the systems that brought DFS into the mainstream were firstly the Google Filesystem and following closely on its heels the Hadoop Distributed Filesystem.\nThe differentiating aspect of GFS has been the fact that its design has been motivated by the actual application workload characteristics. \nBoth GFS and HDFs (which is an open source implementation inspired by GFS) are designed based on a master slave model. The master, which is responsible for managing metadata is called the Name Node (in HDFS terminology) and the slaves, that actually store the data are called Data Nodes. The whole system has only one Name node with whom multiple data nodes coordinate. \n\nMost distributed file systems have or are exploring truly distributed implementations of the namespace/name node. For instance the Ceph filesystem implements the metadata service as a cluster of name nodes. GFS's has also moved to a distributed name space implementation where there are 100s of namespace servers with each master managing about 100 million files. So just exploring newer ways of name node/metadata service scalability could be a good topic for active research. \n
  • Just as with DFS a distributed database is an engine that allows storage and retrieval of records across different machines in a network over just one node. Now a days distributed databases are used to also describe the NoSQL class of databases. Some NoSQL databases are distributed but not all of them. Example of distributed databases are Hive, HadoopDB, Amazon's Dynamo, Apache Cassandra and Google Big Table/Megastore.\n\nThese Databases mostly address some of the points such as being non-relational, distributed, open-source and horizontally scalable. The movement began early 2009 and is growing rapidly. Often more characteristics apply such as: schema-free, easy replication support, simple API, eventually consistent / BASE (not ACID), a huge amount, of data  and more. \n\nTrends in this space :\n\nBy and large this space is quite hyperactive. Both the industry and the open source community are creating more domain specific or in some cases data access pattern specific storage engines. Amazon's Dynamo was motivated by the fact that about 70% of data was accessed based on the primary key across the whole platform. The dynamo paper describes an absolutely fabulous and seminal piece of work. Strongly recommended for DS people.\n\nWhile this trend continues the current noSQL market satisfies the three characteristics of a monopolistically competitive market: the barriers to entry and exit are low; there are many small suppliers; and these suppliers produce technically heterogeneous, highly differentiated products. \nSo as you can see the conditions are not ripe for perfect competition to occur. Hence in the long run monopolistically competitive firms will make zero economic profit. In the early 1970s, the database world was in a similar sorry state.\nThe landscape changed radically when Ted Codd proposed a new data model and a structured query language (SQL) based on the mathematical concept of relations and foreign-/primary-key relationships.\nCodd's relational model and SQL allowed implementations from different vendors to be (near) perfect substitutes, and hence provided the conditions for perfect competition.\n\nToday, the relational database market is a classic example of an oligopoly. The market has a few large players (Oracle, IBM, Microsoft, MySQL), the barriers to entry are high, and all existing SQL-based relational database products are largely indistinguishable. Oligopolies can retain high profits in the long run; today the database industry is worth an estimated $32 billion and still growing in the double digits.\n\nSo the million dollar question is can someone come up with a mathematical model for NoSQL databases?\nThere is already work that is out there - \nA co-Relational Model of Data for Large Shared Data Banks  - Erik Meijer and Gavin Bierman, Microsoft\n\nWe need more of such models for different categories of NoSQL databases. \n\n
  • This is another hyperactive area and is witnessing a never before growth. Some themes include - \n1) Distributed Crawlers\n2) Data parallel programming frameworks such as MapReduce, Dryad, Haystack etc.\n3) Large scale graph processing engines - Pregel from Google and HipG lead the pack\n4) Peer to Peer architectures - Spotify uses a peer to peer architecture to large scale, low latency on demand music streaming. \nThe service is not web-based, but instead uses a proprietary client and protocol. At the heart of the system is this custom music streaming protocol that is optimized for accessing a large library of tracks.\n5) Multi-tenanted SaaS applications - \n6) Content Delivery Networks\n\n\n
  • \n
  • \n
  • \n
  • Systems for Big Data Processing

    1. 1. Systems for Big Data Srihari Srinivasan ThoughtWorks
    2. 2. Web 2.xPlatform for software development
    3. 3. ItsCloud Computing
    4. 4. Its Machine Learning ItsCloud Computing
    5. 5. Its Distributed and Service Oriented Its Machine Learning Its Cloud Computing
    6. 6. Coming up• Going distributed - When and Why?• The landscape of Big Data systems - What are the apps?
    7. 7. When do we go distributed?• A truly distributed design is usually a second/third generation solution • Amazon started off as simple web application talking to a database 15+ years ago • Twitter started out as a simple Ruby on Rails application talking to MySQL in 2006
    8. 8. When do we go distributed?• As the application’s / organization’s complexity grows• Data, request volume is too large for a single machine• Your software needs to be deployed in multiple data centers• Your teams deliver software in the form of services Courtesy : Jeff Dean’s LADIS 2009 Keynote
    9. 9. Designing systems for scale• Many production grade systems have been built and written about in recent times• Need for a taxonomy that describes the big data systems landscape
    10. 10. A taxonomy for distributed systems• Distributed Storage Systems• Distributed Applications• Monitoring & Management• Personalization & Recommendation
    11. 11. Distributed Storage• Distributed Filesystems• Distributed/Parallel Databases• Messaging and Notification engines
    12. 12. Distributed Filesystems• Allows clients to access files from multiple networked hosts• Clients don’t access underlying block storage directly, go through protocols• Modern DFSs are good at providing replication & fault tolerance
    13. 13. Distributed Databases• A database engine that allows storage and retrieval across different machines in a network, a.k.a NoSQL databases.• Apache Hive, Amazon Dynamo, HadoopDB, FB Cassandra, Google Bigtable• They tend to be non relational, distributed, open-source and horizontally scalable• Are schema free, easy support for replication, eventually consistent (BASE over ACID)
    14. 14. Distributed Apps• Data parallel programming frameworks• Graph processing engines• P2P content delivery• Multi tenanted SaaS applications• Content delivery networks
    15. 15. Monitoring and Management• Distributed debuggers, tracers and profiling applications• Monitoring systems
    16. 16. Personalization & Recommendation• Recommendation engines• Sentiment analyzers• Personalized news & content discovery systems
    17. 17. </presentation> Follow on Twitter @systems_we_make