Facebook uses a LAMP stack with additional services and customizations for its architecture. PHP and MySQL are used for the main web and data tiers but have limitations for large scale. Services are implemented using Thrift for cross-language communication and Scribe for distributed logging. Services allow storing code closer to data and using optimized languages. The News Feed and Search architectures distribute work across tiers with Thrift calls and aggregate data using services.
Lessons from Highly Scalable Architectures at Social Networking SitesPatrick Senti
What are the techniques and technolgies used by popular social networking sites such as Facebook, Twitter, Tumblr, Pinterest or Instagram? How do they architect their systems to scale to multiples of 100 million of visits per day?
Who's online, how they're using the Web, and what they're accessing have changed significantly over the last few years. All these factors are straining websites, and their outdated architectures. This presentation covers these changes by the numbers and looks at the emergence of a Web scale architecture that takes advantage of Memcached to improve performance and scalability of today's dynamic websites. The recorded webinar is now available at: http://bit.ly/aPe2T.
Lessons from Highly Scalable Architectures at Social Networking SitesPatrick Senti
What are the techniques and technolgies used by popular social networking sites such as Facebook, Twitter, Tumblr, Pinterest or Instagram? How do they architect their systems to scale to multiples of 100 million of visits per day?
Who's online, how they're using the Web, and what they're accessing have changed significantly over the last few years. All these factors are straining websites, and their outdated architectures. This presentation covers these changes by the numbers and looks at the emergence of a Web scale architecture that takes advantage of Memcached to improve performance and scalability of today's dynamic websites. The recorded webinar is now available at: http://bit.ly/aPe2T.
JavaOne2016 - Microservices: Terabytes in Microseconds [CON4516]Malin Weiss
By leveraging memory-mapped files, Speedment and the Chronicle Engine supports large Java maps that easily can exceed the size of your server’s RAM.Because the Java maps are mapped onto files, these maps can be shared instantly between several microservice JVMs and new microservice instances can be added, removed, or restarted very quickly. Data can be retrieved with predictable ultralow latency for a wide range of operations. The solution can be synchronized with an underlying database so that your in-memory maps will be consistently “alive.” The mapped files can be tens of terabytes, which has been done in real-world deployment cases, and a large number of micro services can share these maps simultaneously. Learn more in this session.
SharePoint Saturday The Conference 2011 - SP2010 PerformanceBrian Culver
Is your farm struggling to server your organization? How long is it taking between page requests? Where is your bottleneck in your farm? Is your SQL Server tuned properly? Worried about upgrading due to poor performance? We will look at various tools for analyzing and measuring performance of your farm. We will look at simple SharePoint and IIS configuration options to instantly improve performance. I will discuss advanced approaches for analyzing, measuring and implementing optimizations in your farm.
The rise of NoSQL is characterized with confusion and ambiguity; very much like any fast-emerging organic movement in the absence of well-defined standards and adequate software solutions. Whether you are a developer or an architect, many questions come to mind when faced with the decision of where your data should be stored and how it should be managed. The following are some of these questions: What does the rise of all these NoSQL technologies mean to my enterprise? What is NoSQL to begin with? Does it mean "No SQL"? Could this be just another fad? Is it a good idea to bet the future of my enterprise on these new exotic technologies and simply abandon proven mature Relational DataBase Management Systems (RDBMS)? How scalable is scalable? Assuming that I am sold, how do I choose the one that fit my needs best? Is there a middle ground somewhere? What is this Polyglot Persistence I hear about? The answers to these questions and many more is the subject of this talk along with a survey of the most popular of NoSQL technologies. Be there or be square.
How to JavaOne 2016 - Generate Customized Java 8 Code from Your Database [TUT...Malin Weiss
The best code is the one you never need to write. Using code generation and automated builds, you can minimize the risk of human error when developing software, but how do you maintain control over code when large parts of it are handed over to a machine? In this tutorial, you will learn how to use open source software to create and control code automation. You will see how you can generate a completely object-oriented domain model by automatically analyzing your database schemas. Every aspect of the process is transparent and configurable, giving you, as a developer, 100 percent control of the generated code. This will not only increase your productivity but also help you build safer, more maintainable Java applications and is a perfect solution for Microservices.
Before joining Couchbase Phil has been a consultant on many different node.js and NoSQL projects working with many different languages and databases. By helping clients solve problems regarding scalability as well building completely new APIs he gained a broad knowledge of the available platforms and their tradeoffs in the big and small. He's a Developer Evangelist for Couchbase where he works to educate developers on the different parts of using a NoSQL database from mobile to big iron servers.
In this session, we'll discuss architectural, design and tuning best practices for building rock solid and scalable Alfresco Solutions. We'll cover the typical use cases for highly scalable Alfresco solutions, like massive injection and high concurrency, also introducing 3.3 and 3.4 Transfer / Replication services for building complex high availability enterprise architectures.
Apache Con 2021 : Apache Bookkeeper Key Value Store and use casesShivji Kumar Jha
In order to leverage the best performance characters of your data or stream backend, it is important to understand the nitty gritty details of how your backend store and compute works, how data is stored, how is it indexed and how the read path is. Understanding this empowers you to design your use case solutioning so as to make the best use of resources at hand as well as get the optimum amount of consistency, availability, latency and throughput for a given amount of resources at hand.
With this underlying philosophy, in this slide deck, we will get to the bottom of storage tier of pulsar (apache bookkeeper), the barebones of the bookkeeper storage semantics, how it is used in different use cases ( even other than pulsar), understand the object models of storage in pulsar, different kinds of data structures and algorithms pulsar uses therein and how that maps to the semantics of the storage class shipped with pulsar by default. Oh yes, you can change the storage backend too with some additional code!
The focus will be more on storage backend so as to not keep this tailored to pulsar specifically but to be able to apply it different data stores or streams.
Η ποντιακή είναι μια διάλεκτος της Ελληνικής γλώσσας που ξεκινά από τα χρόνια του αποικισμού των Ελλήνων της Ιωνίας και την εγκατάστσασή τους στον Πόντο τον 8ο π.Χ. αιώνα...
JavaOne2016 - Microservices: Terabytes in Microseconds [CON4516]Malin Weiss
By leveraging memory-mapped files, Speedment and the Chronicle Engine supports large Java maps that easily can exceed the size of your server’s RAM.Because the Java maps are mapped onto files, these maps can be shared instantly between several microservice JVMs and new microservice instances can be added, removed, or restarted very quickly. Data can be retrieved with predictable ultralow latency for a wide range of operations. The solution can be synchronized with an underlying database so that your in-memory maps will be consistently “alive.” The mapped files can be tens of terabytes, which has been done in real-world deployment cases, and a large number of micro services can share these maps simultaneously. Learn more in this session.
SharePoint Saturday The Conference 2011 - SP2010 PerformanceBrian Culver
Is your farm struggling to server your organization? How long is it taking between page requests? Where is your bottleneck in your farm? Is your SQL Server tuned properly? Worried about upgrading due to poor performance? We will look at various tools for analyzing and measuring performance of your farm. We will look at simple SharePoint and IIS configuration options to instantly improve performance. I will discuss advanced approaches for analyzing, measuring and implementing optimizations in your farm.
The rise of NoSQL is characterized with confusion and ambiguity; very much like any fast-emerging organic movement in the absence of well-defined standards and adequate software solutions. Whether you are a developer or an architect, many questions come to mind when faced with the decision of where your data should be stored and how it should be managed. The following are some of these questions: What does the rise of all these NoSQL technologies mean to my enterprise? What is NoSQL to begin with? Does it mean "No SQL"? Could this be just another fad? Is it a good idea to bet the future of my enterprise on these new exotic technologies and simply abandon proven mature Relational DataBase Management Systems (RDBMS)? How scalable is scalable? Assuming that I am sold, how do I choose the one that fit my needs best? Is there a middle ground somewhere? What is this Polyglot Persistence I hear about? The answers to these questions and many more is the subject of this talk along with a survey of the most popular of NoSQL technologies. Be there or be square.
How to JavaOne 2016 - Generate Customized Java 8 Code from Your Database [TUT...Malin Weiss
The best code is the one you never need to write. Using code generation and automated builds, you can minimize the risk of human error when developing software, but how do you maintain control over code when large parts of it are handed over to a machine? In this tutorial, you will learn how to use open source software to create and control code automation. You will see how you can generate a completely object-oriented domain model by automatically analyzing your database schemas. Every aspect of the process is transparent and configurable, giving you, as a developer, 100 percent control of the generated code. This will not only increase your productivity but also help you build safer, more maintainable Java applications and is a perfect solution for Microservices.
Before joining Couchbase Phil has been a consultant on many different node.js and NoSQL projects working with many different languages and databases. By helping clients solve problems regarding scalability as well building completely new APIs he gained a broad knowledge of the available platforms and their tradeoffs in the big and small. He's a Developer Evangelist for Couchbase where he works to educate developers on the different parts of using a NoSQL database from mobile to big iron servers.
In this session, we'll discuss architectural, design and tuning best practices for building rock solid and scalable Alfresco Solutions. We'll cover the typical use cases for highly scalable Alfresco solutions, like massive injection and high concurrency, also introducing 3.3 and 3.4 Transfer / Replication services for building complex high availability enterprise architectures.
Apache Con 2021 : Apache Bookkeeper Key Value Store and use casesShivji Kumar Jha
In order to leverage the best performance characters of your data or stream backend, it is important to understand the nitty gritty details of how your backend store and compute works, how data is stored, how is it indexed and how the read path is. Understanding this empowers you to design your use case solutioning so as to make the best use of resources at hand as well as get the optimum amount of consistency, availability, latency and throughput for a given amount of resources at hand.
With this underlying philosophy, in this slide deck, we will get to the bottom of storage tier of pulsar (apache bookkeeper), the barebones of the bookkeeper storage semantics, how it is used in different use cases ( even other than pulsar), understand the object models of storage in pulsar, different kinds of data structures and algorithms pulsar uses therein and how that maps to the semantics of the storage class shipped with pulsar by default. Oh yes, you can change the storage backend too with some additional code!
The focus will be more on storage backend so as to not keep this tailored to pulsar specifically but to be able to apply it different data stores or streams.
Η ποντιακή είναι μια διάλεκτος της Ελληνικής γλώσσας που ξεκινά από τα χρόνια του αποικισμού των Ελλήνων της Ιωνίας και την εγκατάστσασή τους στον Πόντο τον 8ο π.Χ. αιώνα...
Large-scale projects development (scaling LAMP)Alexey Rybak
This 8-hours tutorial was given at various conferences including Percona conference (London), DevConf (Moscow), Highload++ (Moscow).
ABSTRACT
During this tutorial we will cover various topics related to high scalability for the LAMP stack. This workshop is divided into three sections.
The first section covers basic principles of shared nothing architectures and horizontal scaling for the app//cache/database tiers.
Section two of this tutorial is devoted to MySQL sharding techniques, queues and a few performance-related tips and tricks.
In section three we will cover the practical approach for measuring site performance and quality, porviding a "lean" support philosophy, connecting buesiness and technology metrics.
In addition we will cover a very useful Pinba real-time statistical server, it's features and various use cases. All of the sections will be based on real-world examples built in Badoo, one of the biggest dating sites on the Internet.
An Engine to process big data in faster(than MR), easy and extremely scalable way. An Open Source, parallel, in-memory processing, cluster computing framework. Solution for loading, processing and end to end analyzing large scale data. Iterative and Interactive : Scala, Java, Python, R and with Command line interface.
HBaseCon 2012 | Building a Large Search Platform on a Shoestring BudgetCloudera, Inc.
YapMap is a new kind of search platform that does multi-quanta search to better understand threaded discussions. This talk will cover how HBase made it possible for two self-funded guys to build a new kind of search platform. We will discuss our data model and how we use row based atomicity to manage parallel data integration problems. We’ll also talk about where we don’t use HBase and instead use a traditional SQL based infrastructure. We’ll cover the benefits of using MapReduce and HBase for index generation. Then we’ll cover our migration of some tasks from a message based queue to the Coprocessor framework as well as our future Coprocessor use cases. Finally, we’ll talk briefly about our operational experience with HBase, our hardware choices and challenges we’ve had.
Summary of recent progress on Apache Drill, an open-source community-driven project to provide easy, dependable, fast and flexible ad hoc query capabilities.
NoSQL is not a buzzword anymore. The array of non- relational technologies have found wide-scale adoption even in non-Internet scale focus areas. With the advent of the Cloud...the churn has increased even more yet there is no crystal clear guidance on adoption techniques and architectural choices surrounding the plethora of options available. This session initiates you into the whys & wherefores, architectural patterns, caveats and techniques that will augment your decision making process & boost your perception of architecting scalable, fault-tolerant & distributed solutions.
This is from a 2 hour talk introducing in-memory databases. First a look at traditional RDBMS architecture and some of it's limitations, then a look at some in-memory products and finally a closer look at OrigoDB, the open source in-memory database toolkit for NET/Mono.
Turbocharging php applications with zend server (workshop)Eric Ritchie
Zend Server is best known for its robust monitoring toolset. But what good is a monitoring toolset if you don't have the tools to fix the issues that come up? In this tutorial we will go over how you can discover where performance issues are occuring in your application and how you can implement fixes using various performance features in our flagship product.
4. At a Glance
The Social Graph
120M+ active users
50B+ PVs per month
10B+ Photos
1B+ connections
50K+ Platform Apps
400K+ App Developers
5. General Design Principles
▪ Use open source where possible
▪ Explore making optimizations where needed
▪ Unix Philosophy
▪ Keep individual components simple yet performant
▪ Combine as necessary
▪ Concentrate on clean interface points
▪ Build everything for scale
▪ Try to minimize failure points
▪ Simplicity, Simplicity, Simplicity!
7. PHP
▪ Good web programming language
▪ Extensive library support for web development
▪ Active developer community
▪ Good for rapid iteration
▪ Dynamically typed, interpreted scripting language
8. PHP: What we Learnt
▪ Tough to scale for large code bases
▪ Weak typing
▪ Limited opportunities for static analysis, code optimizations
▪ Not necessarily optimized for large website use case
▪ E.g. No dynamic reloading of files on web server
▪ Linearly increasing cost per included file
▪ Extension framework is difficult to use
10. MySQL
▪ Fast, reliable
▪ Used primarily as <key,value> store
▪ Data randomly distributed amongst large set of logical instances
▪ Most data access based on global id
▪ Large number of logical instances spread out across physical nodes
▪ Load balancing at physical node level
▪ No read replication
11. MySQL: What We Learnt (ing)
▪ Logical migration of data is very difficult
▪ Create a large number of logical dbs, load balance them over varying
number of physical nodes
▪ No joins in production
▪ Logically difficult (because data is distributed randomly)
▪ Easier to scale CPU on web tier
12. MySQL: What we Learnt (ing)
▪ Most data access is for recent data
▪ Optimize table layout for recency
▪ Archive older data
▪ Don’t ever store non-static data in a central db
▪ CDB makes it easier to perform certain aggregated queries
▪ Will not scale
▪ Use services or memcache for global queries
▪ E.g.: What are the most popular groups in my network
13. MySQL: Customizations
▪ No extensive native MySQL modifications
▪ Custom partitioning scheme
▪ Global id assigned to all data
▪ Custom archiving scheme
▪ Based on frequency and recency of data on a per-user basis
▪ Extended Query Engine for cross-data center replication, cache
consistency
14. MySQL: Customizations
▪ Graph based data-access libraries
▪ Loosely typed objects (nodes) with limited datatypes (int, varchar, text)
▪ Replicated connections (edges)
▪ Analogous to distributed foreign keys
▪ Some data collocated
▪ Example: User profile data and all of user’s connections
▪ Most data distributed randomly
15. Memcache
▪ High-Performance, distributed in-memory hash table
▪ Used to alleviate database load
▪ Primary form of caching
▪ Over 25TB of in-memory cache
▪ Average latency < 200 micro-seconds
▪ Cache serialized PHP data structures
▪ Lots and lots of multi-gets to retrieve data spanning across graph edges
16. Memache: Customizations
▪ Memache over UDP
▪ Reduce memory overhead of thousands of TCP connection buffers
▪ Application-level flow control (optimization for multi-gets)
▪ On demand aggregation of per-thread stats
▪ Reduces global lock contention
▪ Multiple Kernel changes to optimize for Memcache usage
▪ Distributing network interrupt handling over multiple cores
▪ Opportunistic polling of network interface
18. Under the Covers
▪ Get my profile data
▪ Fetch from cache, potentially go to my DB (based on user-id)
▪ Get friend connections
▪ Cache, if not DB (based on user-id)
▪ In parallel, fetch last 10 photo album ids for each of my friends
▪ Multi-get; individual cache misses fetches data from db (based on photo-
album id)
▪ Fetch data for most recent photo albums in parallel
▪ Execute page-specific rendering logic in PHP
▪ Return data, make user happy
20. LAMP is not Perfect
▪ PHP+MySQL+Memcache works for a large class of problems but not for
everything
▪ PHP is stateless
▪ PHP not the fastest executing language
▪ All data is remote
▪ Reasons why services are written
▪ Store code closer to data
▪ Compiled environment is more efficient
▪ Certain functionality only present in other languages
21. Services Philosophy
▪ Create a service iff required
▪ Real overhead for deployment, maintenance, separate code-base
▪ Another failure point
▪ Create a common framework and toolset that will allow for easier
creation of services
▪ Thrift
▪ Scribe
▪ ODS, Alerting service, Monitoring service
▪ Use the right language, library and tool for the task
23. Thrift
▪ Lightweight software framework for cross-language development
▪ Provide IDL, statically generate code
▪ Supported bindings: C++, PHP, Python, Java, Ruby, Erlang, Perl, Haskell
etc.
▪ Transports: Simple Interface to I/O
▪ Tsocket, TFileTransport, TMemoryBuffer
▪ Protocols: Serialization Format
▪ TBinaryProtocol, TJSONProtocol
▪ Servers
▪ Non-Blocking, Async, Single Threaded, Multi-threaded
24. Hasn’t this been done before? (yes.)
▪ SOAP
▪ XML, XML, and more XML
▪ CORBA
▪ Bloated? Remote bindings?
▪ COM
▪ Face-Win32ClientSoftware.dll-Book
▪ Pillar
▪ Slick! But no versioning/abstraction.
▪ Protocol Buffers
25. Thrift: Why?
• It’s quick. Really quick.
• Less time wasted by individual developers
• No duplicated networking and protocol code
• Less time dealing with boilerplate stuff
• Write your client and server in about 5 minutes
• Division of labor
• Work on high-performance servers separate from applications
• Common toolkit
• Fosters code reuse and shared tools
26. Scribe
▪ Scalable distributed logging framework
▪ Useful for logging a wide array of data
▪ Search Redologs
▪ Powers news feed publishing
▪ A/B testing data
▪ Weak Reliability
▪ More reliable than traditional logging but not suitable for database
transactions.
▪ Simple data model
▪ Built on top of Thrift
27. Other Tools
▪ SMC (Service Management Console)
▪ Centralized configuration
▪ Used to determine logical service -> physical node mapping
28. Other Tools
▪ ODS
▪ Used to log and view historical trends for any stats published by service
▪ Useful for service monitoring, alerting
29. Open Source
▪ Thrift
▪ http://developers.facebook.com/thrift/
▪ Scribe
▪ http://developers.facebook.com/scribe/
▪ PHPEmbed
▪ http://developers.facebook.com/phpembed/
▪ More good stuff
▪ http://developers.facebook.com/opensource.php
31. NewsFeed – The Work
friends’
actions
web tier Leaf Server
Html
PHP Actions (Scribe) Leaf Server
home.php Leaf Server
user
return Leaf Server
view state
view aggregators
state
storage friends’
actions?
aggregating...
- Most arrows indicate thrift calls ranking...
33. Search – The Work
Thrift
search tier
slave slave master slave
index index index index
user
web tier
Scribe live db
PHP change index
logs files
Indexing service
DB Tier
Updates