A collection of information taken from previous presentations that was used as drill down for supporting discussion of specific topics during the tutorial.
AWS Re:Invent - High Availability Architecture at NetflixAdrian Cockcroft
Slides from my talk at AWS Re:Invent November 2012. Describes the architecture, how to make highly available application code and data stores, a taxonomy of failure modes, and actual failures and effects. Ends with a summary of @NetflixOSS projects so others can easily leverage this architecture.
Same basic flow as the keynote, but with a lot more detail, and we had a lot more interactive discussion rather than a presentation format. See part 2 for some more specific detail and links to other presentations.
Latest version of the Netflix Cloud Architecture story was given at Gluecon May 23rd 2012. Gluecon rocks, and lots of Van Halen references were added for the occasion. There tradeoff between developer driven high functionality AWS based PaaS, and operations driven low cost portable PaaS is discussed. The three sections cover the developer view, the operator view and the builder view.
SV Forum Platform Architecture SIG - Netflix Open Source PlatformAdrian Cockcroft
Architecture overview of Netflix Cloud Architecture with a focus on the Open Source components that Netflix has put and is planning to release on http://netflix.github.com
Netflix Cloud Platform Building BlocksSudhir Tonse
Architectural Building Blocks of the Netflix Cloud Platform and lessons learned while implementing the same.
Commandments of Web Scale Cloud Deployments
AWS Re:Invent - High Availability Architecture at NetflixAdrian Cockcroft
Slides from my talk at AWS Re:Invent November 2012. Describes the architecture, how to make highly available application code and data stores, a taxonomy of failure modes, and actual failures and effects. Ends with a summary of @NetflixOSS projects so others can easily leverage this architecture.
Same basic flow as the keynote, but with a lot more detail, and we had a lot more interactive discussion rather than a presentation format. See part 2 for some more specific detail and links to other presentations.
Latest version of the Netflix Cloud Architecture story was given at Gluecon May 23rd 2012. Gluecon rocks, and lots of Van Halen references were added for the occasion. There tradeoff between developer driven high functionality AWS based PaaS, and operations driven low cost portable PaaS is discussed. The three sections cover the developer view, the operator view and the builder view.
SV Forum Platform Architecture SIG - Netflix Open Source PlatformAdrian Cockcroft
Architecture overview of Netflix Cloud Architecture with a focus on the Open Source components that Netflix has put and is planning to release on http://netflix.github.com
Netflix Cloud Platform Building BlocksSudhir Tonse
Architectural Building Blocks of the Netflix Cloud Platform and lessons learned while implementing the same.
Commandments of Web Scale Cloud Deployments
(ENT209) Netflix Cloud Migration, DevOps and Distributed Systems | AWS re:Inv...Amazon Web Services
Netflix's migration to the cloud as our primary streaming control plane was paralleled by our move from traditional IT and centralized operations to a more decentralized DevOps organizational model. In this session, we explore the relationship between technical infrastructure and organization and how to find the right balance of centralized and decentralized operations. We also cover the rationale, goals, strategies, and technologies applied to accomplish this daunting task. We reflect on where we stand today and how we've realized many of our goals.
[Full slides now also available at http://www.slideshare.net/adrianco/netflix-on-cloud-combined-slides-for-dev-and-ops]
Short summary of why Netflix is running on the Amazon cloud, what is running there, what we have learned and where this is taking us.
This is the introduction section to a series of public presentations that will go into much more detail. The Silicon Valley Cloud Computing Meetup was on Oct 14th, QCon San Francisco November 3rd.
For the Computer Measurement Group workshop in San Diego November 2013. Also presented to a student class at UC Santa Barbara. What is Cloud Native. Capacity and Performance benchmarks. Cost Optimization Techniques - content co-developed with Jinesh Varia of AWS.
Introduction to the Netflix Open Source Software project, explains why Netflix is doing this, how all the parts fit together and what is planned to come next. Presented at the inaugural NetflixOSS Meetup February 6th 2013 at Netflix headquarters in Los Gatos.
(ISM301) Engineering Netflix Global Operations In The CloudAmazon Web Services
Operating a massively scalable, constantly changing, distributed global service is a daunting task. We innovate at breakneck speed to attract new customers and stay ahead of the competition. This means more features, more experiments, more deployments, more engineers making changes in production environments, and ever-increasing complexity. Simultaneously improving service availability and accelerating rate of change seems impossible on the surface. At Netflix, operations engineering is both a technical and organizational construct designed to accomplish just that by integrating disciplines like continuous delivery, fault injection, regional traffic management, crisis response, best practice automation, and real-time analytics. In this talk, designed for technical leaders seeking a path to operational excellence, we'll explore these disciplines in depth and how they integrate and create competitive advantages.
Architecture talk aimed at a well informed developer audience (i.e. QConSF Real Use Cases for NoSQL track), focused mainly on availability. Skips the Netflix cloud migration stuff that is in other talks.
Netflix Development Patterns for Scale, Performance & Availability (DMG206) |...Amazon Web Services
This session explains how Netflix is using the capabilities of AWS to balance the rate of change against the risk of introducing a fault. Netflix uses a modular architecture with fault isolation and fallback logic for dependencies to maximize availability. This approach allows for rapid independent evolution of individual components to maximize the pace of innovation and A/B testing, and offers nearly unlimited scalability as the business grows. Learn how we balance managing change to (or subtraction from) the customer experience, while aggressively scraping barnacle features that add complexity for little value.
ARC301 Intro to Chaos Monkey & the Simian Army - AWS re: Invent 2012Amazon Web Services
Join the product and cloud computing leaders of Netflix to discuss why and how the company moved to Amazon Web Services. From early experiments for media transcoding, to building the operational skills to optimize costs and the creation of the Simian Army, this session guides business leaders through real world examples of evaluating and adopting cloud computing.
RESTing in the ALPS Mike Amundsen's Presentation from QCon London 2013CA API Management
Generic Hypermedia and Domain-Specific APIs: RESTing in the ALPS
Track: Building Web APIs: Opening & Linking your Data / Time: Thursday 15:40 - 16:30 / Location: Fleming Room
Hypermedia API is the new catch-phrase, but what is a Hypermedia API? Does this trend lead us toward a debilitating explosion of media types? Can we really create successful hypermedia APIs or is this just the latest hype?
Recently a number of new media types that offer hypermedia support have come into use on the Web including HAL, Collection+JSON, Siren, and more. However, these new designs are not designed to communicate application-specific information (e.g. accounting, microblogging, etc.) in a standard way. Is there a way to resolve this problem?
Drawing on the experience of Dublin Core, Microformats, Activity Streams, and other similar approaches, this talk describes the ALPS (Application-Level Profile Semantics) standard; a way to define the data and workflow details for a Web application and apply these details consistently regardless of the media type in use. Working examples in the talk also show how this standardized definition can make designing, implementing, documenting, and maintaining Web APIs easier and more consistent across multiple media types.
Netflix designed a massive scale cloud based media transcoding system from scratch for processing professionally produced studio content(to meet the unique scale and time constraints of our business). We bucked the common industry trend of vertical scaling and, instead, designed a horizontally scaled elastic system using AWS to meet the unique scale and time constraints of our business. Come hear how we designed this system, how it continues to get less expensive for Netflix, and how AWS represents a transformative opportunity in the wider media owning industry.
AWS Re:Invent - Optimizing Costs with AWSCoburn Watson
AWS Re:Invent 2012 presentation from Netflix which covers how to optimize cost and usage of your AWS resources. Areas of focus are Autoscaling EC2 instances, batch access of SQS, and improved S3 usage.
Asgard, the Grails App that Deploys Netflix to the CloudJoe Sondow
Overview and technical exploration of Asgard, a graphical web console created by Netflix for cloud deployments and operations. Presented at the GR8 Conference in Copenhagen, Denmark, June 8, 2012.
Apple Keynote version with animations available at http://bit.ly/asgardgr8denmark
(SPOT302) Under the Covers of AWS: Core Distributed Systems Primitives That P...Amazon Web Services
AWS and Amazon.com operate some of the world's largest distributed systems infrastructure and applications. In our past 18 years of operating this infrastructure, we have come to realize that building such large distributed systems to meet the durability, reliability, scalability, and performance needs of AWS requires us to build our services using a few common distributed systems primitives. Examples of these primitives include a reliable method to build consensus in a distributed system, reliable and scalable key-value store, infrastructure for a transactional logging system, scalable database query layers using both NoSQL and SQL APIs, and a system for scalable and elastic compute infrastructure.
In this session, we discuss some of the solutions that we employ in building these primitives and our lessons in operating these systems. We also cover the history of some of these primitives; DHTs, transactional logging, materialized views and various other deep distributed systems concepts; how their design evolved over time; and how we continue to scale them to AWS.
Flowcon (added to for CMG) Keynote talk on how Speed Wins and how Netflix is ...Adrian Cockcroft
Flowcon keynote was a few days before CMG, a few tweaks and some extra content added at the start and end. Opening Keynote talk for both conferences on how Speed Wins and how Netflix is doing Continuous Delivery
(ENT209) Netflix Cloud Migration, DevOps and Distributed Systems | AWS re:Inv...Amazon Web Services
Netflix's migration to the cloud as our primary streaming control plane was paralleled by our move from traditional IT and centralized operations to a more decentralized DevOps organizational model. In this session, we explore the relationship between technical infrastructure and organization and how to find the right balance of centralized and decentralized operations. We also cover the rationale, goals, strategies, and technologies applied to accomplish this daunting task. We reflect on where we stand today and how we've realized many of our goals.
[Full slides now also available at http://www.slideshare.net/adrianco/netflix-on-cloud-combined-slides-for-dev-and-ops]
Short summary of why Netflix is running on the Amazon cloud, what is running there, what we have learned and where this is taking us.
This is the introduction section to a series of public presentations that will go into much more detail. The Silicon Valley Cloud Computing Meetup was on Oct 14th, QCon San Francisco November 3rd.
For the Computer Measurement Group workshop in San Diego November 2013. Also presented to a student class at UC Santa Barbara. What is Cloud Native. Capacity and Performance benchmarks. Cost Optimization Techniques - content co-developed with Jinesh Varia of AWS.
Introduction to the Netflix Open Source Software project, explains why Netflix is doing this, how all the parts fit together and what is planned to come next. Presented at the inaugural NetflixOSS Meetup February 6th 2013 at Netflix headquarters in Los Gatos.
(ISM301) Engineering Netflix Global Operations In The CloudAmazon Web Services
Operating a massively scalable, constantly changing, distributed global service is a daunting task. We innovate at breakneck speed to attract new customers and stay ahead of the competition. This means more features, more experiments, more deployments, more engineers making changes in production environments, and ever-increasing complexity. Simultaneously improving service availability and accelerating rate of change seems impossible on the surface. At Netflix, operations engineering is both a technical and organizational construct designed to accomplish just that by integrating disciplines like continuous delivery, fault injection, regional traffic management, crisis response, best practice automation, and real-time analytics. In this talk, designed for technical leaders seeking a path to operational excellence, we'll explore these disciplines in depth and how they integrate and create competitive advantages.
Architecture talk aimed at a well informed developer audience (i.e. QConSF Real Use Cases for NoSQL track), focused mainly on availability. Skips the Netflix cloud migration stuff that is in other talks.
Netflix Development Patterns for Scale, Performance & Availability (DMG206) |...Amazon Web Services
This session explains how Netflix is using the capabilities of AWS to balance the rate of change against the risk of introducing a fault. Netflix uses a modular architecture with fault isolation and fallback logic for dependencies to maximize availability. This approach allows for rapid independent evolution of individual components to maximize the pace of innovation and A/B testing, and offers nearly unlimited scalability as the business grows. Learn how we balance managing change to (or subtraction from) the customer experience, while aggressively scraping barnacle features that add complexity for little value.
ARC301 Intro to Chaos Monkey & the Simian Army - AWS re: Invent 2012Amazon Web Services
Join the product and cloud computing leaders of Netflix to discuss why and how the company moved to Amazon Web Services. From early experiments for media transcoding, to building the operational skills to optimize costs and the creation of the Simian Army, this session guides business leaders through real world examples of evaluating and adopting cloud computing.
RESTing in the ALPS Mike Amundsen's Presentation from QCon London 2013CA API Management
Generic Hypermedia and Domain-Specific APIs: RESTing in the ALPS
Track: Building Web APIs: Opening & Linking your Data / Time: Thursday 15:40 - 16:30 / Location: Fleming Room
Hypermedia API is the new catch-phrase, but what is a Hypermedia API? Does this trend lead us toward a debilitating explosion of media types? Can we really create successful hypermedia APIs or is this just the latest hype?
Recently a number of new media types that offer hypermedia support have come into use on the Web including HAL, Collection+JSON, Siren, and more. However, these new designs are not designed to communicate application-specific information (e.g. accounting, microblogging, etc.) in a standard way. Is there a way to resolve this problem?
Drawing on the experience of Dublin Core, Microformats, Activity Streams, and other similar approaches, this talk describes the ALPS (Application-Level Profile Semantics) standard; a way to define the data and workflow details for a Web application and apply these details consistently regardless of the media type in use. Working examples in the talk also show how this standardized definition can make designing, implementing, documenting, and maintaining Web APIs easier and more consistent across multiple media types.
Netflix designed a massive scale cloud based media transcoding system from scratch for processing professionally produced studio content(to meet the unique scale and time constraints of our business). We bucked the common industry trend of vertical scaling and, instead, designed a horizontally scaled elastic system using AWS to meet the unique scale and time constraints of our business. Come hear how we designed this system, how it continues to get less expensive for Netflix, and how AWS represents a transformative opportunity in the wider media owning industry.
AWS Re:Invent - Optimizing Costs with AWSCoburn Watson
AWS Re:Invent 2012 presentation from Netflix which covers how to optimize cost and usage of your AWS resources. Areas of focus are Autoscaling EC2 instances, batch access of SQS, and improved S3 usage.
Asgard, the Grails App that Deploys Netflix to the CloudJoe Sondow
Overview and technical exploration of Asgard, a graphical web console created by Netflix for cloud deployments and operations. Presented at the GR8 Conference in Copenhagen, Denmark, June 8, 2012.
Apple Keynote version with animations available at http://bit.ly/asgardgr8denmark
(SPOT302) Under the Covers of AWS: Core Distributed Systems Primitives That P...Amazon Web Services
AWS and Amazon.com operate some of the world's largest distributed systems infrastructure and applications. In our past 18 years of operating this infrastructure, we have come to realize that building such large distributed systems to meet the durability, reliability, scalability, and performance needs of AWS requires us to build our services using a few common distributed systems primitives. Examples of these primitives include a reliable method to build consensus in a distributed system, reliable and scalable key-value store, infrastructure for a transactional logging system, scalable database query layers using both NoSQL and SQL APIs, and a system for scalable and elastic compute infrastructure.
In this session, we discuss some of the solutions that we employ in building these primitives and our lessons in operating these systems. We also cover the history of some of these primitives; DHTs, transactional logging, materialized views and various other deep distributed systems concepts; how their design evolved over time; and how we continue to scale them to AWS.
Flowcon (added to for CMG) Keynote talk on how Speed Wins and how Netflix is ...Adrian Cockcroft
Flowcon keynote was a few days before CMG, a few tweaks and some extra content added at the start and end. Opening Keynote talk for both conferences on how Speed Wins and how Netflix is doing Continuous Delivery
Summary of past Cassandra benchmarks performed by Netflix and description of how Netflix uses Cassandra interspersed with a live demo automated using Jenkins and Jmeter that created two 12 node Cassandra clusters from scratch on AWS, one with regular disks and one with SSDs. Both clusters were scaled up to 24 nodes each during the demo.
Harnessing the Power of Library Loan Stars - Jennifer Hubbs - Tech Forum 2017BookNet Canada
Librarians all over Canada are ready to read, nominate, and champion your titles. Are you ready to help them? With almost a full year’s worth of top 10 lists under its belt, the Loan Stars program is geared up to continue shining a spotlight on exciting new titles for library staff and their patrons. Find out how the program works, how it can help you reach readers, and the most useful ways to get involved.
March 24, 2017
Analítica Web para Un Concesionario de automóviles - eshow Barcelona 2017Eduardo Sánchez González
Cuando la conversión no acaba en la web. Analítica accionable para webs de leads.
Ponencia del eshow de Barcelona 2017 donde vemos a través de un ejemplo concreto en un concesionario de automóviles como usar la analítica digital para sacar información de valor sobre la efectividad de los equipos de ventas, rentabilidad de las campañas publicitarias, efectividad del call center y a conocer el interés de los usuarios en los distintos modelos, acabados de vehículo y más.
Careggi Smart Hospital - candidatura premio forum pa 2017CareggiMobile
Careggi Smart Hospital, la sanità del futuro.
Pagare il ticket sanitario con lo smartphone in totale mobilità, prenotare un prelievo del sangue scegliendo data ed ora a piacimento e senza attesa, ricevere il proprio piano terapeutico e tutti i referti comodamente a casa sul tablet, orientarsi all’interno del campus cercando dipartimenti, unità operative o personale
sanitario, non è mai stato così facile !
C* Summit 2013: Netflix Open Source Tools and Benchmarks for Cassandra by Adr...DataStax Academy
Netflix has updated and added new tools and benchmarks for Cassandra in the last year. In this talk we will cover the latest additions and recipes for the Astyanax Java client, updates to Priam to support Cassandra 1.2 Vnodes, plus newly released and upcoming tools that are all part of the NetflixOSS platform. Following on from the Cassandra on SSD on AWS benchmark that was run live during the 2012 Summit, we've been benchmarking a large write intensive multi-region cluster to see how far we can push it. Cassandra is the data storage and global replication foundation for the Cloud Native architecture that runs Netflix streaming for 36 Million users. Netflix is also offering a Cloud Prize for open source contributions to NetflixOSS, and there are ten categories including Best Datastore Integration and Best Contribution to Performance Improvements, with $10K cash and $5K of AWS credits for each winner. We'd like to pay you to use our free software!
Arc305 how netflix leverages multiple regions to increase availability an i...Ruslan Meshenberg
Learn how to make your services more resilient and available by embracing principles of isolation and redundancy. See details of 2 projects - Isthmus and Active/Active to learn how Netflix architects for availability in multi-regional environment.
ARC203 Highly Available Architecture at Netflix - AWS re: Invent 2012Amazon Web Services
This talk describes a set of architectural patterns that support highly available services that are also scalable, low cost, low latency and allow agile continuous deployment development practices. The building blocks for these patterns have been released at netflix.github.com as open source projects for others to use.
Cloud Architecture Tutorial - Running in the Cloud (3of3)Adrian Cockcroft
Part 3 of the talk covers how to transition to cloud, how to bootstrap developers, how to run cloud services including Cassandra, capacity planning and workload analysis, and organizational structure
A recap of some of the most interesting things learned from the AWS re:Invent 2013 Conference. Easily the most intense and educational conference I've ever attended.
How Netflix’s Tools Can Help Accelerate Your Start-up (SVC202) | AWS re:Inven...Amazon Web Services
You're on the verge of a new startup and you need to build a world-class, high-scale web application on AWS so it can handle millions of users. How do you build it quickly without having to reinvent and re-implement the best-practices of large successful Internet companies? NetflixOSS is your answer. In this session, we’ll cover how an emerging startup can leverage the different open source tools that Netflix has developed and uses every day in production, ranging from baking and deploying applications (Asgard, Aminator), to hardening resiliency to failures (Hystrix, Simian Army, Zuul), making them highly distributed and load balanced (Eureka, Ribbon, Archaius) and managing your AWS resources efficiently and effectively (Edda, Ice). You’ll learn how to get started using these tools, learn best practices from engineers who actually created them, so, like Netflix, you can too unleash the power of AWS and scale your application processes as you grow.
Neutron Done the SDN Way
Dragonflow is an open source distributed control plane implementation of Neutron which is an integral part of OpenStack. Dragonflow introduces innovative solutions and features to implement networking and distributed network services in a manner that is both lightweight and simple to extend, yet targeted towards performance-intensive and latency-sensitive applications. Dragonflow aims at solving the performance
Disaster Recovery and Business Continuity - Toronto FSI Symposium - October 2016Amazon Web Services
Felix Candelario
Global Financial Services Solutions Architect explains the high level AWS Cloud Architecture, the concepts behind Availability Zones, Regions and how they relate with the traditional concept of Data Centers, Pods. He concludes with a presentation on how applications can be architected for the AWS Cloud, and how Mission Critical, Disaster Recovery and Business Continuity architectures are in use by Financial Services customer today.
Docker is a key player in the microservices movement and is arguably the leader in containerization technology.
That said, there are many ways to “do Docker”.
Between the leading cloud providers AWS, Azure, and Google; plus other platform stacks like Docker/Swarm, Apache Mesos – DC/OS, and Kubernetes; it can get confusing.In this session, Michele will bring her customer experiences building solutions across most of these platforms – to provide you with the highlights, the architecture topologies, and some perspective on the way she helps her customers choose the right platform for their cloud, on premise or hybrid solutions.
Cassandra Summit 2014: Cassandra Compute Cloud: An elastic Cassandra Infrastr...DataStax Academy
Presenter: Gurashish Brar, Member of Technical Staff at Bloomreach
Dynamically scaling Cassandra to serve hundreds of map-reduce jobs that come at an unpredictable rate and at the same time providing access to the data in real time to front-end application with strict TP95 latency guarantees is a hard problem. We present a system for managing Cassandra clusters which provide following functionality: 1) Dynamic scaling of capacity to serve high throughput map-reduce jobs 2) Provide access to data generated by map-reduce jobs in realtime to front-end applications while providing latency SLAs for TP95 3) Maintain a low cost by leveraging Amazon Spot Instances and through demand based scaling. At the heart of this infrastructure lies a custom data replication service that makes it possible to stream data to new nodes as needed.
Laine Campbell, CEO of Blackbird, will explain the options for running MySQL at high volumes at Amazon Web Services, exploring options around database as a service, hosted instances/storages and all appropriate availability, performance and provisioning considerations using real-world examples from Call of Duty, Obama for America and many more. Laine will show how to build highly available, manageable and performant MySQL environments that scale in AWS—how to maintain then, grow them and deal with failure. Some of the specific topics covered are:
* Overview of RDS and EC2 – pros, cons and usage patterns/antipatterns.
* Implementation choices in both offerings: instance sizing, ephemeral SSDs, EBS, provisioned IOPS and advanced techniques (RAID, mixed storage environments, etc…)
* Leveraging regions and availability zones for availability, business continuity and disaster recovery.
* Scaling patterns including read/write splitting, read distribution, functional dataset partitioning and horizontal dataset partitioning (aka sharding)
* Common failure modes – AZ and Region failures, EBS corruption, EBS performance inconsistencies and more.
* Managing and mitigating cost with various instance and storage options
Improving Availability & Lowering Costs with Auto Scaling & Amazon EC2 (CPN20...Amazon Web Services
Running your Amazon EC2 instances in Auto Scaling groups allows you to improve your application's availability right out of the box. Auto Scaling replaces impaired or unhealthy instances automatically to maintain your desired number of instances (even if that number is one). You can also use Auto Scaling to automate the provisioning of new instances and software configurations as well as to track of usage and costs by app, project, or cost center. Of course, you can also use Auto Scaling to adjust capacity as needed - on demand, on a schedule, or dynamically based on demand. In this session, we show you a few of the tools you can use to enable Auto Scaling for the applications you run on Amazon EC2. We also share tips and tricks we've picked up from customers such as Netflix, Adobe, Nokia, and Amazon.com about managing capacity, balancing performance against cost, and optimizing availability.
Building cross-region and cross could high availability into your app, a real life use case by Gigaspaces, Nati Shalom, Funder & CTO, Gigaspaces
Achieving high levels of availability and disaster recovery in a cloud environment requires the implementation of patterns and practices that introduce redundancy through multi-zone, multi-region, and multi-cloud deployments. As we move towards implementing higher availability, we cannot escape the direct increase in the accidental complexity of the deployment architecture resulting from lack of cloud portability and deployment lifecycle automation. We present how high availability and disaster recovery were achieved in reality by using the Cloudify open source framework on top of AWS. This approach applies to not just AWS but also other public clouds and private cloud environments such as Eucalyptus. The resulting reference architecture provides portable PostgreSQL replication and disaster recovery as well as application tier scalability across zones, regions, and public/private clouds through a unified deployment workflow.
These days, EVERY workload is considered critical by someone in the organization. As a result, SLAs are shrinking. IT is challenged to meet these SLAs, but there isn’t enough budget to provide services like disaster recovery (DR) using traditional methods and infrastructure. The good news is that public cloud platforms, like AWS, are becoming the de facto infrastructure choice for DR. However, workload portability solutions that simplify cross-platform or cloud recovery are required to meet most RTO & RPO SLAs in the cloud. AWS provides the infrastructure we need to bring DR to tier 2 and tier 3 workloads that have never been able to afford it before. Now, we need orchestration and automation to make it scalable and reliable.
In this session you will learn key considerations and practical steps for getting to the AWS cloud and how you can leverage Amazon S3 storage for cost-effective disaster recovery. Dow Jones will also share details on their migration to AWS Cloud, the benefits realized there, and what the future looks like. Session sponsored by Commvault.
Cloud Architecture Tutorial - Why and What (1of 3) Adrian Cockcroft
Introduction to the Netflix Cloud Architecture Tutorial - discusses the why and what of cloud including the thinking behind Netflix choice of AWS, and the product features that Netflix runs in the cloud.
This is the meat of the presentation, it describes in detail how do use anti-architecture to define what gets done, then discusses patterns, type systems, PaaS frameworks, services and components. There is a detailed explanation of Cassandra as a data store and open source components.
Slides from QConSF Nov 19th, 2011 focusing this time on describing the globally distributed and scaled industrial strength Java Platform as a Service that Netflix has built and run on top of AWS and Cassandra. Parts of that platform are being released as open source - Curator, Priam and Astyanax.
Global Netflix - HPTS Workshop - Scaling Cassandra benchmark to over 1M write...Adrian Cockcroft
Presentation given in October 2011 at the High Performance Transaction Systems Workshop http://hpts.ws - describes how Netflix used AWS to run a set of highly scalable Cassandra benchmarks on hundreds of instances in only a few hours.
The Netflix recipe for migrating your organization from building a datacenter based product to a cloud based product. First presented at the Silicon Valley Cloud Computing Meetup "Speak Cloudy to Me" on Saturday April 30th, 2011
Neuro-symbolic is not enough, we need neuro-*semantic*Frank van Harmelen
Neuro-symbolic (NeSy) AI is on the rise. However, simply machine learning on just any symbolic structure is not sufficient to really harvest the gains of NeSy. These will only be gained when the symbolic structures have an actual semantics. I give an operational definition of semantics as “predictable inference”.
All of this illustrated with link prediction over knowledge graphs, but the argument is general.
Epistemic Interaction - tuning interfaces to provide information for AI supportAlan Dix
Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024
https://alandix.com/academic/papers/synergy2024-epistemic/
As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.
Search and Society: Reimagining Information Access for Radical FuturesBhaskar Mitra
The field of Information retrieval (IR) is currently undergoing a transformative shift, at least partly due to the emerging applications of generative AI to information access. In this talk, we will deliberate on the sociotechnical implications of generative AI for information access. We will argue that there is both a critical necessity and an exciting opportunity for the IR community to re-center our research agendas on societal needs while dismantling the artificial separation between the work on fairness, accountability, transparency, and ethics in IR and the rest of IR research. Instead of adopting a reactionary strategy of trying to mitigate potential social harms from emerging technologies, the community should aim to proactively set the research agenda for the kinds of systems we should build inspired by diverse explicitly stated sociotechnical imaginaries. The sociotechnical imaginaries that underpin the design and development of information access technologies needs to be explicitly articulated, and we need to develop theories of change in context of these diverse perspectives. Our guiding future imaginaries must be informed by other academic fields, such as democratic theory and critical theory, and should be co-developed with social science scholars, legal scholars, civil rights and social justice activists, and artists, among others.
PHP Frameworks: I want to break free (IPC Berlin 2024)Ralf Eggert
In this presentation, we examine the challenges and limitations of relying too heavily on PHP frameworks in web development. We discuss the history of PHP and its frameworks to understand how this dependence has evolved. The focus will be on providing concrete tips and strategies to reduce reliance on these frameworks, based on real-world examples and practical considerations. The goal is to equip developers with the skills and knowledge to create more flexible and future-proof web applications. We'll explore the importance of maintaining autonomy in a rapidly changing tech landscape and how to make informed decisions in PHP development.
This talk is aimed at encouraging a more independent approach to using PHP frameworks, moving towards a more flexible and future-proof approach to PHP development.
Connector Corner: Automate dynamic content and events by pushing a buttonDianaGray10
Here is something new! In our next Connector Corner webinar, we will demonstrate how you can use a single workflow to:
Create a campaign using Mailchimp with merge tags/fields
Send an interactive Slack channel message (using buttons)
Have the message received by managers and peers along with a test email for review
But there’s more:
In a second workflow supporting the same use case, you’ll see:
Your campaign sent to target colleagues for approval
If the “Approve” button is clicked, a Jira/Zendesk ticket is created for the marketing design team
But—if the “Reject” button is pushed, colleagues will be alerted via Slack message
Join us to learn more about this new, human-in-the-loop capability, brought to you by Integration Service connectors.
And...
Speakers:
Akshay Agnihotri, Product Manager
Charlie Greenberg, Host
Accelerate your Kubernetes clusters with Varnish CachingThijs Feryn
A presentation about the usage and availability of Varnish on Kubernetes. This talk explores the capabilities of Varnish caching and shows how to use the Varnish Helm chart to deploy it to Kubernetes.
This presentation was delivered at K8SUG Singapore. See https://feryn.eu/presentations/accelerate-your-kubernetes-clusters-with-varnish-caching-k8sug-singapore-28-2024 for more details.
The Art of the Pitch: WordPress Relationships and SalesLaura Byrne
Clients don’t know what they don’t know. What web solutions are right for them? How does WordPress come into the picture? How do you make sure you understand scope and timeline? What do you do if sometime changes?
All these questions and more will be explored as we talk about matching clients’ needs with what your agency offers without pulling teeth or pulling your hair out. Practical tips, and strategies for successful relationship building that leads to closing the deal.
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Tobias Schneck
As AI technology is pushing into IT I was wondering myself, as an “infrastructure container kubernetes guy”, how get this fancy AI technology get managed from an infrastructure operational view? Is it possible to apply our lovely cloud native principals as well? What benefit’s both technologies could bring to each other?
Let me take this questions and provide you a short journey through existing deployment models and use cases for AI software. On practical examples, we discuss what cloud/on-premise strategy we may need for applying it to our own infrastructure to get it to work from an enterprise perspective. I want to give an overview about infrastructure requirements and technologies, what could be beneficial or limiting your AI use cases in an enterprise environment. An interactive Demo will give you some insides, what approaches I got already working for real.
DevOps and Testing slides at DASA ConnectKari Kakkonen
My and Rik Marselis slides at 30.5.2024 DASA Connect conference. We discuss about what is testing, then what is agile testing and finally what is Testing in DevOps. Finally we had lovely workshop with the participants trying to find out different ways to think about quality and testing in different parts of the DevOps infinity loop.
"Impact of front-end architecture on development cost", Viktor TurskyiFwdays
I have heard many times that architecture is not important for the front-end. Also, many times I have seen how developers implement features on the front-end just following the standard rules for a framework and think that this is enough to successfully launch the project, and then the project fails. How to prevent this and what approach to choose? I have launched dozens of complex projects and during the talk we will analyze which approaches have worked for me and which have not.
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...DanBrown980551
Do you want to learn how to model and simulate an electrical network from scratch in under an hour?
Then welcome to this PowSyBl workshop, hosted by Rte, the French Transmission System Operator (TSO)!
During the webinar, you will discover the PowSyBl ecosystem as well as handle and study an electrical network through an interactive Python notebook.
PowSyBl is an open source project hosted by LF Energy, which offers a comprehensive set of features for electrical grid modelling and simulation. Among other advanced features, PowSyBl provides:
- A fully editable and extendable library for grid component modelling;
- Visualization tools to display your network;
- Grid simulation tools, such as power flows, security analyses (with or without remedial actions) and sensitivity analyses;
The framework is mostly written in Java, with a Python binding so that Python developers can access PowSyBl functionalities as well.
What you will learn during the webinar:
- For beginners: discover PowSyBl's functionalities through a quick general presentation and the notebook, without needing any expert coding skills;
- For advanced developers: master the skills to efficiently apply PowSyBl functionalities to your real-world scenarios.
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Jeffrey Haguewood
Sidekick Solutions uses Bonterra Impact Management (fka Social Solutions Apricot) and automation solutions to integrate data for business workflows.
We believe integration and automation are essential to user experience and the promise of efficient work through technology. Automation is the critical ingredient to realizing that full vision. We develop integration products and services for Bonterra Case Management software to support the deployment of automations for a variety of use cases.
This video focuses on the notifications, alerts, and approval requests using Slack for Bonterra Impact Management. The solutions covered in this webinar can also be deployed for Microsoft Teams.
Interested in deploying notification automations for Bonterra Impact Management? Contact us at sales@sidekicksolutionsllc.com to discuss next steps.
UiPath Test Automation using UiPath Test Suite series, part 4DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 4. In this session, we will cover Test Manager overview along with SAP heatmap.
The UiPath Test Manager overview with SAP heatmap webinar offers a concise yet comprehensive exploration of the role of a Test Manager within SAP environments, coupled with the utilization of heatmaps for effective testing strategies.
Participants will gain insights into the responsibilities, challenges, and best practices associated with test management in SAP projects. Additionally, the webinar delves into the significance of heatmaps as a visual aid for identifying testing priorities, areas of risk, and resource allocation within SAP landscapes. Through this session, attendees can expect to enhance their understanding of test management principles while learning practical approaches to optimize testing processes in SAP environments using heatmap visualization techniques
What will you get from this session?
1. Insights into SAP testing best practices
2. Heatmap utilization for testing
3. Optimization of testing processes
4. Demo
Topics covered:
Execution from the test manager
Orchestrator execution result
Defect reporting
SAP heatmap example with demo
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
1. Details:
Building Using The NetflixOSS
Architecture
May 2013
Adrian Cockcroft
@adrianco #netflixcloud @NetflixOSS
http://www.linkedin.com/in/adriancockcroft
2. Architectures for High Availability
Cassandra Storage and Replication
NetflixOSS Components
4. Three Balanced Availability Zones
Test with Chaos Gorilla
Cassandra and Evcache
Replicas
Zone A
Cassandra and Evcache
Replicas
Zone B
Cassandra and Evcache
Replicas
Zone C
Load Balancers
5. Triple Replicated Persistence
Cassandra maintenance affects individual replicas
Cassandra and Evcache
Replicas
Zone A
Cassandra and Evcache
Replicas
Zone B
Cassandra and Evcache
Replicas
Zone C
Load Balancers
6. Isolated Regions
Cassandra Replicas
Zone A
Cassandra Replicas
Zone B
Cassandra Replicas
Zone C
US-East Load Balancers
Cassandra Replicas
Zone A
Cassandra Replicas
Zone B
Cassandra Replicas
Zone C
EU-West Load Balancers
7. Failure Modes and Effects
Failure Mode Probability Current Mitigation Plan
Application Failure High Automatic degraded response
AWS Region Failure Low Active-Active Using Denominator
AWS Zone Failure Medium Continue to run on 2 out of 3 zones
Datacenter Failure Medium Migrate more functions to cloud
Data store failure Low Restore from S3 backups
S3 failure Low Restore from remote archive
Until we got really good at mitigating high and medium
probability failures, the ROI for mitigating regional failures
didn’t make sense. Working on Active-Active in 2013.
9. Run What You Wrote
• Make developers responsible for failures
– Then they learn and write code that doesn’t fail
• Use Incident Reviews to find gaps to fix
– Make sure its not about finding “who to blame”
• Keep timeouts short, fail fast
– Don’t let cascading timeouts stack up
• Dynamic configuration options - Archaius
– http://techblog.netflix.com/2012/06/annoucing-archaius-dynamic-properties.html
12. Monkeys
Edda – Configuration History
http://techblog.netflix.com/2012/11/edda-learn-stories-of-your-cloud.html
Edda
AWS
Instances,
ASGs, etc.
Eureka
Services
metadata
AppDynamics
Request flow
13. Edda Query Examples
Find any instances that have ever had a specific public IP address
$ curl "http://edda/api/v2/view/instances;publicIpAddress=1.2.3.4;_since=0"
["i-0123456789","i-012345678a","i-012345678b”]
Show the most recent change to a security group
$ curl "http://edda/api/v2/aws/securityGroups/sg-0123456789;_diff;_all;_limit=2"
--- /api/v2/aws.securityGroups/sg-0123456789;_pp;_at=1351040779810
+++ /api/v2/aws.securityGroups/sg-0123456789;_pp;_at=1351044093504
@@ -1,33 +1,33 @@
{
…
"ipRanges" : [
"10.10.1.1/32",
"10.10.1.2/32",
+ "10.10.1.3/32",
- "10.10.1.4/32"
…
}
16. Zone Failure Modes
• Power Outage
– Instances lost, ephemeral state lost
– Clean break and recovery, fail fast, “no route to host”
• Network Outage
– Instances isolated, state inconsistent
– More complex symptoms, recovery issues, transients
• Dependent Service Outage
– Cascading failures, misbehaving instances, human errors
– Confusing symptoms, recovery issues, byzantine effects
17. Zone Power Failure
• June 29, 2012 AWS US-East - The Big Storm
– http://aws.amazon.com/message/67457/
– http://techblog.netflix.com/2012/07/lessons-netflix-learned-from-aws-storm.html
• Highlights
– One of 10+ US-East datacenters failed generator startup
– UPS depleted -> 10min power outage for 7% of instances
• Result
– Netflix lost power to most of a zone, evacuated the zone
– Small/brief user impact due to errors and retries
18. Zone Failure Modes
Cassandra Replicas
Zone A
Cassandra Replicas
Zone B
Cassandra Replicas
Zone C
US-East Load Balancers
Cassandra Replicas
Zone A
Cassandra Replicas
Zone B
Cassandra Replicas
Zone C
EU-West Load Balancers
Zone Power Outage
Zone Network Outage
Zone Dependent
Service Outage
19. Regional Failure Modes
• Network Failure Takes Region Offline
– DNS configuration errors
– Bugs and configuration errors in routers
– Network capacity overload
• Control Plane Overload Affecting Entire Region
– Consequence of other outages
– Lose control of remaining zones infrastructure
– Cascading service failure, hard to diagnose
20. Regional Control Plane Overload
• April 2011 – “The big EBS Outage”
– http://techblog.netflix.com/2011/04/lessons-netflix-learned-from-aws-outage.html
– Human error during network upgrade triggered cascading failure
– Zone level failure, with brief regional control plane overload
• Netflix Infrastructure Impact
– Instances in one zone hung and could not launch replacements
– Overload prevented other zones from launching instances
– Some MySQL slaves offline for a few days
• Netflix Customer Visible Impact
– Higher latencies for a short time
– Higher error rates for a short time
– Outage was at a low traffic level time, so no capacity issues
21. Regional Failure Modes
Cassandra Replicas
Zone A
Cassandra Replicas
Zone B
Cassandra Replicas
Zone C
US-East Load Balancers
Cassandra Replicas
Zone A
Cassandra Replicas
Zone B
Cassandra Replicas
Zone C
EU-West Load Balancers
Regional Network Outage
Control Plane Overload
22. Dependent Services Failure
• June 29, 2012 AWS US-East - The Big Storm
– Power failure recovery overloaded EBS storage service
– Backlog of instance startups using EBS root volumes
• ELB (Load Balancer) Impacted
– ELB instances couldn’t scale because EBS was backlogged
– ELB control plane also became backlogged
• Mitigation Plans Mentioned
– Multiple control plane request queues to isolate backlog
– Rapid DNS based traffic shifting between zones
23. Application Routing Failure
June 29, 2012 AWS US-East - The Big Storm
Cassandra Replicas
Zone A
Cassandra Replicas
Zone B
Cassandra Replicas
Zone C
US-East Load Balancers
Cassandra Replicas
Zone A
Cassandra Replicas
Zone B
Cassandra Replicas
Zone C
EU-West Load Balancers
Zone Power Outage
Applications not using
Zone-aware routing kept
trying to talk to dead
instances and timing out
Eureka service directory failed to mark down
dead instances due to a configuration error
Effect: higher latency and errors
Mitigation: Fixed config, and made
zone aware routing the default
24. Dec 24th 2012
Cassandra Replicas
Zone A
Cassandra Replicas
Zone B
Cassandra Replicas
Zone C
US-East Load Balancers
Cassandra Replicas
Zone A
Cassandra Replicas
Zone B
Cassandra Replicas
Zone C
EU-West Load Balancers
Partial Regional ELB Outage
• ELB (Load Balancer) Impacted
– ELB control plane database state accidentally corrupted
– Hours to detect, hours to restore from backups
• Mitigation Plans Mentioned
– Tighter process for access to control plane
– Better zone isolation
25. Global Failure Modes
• Software Bugs
– Externally triggered (e.g. leap year/leap second)
– Memory leaks and other delayed action failures
• Global configuration errors
– Usually human error
– Both infrastructure and application level
• Cascading capacity overload
– Customers migrating away from a failure
– Lack of cross region service isolation
26. Global Software Bug Outages
• AWS S3 Global Outage in 2008
– Gossip protocol propagated errors worldwide
– No data loss, but service offline for up to 9hrs
– Extra error detection fixes, no big issues since
• Microsoft Azure Leap Day Outage in 2012
– Bug failed to generate certificates ending 2/29/13
– Failure to launch new instances for up to 13hrs
– One line code fix.
• Netflix Configuration Error in 2012
– Global property updated to broken value
– Streaming stopped worldwide for ~1hr until we changed back
– Fix planned to keep history of properties for quick rollback
27. Global Failure Modes
Cassandra Replicas
Zone A
Cassandra Replicas
Zone B
Cassandra Replicas
Zone C
US-East Load Balancers
Cassandra Replicas
Zone A
Cassandra Replicas
Zone B
Cassandra Replicas
Zone C
EU-West Load Balancers
Cascading Capacity Overload
Software Bugs and Global
Configuration Errors
Capacity Demand Migrates
“Oops…”
28. Denominator
Portable DNS management
API Models (varied
and mostly broken)
DNS Vendor Plug-in
Common Model
Use Cases
Edda, Multi-
Region
Failover
Denominator
AWS Route53
IAM Key Auth
REST
DynECT
User/pwd
REST
UltraDNS
User/pwd
SOAP
Etc…
Currently being built by Adrian Cole (the jClouds guy, he works for Netflix now…)
30. Micro-Service Pattern
One keyspace, replaces a single table or materialized view
Single function Cassandra
Cluster Managed by Priam
Between 6 and 72 nodes
Stateless Data Access REST Service
Astyanax Cassandra Client
Optional
Datacenter
Update Flow
Many Different Single-Function REST Clients
Appdynamics Service Flow Visualization
Each icon represents a horizontally scaled service of three to
hundreds of instances deployed over three availability zones
31. Stateless Micro-Service Architecture
Linux Base AMI (CentOS or Ubuntu)
Optional
Apache
frontend,
memcached,
non-java apps
Monitoring
Log rotation
to S3
AppDynamics
machineagent
Epic/Atlas
Java (JDK 6 or 7)
AppDynamics
appagent
monitoring
GC and thread
dump logging
Tomcat
Application war file, base
servlet, platform, client
interface jars, Astyanax
Healthcheck, status
servlets, JMX interface,
Servo autoscale
32. Astyanax
Available at http://github.com/netflix
• Features
– Complete abstraction of connection pool from RPC protocol
– Fluent Style API
– Operation retry with backoff
– Token aware
• Recipes
– Distributed row lock (without zookeeper)
– Multi-DC row lock
– Uniqueness constraint
– Multi-row uniqueness constraint
– Chunked and multi-threaded large file storage
33. Initializing Astyanax
// Configuration either set in code or nfastyanax.properties
platform.ListOfComponentsToInit=LOGGING,APPINFO,DISCOVERY
netflix.environment=test
default.astyanax.readConsistency=CL_QUORUM
default.astyanax.writeConsistency=CL_QUORUM
MyCluster.MyKeyspace.astyanax.servers=127.0.0.1
// Must initialize platform for discovery to work
NFLibraryManager.initLibrary(PlatformManager.class, props, false, true);
NFLibraryManager.initLibrary(NFAstyanaxManager.class, props, true, false);
// Open a keyspace instance
Keyspace keyspace = KeyspaceFactory.openKeyspace(”MyCluster”,”MyKeyspace");
34. Astyanax Query Example
Paginate through all columns in a row
ColumnList<String> columns;
int pageize = 10;
try {
RowQuery<String, String> query = keyspace
.prepareQuery(CF_STANDARD1)
.getKey("A")
.setIsPaginating()
.withColumnRange(new RangeBuilder().setMaxSize(pageize).build());
while (!(columns = query.execute().getResult()).isEmpty()) {
for (Column<String> c : columns) {
}
}
} catch (ConnectionException e) {
}
35. Astyanax - Cassandra Write Data Flows
Single Region, Multiple Availability Zone, Token Aware
Token
Aware
Clients
Cassandra
•Disks
•Zone A
Cassandra
•Disks
•Zone B
Cassandra
•Disks
•Zone C
Cassandra
•Disks
•Zone A
Cassandra
•Disks
•Zone B
Cassandra
•Disks
•Zone C
1. Client Writes to local
coordinator
2. Coodinator writes to
other zones
3. Nodes return ack
4. Data written to
internal commit log
disks (no more than
10 seconds later)
If a node goes offline,
hinted handoff
completes the write
when the node comes
back up.
Requests can choose to
wait for one node, a
quorum, or all nodes to
ack the write
SSTable disk writes and
compactions occur
asynchronously
1
4
4
4
2
3
3
3
2
36. Data Flows for Multi-Region Writes
Token Aware, Consistency Level = Local Quorum
US
Clients
Cassandra
• Disks
• Zone A
Cassandra
• Disks
• Zone B
Cassandra
• Disks
• Zone C
Cassandra
• Disks
• Zone A
Cassandra
• Disks
• Zone B
Cassandra
• Disks
• Zone C
1. Client writes to local replicas
2. Local write acks returned to
Client which continues when
2 of 3 local nodes are
committed
3. Local coordinator writes to
remote coordinator.
4. When data arrives, remote
coordinator node acks and
copies to other remote zones
5. Remote nodes ack to local
coordinator
6. Data flushed to internal
commit log disks (no more
than 10 seconds later)
If a node or region goes offline, hinted handoff
completes the write when the node comes back up.
Nightly global compare and repair jobs ensure
everything stays consistent.
EU
Clients
Cassandra
• Disks
• Zone A
Cassandra
• Disks
• Zone B
Cassandra
• Disks
• Zone C
Cassandra
• Disks
• Zone A
Cassandra
• Disks
• Zone B
Cassandra
• Disks
• Zone C
6
5
5
6 6
4
4
4
1
6
6
6
2
2
2
3
100+ms latency
37. Cassandra Instance Architecture
Linux Base AMI (CentOS or Ubuntu)
Tomcat and
Priam on JDK
Healthcheck,
Status
Monitoring
AppDynamics
machineagent
Epic/Atlas
Java (JDK 7)
AppDynamics
appagent
monitoring
GC and thread
dump logging
Cassandra Server
Local Ephemeral Disk Space – 2TB of SSD or 1.6TB disk
holding Commit log and SSTables
38. Priam – Cassandra Automation
Available at http://github.com/netflix
• Netflix Platform Tomcat Code
• Zero touch auto-configuration
• State management for Cassandra JVM
• Token allocation and assignment
• Broken node auto-replacement
• Full and incremental backup to S3
• Restore sequencing from S3
• Grow/Shrink Cassandra “ring”
39. ETL for Cassandra
• Data is de-normalized over many clusters!
• Too many to restore from backups for ETL
• Solution – read backup files using Hadoop
• Aegisthus
– http://techblog.netflix.com/2012/02/aegisthus-bulk-data-pipeline-out-of.html
– High throughput raw SSTable processing
– Re-normalizes many clusters to a consistent view
– Extract, Transform, then Load into Teradata
41. Datacenter to Cloud Transition Goals
• Faster
– Lower latency than the equivalent datacenter web pages and API calls
– Measured as mean and 99th percentile
– For both first hit (e.g. home page) and in-session hits for the same user
• Scalable
– Avoid needing any more datacenter capacity as subscriber count increases
– No central vertically scaled databases
– Leverage AWS elastic capacity effectively
• Available
– Substantially higher robustness and availability than datacenter services
– Leverage multiple AWS availability zones
– No scheduled down time, no central database schema to change
• Productive
– Optimize agility of a large development team with automation and tools
– Leave behind complex tangled datacenter code base (~8 year old architecture)
– Enforce clean layered interfaces and re-usable components
44. Netflix Datacenter vs. Cloud Arch
Central SQL Database Distributed Key/Value NoSQL
Sticky In-Memory Session Shared Memcached Session
Chatty Protocols Latency Tolerant Protocols
Tangled Service Interfaces Layered Service Interfaces
Instrumented Code Instrumented Service Patterns
Fat Complex Objects Lightweight Serializable Objects
Components as Jar Files Components as Services
45. Tangled Service Interfaces
• Datacenter implementation is exposed
– Oracle SQL queries mixed into business logic
• Tangled code
– Deep dependencies, false sharing
• Data providers with sideways dependencies
– Everything depends on everything else
Anti-pattern affects productivity, availability
46. Untangled Service Interfaces
Two layers:
• SAL - Service Access Library
– Basic serialization and error handling
– REST or POJO’s defined by data provider
• ESL - Extended Service Library
– Caching, conveniences, can combine several SALs
– Exposes faceted type system (described later)
– Interface defined by data consumer in many cases
48. NetflixOSS Details
• Platform entities and services
• AWS Accounts and access management
• Upcoming and recent NetflixOSS components
• In-depth on NetflixOSS components
49. Basic Platform Entities
• AWS Based Entities
– Instances and Machine Images, Elastic IP Addresses
– Security Groups, Load Balancers, Autoscale Groups
– Availability Zones and Geographic Regions
• NetflixOS Specific Entities
– Applications (registered services)
– Clusters (versioned Autoscale Groups for an App)
– Properties (dynamic hierarchical configuration)
50. Core Platform Services
• AWS Based Services
– S3 storage, to 5TB files, parallel multipart writes
– SQS – Simple Queue Service. Messaging layer.
• Netflix Based Services
– EVCache – memcached based ephemeral cache
– Cassandra – distributed persistent data store
51. Security Architecture
• Instance Level Security baked into base AMI
– Login: ssh only allowed via portal (not between instances)
– Each app type runs as its own userid app{test|prod}
• AWS Security, Identity and Access Management
– Each app has its own security group (firewall ports)
– Fine grain user roles and resource ACLs
• Key Management
– AWS Keys dynamically provisioned, easy updates
– High grade app specific key management support
53. Accounts Isolate Concerns
• paastest – for development and testing
– Fully functional deployment of all services
– Developer tagged “stacks” for separation
• paasprod – for production
– Autoscale groups only, isolated instances are terminated
– Alert routing, backups enabled by default
• paasaudit – for sensitive services
– To support SOX, PCI, etc.
– Extra access controls, auditing
• paasarchive – for disaster recovery
– Long term archive of backups
– Different region, perhaps different vendor
54. Reservations and Billing
• Consolidated Billing
– Combine all accounts into one bill
– Pooled capacity for bigger volume discounts
http://docs.amazonwebservices.com/AWSConsolidatedBilling/1.0/AWSConsolidatedBillingGuide.html
• Reservations
– Save up to 71% on your baseline load
– Priority when you request reserved capacity
– Unused reservations are shared across accounts
55. Cloud Access Gateway
• Datacenter or office based
– A separate VM for each AWS account
– Two per account for high availability
– Mount NFS shared home directories for developers
– Instances trust the gateway via a security group
• Manage how developers login to cloud
– Access control via ldap group membership
– Audit logs of every login to the cloud
– Similar to awsfabrictasks ssh wrapper
http://readthedocs.org/docs/awsfabrictasks/en/latest/
58. Dashboards with Pytheas (Explorers)
http://techblog.netflix.com/2013/05/announcing-pytheas.html
• Cassandra Explorer
– Browse clusters, keyspaces, column families
• Base Server Explorer
– Browse service endpoints configuration, perf
• Anything else you want to build…
63. Slideshare NetflixOSS Details
• Lightning Talks Feb S1E1
– http://www.slideshare.net/RuslanMeshenberg/netflixoss-open-house-lightning-talks
• Asgard In Depth Feb S1E1
– http://www.slideshare.net/joesondow/asgard-overview-from-netflix-oss-open-house
• Lightning Talks March S1E2
– http://www.slideshare.net/RuslanMeshenberg/netflixoss-meetup-lightning-talks-and-
roadmap
• Security Architecture
– http://www.slideshare.net/jason_chan/resilience-and-security-scale-lessons-learned
• Cost Aware Cloud Architectures – with Jinesh Varia of AWS
– http://www.slideshare.net/AmazonWebServices/building-costaware-architectures-jinesh-
varia-aws-and-adrian-cockroft-netflix
64. Amazon Cloud Terminology Reference
See http://aws.amazon.com/ This is not a full list of Amazon Web Service features
• AWS – Amazon Web Services (common name for Amazon cloud)
• AMI – Amazon Machine Image (archived boot disk, Linux, Windows etc. plus application code)
• EC2 – Elastic Compute Cloud
– Range of virtual machine types m1, m2, c1, cc, cg. Varying memory, CPU and disk configurations.
– Instance – a running computer system. Ephemeral, when it is de-allocated nothing is kept.
– Reserved Instances – pre-paid to reduce cost for long term usage
– Availability Zone – datacenter with own power and cooling hosting cloud instances
– Region – group of Avail Zones – US-East, US-West, EU-Eire, Asia-Singapore, Asia-Japan, SA-Brazil, US-Gov
• ASG – Auto Scaling Group (instances booting from the same AMI)
• S3 – Simple Storage Service (http access)
• EBS – Elastic Block Storage (network disk filesystem can be mounted on an instance)
• RDS – Relational Database Service (managed MySQL master and slaves)
• DynamoDB/SDB – Simple Data Base (hosted http based NoSQL datastore, DynamoDB replaces SDB)
• SQS – Simple Queue Service (http based message queue)
• SNS – Simple Notification Service (http and email based topics and messages)
• EMR – Elastic Map Reduce (automatically managed Hadoop cluster)
• ELB – Elastic Load Balancer
• EIP – Elastic IP (stable IP address mapping assigned to instance or ELB)
• VPC – Virtual Private Cloud (single tenant, more flexible network and security constructs)
• DirectConnect – secure pipe from AWS VPC to external datacenter
• IAM – Identity and Access Management (fine grain role based security keys)