- The document describes a serverless data ingestion and processing architecture using AWS services like SNS, SQS, Lambda, Firehose, and S3.
- Streaming data is collected from SNS into SQS queues and processed by Lambda functions to store raw and refined data in S3 buckets.
- The architecture ensures data is not lost if messages fail processing and developers can independently deploy stream processors.
Netflix Keystone - How Netflix Handles Data Streams up to 11M Events/SecPeter Bakas
Talk on Netflix Keystone by Peter Bakas at SF Data Engineering Meetup on 2/23/2016.
Topics covered:
- Architectural design and principles for Keystone
- Technologies that Keystone is leveraging
- Best practices
http://www.meetup.com/SF-Data-Engineering/events/228293610/
S3 cassandra or outer space? dumping time series data using sparkDemi Ben-Ari
Vast volume of our processed data is Time Series data and once you start working with distributed systems, you start tackling many scale and performance problems, many questions arise:
How to handle missing data?
Should my system handle both serving and backed process or separating them out?
Which one of the solutions will be cheaper? Best Performance for Money?
In the talk we will tell the tale of all of the transformations we’ve made to our data model @Windward, show some of the problems we’ve handled, review the multiple data persistency layers like: S3, MongoDB, Apache Cassandra, MySQL.
And I’ll try my best NOT to answer the question “Which one of them is the Best?”
Sharing our Pain and Lessons learned is promised!
Bio:
Demi Ben-Ari, Sr. Data Engineer @Windward,
I have over 9 years of experience in building various systems both from the field of near real time applications and Big Data distributed systems.
Co-Founder of the “Big Things” Big Data community: http://somebigthings.com/big-things-intro/
I’m a software development groupie, Interested in tackling cutting edge technologies.
Netflix Keystone Pipeline at Big Data Bootcamp, Santa Clara, Nov 2015Monal Daxini
Keystone - Processing over Half a Trillion events per day with 8 million events & 17 GB per second peaks, and at-least once processing semantics. We will explore in detail how we employ Kafka, Samza, and Docker at scale to implement a multi-tenant pipeline. We will also look at the evolution to its current state and where the pipeline is headed next in offering a self-service stream processing infrastructure atop the Kafka based pipeline and support Spark Streaming.
Slides from a presentation by Monal Daxini at Disney, Glendale CA about Netflix Open Source Software, Cloud Data Persistence, and Cassandra best Practices
Netflix keystone streaming data pipeline @scale in the cloud-dbtb-2016Monal Daxini
Keystone processes over 700 billion events per day (1 peta byte) with at-least once processing semantics in the cloud. We will explore in detail how we leverage Kafka, Samza, Docker, and Linux at scale to implement a multi-tenant pipeline in AWS cloud within a year. We will also share our plans on offering a Stream Processing as a Service for all of Netflix use.
Netflix Keystone Pipeline at Samza Meetup 10-13-2015Monal Daxini
Netflix Keystone Pipeline processing 600 billion events a day, and detailed treatise on the modification of and use of Samza for real time routing of events including docker.
Data Streaming Ecosystem Management at Booking.com confluent
(Alex Mironov, Booking.com) Kafka Summit SF 2018
Since its original introduction at Booking.com, Apache Kafka and overall concept of real-time data streaming have come a long way from being a complicated novelty to a common tool, used by a multitude of internal users ranging in their importance from the ad-hoc consumers to business-critical services powering up our property search engine.
Over the course of this talk we’ll dive deep into how a relatively small team of SREs is successfully managing a multi-cluster, multi-tenant setup of Kafka and its surrounding ecosystem capable of transporting millions of messages per day. We’ll discuss challenges they faced along their way while building this platform and take a close look not only at application but also at architectural-level decisions they made to overcome them. Surely, we will also review what kind of tooling and automation team is using to stay sane during the day and sleep well during the night.
Netflix Keystone - How Netflix Handles Data Streams up to 11M Events/SecPeter Bakas
Talk on Netflix Keystone by Peter Bakas at SF Data Engineering Meetup on 2/23/2016.
Topics covered:
- Architectural design and principles for Keystone
- Technologies that Keystone is leveraging
- Best practices
http://www.meetup.com/SF-Data-Engineering/events/228293610/
S3 cassandra or outer space? dumping time series data using sparkDemi Ben-Ari
Vast volume of our processed data is Time Series data and once you start working with distributed systems, you start tackling many scale and performance problems, many questions arise:
How to handle missing data?
Should my system handle both serving and backed process or separating them out?
Which one of the solutions will be cheaper? Best Performance for Money?
In the talk we will tell the tale of all of the transformations we’ve made to our data model @Windward, show some of the problems we’ve handled, review the multiple data persistency layers like: S3, MongoDB, Apache Cassandra, MySQL.
And I’ll try my best NOT to answer the question “Which one of them is the Best?”
Sharing our Pain and Lessons learned is promised!
Bio:
Demi Ben-Ari, Sr. Data Engineer @Windward,
I have over 9 years of experience in building various systems both from the field of near real time applications and Big Data distributed systems.
Co-Founder of the “Big Things” Big Data community: http://somebigthings.com/big-things-intro/
I’m a software development groupie, Interested in tackling cutting edge technologies.
Netflix Keystone Pipeline at Big Data Bootcamp, Santa Clara, Nov 2015Monal Daxini
Keystone - Processing over Half a Trillion events per day with 8 million events & 17 GB per second peaks, and at-least once processing semantics. We will explore in detail how we employ Kafka, Samza, and Docker at scale to implement a multi-tenant pipeline. We will also look at the evolution to its current state and where the pipeline is headed next in offering a self-service stream processing infrastructure atop the Kafka based pipeline and support Spark Streaming.
Slides from a presentation by Monal Daxini at Disney, Glendale CA about Netflix Open Source Software, Cloud Data Persistence, and Cassandra best Practices
Netflix keystone streaming data pipeline @scale in the cloud-dbtb-2016Monal Daxini
Keystone processes over 700 billion events per day (1 peta byte) with at-least once processing semantics in the cloud. We will explore in detail how we leverage Kafka, Samza, Docker, and Linux at scale to implement a multi-tenant pipeline in AWS cloud within a year. We will also share our plans on offering a Stream Processing as a Service for all of Netflix use.
Netflix Keystone Pipeline at Samza Meetup 10-13-2015Monal Daxini
Netflix Keystone Pipeline processing 600 billion events a day, and detailed treatise on the modification of and use of Samza for real time routing of events including docker.
Data Streaming Ecosystem Management at Booking.com confluent
(Alex Mironov, Booking.com) Kafka Summit SF 2018
Since its original introduction at Booking.com, Apache Kafka and overall concept of real-time data streaming have come a long way from being a complicated novelty to a common tool, used by a multitude of internal users ranging in their importance from the ad-hoc consumers to business-critical services powering up our property search engine.
Over the course of this talk we’ll dive deep into how a relatively small team of SREs is successfully managing a multi-cluster, multi-tenant setup of Kafka and its surrounding ecosystem capable of transporting millions of messages per day. We’ll discuss challenges they faced along their way while building this platform and take a close look not only at application but also at architectural-level decisions they made to overcome them. Surely, we will also review what kind of tooling and automation team is using to stay sane during the day and sleep well during the night.
Recently, the interest in highly scalable stream processing engines has risen, thus many projects have appeared. Apache Samza is a distributed stream-processing framework that uses Apache Kafka for messaging, and Apache Hadoop YARN to provide fault tolerance, and resource management. It is one of the most popular stream processing engines out there used by many high-profile companies. On the other hand, we have Amazon Kinesis that is a fully managed service for real-time processing of streaming data which allows users to scale the amount of data ingested by Kinesis without worrying about the infrastructure details. This presentation gives a brief introduction about the very popular Samza-Kafka integration, then focuses on the new Samza-Kinesis integration, and explains users the new opportunities they have due to the new Samza-Kinesis integration.
http://www.oreilly.com/pub/e/3764
Keystone processes over 700 billion events per day (1 peta byte) with at-least-once processing semantics in the cloud. Monal Daxini details how they used Kafka, Samza, Docker, and Linux at scale to implement a multi-tenant pipeline in AWS cloud within a year. He'll also share plans on offering a Stream Processing as a Service for all of Netflix use.
The need for gleaning answers from data in real-time is moving from nicety to a necessity. There are few options to analyze the never-ending stream of unbounded data at scale. Let’s compare and contrast the core principles and technologies the different open source solutions available to help with this endeavor, and where in the future processing engines need to evolve to solve processing needs at scale. These findings are based on the experience of continuing to build a scalable solution in the cloud to process over 700 billion events at Netflix, and how we are embarking on the next journey to evolve unbounded data processing engines.
DataStax and Esri: Geotemporal IoT Search and AnalyticsDataStax Academy
Internet of Things (IoT) data frequently has a location and time component. Getting value out of this "geotemporal" data can be tricky. We'll explore when and how to leverage Cassandra, DSE Search and DSE Analytics to surface meaningful information from your geotemporal data.
Samza at LinkedIn: Taking Stream Processing to the Next LevelMartin Kleppmann
Slides from my talk at Berlin Buzzwords, 27 May 2014. Unfortunately Slideshare has screwed up the fonts. See https://speakerdeck.com/ept/samza-at-linkedin-taking-stream-processing-to-the-next-level for a version of the deck with correct fonts.
Stream processing is an essential part of real-time data systems, such as news feeds, live search indexes, real-time analytics, metrics and monitoring. But writing stream processes is still hard, especially when you're dealing with so much data that you have to distribute it across multiple machines. How can you keep the system running smoothly, even when machines fail and bugs occur?
Apache Samza is a new framework for writing scalable stream processing jobs. Like Hadoop and MapReduce for batch processing, it takes care of the hard parts of running your message-processing code on a distributed infrastructure, so that you can concentrate on writing your application using simple APIs. It is in production use at LinkedIn.
This talk will introduce Samza, and show how to use it to solve a range of different problems. Samza has some unique features that make it especially interesting for large deployments, and in this talk we will dig into how they work under the hood. In particular:
• Samza is built to support many different jobs written by different teams. Isolation between jobs ensures that a single badly behaved job doesn't affect other jobs. It is robust by design.
• Samza can handle jobs that require large amounts of state, for example joining multiple streams, augmenting a stream with data from a database, or aggregating data over long time windows. This makes it a very powerful tool for applications.
Harvesting the Power of Samza in LinkedIn's FeedMohamed El-Geish
LinkedIn's Feed is the entry point for hundreds of millions of members who seek to stay informed about their professional interests. The feed strives to provide relevant content to members that's also new and fresh. How does the feed solve this problem at scale? What role does Samza play in this? Join us to find out.
Building large-scale analytics platform with Storm, Kafka and Cassandra - NYC...Alexey Kharlamov
At Integral, we process heavy volumes of click-stream traffic. 50K QPS of ad impressions at peak and close to 200K QPS of all browser calls. We build analytics on this streams of data. There are two applications which require quite significant computational effort: 'sessionization' and fraud detection.
Sessionization implies linking a series of requests from same browser into single record. There can be 5 or more total requests spread over 15-30 minutes which we need to link to each other.
Fraud detection is a process looking at various signals in browser requests and at substantial historical evidence data classifying ad impression either as legitimate or as fraudulent.
We've been doing both (as well as all other analytics) in batch mode once an hour at best. Both processes, and, in particular, fraud detection, are time sensitive and much more meaningful if done in near-real-time.
This talk would be about our experience migrating a once-per-day offline batch processing of impression data using hadoop to in-memory stream processing using Kafka, Storm and Cassandra. We will touch upon our choices and our reasoning for selecting the products used for this solution.
Hadoop is no longer the only or always preferred option in Big Data space. In-memory stream processing may be more effective for time series data preparation and aggregation. Ability to scale at a significantly lower cost means more customers, better accuracy and better business practices: since only in-stream processing allows for low-latency data and insight delivery it opens entirely new opportunities. However, transitioning of non-trivial data pipelines raises a number of questions hidden previously within the offline nature of batch processing. How will you join several data feeds? How will you implement failure recovery? In addition to handling terabytes of data per day our streaming system has to be guided by the following considerations:
• Recovery time
• Time relativity and continuity
• Geographical distribution of data sources
• Limit on data loss
• Maintainability
The system produces complex cross-correlational analysis of several data feeds and aggregation for client analytics with input feed frequency of up to 100K msg/sec.
This presentation will benefit anyone interested in learning an alternate approach for big data analytics, especially the process of joining multiple streams in memory using Cassandra. Presentation will also highlight certain optimization patterns used those can be useful in similar situations.
Some notes about spark streming positioning give the current players: Beam, Flink, Storm et al. Helpful if you have to choose an Streaming engine for your project.
PlayStation and Searchable Cassandra Without Solr (Dustin Pham & Alexander Fi...DataStax
When we were preparing for PlayStation4 launch we faced 1 hard problem: how to make our games and videos storage system super-fast, highly available and fault tolerant. Moving from relation database to Cassandra is not easy, and it even harder if you want to support different search and query use cases. Join our talk to learn how we managed to build highly available platform, that supports tens of millions of active users and can execute multiple user specific queries in less than a millisecond. And all of it without using Solr or ElasticSearch.
About the Speaker
Alexander Filipchik, Sony
Alex spent last 4 years of his life building the next generation of PlayStation Network. He is honored to be a part of a small team of engineers who managed to build from scratch a platform that scaled from 0 users to 1 million PS4s in just 1 day and have being landing 1.5 million of new devices per month since and now reached tenth of millions of active users. He is passionate about technology, innovations, walking his dog and building scalable software using Cassandra.
The state of analytics has changed dramatically over the last few years. Hadoop is now commonplace, and the ecosystem has evolved to include new tools such as Spark, Shark, and Drill, that live alongside the old MapReduce-based standards. It can be difficult to keep up with the pace of change, and newcomers are left with a dizzying variety of seemingly similar choices. This is compounded by the number of possible deployment permutations, which can cause all but the most determined to simply stick with the tried and true. In this talk I will introduce you to a powerhouse combination of Cassandra and Spark, which provides a high-speed platform for both real-time and batch analysis.
Scala like distributed collections - dumping time-series data with apache sparkDemi Ben-Ari
Spark RDDs are almost identical to Scala collection, just in a distributed manner, all of the transformations and actions are derived from the Scala collections API.
As Martin Odersky mentioned, “Spark - The Ultimate Scala Collections” is the right way to look at RDDs. But with that great distributed power comes a great many data problems: at first you’ll start tackling the concept of partitioning, then the actual data becomes the next thing to worry about.
In the talk we’ll go through an overview on Spark's architecture, and see how similar RDDs are to the Scala collections API. We'll then shift to the world of problems that you’ll be facing when using Spark for processing a vast volume of time-series data with multiple data stores (S3, MongoDB, Apache Cassandra, MySQL).
When you start tackling many scale and performance problems, many questions arise:
> How to handle missing data?
> Should the system handle both serving and backend processes, or should we separate them out?
> Which solution is cheaper?
> How do we get the best performance for money spent?
In the talk we will tell the tale of all of the transformations we’ve made to our data and review the multiple data persistency layers... and I’ll try my best NOT to answer the question “which persistency layer is the best?” but I do promise to share our pains and lessons learned!
NASA LandSat data can be stored, transformed, navigated, and visualized. In this session we will explore how the LandSat dataset is stored in Amazon Simple Storage Service (S3), one of the recommended cloud storage services in AWS for storage of petabytes of data, and how data stored in S3 can be processed on the server with the Lambda service, visualized for users, and made available to search engines.
Create by: Ben Snively, Senior Solutions Architect
Recently, the interest in highly scalable stream processing engines has risen, thus many projects have appeared. Apache Samza is a distributed stream-processing framework that uses Apache Kafka for messaging, and Apache Hadoop YARN to provide fault tolerance, and resource management. It is one of the most popular stream processing engines out there used by many high-profile companies. On the other hand, we have Amazon Kinesis that is a fully managed service for real-time processing of streaming data which allows users to scale the amount of data ingested by Kinesis without worrying about the infrastructure details. This presentation gives a brief introduction about the very popular Samza-Kafka integration, then focuses on the new Samza-Kinesis integration, and explains users the new opportunities they have due to the new Samza-Kinesis integration.
http://www.oreilly.com/pub/e/3764
Keystone processes over 700 billion events per day (1 peta byte) with at-least-once processing semantics in the cloud. Monal Daxini details how they used Kafka, Samza, Docker, and Linux at scale to implement a multi-tenant pipeline in AWS cloud within a year. He'll also share plans on offering a Stream Processing as a Service for all of Netflix use.
The need for gleaning answers from data in real-time is moving from nicety to a necessity. There are few options to analyze the never-ending stream of unbounded data at scale. Let’s compare and contrast the core principles and technologies the different open source solutions available to help with this endeavor, and where in the future processing engines need to evolve to solve processing needs at scale. These findings are based on the experience of continuing to build a scalable solution in the cloud to process over 700 billion events at Netflix, and how we are embarking on the next journey to evolve unbounded data processing engines.
DataStax and Esri: Geotemporal IoT Search and AnalyticsDataStax Academy
Internet of Things (IoT) data frequently has a location and time component. Getting value out of this "geotemporal" data can be tricky. We'll explore when and how to leverage Cassandra, DSE Search and DSE Analytics to surface meaningful information from your geotemporal data.
Samza at LinkedIn: Taking Stream Processing to the Next LevelMartin Kleppmann
Slides from my talk at Berlin Buzzwords, 27 May 2014. Unfortunately Slideshare has screwed up the fonts. See https://speakerdeck.com/ept/samza-at-linkedin-taking-stream-processing-to-the-next-level for a version of the deck with correct fonts.
Stream processing is an essential part of real-time data systems, such as news feeds, live search indexes, real-time analytics, metrics and monitoring. But writing stream processes is still hard, especially when you're dealing with so much data that you have to distribute it across multiple machines. How can you keep the system running smoothly, even when machines fail and bugs occur?
Apache Samza is a new framework for writing scalable stream processing jobs. Like Hadoop and MapReduce for batch processing, it takes care of the hard parts of running your message-processing code on a distributed infrastructure, so that you can concentrate on writing your application using simple APIs. It is in production use at LinkedIn.
This talk will introduce Samza, and show how to use it to solve a range of different problems. Samza has some unique features that make it especially interesting for large deployments, and in this talk we will dig into how they work under the hood. In particular:
• Samza is built to support many different jobs written by different teams. Isolation between jobs ensures that a single badly behaved job doesn't affect other jobs. It is robust by design.
• Samza can handle jobs that require large amounts of state, for example joining multiple streams, augmenting a stream with data from a database, or aggregating data over long time windows. This makes it a very powerful tool for applications.
Harvesting the Power of Samza in LinkedIn's FeedMohamed El-Geish
LinkedIn's Feed is the entry point for hundreds of millions of members who seek to stay informed about their professional interests. The feed strives to provide relevant content to members that's also new and fresh. How does the feed solve this problem at scale? What role does Samza play in this? Join us to find out.
Building large-scale analytics platform with Storm, Kafka and Cassandra - NYC...Alexey Kharlamov
At Integral, we process heavy volumes of click-stream traffic. 50K QPS of ad impressions at peak and close to 200K QPS of all browser calls. We build analytics on this streams of data. There are two applications which require quite significant computational effort: 'sessionization' and fraud detection.
Sessionization implies linking a series of requests from same browser into single record. There can be 5 or more total requests spread over 15-30 minutes which we need to link to each other.
Fraud detection is a process looking at various signals in browser requests and at substantial historical evidence data classifying ad impression either as legitimate or as fraudulent.
We've been doing both (as well as all other analytics) in batch mode once an hour at best. Both processes, and, in particular, fraud detection, are time sensitive and much more meaningful if done in near-real-time.
This talk would be about our experience migrating a once-per-day offline batch processing of impression data using hadoop to in-memory stream processing using Kafka, Storm and Cassandra. We will touch upon our choices and our reasoning for selecting the products used for this solution.
Hadoop is no longer the only or always preferred option in Big Data space. In-memory stream processing may be more effective for time series data preparation and aggregation. Ability to scale at a significantly lower cost means more customers, better accuracy and better business practices: since only in-stream processing allows for low-latency data and insight delivery it opens entirely new opportunities. However, transitioning of non-trivial data pipelines raises a number of questions hidden previously within the offline nature of batch processing. How will you join several data feeds? How will you implement failure recovery? In addition to handling terabytes of data per day our streaming system has to be guided by the following considerations:
• Recovery time
• Time relativity and continuity
• Geographical distribution of data sources
• Limit on data loss
• Maintainability
The system produces complex cross-correlational analysis of several data feeds and aggregation for client analytics with input feed frequency of up to 100K msg/sec.
This presentation will benefit anyone interested in learning an alternate approach for big data analytics, especially the process of joining multiple streams in memory using Cassandra. Presentation will also highlight certain optimization patterns used those can be useful in similar situations.
Some notes about spark streming positioning give the current players: Beam, Flink, Storm et al. Helpful if you have to choose an Streaming engine for your project.
PlayStation and Searchable Cassandra Without Solr (Dustin Pham & Alexander Fi...DataStax
When we were preparing for PlayStation4 launch we faced 1 hard problem: how to make our games and videos storage system super-fast, highly available and fault tolerant. Moving from relation database to Cassandra is not easy, and it even harder if you want to support different search and query use cases. Join our talk to learn how we managed to build highly available platform, that supports tens of millions of active users and can execute multiple user specific queries in less than a millisecond. And all of it without using Solr or ElasticSearch.
About the Speaker
Alexander Filipchik, Sony
Alex spent last 4 years of his life building the next generation of PlayStation Network. He is honored to be a part of a small team of engineers who managed to build from scratch a platform that scaled from 0 users to 1 million PS4s in just 1 day and have being landing 1.5 million of new devices per month since and now reached tenth of millions of active users. He is passionate about technology, innovations, walking his dog and building scalable software using Cassandra.
The state of analytics has changed dramatically over the last few years. Hadoop is now commonplace, and the ecosystem has evolved to include new tools such as Spark, Shark, and Drill, that live alongside the old MapReduce-based standards. It can be difficult to keep up with the pace of change, and newcomers are left with a dizzying variety of seemingly similar choices. This is compounded by the number of possible deployment permutations, which can cause all but the most determined to simply stick with the tried and true. In this talk I will introduce you to a powerhouse combination of Cassandra and Spark, which provides a high-speed platform for both real-time and batch analysis.
Scala like distributed collections - dumping time-series data with apache sparkDemi Ben-Ari
Spark RDDs are almost identical to Scala collection, just in a distributed manner, all of the transformations and actions are derived from the Scala collections API.
As Martin Odersky mentioned, “Spark - The Ultimate Scala Collections” is the right way to look at RDDs. But with that great distributed power comes a great many data problems: at first you’ll start tackling the concept of partitioning, then the actual data becomes the next thing to worry about.
In the talk we’ll go through an overview on Spark's architecture, and see how similar RDDs are to the Scala collections API. We'll then shift to the world of problems that you’ll be facing when using Spark for processing a vast volume of time-series data with multiple data stores (S3, MongoDB, Apache Cassandra, MySQL).
When you start tackling many scale and performance problems, many questions arise:
> How to handle missing data?
> Should the system handle both serving and backend processes, or should we separate them out?
> Which solution is cheaper?
> How do we get the best performance for money spent?
In the talk we will tell the tale of all of the transformations we’ve made to our data and review the multiple data persistency layers... and I’ll try my best NOT to answer the question “which persistency layer is the best?” but I do promise to share our pains and lessons learned!
NASA LandSat data can be stored, transformed, navigated, and visualized. In this session we will explore how the LandSat dataset is stored in Amazon Simple Storage Service (S3), one of the recommended cloud storage services in AWS for storage of petabytes of data, and how data stored in S3 can be processed on the server with the Lambda service, visualized for users, and made available to search engines.
Create by: Ben Snively, Senior Solutions Architect
Riga dev day: Lambda architecture at AWSAntons Kranga
My recent talk at Riga DevDay about Lambda architect at AWS. It illustrates few design simplifications that we can get when we implement Lambda Architecture in Cloud Native way
In this talk, we share our experience when we build up our data pipeline. We went from mongodb, and migrated to cassandra, and now we use kafka and spark to handle our data. We also talk about what problem encounter, why we select these solutions, and where we will go next.
S3, Cassandra or Outer Space? Dumping Time Series Data using Spark - Demi Be...Codemotion
Vast volume of our processed data is Time Series data and once you start working with distributed systems, you start tackling many scale and performance problems: How to handle missing data?Should I handle both serving and backed process or separating them out? Best Performance for Money? In the talk we will tell the tale of all of the transformations we’ve made to our data model@Windward, some of the problems we’ve handled, review the multiple data persistency layers like: S3, MongoDB, Apache Cassandra, MySQL. And I’ll try my best NOT to answer the question “Which one of them is the Best?"
Netflix Open Source Meetup Season 4 Episode 2aspyker
In this episode, we will take a close look at 2 different approaches to high-throughput/low-latency data stores, developed by Netflix.
The first, EVCache, is a battle-tested distributed memcached-backed data store, optimized for the cloud. You will also hear about the road ahead for EVCache it evolves into an L1/L2 cache over RAM and SSDs.
The second, Dynomite, is a framework to make any non-distributed data-store, distributed. Netflix's first implementation of Dynomite is based on Redis.
Come learn about the products' features and hear from Thomson and Reuters, Diego Pacheco from Ilegra and other third party speakers, internal and external to Netflix, on how these products fit in their stack and roadmap.
AWS re:Invent 2016: Case Study: How Startups Like Smartsheet and Quantcast Ac...Amazon Web Services
Startups around the world use AWS services to access the power of the cloud to grow faster and more cost effectively. In this session, Smartsheet talks about how they were able to cost-effectively build their prototype for scale and avoid replatforming at different points in the adoption curve, and Quantcast discusses how they are running a high-performance analytics solution on AWS. They provide several tips and tricks for S3, and show how they removed a traditional MySQL data store from a distributed-image hosting application so that the only required data store is S3. They also show how to avoid common, cumbersome database practices by working with the eventually consistent nature of S3 objects and the fact that objects and directories share the same namespace.
Serverless architecture can eliminate the need to provision and manage servers required to process files or streaming data in real time.
In this session, we will cover the fundamentals of using AWS Lambda to process data from sources such as Amazon DynamoDB Streams, Amazon Kinesis, and Amazon S3. We will walk through sample use cases for real-time data processing and discuss best practices on using these services together. We will then demonstrate how to set up a real-time stream processing solution using just Amazon Kinesis and AWS Lambda, all without the need to run or manage servers.
Getting Maximum Performance from Amazon Redshift (DAT305) | AWS re:Invent 2013Amazon Web Services
Get the most out of Amazon Redshift by learning about cutting-edge data warehousing implementations. Desk.com, a Salesforce.com company, discusses how they maintain a large concurrent user base on their customer-facing business intelligence portal powered by Amazon Redshift. HasOffers shares how they load 60 million events per day into Amazon Redshift with a 3-minute end-to-end load latency to support ad performance tracking for thousands of affiliate networks. Finally, Aggregate Knowledge discusses how they perform complex queries at scale with Amazon Redshift to support their media intelligence platform.
What if there were an easier way to perform big data analysis with less setup, instant scaling, and no servers to provision and manage? With serverless computing, you can perform real-time stream processing of multiple data types without needing to spin up servers or install software. Come learn how you can use AWS Lambda with Amazon Kinesis to analyze streaming data in real-time and then store the results in a managed NoSQL database such as Amazon DynamoDB. You’ll learn tips and tricks for doing in-line processing, data manipulation, and even distributed MapReduce on large data sets.
NetflixOSS Meetup S3 E1, covering latest components in Distributed Databases, Telemetry systems, Big Data tools and more. Speakers from Netflix, IBM Watson, Pivotal and Nike Digital
Bridging the Digital Gap Brad Spiegel Macon, GA Initiative.pptxBrad Spiegel Macon GA
Brad Spiegel Macon GA’s journey exemplifies the profound impact that one individual can have on their community. Through his unwavering dedication to digital inclusion, he’s not only bridging the gap in Macon but also setting an example for others to follow.
This 7-second Brain Wave Ritual Attracts Money To You.!nirahealhty
Discover the power of a simple 7-second brain wave ritual that can attract wealth and abundance into your life. By tapping into specific brain frequencies, this technique helps you manifest financial success effortlessly. Ready to transform your financial future? Try this powerful ritual and start attracting money today!
# Internet Security: Safeguarding Your Digital World
In the contemporary digital age, the internet is a cornerstone of our daily lives. It connects us to vast amounts of information, provides platforms for communication, enables commerce, and offers endless entertainment. However, with these conveniences come significant security challenges. Internet security is essential to protect our digital identities, sensitive data, and overall online experience. This comprehensive guide explores the multifaceted world of internet security, providing insights into its importance, common threats, and effective strategies to safeguard your digital world.
## Understanding Internet Security
Internet security encompasses the measures and protocols used to protect information, devices, and networks from unauthorized access, attacks, and damage. It involves a wide range of practices designed to safeguard data confidentiality, integrity, and availability. Effective internet security is crucial for individuals, businesses, and governments alike, as cyber threats continue to evolve in complexity and scale.
### Key Components of Internet Security
1. **Confidentiality**: Ensuring that information is accessible only to those authorized to access it.
2. **Integrity**: Protecting information from being altered or tampered with by unauthorized parties.
3. **Availability**: Ensuring that authorized users have reliable access to information and resources when needed.
## Common Internet Security Threats
Cyber threats are numerous and constantly evolving. Understanding these threats is the first step in protecting against them. Some of the most common internet security threats include:
### Malware
Malware, or malicious software, is designed to harm, exploit, or otherwise compromise a device, network, or service. Common types of malware include:
- **Viruses**: Programs that attach themselves to legitimate software and replicate, spreading to other programs and files.
- **Worms**: Standalone malware that replicates itself to spread to other computers.
- **Trojan Horses**: Malicious software disguised as legitimate software.
- **Ransomware**: Malware that encrypts a user's files and demands a ransom for the decryption key.
- **Spyware**: Software that secretly monitors and collects user information.
### Phishing
Phishing is a social engineering attack that aims to steal sensitive information such as usernames, passwords, and credit card details. Attackers often masquerade as trusted entities in email or other communication channels, tricking victims into providing their information.
### Man-in-the-Middle (MitM) Attacks
MitM attacks occur when an attacker intercepts and potentially alters communication between two parties without their knowledge. This can lead to the unauthorized acquisition of sensitive information.
### Denial-of-Service (DoS) and Distributed Denial-of-Service (DDoS) Attacks
Multi-cluster Kubernetes Networking- Patterns, Projects and GuidelinesSanjeev Rampal
Talk presented at Kubernetes Community Day, New York, May 2024.
Technical summary of Multi-Cluster Kubernetes Networking architectures with focus on 4 key topics.
1) Key patterns for Multi-cluster architectures
2) Architectural comparison of several OSS/ CNCF projects to address these patterns
3) Evolution trends for the APIs of these projects
4) Some design recommendations & guidelines for adopting/ deploying these solutions.
APNIC Foundation, presented by Ellisha Heppner at the PNG DNS Forum 2024APNIC
Ellisha Heppner, Grant Management Lead, presented an update on APNIC Foundation to the PNG DNS Forum held from 6 to 10 May, 2024 in Port Moresby, Papua New Guinea.
1.Wireless Communication System_Wireless communication is a broad term that i...JeyaPerumal1
Wireless communication involves the transmission of information over a distance without the help of wires, cables or any other forms of electrical conductors.
Wireless communication is a broad term that incorporates all procedures and forms of connecting and communicating between two or more devices using a wireless signal through wireless communication technologies and devices.
Features of Wireless Communication
The evolution of wireless technology has brought many advancements with its effective features.
The transmitted distance can be anywhere between a few meters (for example, a television's remote control) and thousands of kilometers (for example, radio communication).
Wireless communication can be used for cellular telephony, wireless access to the internet, wireless home networking, and so on.
2. Principal Data Architect at Home24
Data Services: Search, Recommendations, Ranking
Worked on: Here Maps, Sapo.pt, DataJet, Xing, …
Scala, Perl, Prolog, Java, SQL, R, …
AWS: Step-Functions, Lambda Function, EMR, EC2,
Batch, SQS, SNS, Firehose, Athena, API Gateway, ...
5. ● 15 persons of 12 Nationalities
● Serverless Lovers. For data ingestion we have:
● AWS Technologies: Step-Functions, Cloud-Formation, Lambda Functions,
Athena, EMR, Redshift, S3, ...
Production Development
Number of Lambdas 625 2311
Number of Step Function 113 490
Consumed time (a month) 3,383,525 sec (39 days) 5,371,037 sec (62 days)
Number of requests (a month) 2,014,203 Requests 3,300,118 Requests
6. ● Majority of our Streams are low rate messages
● The Big Stream doesn’t have an easily predictable rate of
messages and can peak to 100 messages/sec
● We will have many more low rate Streams
7. Main requirements
● Store new Stream Data in Raw S3 Bucket
● Refine Raw S3 Bucket data to a Refined S3 Bucket
● Wrong formatted messages shall not stop the flow
● Notification shall be sent on bad data
● Data must be refined in less than 10 minutes
Other
● Able to replay many days of data fast
● For development, every developer shall be able to deploy his version
independently
8. Requirements
● Collect data from SNS
● The data must be stored as received in S3.
● Files size must be easy to process on
Lambda (< 10MB)
● At least 1 file per minute must be created
9. Requirements
● Collect data from SNS
● The data must be stored as received in S3.
● Files size must be easy to process on
Lambda (< 10MB)
● At least 1 file per minute must be created
Architecture
● A SQS Queue collects all data from the SNS
10. Requirements
● Collect data from SNS
● The data must be stored as received in S3.
● Files size must be easy to process on
Lambda (< 10MB)
● At least 1 file per minute must be created
Architecture
● A SQS Queue collects all data from the SNS
● A Lambda copies the data from the SQS to a
Firehose
● The Lambda Function is invoked once a
minute via CloudWatch Event
11. Requirements
● Collect data from SNS
● The data must be stored as received in S3.
● Files size must be easy to process on
Lambda (< 10MB)
● At least 1 file per minute must be created
Architecture
● A SQS Queue collects all data from the SNS
● A Lambda copies the data from the SQS to a
Firehose
● The Lambda Function is invoked once a
minute via CloudWatch Event
● Firehose merges the data and creates files
on Raw S3 Bucket
13. Requirement
● When some message are not
processable, send a notification.
Architecture
● The data is deleted from the SQS
Queue after successful copy to
Firehose
14. Requirement
● When some message are not
processable, send a notification.
Architecture
● The data is deleted from the SQS
Queue after successful copy to
Firehose
● On case of error, the messages will
end on the Dead-Letter Queue
15. Requirement
● When some message are not
processable, send a notification.
Architecture
● The data is deleted from the SQS
Queue after successful copy to
Firehose
● On case of error, the messages will
end on the Dead-Letter Queue
● Non empty Dead-Letter SQS means
there is an error on the data
16. Requirement
● When some message are not
processable, send a notification.
Architecture
● The data is deleted from the SQS
Queue after successful copy to
Firehose
● On case of error, the messages will
end on the Dead-Letter Queue
● Non empty Dead-Letter SQS means
there is an error on the data
● After fixing the Lambda function, one
can always copy the messages back
to the Raw SQS
17. Requirements
● Decompress data (zip, deflate, gz,
base64, ...)
● Normalize fields (dates for example)
● Add metadata
● Convert all to JSON
● Stored on S3
18. Requirements
● Decompress data (zip, deflate, gz,
base64, ...)
● Normalize fields (dates for example)
● Add metadata
● Convert all to JSON
● Stored on S3
Architecture
● When a new file is created on Raw S3
Bucket a message is sent to SQS via
SNS
19. Requirements
● Decompress data (zip, deflate, gz,
base64, ...)
● Normalize fields (dates for example)
● Add metadata
● Convert all to JSON
● Stored on S3
Architecture
● When a new file is created on Raw S3
Bucket a message is sent to SQS via
SNS
● The Lambda Function is invoked once
a minute via CloudWatch Event and
process all unprocessed files
20. Requirements
● Decompress data (zip, deflate, gz,
base64, ...)
● Normalize fields (dates for example)
● Add metadata
● Convert all to JSON
● Stored on S3
Architecture
● When a new file is created on Raw S3
Bucket a message is sent to SQS via
SNS
● The Lambda Function is invoked once
a minute via CloudWatch Event and
process all unprocessed files
● A file with the same key, as Raw file, is
created on the Refine S3 Bucket
21. Requirements
● Decompress data (zip, deflate, gz,
base64, ...)
● Normalize fields (dates for example)
● Add metadata
● Convert all to JSON
● Stored on S3
Architecture
● When a new file is created on Raw S3
Bucket a message is sent to SQS via
SNS
● The Lambda Function is invoked once
a minute via CloudWatch Event and
process all unprocessed files
● A file with the same key, as Raw file, is
created on the Refine S3 Bucket
● Messages that fail to process will end
on the Dead Letter Queue
23. Requirements
● Replay multiple days of data
Architecture
● Lambda Function List files on the
Raw S3 Bucket and send
messages to SQS
24. Requirements
● Replay multiple days of data
Architecture
● Lambda Function List files on the
Raw S3 Bucket and send
messages to SQS
● Since the files in Raw and Refine
have the same key, the files will
always overwrite the existing ones
25. Requirements
● Replay multiple days of data
Architecture
● Lambda Function List files on the
Raw S3 Bucket and send
messages to SQS
● Since the files in Raw and Refine
have the same key, the files will
always overwrite the existing ones
● The execution time of the Refiner
Lambda will rise and the Refiner
Lambdas will work in parallel
26. Requirements
● Replay multiple days of data
Architecture
● Lambda Function List files on the
Raw S3 Bucket and send
messages to SQS
● Since the files in Raw and Refine
have the same key, the files will
always overwrite the existing ones
● The execution time of the Refiner
Lambda will rise and the Refiner
Lambdas will work in parallelParallelism:
● our Lambda goes to ~190 sec, 3 lambdas
running in parallel.
● 9198 S3 objects
● 30 GB of GZip data, 10GB/hour
27. Requirement
● Developers shall be able to
deploy their Stream
Processors
● No interaction with external
team shall be required
28. Requirement
● Developers shall be able to
deploy their Stream
Processors
● No interaction with external
team shall be required
Architecture
● We created an internal SNS
where we clone the external
messages
29. Requirement
● Developers shall be able to
deploy their Stream
Processors
● No interaction with external
team shall be required
Architecture
● We created an internal SNS
where we clone the external
messages
● SNS can write to multiple
SQS
30. Requirement
● Developers shall be able to
deploy their Stream
Processors
● No interaction with external
team shall be required
Architecture
● We created an internal SNS
where we clone the external
messages
● SNS can write to multiple
SQS
● Same CloudFormation magic
and every developer can
deploy his own Environment
31. EC2 Lambda
CPU /
Price
1 t2.nano (5% vCPU and 500MB)
0.0063*24*30 = 4.536$/month
Considering 3 seconds a minute
with the highest memory (2
vCPU and 1536 MB)
3*60*24*30*10*(0.000002501+0
.0000002) = 3.5$/month
32. EC2 Lambda
CPU /
Price
1 t2.nano (5% vCPU and 500MB)
0.0063*24*30 = 4.536$/month
Considering 3 seconds a minute
with the highest memory (2
vCPU and 1536 MB)
3*60*24*30*10*(0.000002501+0
.0000002) = 3.5$/month
Devops Higher Low
33. EC2 Lambda
CPU /
Price
1 t2.nano (5% vCPU and 500MB)
0.0063*24*30 = 4.536$/month
Considering 3 seconds a minute
with the highest memory (2
vCPU and 1536 MB)
3*60*24*30*10*(0.000002501+0
.0000002) = 3.5$/month
Devops Higher Low
Scale Scale while it has credits to 1
vCPU. To have more vCPUs you
need to use more expensive
instance types or implement
autoscaling
Out of the box until a certain
level.
2 vCPU * 5 Lambdas = 10
vCPUs
34. EC2 Lambda
CPU /
Price
1 t2.nano (5% vCPU and 500MB)
0.0063*24*30 = 4.536$/month
Considering 3 seconds a minute
with the highest memory (2
vCPU and 1536 MB)
3*60*24*30*10*(0.000002501+0
.0000002) = 3.5$/month
Devops Higher Low
Scale Scale while it has credits to 1
vCPU. To have more vCPUs you
need to use more expensive
instance types or implement
autoscaling
Out of the box until a certain
level.
2 vCPU * 5 Lambdas = 10
vCPUs
Price wise, lambda seems a good solution. For our problems, 10 vCPUs is
clearly more than enough.
35. Kinesys SQS
Slow stream 2 Shards 24.5$/month
Puts 0.042$/Month
Requests
2.07$/month
We analyze our 2 types of stream of data:
● Slow Stream: 1 message/sec (2.6 million requests/month)
On SQS you pay PUTs and GETs on Kinesys you pay PUTs
36. Kinesys SQS
Slow stream 2 Shards 24.5$/month
Puts 0.042$/Month
Requests
2.07$/month
Fast stream 3 Shards 36.7$/month
Puts 1.1$/month
Requests
51.8$/month
We analyze our 2 types of stream of data:
● Slow Stream: 1 message/sec (2.6 million requests/month)
● Fast Stream: 25 message/second (64.8 million requests/month)
with spikes of 100 message/second
On SQS you pay PUTs and GETs on Kinesys you pay PUTs
37. Kinesys SQS
Slow stream 2 Shards 24.5$/month
Puts 0.042$/Month
Requests
2.07$/month
Fast stream 3 Shards 36.7$/month
Puts 1.1$/month
Requests
51.8$/month
Errors Errors have to be controlled
externally
Errors go to
DeadLeter Queue
We analyze our 2 types of stream of data:
● Slow Stream: 1 message/sec (2.6 million requests/month)
● Fast Stream: 25 message/second (64.8 million requests/month)
with spikes of 100 message/second
On SQS you pay PUTs and GETs on Kinesys you pay PUTs
38. ● You just pay for what you use
● Scalability is not an issue at our messages volume (top 100
messages/second)
○ SQS and Firehose can easily process that volume of messages
○ Multiple Lambdas can work in parallel in case of high traffic or
replay.
● Separated Lambdas by Stream help understanding the logs
● Separated environments simplify developers work
● Data is on S3 and it can be queried via Athena, EMR, Redshift
Spectrum, ...