Cloud Native Data Pipelines
The document discusses building data pipelines in a cloud native way using open source technologies and cloud native techniques. It describes a message scoring use case at Agari where data is ingested from multiple enterprises into S3 and then processed through a Spark job on EMR hourly. The results are written to S3 and trigger downstream processing. Design goals for resilient data pipelines include operability, correctness, timeliness and cost. Techniques discussed to achieve these goals include using Apache Airflow for workflow management, auto scaling groups, and leveraging serverless technologies where possible.
Cloud Native Data Pipelines (in Eng & Japanese) - QCon TokyoSid Anand
Slides from "Cloud Native Data Pipelines" talk given @ QCon Tokyo 2016. The slides are in both English and Japanese. Thanks to Kiro Harada (https://jp.linkedin.com/in/haradakiro) for the translation.
AWS re:Invent 2016: Tableau Rules of Engagement in the Cloud (STG306)Amazon Web Services
You have billions of events in your fact table, all of it waiting to be visualized. Enter Tableau… but wait: how can you ensure scalability and speed with your data in Amazon S3, Spark, Amazon Redshift, or Presto? In this talk, you’ll hear how Albert Wong and Srikanth Devidi at Netflix use Tableau on top of their big data stack. Albert and Srikanth also show how you can get the most out of a massive dataset using Tableau, and help guide you through the problems you may encounter along the way. Session sponsored by Tableau.
AWS Competency Partner
Cloud Native Data Pipelines (in Eng & Japanese) - QCon TokyoSid Anand
Slides from "Cloud Native Data Pipelines" talk given @ QCon Tokyo 2016. The slides are in both English and Japanese. Thanks to Kiro Harada (https://jp.linkedin.com/in/haradakiro) for the translation.
AWS re:Invent 2016: Tableau Rules of Engagement in the Cloud (STG306)Amazon Web Services
You have billions of events in your fact table, all of it waiting to be visualized. Enter Tableau… but wait: how can you ensure scalability and speed with your data in Amazon S3, Spark, Amazon Redshift, or Presto? In this talk, you’ll hear how Albert Wong and Srikanth Devidi at Netflix use Tableau on top of their big data stack. Albert and Srikanth also show how you can get the most out of a massive dataset using Tableau, and help guide you through the problems you may encounter along the way. Session sponsored by Tableau.
AWS Competency Partner
(BDT404) Large-Scale ETL Data Flows w/AWS Data Pipeline & DataductAmazon Web Services
"As data volumes grow, managing and scaling data pipelines for ETL and batch processing can be daunting. With more than 13.5 million learners worldwide, hundreds of courses, and thousands of instructors, Coursera manages over a hundred data pipelines for ETL, batch processing, and new product development.
In this session, we dive deep into AWS Data Pipeline and Dataduct, an open source framework built at Coursera to manage pipelines and create reusable patterns to expedite developer productivity. We share the lessons learned during our journey: from basic ETL processes, such as loading data from Amazon RDS to Amazon Redshift, to more sophisticated pipelines to power recommendation engines and search services.
Attendees learn:
Do's and don’ts of Data Pipeline
Using Dataduct to streamline your data pipelines
How to use Data Pipeline to power other data products, such as recommendation systems
What’s next for Dataduct"
AWS May Webinar Series - Streaming Data Processing with Amazon Kinesis and AW...Amazon Web Services
If you are interested to know more about AWS Chicago Summit, please use the following to register: http://amzn.to/1RooPPL
Amazon Kinesis is a fully managed, cloud-based service for real-time data processing over large, distributed data streams. AWS Lambda is a compute service that runs your code in response to events and automatically manages the compute resources for you. AWS Lambda can run code in response to data in Amazon Kinesis streams, making it easy to build big data applications that respond quickly to new information. In this webinar, we will cover key Kinesis and Lambda features, walk through sample use cases for stream processing, and discuss best practices on using the services together. We'll then demonstrate setting up an Amazon Kinesis stream and an associated Lambda function to capture and perform custom computations on click-stream data, all without setting up any infrastructure.
Learning Objectives: • Understand key Amazon Kinesis and AWS Lambda features • Learn how to setup streaming data capture and processing framework using AWS Lambda • Learn sample use cases, best practices and tips on using AWS Lambda with Amazon Kinesis
Who Should Attend: • Developers, Devops Engineers, IT Operations Professionals
AWS Step Functions is a new, fully-managed service that makes it easy to coordinate the components of distributed applications and microservices using visual workflows. Step Functions is a reliable way to connect and step through a series of AWS Lambda functions so that you can build and run multi-step applications in a matter of minutes. This session shows how to use AWS Step Functions to create, run, and debug cloud state machines to execute parallel, sequential, and branching steps of your application, with automatic catch and retry conditions.
Join us for a for a Amazon Kinesis tutorial webinar. In this session we will provide a reference architecture and instructions for building a system that performs real-time sliding-windows analysis over streaming clickstream data. We will use Amazon Kinesis for managed ingestion of streaming data at scale with the ability to replay past data, and run sliding-window computation using Apache Storm. We’ll demonstrate in the webinar on how to build the system and deploy on AWS and walkthrough all the steps from ingestion, processing, and storing to visualizing of the data in real-time.
Building a Real Time Dashboard with Amazon Kinesis, Amazon Lambda and Amazon ...Amazon Web Services
Organisations today need a way to manage the ever-increasing volume of data from numerous sources such as log systems, click streams or connected devices and be able to analyse this data in real-time. In this session we will walk through an architecture demonstration of how to leverage AWS services to meet these needs.
Speaker: Ganesh Raja, Solutions Architect, Amazon Web Services
Cassandra Day SV 2014: Spark, Shark, and Apache CassandraDataStax Academy
This session covers our experience with using the Spark and Shark frameworks for running real-time queries on top of Cassandra data.We will start by surveying the current Cassandra analytics landscape, including Hadoop and HIVE, and touch on the use of custom input formats to extract data from Cassandra. We will then dive into Spark and Shark, two memory-based cluster computing frameworks, and how they enable often dramatic improvements in query speed and productivity, over the standard solutions today.
Streaming Data Analytics with Amazon Redshift and Kinesis FirehoseAmazon Web Services
Evolving your analytics from batch processing to real-time processing can have a major business impact, but ingesting streaming data into your data warehouse requires building complex streaming data pipelines. Amazon Kinesis Firehose solves this problem by making it easy to transform and load streaming data into Amazon Redshift so that you can use existing analytics and business intelligence tools to extract information in near real-time and respond promptly. In this session, we will dive deep using Amazon Kinesis Firehose to load streaming data into Amazon Redshift reliably, scalably, and cost-effectively.
SF Big Analytics: Machine Learning with Presto by Christopher BernerChester Chen
Talk 1: Machine Learning in Presto
Presto is an open source distributed SQL query engine used by Facebook, in our Hadoop warehouse. It's typically about 10x faster than Hive, and can be extended to a number of other use cases. One of these extensions adds SQL functions to create and make predictions with machine learning models. The aim of this is to significantly reduce the time it takes to prototype a model, by moving the construction and testing of the model to the database.
Bio:
Christopher Berner works as a software engineer at Facebook on the Presto team. He wrote the ML functionality, and has worked on the query planner, type system, bytecode generator, and many other pieces of Presto. Before Presto he worked on the newsfeed ranking team developing machine learning models.
BDT303 Data Science with Elastic MapReduce - AWS re: Invent 2012Amazon Web Services
In this talk, we dive into the Netflix Data Science & Engineering architecture. Not just the what, but also the why. Some key topics include the big data technologies we leverage (Cassandra, Hadoop, Pig + Python, and Hive), our use of Amazon S3 as our central data hub, our use of multiple persistent Amazon Elastic MapReduce (EMR) clusters, how we leverage the elasticity of AWS, our data science as a service approach, how we make our hybrid AWS / data center setup work well, and more.
Data Pipeline with Kafka, This slide include
Kafka Introduction, Topic / Partitions, Produce / Consumer, Quick Start, Offset Monitoring, Example Code, Camus
In this session, we introduce AWS Glue, provide an overview of its components, and share how you can use AWS Glue to automate discovering your data, cataloging it, and preparing it for analysis.
Tackle Your Dark Data Challenge with AWS Glue - AWS Online Tech TalksAmazon Web Services
Learning Objectives:
- Discover dark data that you are currently not analyzing.
- Analyze dark data without moving it into your data warehouse.
- Visualize the results of your dark data analytics.
Sinking data from a distributed message queue to files seems easy, right? Well, it is not trivial, but we managed to make it very complicated. At bol.com, the largest webshop of the Netherlands and Belgium, we started measuring customer behavior in real-time last year producing around 500,000 events per second. We ingest that data in Apache Kafka, a popular distributed message queue, and wanted to have that data stored in Parquet files for easier batch processing by different applications in the landscape. To do that we tried several solutions with Apache Flink, a stream processing framework, but transitioning from an event stream to windowed batch files under exactly-once guarantees proved to be very challenging. We struggled a lot with fault tolerance and, in the end, we decided to go back to basics and build a much simpler daily batch job instead of a streaming job. In this talk I’m going to share our struggle and learnings with you, especially about how (not) to deal with fault tolerance when you need exactly-once delivery guarantees.
AWS re:Invent 2016: Streaming ETL for RDS and DynamoDB (DAT315)Amazon Web Services
During this session Greg Brandt and Liyin Tang, Data Infrastructure engineers from Airbnb, will discuss the design and architecture of Airbnb's streaming ETL infrastructure, which exports data from RDS for MySQL and DynamoDB into Airbnb's data warehouse, using a system called SpinalTap. We will also discuss how we leverage Spark Streaming to compute derived data from tracking topics and/or database tables, and HBase to provide immediate data access and generate cleanly time-partitioned Hive tables.
This presentation aims to be useful by covering the following topics:
- Modern Data Processing System Architectures and Models,
- Batch and Stream Processing Pipelines' details,
- Apache Spark Architecture and Internals,
- Real life use cases used with Apache Spark.
(BDT404) Large-Scale ETL Data Flows w/AWS Data Pipeline & DataductAmazon Web Services
"As data volumes grow, managing and scaling data pipelines for ETL and batch processing can be daunting. With more than 13.5 million learners worldwide, hundreds of courses, and thousands of instructors, Coursera manages over a hundred data pipelines for ETL, batch processing, and new product development.
In this session, we dive deep into AWS Data Pipeline and Dataduct, an open source framework built at Coursera to manage pipelines and create reusable patterns to expedite developer productivity. We share the lessons learned during our journey: from basic ETL processes, such as loading data from Amazon RDS to Amazon Redshift, to more sophisticated pipelines to power recommendation engines and search services.
Attendees learn:
Do's and don’ts of Data Pipeline
Using Dataduct to streamline your data pipelines
How to use Data Pipeline to power other data products, such as recommendation systems
What’s next for Dataduct"
AWS May Webinar Series - Streaming Data Processing with Amazon Kinesis and AW...Amazon Web Services
If you are interested to know more about AWS Chicago Summit, please use the following to register: http://amzn.to/1RooPPL
Amazon Kinesis is a fully managed, cloud-based service for real-time data processing over large, distributed data streams. AWS Lambda is a compute service that runs your code in response to events and automatically manages the compute resources for you. AWS Lambda can run code in response to data in Amazon Kinesis streams, making it easy to build big data applications that respond quickly to new information. In this webinar, we will cover key Kinesis and Lambda features, walk through sample use cases for stream processing, and discuss best practices on using the services together. We'll then demonstrate setting up an Amazon Kinesis stream and an associated Lambda function to capture and perform custom computations on click-stream data, all without setting up any infrastructure.
Learning Objectives: • Understand key Amazon Kinesis and AWS Lambda features • Learn how to setup streaming data capture and processing framework using AWS Lambda • Learn sample use cases, best practices and tips on using AWS Lambda with Amazon Kinesis
Who Should Attend: • Developers, Devops Engineers, IT Operations Professionals
AWS Step Functions is a new, fully-managed service that makes it easy to coordinate the components of distributed applications and microservices using visual workflows. Step Functions is a reliable way to connect and step through a series of AWS Lambda functions so that you can build and run multi-step applications in a matter of minutes. This session shows how to use AWS Step Functions to create, run, and debug cloud state machines to execute parallel, sequential, and branching steps of your application, with automatic catch and retry conditions.
Join us for a for a Amazon Kinesis tutorial webinar. In this session we will provide a reference architecture and instructions for building a system that performs real-time sliding-windows analysis over streaming clickstream data. We will use Amazon Kinesis for managed ingestion of streaming data at scale with the ability to replay past data, and run sliding-window computation using Apache Storm. We’ll demonstrate in the webinar on how to build the system and deploy on AWS and walkthrough all the steps from ingestion, processing, and storing to visualizing of the data in real-time.
Building a Real Time Dashboard with Amazon Kinesis, Amazon Lambda and Amazon ...Amazon Web Services
Organisations today need a way to manage the ever-increasing volume of data from numerous sources such as log systems, click streams or connected devices and be able to analyse this data in real-time. In this session we will walk through an architecture demonstration of how to leverage AWS services to meet these needs.
Speaker: Ganesh Raja, Solutions Architect, Amazon Web Services
Cassandra Day SV 2014: Spark, Shark, and Apache CassandraDataStax Academy
This session covers our experience with using the Spark and Shark frameworks for running real-time queries on top of Cassandra data.We will start by surveying the current Cassandra analytics landscape, including Hadoop and HIVE, and touch on the use of custom input formats to extract data from Cassandra. We will then dive into Spark and Shark, two memory-based cluster computing frameworks, and how they enable often dramatic improvements in query speed and productivity, over the standard solutions today.
Streaming Data Analytics with Amazon Redshift and Kinesis FirehoseAmazon Web Services
Evolving your analytics from batch processing to real-time processing can have a major business impact, but ingesting streaming data into your data warehouse requires building complex streaming data pipelines. Amazon Kinesis Firehose solves this problem by making it easy to transform and load streaming data into Amazon Redshift so that you can use existing analytics and business intelligence tools to extract information in near real-time and respond promptly. In this session, we will dive deep using Amazon Kinesis Firehose to load streaming data into Amazon Redshift reliably, scalably, and cost-effectively.
SF Big Analytics: Machine Learning with Presto by Christopher BernerChester Chen
Talk 1: Machine Learning in Presto
Presto is an open source distributed SQL query engine used by Facebook, in our Hadoop warehouse. It's typically about 10x faster than Hive, and can be extended to a number of other use cases. One of these extensions adds SQL functions to create and make predictions with machine learning models. The aim of this is to significantly reduce the time it takes to prototype a model, by moving the construction and testing of the model to the database.
Bio:
Christopher Berner works as a software engineer at Facebook on the Presto team. He wrote the ML functionality, and has worked on the query planner, type system, bytecode generator, and many other pieces of Presto. Before Presto he worked on the newsfeed ranking team developing machine learning models.
BDT303 Data Science with Elastic MapReduce - AWS re: Invent 2012Amazon Web Services
In this talk, we dive into the Netflix Data Science & Engineering architecture. Not just the what, but also the why. Some key topics include the big data technologies we leverage (Cassandra, Hadoop, Pig + Python, and Hive), our use of Amazon S3 as our central data hub, our use of multiple persistent Amazon Elastic MapReduce (EMR) clusters, how we leverage the elasticity of AWS, our data science as a service approach, how we make our hybrid AWS / data center setup work well, and more.
Data Pipeline with Kafka, This slide include
Kafka Introduction, Topic / Partitions, Produce / Consumer, Quick Start, Offset Monitoring, Example Code, Camus
In this session, we introduce AWS Glue, provide an overview of its components, and share how you can use AWS Glue to automate discovering your data, cataloging it, and preparing it for analysis.
Tackle Your Dark Data Challenge with AWS Glue - AWS Online Tech TalksAmazon Web Services
Learning Objectives:
- Discover dark data that you are currently not analyzing.
- Analyze dark data without moving it into your data warehouse.
- Visualize the results of your dark data analytics.
Sinking data from a distributed message queue to files seems easy, right? Well, it is not trivial, but we managed to make it very complicated. At bol.com, the largest webshop of the Netherlands and Belgium, we started measuring customer behavior in real-time last year producing around 500,000 events per second. We ingest that data in Apache Kafka, a popular distributed message queue, and wanted to have that data stored in Parquet files for easier batch processing by different applications in the landscape. To do that we tried several solutions with Apache Flink, a stream processing framework, but transitioning from an event stream to windowed batch files under exactly-once guarantees proved to be very challenging. We struggled a lot with fault tolerance and, in the end, we decided to go back to basics and build a much simpler daily batch job instead of a streaming job. In this talk I’m going to share our struggle and learnings with you, especially about how (not) to deal with fault tolerance when you need exactly-once delivery guarantees.
AWS re:Invent 2016: Streaming ETL for RDS and DynamoDB (DAT315)Amazon Web Services
During this session Greg Brandt and Liyin Tang, Data Infrastructure engineers from Airbnb, will discuss the design and architecture of Airbnb's streaming ETL infrastructure, which exports data from RDS for MySQL and DynamoDB into Airbnb's data warehouse, using a system called SpinalTap. We will also discuss how we leverage Spark Streaming to compute derived data from tracking topics and/or database tables, and HBase to provide immediate data access and generate cleanly time-partitioned Hive tables.
This presentation aims to be useful by covering the following topics:
- Modern Data Processing System Architectures and Models,
- Batch and Stream Processing Pipelines' details,
- Apache Spark Architecture and Internals,
- Real life use cases used with Apache Spark.
Machine learning at scale with aws sage makerPhilipBasford
The adoption of Machine Learning (ML) has boomed over the last 12 months; from initial prototypes and now into fully managed production workloads that embed ML in critical areas of both start-up and enterprise businesses. These workloads need to be highly available, elastic, have low latency, be very secure, and also cost efficient.
The corner stone of this is AWS SageMaker. AWS SageMaker offers a great platform for Machine Learning that includes one-click deployment of models for inference using AWS SageMaker Endpoints. This talk will provide advice and recommendations on how to use cases AWS SageMaker Endpoints as there is an awful lot more to AWS SageMaker Endpoints than meets the eye. During this talk we will look how to use AWS SageMaker Endpoints, how to build a custom model; look at how to scale them using Auto Scaling, look at canary style deployments, how to monitor them with CloudWatch. We will also look at how AWS SageMaker Endpoints can be used within serverless APIs with real-time observations using AWS X-Ray.
Creating a scalable & cost efficient BI infrastructure for a startup in the A...vcrisan
Presentation for Bucharest Big Data Meetup - October 14th 2021
How we created an efficient BI solution that can easily used by a startup, using the AWS cloud environment. Using Python we can easily import, process and store data in Amazon S3 from different data sources including Rabbit MQ, Big Query, MySQL etc. From there we are taking advantage of the power of Dremio as a query engine & the scalability of S3, you can create beautiful dashboards in Tableau fast, in order to kickstart a data journey in a startup.
A quick overview of Redshift and common use-cases. Followed by tools and links to performance tuning. How Redshift fits in the AWS data services. A list of key new features since last meetup in September 2016, including Redshift Spectrum that allows one to run SQL directly on your data sitting on Amazon S3. It also includes Redshift echosystem with data integration, bi, consultancy and data modelling partners.
How do you improve the visibility of your logs while running Spark on EMR? If you're tired of ssh-ing into your servers and searching log files, this architecture design is for you.
Data analytics master class: predict hotel revenueKris Peeters
We predict future revenues in hotels by solving the data science puzzle end-to-end: from infrastructure in the cloud and security, to data ingestion, data cleaning, feature building and model training and model scoring.
The video of this talk is here: https://www.facebook.com/datamindedbe/posts/1385820021562117
AWS Lambda allows any Node.js app to be run at scale in a massively parallel environment with no up-front costs or planning. This session shows how to use Lambda to build dynamic analytic data flows that can be tuned as they execute, based on initial results, to provide real-time output streamed to web clients. This process enables a cost-effective and responsive user experience for ad hoc big data jobs and lets developers focus on how data is consumed and presented, instead of how it is obtained.
What's New in Apache Spark 2.3 & Why Should You CareDatabricks
The Apache Spark 2.3 release marks a big step forward in speed, unification, and API support.
This talk will quickly walk through what’s new and how you can benefit from the upcoming improvements:
* Continuous Processing in Structured Streaming.
* PySpark support for vectorization, giving Python developers the ability to run native Python code fast.
* Native Kubernetes support, marrying the best of container orchestration and distributed data processing.
Best Practices for Building Open Source Data LayersIBMCompose
The IBM Compose Platform is a managed platform for open source databases-as-a-service, serving a suite of comprehensive databases for seamless integration. With no one-size-fits-all approach for building a Virtual Reality Data Layer, IBM Compose lets you build your data, your way.
Watch the webinar at: http://ibm.biz/BdrNVR
DCEU 18: From Legacy Mainframe to the Cloud: The Finnish Railways Evolution w...Docker, Inc.
Niko Virtala - Cloud Architect, VR Group (Finnish Railways)
In 2016, Finnish Railways reservation system and many other systems were monolithic applications running on mainframe or local datacenters. They began a containerization project focused on modernizing the reservation system. The invest paid off. Today, they have containerized multiple applications, running both on-premises and on AWS today. That’s allowed Finland’s leading public transport agency to shut down a data center and become a technology innovator. In this session, Finnish Rail will explain the processes and tools they used to build a multi-cloud strategy that lets them take advantage of geo-location and cost advantages to run in AWS, Azure and soon Google Cloud. You’ll learn: - How to implement a successful multi-cloud deployment - What challenges you can expect to face along the way - The processes and tools that are critical part of a successful project.
Simpler, faster, cheaper Enterprise Apps using only Spring Boot on GCPDaniel Zivkovic
Enterprises traditionally think of App Platforms as PCF (Pivotal Cloud Foundry) or Red Hat OpenShift. In reality, public Clouds have evolved into Application Platforms - especially when using Managed Services & Serverless.
• If you are an IT Executive under increased pressure to cut costs, see how better Technology Stack choices – not layoffs or pay cuts, can reduce IT costs + increase business agility (while avoiding vendor lock-in):
• If you are a Developer lost in the sea of the Cloud Computing choices, watch Ray Tsang (Java Champion from GCP) live-code, and you will walk away Cloud-Native :)
See how to stop cannibalization of IT by deploying your good ol' Java Spring Boot Apps directly to Google Cloud Platform - no Servers/PCF/OpenShift/Kubernetes to manage, nor to limit your creativity: https://youtu.be/2B0wWagE0dc
P.S. For more forward-looking Software Developerment topics, join ServerlessToronto.org Meetups, and if you have any questions about the Architectural Patterns discussed, reach out to me to chat.
CI/CD on AWS: Deploy Everything All the Time | AWS Public Sector Summit 2016Amazon Web Services
Why does DevOps matter? How can you use continuous integration to build your product faster, make it more highly available, and be able to recover from bugs quickly? Let one of our solutions architects walk you through continuous integration and continuous delivery on AWS. This session includes live demos of our tools AWS CodeCommit, AWS CodePipeline, and AWS CodeDeploy.
Learn how to use AWS services to automate manual tasks, help teams manage complex environments at scale, and keep engineers in control of the high velocity that is enabled by DevOps. In this session, we will provide an overview of the various AWS development and deployment services and when best to use them. We will show how to build a fully automated infrastructure and software delivery pipeline with AWS CodePipeline, AWS CodeBuild, AWS CloudFormation and AWS CodeDeploy. At the end of the session, a GitHub repository of AWS CloudFormation templates will be provided so you can quickly deploy the same pipeline to your AWS account(s).
Why does DevOps matter? How can you use continuous integration to build your product faster, make it more highly available, and be able to recover from bugs quickly? Let one of our solutions architects walk you through continuous integration and continuous delivery on AWS. This session includes live demos of our tools AWS CodeCommit, AWS CodePipeline, and AWS CodeDeploy.
Speaker: Leo Zhandovsky, Solutions Architect, Amazon Web services
recordings to the Canberra Summit can be found here
https://aws.amazon.com/events/anz/on-demand/canberra-summit/
(BDT208) A Technical Introduction to Amazon Elastic MapReduceAmazon Web Services
"Amazon EMR provides a managed framework which makes it easy, cost effective, and secure to run data processing frameworks such as Apache Hadoop, Apache Spark, and Presto on AWS. In this session, you learn the key design principles behind running these frameworks on the cloud and the feature set that Amazon EMR offers. We discuss the benefits of decoupling compute and storage and strategies to take advantage of the scale and the parallelism that the cloud offers, while lowering costs. Additionally, you hear from AOL’s Senior Software Engineer on how they used these strategies to migrate their Hadoop workloads to the AWS cloud and lessons learned along the way.
In this session, you learn the benefits of decoupling storage and compute and allowing them to scale independently; how to run Hadoop, Spark, Presto and other supported Hadoop Applications on Amazon EMR; how to use Amazon S3 as a persistent data-store and process data directly from Amazon S3; dDeployment strategies and how to avoid common mistakes when deploying at scale; and how to use Spot instances to scale your transient infrastructure effectively."
Similar to Cloud Native Data Pipelines (GoTo Chicago 2017) (20)
Building & Operating High-Fidelity Data Streams - QCon Plus 2021Sid Anand
The world we live in today is fed by data. From self-driving cars and route planning to fraud prevention, to content and network recommendations, to ranking and bidding, our world not only consumes low-latency data streams, it adapts to changing conditions modeled by that data.
While software engineering has settled on best practices for developing and managing both stateless service architectures and database systems, the ecosystem of data infrastructure still presents a greenfield opportunity. To thrive, this field borrows from several disciplines : distributed systems, database systems, operating systems, control systems, and software engineering to name a few.
Of particular interest to me is the sub field of data streams, specifically regarding how to build high-fidelity nearline data streams as a service within a lean team. To build such systems, human operations is a non-starter. All aspects of operating streaming data pipelines must be automated. Come to this talk to learn how to build such a system soup-to-nuts.
LinkedIn Data Infrastructure Slides (Version 2)Sid Anand
Learn about Espresso, Databus, and Voldemort. LinkedIn Data Infrastructure Slides (Version 2). This talk was given in NYC on June 20, 2012
You can download the slides as PPT in order to see the transitions here :
http://bit.ly/LfH6Ru
This is a talk about Netflix's path to Cassandra. The first few slides may look similar to previous presentations, but they are just to set the context. Most the content is brand new!
Keeping Movies Running Amid Thunderstorms!Sid Anand
How does Netflix strive to deliver an uninterrupted service? This talk, delivered for the first time in November, 2011, covers some engineering design concepts that help us deliver features at a rapid pace while assuring high availability.
Netflix's Transition to High-Availability Storage (QCon SF 2010)Sid Anand
This talk focuses on Netflix's transition from Oracle to SimpleDB -- a cloud-hosted, key-value store -- during Netflix's transition to the cloud (i.e. AWS). Stay tuned for future talks as Netflix evaluates more technologies, e.g. Cassandra.
A Comprehensive Look at Generative AI in Retail App Testing.pdfkalichargn70th171
Traditional software testing methods are being challenged in retail, where customer expectations and technological advancements continually shape the landscape. Enter generative AI—a transformative subset of artificial intelligence technologies poised to revolutionize software testing.
Why React Native as a Strategic Advantage for Startup Innovation.pdfayushiqss
Do you know that React Native is being increasingly adopted by startups as well as big companies in the mobile app development industry? Big names like Facebook, Instagram, and Pinterest have already integrated this robust open-source framework.
In fact, according to a report by Statista, the number of React Native developers has been steadily increasing over the years, reaching an estimated 1.9 million by the end of 2024. This means that the demand for this framework in the job market has been growing making it a valuable skill.
But what makes React Native so popular for mobile application development? It offers excellent cross-platform capabilities among other benefits. This way, with React Native, developers can write code once and run it on both iOS and Android devices thus saving time and resources leading to shorter development cycles hence faster time-to-market for your app.
Let’s take the example of a startup, which wanted to release their app on both iOS and Android at once. Through the use of React Native they managed to create an app and bring it into the market within a very short period. This helped them gain an advantage over their competitors because they had access to a large user base who were able to generate revenue quickly for them.
Accelerate Enterprise Software Engineering with PlatformlessWSO2
Key takeaways:
Challenges of building platforms and the benefits of platformless.
Key principles of platformless, including API-first, cloud-native middleware, platform engineering, and developer experience.
How Choreo enables the platformless experience.
How key concepts like application architecture, domain-driven design, zero trust, and cell-based architecture are inherently a part of Choreo.
Demo of an end-to-end app built and deployed on Choreo.
SOCRadar Research Team: Latest Activities of IntelBrokerSOCRadar
The European Union Agency for Law Enforcement Cooperation (Europol) has suffered an alleged data breach after a notorious threat actor claimed to have exfiltrated data from its systems. Infamous data leaker IntelBroker posted on the even more infamous BreachForums hacking forum, saying that Europol suffered a data breach this month.
The alleged breach affected Europol agencies CCSE, EC3, Europol Platform for Experts, Law Enforcement Forum, and SIRIUS. Infiltration of these entities can disrupt ongoing investigations and compromise sensitive intelligence shared among international law enforcement agencies.
However, this is neither the first nor the last activity of IntekBroker. We have compiled for you what happened in the last few days. To track such hacker activities on dark web sources like hacker forums, private Telegram channels, and other hidden platforms where cyber threats often originate, you can check SOCRadar’s Dark Web News.
Stay Informed on Threat Actors’ Activity on the Dark Web with SOCRadar!
Modern design is crucial in today's digital environment, and this is especially true for SharePoint intranets. The design of these digital hubs is critical to user engagement and productivity enhancement. They are the cornerstone of internal collaboration and interaction within enterprises.
How Does XfilesPro Ensure Security While Sharing Documents in Salesforce?XfilesPro
Worried about document security while sharing them in Salesforce? Fret no more! Here are the top-notch security standards XfilesPro upholds to ensure strong security for your Salesforce documents while sharing with internal or external people.
To learn more, read the blog: https://www.xfilespro.com/how-does-xfilespro-make-document-sharing-secure-and-seamless-in-salesforce/
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...Shahin Sheidaei
Games are powerful teaching tools, fostering hands-on engagement and fun. But they require careful consideration to succeed. Join me to explore factors in running and selecting games, ensuring they serve as effective teaching tools. Learn to maintain focus on learning objectives while playing, and how to measure the ROI of gaming in education. Discover strategies for pitching gaming to leadership. This session offers insights, tips, and examples for coaches, team leads, and enterprise leaders seeking to teach from simple to complex concepts.
How Recreation Management Software Can Streamline Your Operations.pptxwottaspaceseo
Recreation management software streamlines operations by automating key tasks such as scheduling, registration, and payment processing, reducing manual workload and errors. It provides centralized management of facilities, classes, and events, ensuring efficient resource allocation and facility usage. The software offers user-friendly online portals for easy access to bookings and program information, enhancing customer experience. Real-time reporting and data analytics deliver insights into attendance and preferences, aiding in strategic decision-making. Additionally, effective communication tools keep participants and staff informed with timely updates. Overall, recreation management software enhances efficiency, improves service delivery, and boosts customer satisfaction.
top nidhi software solution freedownloadvrstrong314
This presentation emphasizes the importance of data security and legal compliance for Nidhi companies in India. It highlights how online Nidhi software solutions, like Vector Nidhi Software, offer advanced features tailored to these needs. Key aspects include encryption, access controls, and audit trails to ensure data security. The software complies with regulatory guidelines from the MCA and RBI and adheres to Nidhi Rules, 2014. With customizable, user-friendly interfaces and real-time features, these Nidhi software solutions enhance efficiency, support growth, and provide exceptional member services. The presentation concludes with contact information for further inquiries.
First Steps with Globus Compute Multi-User EndpointsGlobus
In this presentation we will share our experiences around getting started with the Globus Compute multi-user endpoint. Working with the Pharmacology group at the University of Auckland, we have previously written an application using Globus Compute that can offload computationally expensive steps in the researcher's workflows, which they wish to manage from their familiar Windows environments, onto the NeSI (New Zealand eScience Infrastructure) cluster. Some of the challenges we have encountered were that each researcher had to set up and manage their own single-user globus compute endpoint and that the workloads had varying resource requirements (CPUs, memory and wall time) between different runs. We hope that the multi-user endpoint will help to address these challenges and share an update on our progress here.
Strategies for Successful Data Migration Tools.pptxvarshanayak241
Data migration is a complex but essential task for organizations aiming to modernize their IT infrastructure and leverage new technologies. By understanding common challenges and implementing these strategies, businesses can achieve a successful migration with minimal disruption. Data Migration Tool like Ask On Data play a pivotal role in this journey, offering features that streamline the process, ensure data integrity, and maintain security. With the right approach and tools, organizations can turn the challenge of data migration into an opportunity for growth and innovation.
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...Anthony Dahanne
Les Buildpacks existent depuis plus de 10 ans ! D’abord, ils étaient utilisés pour détecter et construire une application avant de la déployer sur certains PaaS. Ensuite, nous avons pu créer des images Docker (OCI) avec leur dernière génération, les Cloud Native Buildpacks (CNCF en incubation). Sont-ils une bonne alternative au Dockerfile ? Que sont les buildpacks Paketo ? Quelles communautés les soutiennent et comment ?
Venez le découvrir lors de cette session ignite
Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...informapgpstrackings
Keep tabs on your field staff effortlessly with Informap Technology Centre LLC. Real-time tracking, task assignment, and smart features for efficient management. Request a live demo today!
For more details, visit us : https://informapuae.com/field-staff-tracking/
Experience our free, in-depth three-part Tendenci Platform Corporate Membership Management workshop series! In Session 1 on May 14th, 2024, we began with an Introduction and Setup, mastering the configuration of your Corporate Membership Module settings to establish membership types, applications, and more. Then, on May 16th, 2024, in Session 2, we focused on binding individual members to a Corporate Membership and Corporate Reps, teaching you how to add individual members and assign Corporate Representatives to manage dues, renewals, and associated members. Finally, on May 28th, 2024, in Session 3, we covered questions and concerns, addressing any queries or issues you may have.
For more Tendenci AMS events, check out www.tendenci.com/events
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERRORTier1 app
Even though at surface level ‘java.lang.OutOfMemoryError’ appears as one single error; underlyingly there are 9 types of OutOfMemoryError. Each type of OutOfMemoryError has different causes, diagnosis approaches and solutions. This session equips you with the knowledge, tools, and techniques needed to troubleshoot and conquer OutOfMemoryError in all its forms, ensuring smoother, more efficient Java applications.
Developing Distributed High-performance Computing Capabilities of an Open Sci...Globus
COVID-19 had an unprecedented impact on scientific collaboration. The pandemic and its broad response from the scientific community has forged new relationships among public health practitioners, mathematical modelers, and scientific computing specialists, while revealing critical gaps in exploiting advanced computing systems to support urgent decision making. Informed by our team’s work in applying high-performance computing in support of public health decision makers during the COVID-19 pandemic, we present how Globus technologies are enabling the development of an open science platform for robust epidemic analysis, with the goal of collaborative, secure, distributed, on-demand, and fast time-to-solution analyses to support public health.
Globus Compute wth IRI Workflows - GlobusWorld 2024Globus
As part of the DOE Integrated Research Infrastructure (IRI) program, NERSC at Lawrence Berkeley National Lab and ALCF at Argonne National Lab are working closely with General Atomics on accelerating the computing requirements of the DIII-D experiment. As part of the work the team is investigating ways to speedup the time to solution for many different parts of the DIII-D workflow including how they run jobs on HPC systems. One of these routes is looking at Globus Compute as a way to replace the current method for managing tasks and we describe a brief proof of concept showing how Globus Compute could help to schedule jobs and be a tool to connect compute at different facilities.
12. Cloud Native Data Pipelines
12
Big Data Companies like LinkedIn, Facebook, Twitter, & Google
have large teams to manage their data pipelines (100s of
engineers)
Most start-ups have small teams (10s of engineers) & run in the
public cloud. Can they leverage aspects of the public cloud to
build comparable pipelines?
13. Cloud Native Data Pipelines
13
Cloud Native
Techniques
Open Source
Technogies
Data Pipelines seen
in Big Data companies
~
21. Use-Case : Message Scoring
21
enterprise A
enterprise B
enterprise C
S3
S3 uploads an Avro file
every 15 minutes
22. Use-Case : Message Scoring
22
enterprise A
enterprise B
enterprise C
S3
Airflow kicks of a Spark
message scoring job
every hour (EMR)
23. Use-Case : Message Scoring
23
enterprise A
enterprise B
enterprise C
S3
Spark job writes scored
messages and stats to
another S3 bucket
S3
24. Use-Case : Message Scoring
24
enterprise A
enterprise B
enterprise C
S3
This triggers SNS/SQS
messages events
S3
SNS
SQS
25. Use-Case : Message Scoring
25
enterprise A
enterprise B
enterprise C
S3
An Autoscale Group
(ASG) of Importers spins
up when it detects SQS
messages
S3
SNS
SQS
Importers
ASG
26. 26
enterprise A
enterprise B
enterprise C
S3
The importers rapidly ingest scored
messages and aggregate statistics into
the DB
S3
SNS
SQS
Importers
ASG
DB
Use-Case : Message Scoring
27. 27
enterprise A
enterprise B
enterprise C
S3
Users receive alerts of
untrusted emails &
can review them in
the web app
S3
SNS
SQS
Importers
ASG
DB
Use-Case : Message Scoring
29. 29
Architectural Components
Component Role Uses Salient Features Operability Model
Data Lake
• All data stored in S3
• All processing uses S3
Scalable, Available,
Performant
Serverless
Messaging
• Reliable, Transactional,
Pub/Sub
Scalable, Available,
Performant
Serverless
ASG
General
Processing
• Used for importing,
data cleansing,
business logic
Scalable, Available,
Performant
Managed
Data Science
Processing
• Aggregation
• Model Building
• Scoring
Nice programming
model at the cost of
debugging complexity
We Operate
Workflow
Engine
• Coordinates all Spark
Jobs & complex flows
Lightweight, DAGs as
Code, Steep learning
curve
We Operate
DB
Persistence for
WebApp
• Holds subset of data
needed for Web App
Rails + Postgres
‘nuff said
We Operate
S3
SNS SQS
31. Tackling Cost
31
Between Daily Runs During Daily Runs
When running daily, for 23 hours of a day, we didn’t
pay for instances in the ASG or EMR
32. Tackling Cost
32
Between Hourly Runs During Hourly Runs
When running daily, for 23 hours of a day, we didn’t pay for
instances in the ASG or EMR
This does not help when runs are hourly since AWS charges at
an hourly rate for EC2 instances!
34. ASG - Overview
34
What is it?
A means to automatically scale out/in clusters to handle
variable load/traffic
A means to keep a cluster/service of a fixed size always up
35. ASG - Data Pipeline
35
importer
importer
importer
importer
Importer
ASG
scaleout/in
SQS
DB
38. 38
Scale-out: When Visible Messages > 0 (a.k.a. when queue depth > 0)
Scale-in: When Invisible Messages = 0 (a.k.a. when the last in-flight
message is ACK’d)
This causes the
ASG to grow
This causes the
ASG to shrink
ASG : Queue-based
40. ASG - Build & Deploy
40
Component Role Details
Spins up Cloud Resources
• Spins up SQS, Kinesis, EC2, ASG,
ELB, etc.. and associate them
using Terraform
• A better version of Chef &
Puppet
• Sets up an EC2 instance
• Agentless, idempotent, &
declarative tool to set up EC2
instances, by installing &
configuring packages, and more
• Spins up an EC2 instance
for the purposes of building
an AMI!
• Can be used with Ansible &
Terraform to bake AMIs & Launch
Auto-Scaling Groups
41. ASG - Build & Deploy
41
EC2 Step 1 : Packer spins up a temporary
EC2 node - a blank canvas!
42. EC2
ASG - Build & Deploy
42
EC2 Step 1 : Packer spins up a temporary
EC2 node - a blank canvas!
Step 2 : Packer runs an Ansible role against the
EC2 node to set it up.
43. EC2
ASG - Build & Deploy
43
EC2
Step 2 : Packer runs an Ansible role against the
EC2 node to set it up.
Step 3 : Snapshots the machine & register the
AMI.EC2
Step 1 : Packer spins up a temporary
EC2 node - a blank canvas!
44. EC2
ASG - Build & Deploy
44
EC2
Step 2 : Packer runs an Ansible role against the
EC2 node to set it up.
Step 3 : Snapshots the machine & register the
AMI.EC2
Step 4 : Terminates the EC2 instance!
Step 1 : Packer spins up a temporary
EC2 node - a blank canvas!
45. EC2
ASG - Build & Deploy
45
EC2
Step 2 : Packer runs an Ansible role against the
EC2 node to set it up.
Step 3 : Snapshots the machine & register the
AMI.EC2
Step 4 : Terminates the EC2 instance!
Step 5 : Using the AMI, Terraform spins up an
auto-scaled compute cluster (ASG)
Step 1 : Packer spins up a temporary
EC2 node - a blank canvas!
ASG
46. 46
Desirable Qualities of a Resilient
Data Pipeline
OperabilityCorrectness
Timeliness Cost
• ASG
• EMR Spark
Daily
• ASG
• EMR Spark
Hourly ASG
• No Cost Savings
48. 48
A simple way to author, configure, manage workflows
Provides visual insight into the state & performance of workflow
runs
Integrates with our alerting and monitoring tools
Tackling Operability : Requirements
57. Use-Case : Message Scoring
57
enterprise A
enterprise B
enterprise C
Kinesis batch put every
second
K
58. Use-Case : Message Scoring
58
enterprise A
enterprise B
enterprise C
K
As ASG of scorers is
scaled up to one process
per core per kinesis shard
Scorers
ASG
59. Use-Case : Message Scoring
59
enterprise A
enterprise B
enterprise C
K
Scorers
ASG
Kinesis
Scorers apply the trust
model and send scored
messages downstream
60. Use-Case : Message Scoring
60
enterprise A
enterprise B
enterprise C
K
Scorers
ASG
Kinesis
Importers
ASG
As ASG of importers is
scaled up to rapidly
import messages
DB
61. Use-Case : Message Scoring
61
enterprise A
enterprise B
enterprise C
K
Scorers
ASG
Kinesis
Importers
ASG
Imported messages are
also consumed by the
alerter
DB
K
Alerters
ASG
62. Use-Case : Message Scoring
62
enterprise A
enterprise B
enterprise C
K
Scorers
ASG
Kinesis
Importers
ASG
Imported messages are
also consumed by the
alerter
DB
K
Alerters
ASG
Quarantine Email
63. 63
Stream Processing Architecture
Component Role Details Pros Operability Model
Data Lake
• All data stored in S3 via
Kinesis Firehose
Scalable, Available,
Performant, Serverless
Serverless
Kinesis Messaging
• Streaming transport
modeled on Kafka
Scalable, Available,
Serverless
Serverless
General
Processing
• ASG Replacement except
for Rails Apps
Scalable, Available,
Serverless
Serverless
ASG
General
Processing
• Used for importing, data
cleansing, business logic
Scalable, Available,
Managed
Managed
Data Science
Processing
• Model Building
We Operate
Workflow Engine
• Nightly model builds +
some classic Ops cron
workloads
Lightweight, DAGs as
Code
We Operate
DB
Persistence for
WebApp
• Holds smaller subset of
data needed for Web App
Rails + Postgres
‘nuff said
We Operate
Persistence for
WebApp
• Aggregation + Search
moved from DB to ES
• Model Building queries
moved to Elasticache
Redis
Faster. more accurate for
aggregates, frees up
headroom for DB (polyglot
persistence)
Managed
S3
66. 66
What is Avro?
Avro is a self-describing serialization format that supports
primitive data types : int, long, boolean, float, string, bytes, etc…
complex data types : records, arrays, unions, maps, enums, etc…
many language bindings : Java, Scala, Python, Ruby, etc…
67. 67
What is Avro?
Avro is a self-describing serialization format that supports
primitive data types : int, long, boolean, float, string, bytes, etc…
complex data types : records, arrays, unions, maps, enums, etc…
many language bindings : Java, Scala, Python, Ruby, etc…
The most common format for storing structured Big Data at rest in
HDFS, S3, Google Cloud Storage, etc…
Supports Schema Evolution!
69. 69
Why is Avro Useful?
Agari is an IoT company!
Agari Sensors, deployed at customer sites, stream data to Agari’s
Cloud SAAS
Data is sent via Kinesis!
enterprise A
enterprise B
enterprise C Kinesis
Agari SAAS
in AWS
70. 70
Why is Avro Useful?
enterprise A :
enterprise B :
enterprise C : Kinesis
v1
v2
v3
Agari is an IoT company!
Agari Sensors, deployed at customer sites, stream data to Agari’s
Cloud SAAS
Data is sent via Kinesis!
At any point in time, customers run different versions of the Agari
Sensor
Agari SAAS
in AWS
71. 71
Why is Avro Useful?
enterprise A :
enterprise B :
enterprise C : Kinesis
v1
v2
v3
Agari is an IoT company!
Agari Sensors, deployed at customer sites, stream data to
Agari’s Cloud SAAS
Data is sent via Kinesis!
At any point in time, customers run different versions of the
Agari Sensor
These Sensors might send different format versions of the
data!
Agari SAAS
in AWS
72. 72
Why is Avro Useful?
enterprise A :
enterprise B :
enterprise C : Kinesis
v1
v2
v3
Agari SAAS
in AWS
v4
Agari is an IoT company!
Agari Sensors, deployed at customer sites, stream data to
Agari’s Cloud SAAS
Data is sent via Kinesis!
At any point in time, customers run different versions of the
Agari Sensor
These Sensors might send different format versions of the
data!
73. 73
Why is Avro Useful?
enterprise A :
enterprise B :
enterprise C :
v1
v2
v3
Avro allows Agari to seamlessly handle different IoT data format
versions
Agari SAAS
in AWS
Kinesis v4
datum_reader = DatumReader( writers_schema = writers_schema,
readers_schema = readers_schema)
Requirements:
• Schemas are backward-compatible
74. 74
Why is Avro Useful?
Agari SAAS in AWS
S1 S2 S3
s3 Spark
Avro Everywhere!
Avro is so useful, we don’t just to communicate between our
Sensors & our SAAS infrastructure
We also use it as the common data-interchange format between all
services (streaming & batch) within our AWS deployment
75. 75
Why is Avro Useful?
Agari SAAS in AWS
S1 S2 S3
s3 Spark
Avro Everywhere!
Good Language Bindings :
Data Pipelines services are written in Java, Ruby, & Python
77. 77
{"namespace": "agari",
"type": "record",
"name": "User",
"fields": [
{"name": "name", "type": "string"},
{"name": "favorite_number", "type": ["int", "null"]},
{"name": "favorite_color", "type": ["string", "null"]}
]
}
complex type (record)
Schema name : User
3 fields in the record: 1 required, 2
optional
Avro Schema Example
78. 78
{"namespace": "agari",
"type": "record",
"name": "User",
"fields": [
{"name": "name", "type": "string"},
{"name": "favorite_number", "type": ["int", "null"]},
{"name": "favorite_color", "type": ["string", "null"]}
]
}
Data
x 1,000,000,000
Avro Schema Data File Example
Schema
Data
0.0001 %
99.999 %
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
79. 79
{"namespace": "agari",
"type": "record",
"name": "User",
"fields": [
{"name": "name", "type": "string"},
{"name": "favorite_number", "type": ["int", "null"]},
{"name": "favorite_color", "type": ["string", "null"]}
]
}
Binary Data block
Avro Schema Streaming Example
Schema
Data
99 %
1 %
Data
88. 88
enterprise A
enterprise B
enterprise C
K
Scorers
ASG
Kinesis
Importers
ASG
Imported messages are
also consumed by the
alerter
DB
K
Alerters
ASG
SR
SR
SR
Avro Schema Registry
89. 89
enterprise A
enterprise B
enterprise C
K
Scorers
ASG
Kinesis
Importers
ASG
Imported messages are
also consumed by the
alerter
DB
K
Alerters
ASG
SR
SR
SR
Avro Schema Registry
90. Acknowledgments
90
• Vidur Apparao
• Stephen Cattaneo
• Jon Chase
• Andrew Flury
• William Forrester
• Chris Haag
• Chris Buchanan
• Neil Chapin
• Wil Collins
• Don Spencer
• Scot Kennedy
• Natia Chachkhiani
• Patrick Cockwell
• Kevin Mandich
• Gabriel Ortiz
• Jacob Rideout
• Josh Yang
• Julian Mehnle
• Gabriel Poon
• Spencer Sun
• Nathan Bryant
None of this work would be possible without the
essential contributions of the team below