This document summarizes a presentation on real-time streaming data on AWS. It discusses Amazon Kinesis, Spark Streaming, AWS Lambda, and Amazon EMR. The presentation covers an overview of streaming vs batch processing, common streaming data use cases and design patterns, a deep dive on Amazon Kinesis, examples of ingesting and processing streaming data, and a case study of how Sizmek uses these services for their real-time analytics needs.
Data Science & Best Practices for Apache Spark on Amazon EMRAmazon Web Services
Organizations need to perform increasingly complex analysis on their data — streaming analytics, ad-hoc querying and predictive analytics — in order to get better customer insights and actionable business intelligence. However, the growing data volume, speed, and complexity of diverse data formats make current tools inadequate or difficult to use. Apache Spark has recently emerged as the framework of choice to address these challenges. Spark is a general-purpose processing framework that follows a DAG model and also provides high-level APIs, making it more flexible and easier to use than MapReduce. Thanks to its use of in-memory datasets (RDDs), embedded libraries, fault-tolerance, and support for a variety of programming languages, Apache Spark enables developers to implement and scale far more complex big data use cases, including real-time data processing, interactive querying, graph computations and predictive analytics. In this session, we present a technical deep dive on Spark running on Amazon EMR. You learn why Spark is great for ad-hoc interactive analysis and real-time stream processing, how to deploy and tune scalable clusters running Spark on Amazon EMR, how to use EMRFS with Spark to query data directly in Amazon S3, and best practices and patterns for Spark on Amazon EMR.
Build Real-Time Applications with Databricks StreamingDatabricks
This document discusses using Databricks, Spark, and Power BI for real-time data streaming. It describes a use case of a fire department needing real-time reporting of equipment locations, personnel statuses, and active incidents. The solution involves ingesting event data using Azure Event Hubs, processing the stream using Databricks and Spark Structured Streaming, storing the results in Delta Lake, and visualizing the data in Power BI dashboards. It then demonstrates the architecture by walking through creating Delta tables, streaming from Event Hubs to Delta Lake, and running a sample event simulator.
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...Flink Forward
Flink Forward San Francisco 2022.
Being in the payments space, Stripe requires strict correctness and freshness guarantees. We rely on Flink as the natural solution for delivering on this in support of our Change Data Capture (CDC) infrastructure. We heavily rely on CDC as a tool for capturing data change streams from our databases without critically impacting database reliability, scalability, and maintainability. Data derived from these streams is used broadly across the business and powers many of our critical financial reporting systems totalling over $640 Billion in payment volume annually. We use many components of Flink’s flexible DataStream API to perform aggregations and abstract away the complexities of stream processing from our downstreams. In this talk, we’ll walk through our experience from the very beginning to what we have in production today. We’ll share stories around the technical details and trade-offs we encountered along the way.
by
Jeff Chao
This document provides an overview of AWS Lake Formation and related services for building a secure data lake. It discusses how Lake Formation provides a centralized management layer for data ingestion, cleaning, security and access. It also describes how Lake Formation integrates with services like AWS Glue, Amazon S3 and ML transforms to simplify and automate many data lake tasks. Finally, it provides an example workflow for using Lake Formation to deduplicate data from various sources and grant secure access for analysis.
Redis is an open source, in-memory data store that delivers sub-millisecond response times enabling millions of requests per second to power real-time applications. It can be used as a fast database, cache, message broker, and queue. Amazon ElastiCache delivers the ease-of-use and power of Redis along with the availability, reliability, scalability, security, and performance suitable for the most demanding applications. We’ll take a close look at Redis and how to use it to power different use cases.
Speaker: Samir Karande - Sr. Manager, Solutions Architecture, AWS
Architect’s Open-Source Guide for a Data Mesh ArchitectureDatabricks
Data Mesh is an innovative concept addressing many data challenges from an architectural, cultural, and organizational perspective. But is the world ready to implement Data Mesh?
In this session, we will review the importance of core Data Mesh principles, what they can offer, and when it is a good idea to try a Data Mesh architecture. We will discuss common challenges with implementation of Data Mesh systems and focus on the role of open-source projects for it. Projects like Apache Spark can play a key part in standardized infrastructure platform implementation of Data Mesh. We will examine the landscape of useful data engineering open-source projects to utilize in several areas of a Data Mesh system in practice, along with an architectural example. We will touch on what work (culture, tools, mindset) needs to be done to ensure Data Mesh is more accessible for engineers in the industry.
The audience will leave with a good understanding of the benefits of Data Mesh architecture, common challenges, and the role of Apache Spark and other open-source projects for its implementation in real systems.
This session is targeted for architects, decision-makers, data-engineers, and system designers.
Amazon S3 hosts trillions of objects and is used for storing a wide range of data, from system backups to digital media. This presentation from the Amazon S3 Masterclass webinar we explain the features of Amazon S3 from static website hosting, through server side encryption to Amazon Glacier integration. This webinar will dive deep into the feature sets of Amazon S3 to give a rounded overview of its capabilities, looking at common use cases, APIs and best practice.
See a recording of this video here on YouTube: http://youtu.be/VC0k-noNwOU
Check out future webinars in the Masterclass series here: http://aws.amazon.com/campaigns/emea/masterclass/
View the Journey Through the Cloud webinar series here: http://aws.amazon.com/campaigns/emea/journey/
Today’s organisations require a data storage and analytics solution that offers more agility and flexibility than traditional data management systems. Data Lake is a new and increasingly popular way to store all of your data, structured and unstructured, in one, centralised repository. Since data can be stored as-is, there is no need to convert it to a predefined schema and you no longer need to know what questions you want to ask of your data beforehand.
In this webinar, you will discover how AWS gives you fast access to flexible and low-cost IT resources, so you can rapidly scale and build your data lake that can power any kind of analytics such as data warehousing, clickstream analytics, fraud detection, recommendation engines, event-driven ETL, serverless computing, and internet-of-things processing regardless of volume, velocity and variety of data.
Learning Objectives:
• Discover how you can rapidly scale and build your data lake with AWS.
• Explore the key pillars behind a successful data lake implementation.
• Learn how to use the Amazon Simple Storage Service (S3) as the basis for your data lake.
• Learn about the new AWS services recently launched, Amazon Athena and Amazon Redshift Spectrum, that help customers directly query that data lake.
Data Science & Best Practices for Apache Spark on Amazon EMRAmazon Web Services
Organizations need to perform increasingly complex analysis on their data — streaming analytics, ad-hoc querying and predictive analytics — in order to get better customer insights and actionable business intelligence. However, the growing data volume, speed, and complexity of diverse data formats make current tools inadequate or difficult to use. Apache Spark has recently emerged as the framework of choice to address these challenges. Spark is a general-purpose processing framework that follows a DAG model and also provides high-level APIs, making it more flexible and easier to use than MapReduce. Thanks to its use of in-memory datasets (RDDs), embedded libraries, fault-tolerance, and support for a variety of programming languages, Apache Spark enables developers to implement and scale far more complex big data use cases, including real-time data processing, interactive querying, graph computations and predictive analytics. In this session, we present a technical deep dive on Spark running on Amazon EMR. You learn why Spark is great for ad-hoc interactive analysis and real-time stream processing, how to deploy and tune scalable clusters running Spark on Amazon EMR, how to use EMRFS with Spark to query data directly in Amazon S3, and best practices and patterns for Spark on Amazon EMR.
Build Real-Time Applications with Databricks StreamingDatabricks
This document discusses using Databricks, Spark, and Power BI for real-time data streaming. It describes a use case of a fire department needing real-time reporting of equipment locations, personnel statuses, and active incidents. The solution involves ingesting event data using Azure Event Hubs, processing the stream using Databricks and Spark Structured Streaming, storing the results in Delta Lake, and visualizing the data in Power BI dashboards. It then demonstrates the architecture by walking through creating Delta tables, streaming from Event Hubs to Delta Lake, and running a sample event simulator.
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...Flink Forward
Flink Forward San Francisco 2022.
Being in the payments space, Stripe requires strict correctness and freshness guarantees. We rely on Flink as the natural solution for delivering on this in support of our Change Data Capture (CDC) infrastructure. We heavily rely on CDC as a tool for capturing data change streams from our databases without critically impacting database reliability, scalability, and maintainability. Data derived from these streams is used broadly across the business and powers many of our critical financial reporting systems totalling over $640 Billion in payment volume annually. We use many components of Flink’s flexible DataStream API to perform aggregations and abstract away the complexities of stream processing from our downstreams. In this talk, we’ll walk through our experience from the very beginning to what we have in production today. We’ll share stories around the technical details and trade-offs we encountered along the way.
by
Jeff Chao
This document provides an overview of AWS Lake Formation and related services for building a secure data lake. It discusses how Lake Formation provides a centralized management layer for data ingestion, cleaning, security and access. It also describes how Lake Formation integrates with services like AWS Glue, Amazon S3 and ML transforms to simplify and automate many data lake tasks. Finally, it provides an example workflow for using Lake Formation to deduplicate data from various sources and grant secure access for analysis.
Redis is an open source, in-memory data store that delivers sub-millisecond response times enabling millions of requests per second to power real-time applications. It can be used as a fast database, cache, message broker, and queue. Amazon ElastiCache delivers the ease-of-use and power of Redis along with the availability, reliability, scalability, security, and performance suitable for the most demanding applications. We’ll take a close look at Redis and how to use it to power different use cases.
Speaker: Samir Karande - Sr. Manager, Solutions Architecture, AWS
Architect’s Open-Source Guide for a Data Mesh ArchitectureDatabricks
Data Mesh is an innovative concept addressing many data challenges from an architectural, cultural, and organizational perspective. But is the world ready to implement Data Mesh?
In this session, we will review the importance of core Data Mesh principles, what they can offer, and when it is a good idea to try a Data Mesh architecture. We will discuss common challenges with implementation of Data Mesh systems and focus on the role of open-source projects for it. Projects like Apache Spark can play a key part in standardized infrastructure platform implementation of Data Mesh. We will examine the landscape of useful data engineering open-source projects to utilize in several areas of a Data Mesh system in practice, along with an architectural example. We will touch on what work (culture, tools, mindset) needs to be done to ensure Data Mesh is more accessible for engineers in the industry.
The audience will leave with a good understanding of the benefits of Data Mesh architecture, common challenges, and the role of Apache Spark and other open-source projects for its implementation in real systems.
This session is targeted for architects, decision-makers, data-engineers, and system designers.
Amazon S3 hosts trillions of objects and is used for storing a wide range of data, from system backups to digital media. This presentation from the Amazon S3 Masterclass webinar we explain the features of Amazon S3 from static website hosting, through server side encryption to Amazon Glacier integration. This webinar will dive deep into the feature sets of Amazon S3 to give a rounded overview of its capabilities, looking at common use cases, APIs and best practice.
See a recording of this video here on YouTube: http://youtu.be/VC0k-noNwOU
Check out future webinars in the Masterclass series here: http://aws.amazon.com/campaigns/emea/masterclass/
View the Journey Through the Cloud webinar series here: http://aws.amazon.com/campaigns/emea/journey/
Today’s organisations require a data storage and analytics solution that offers more agility and flexibility than traditional data management systems. Data Lake is a new and increasingly popular way to store all of your data, structured and unstructured, in one, centralised repository. Since data can be stored as-is, there is no need to convert it to a predefined schema and you no longer need to know what questions you want to ask of your data beforehand.
In this webinar, you will discover how AWS gives you fast access to flexible and low-cost IT resources, so you can rapidly scale and build your data lake that can power any kind of analytics such as data warehousing, clickstream analytics, fraud detection, recommendation engines, event-driven ETL, serverless computing, and internet-of-things processing regardless of volume, velocity and variety of data.
Learning Objectives:
• Discover how you can rapidly scale and build your data lake with AWS.
• Explore the key pillars behind a successful data lake implementation.
• Learn how to use the Amazon Simple Storage Service (S3) as the basis for your data lake.
• Learn about the new AWS services recently launched, Amazon Athena and Amazon Redshift Spectrum, that help customers directly query that data lake.
The Heart of the Data Mesh Beats in Real-Time with Apache KafkaKai Wähner
If there were a buzzword of the hour, it would certainly be "data mesh"! This new architectural paradigm unlocks analytic data at scale and enables rapid access to an ever-growing number of distributed domain datasets for various usage scenarios.
As such, the data mesh addresses the most common weaknesses of the traditional centralized data lake or data platform architecture. And the heart of a data mesh infrastructure must be real-time, decoupled, reliable, and scalable.
This presentation explores how Apache Kafka, as an open and scalable decentralized real-time platform, can be the basis of a data mesh infrastructure and - complemented by many other data platforms like a data warehouse, data lake, and lakehouse - solve real business problems.
There is no silver bullet or single technology/product/cloud service for implementing a data mesh. The key outcome of a data mesh architecture is the ability to build data products; with the right tool for the job.
A good data mesh combines data streaming technology like Apache Kafka or Confluent Cloud with cloud-native data warehouse and data lake architectures from Snowflake, Databricks, Google BigQuery, et al.
Azure data analytics platform - A reference architecture Rajesh Kumar
This document provides an overview of Azure data analytics architecture using the Lambda architecture pattern. It covers Azure data and services, including ingestion, storage, processing, analysis and interaction services. It provides a brief overview of the Lambda architecture including the batch layer for pre-computed views, speed layer for real-time views, and serving layer. It also discusses Azure data distribution, SQL Data Warehouse architecture and design best practices, and data modeling guidance.
AWS Glue is a fully managed, serverless extract, transform, and load (ETL) service that makes it easy to move data between data stores. AWS Glue simplifies and automates the difficult and time consuming tasks of data discovery, conversion mapping, and job scheduling so you can focus more of your time querying and analyzing your data using Amazon Redshift Spectrum and Amazon Athena. In this session, we introduce AWS Glue, provide an overview of its components, and share how you can use AWS Glue to automate discovering your data, cataloging it, and preparing it for analysis.
Object Storage 1: The Fundamentals of Objects and Object StorageHitachi Vantara
In part 1 of 3, objects and object storage are defined, their key attributes are identified and the most common use cases for object storage are described. Join Jeff Lundberg, senior product marketing manager at Hitachi Data Systems, to learn the fundamentals of object storage and get answers to your questions. View this WebTech to learn: What makes an object. The difference between block, file and object storage. Key attributes and uses of object store solutions. For more information on Object Storage please view our white paper: http://www.hds.com/assets/pdf/hitachi-white-paper-introduction-to-object-storage-and-hcp.pdf
AWS Glue is a fully managed extract, transform, and load (ETL) service that makes it easy for customers to prepare and load their data for analytics. You can create and run an ETL job with a few clicks in the AWS Management Console. You simply point AWS Glue to your data stored on AWS, and AWS Glue discovers your data and stores the associated metadata (e.g. table definition and schema) in the AWS Glue Data Catalog. Once cataloged, your data is immediately searchable, queryable, and available for ETL. AWS Glue generates the code to execute your data transformations and data loading processes.
Level: Intermediate
Speakers:
Ryan Malecky - Solutions Architect, EdTech, AWS
Rajakumar Sampathkumar - Sr. Technical Account Manager, AWS
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...Databricks
Presto, an open source distributed SQL engine, is widely recognized for its low-latency queries, high concurrency, and native ability to query multiple data sources. Proven at scale in a variety of use cases at Airbnb, Comcast, GrubHub, Facebook, FINRA, LinkedIn, Lyft, Netflix, Twitter, and Uber, in the last few years Presto experienced an unprecedented growth in popularity in both on-premises and cloud deployments over Object Stores, HDFS, NoSQL and RDBMS data stores.
講師: Ivan Cheng, Solution Architect, AWS
Join us for a series of introductory and technical sessions on AWS Big Data solutions. Gain a thorough understanding of what Amazon Web Services offers across the big data lifecycle and learn architectural best practices for applying those solutions to your projects.
We will kick off this technical seminar in the morning with an introduction to the AWS Big Data platform, including a discussion of popular use cases and reference architectures. In the afternoon, we will deep dive into Machine Learning and Streaming Analytics. We will then walk everyone through building your first Big Data application with AWS.
The document discusses building a data lake on AWS. It describes various AWS services that can be used to ingest, store, transform, analyze and visualize data in the data lake. These services include Amazon S3 for storage, AWS Glue for ETL/data cataloging, AWS Lake Formation for governance, Amazon Athena/EMR for analytics and Amazon QuickSight for visualization. The document also covers data movement options from on-premises to the data lake and real-time streaming of data using services like Kinesis. Machine learning workloads can leverage Amazon SageMaker for training and deployment.
This document discusses implementing a data lake on AWS to securely store, categorize, and analyze all types of data in a centralized repository. It describes key attributes of a data lake like decoupled storage and compute, rapid ingestion and transformation, and schema on read. It then outlines various AWS services that can be used to build a data lake like S3, Athena, EMR, Redshift, Glue, and Kinesis. It provides examples of streaming IoT data into a data lake and running queries and analytics on the data.
The document provides an overview of the Databricks platform, which offers a unified environment for data engineering, analytics, and AI. It describes how Databricks addresses the complexity of managing data across siloed systems by providing a single "data lakehouse" platform where all data and analytics workloads can be run. Key features highlighted include Delta Lake for ACID transactions on data lakes, auto loader for streaming data ingestion, notebooks for interactive coding, and governance tools to securely share and catalog data and models.
This document discusses building a modern data analytics architecture on AWS. It provides an overview of AWS services that can be used for ingesting, processing, storing, and analyzing large volumes of data in both real-time and batch scenarios. These include services like Amazon S3, Kinesis, EMR, Redshift, Athena, Elasticsearch, and Glue for ingesting, storing, processing, and querying data. Architectures shown include real-time data pipelines, data lakes, and batch ETL/ELT processes. Performance, cost effectiveness, and scalability benefits of AWS services are highlighted.
Amazon Web Services gives you fast access to flexible and low cost IT resources, so you can rapidly scale and build virtually any big data application including data warehousing, clickstream analytics, fraud detection, recommendation engines, event-driven ETL, serverless computing, and internet-of-things processing regardless of volume, velocity, and variety of data.
https://aws.amazon.com/webinars/anz-webinar-series/
Building Cloud-Native App Series - Part 3 of 11
Microservices Architecture Series
AWS Kinesis Data Streams
AWS Kinesis Firehose
AWS Kinesis Data Analytics
Apache Flink - Analytics
Securing data in hybrid environments using Apache RangerDataWorks Summit
Companies are increasingly moving to the cloud to store and process data. In this talk, we will talk through how companies can use tag-based policies in Apache Ranger to protect access to data both in on-premises environments as well in AWS-based cloud environments. We will go into details of how tag-based policies work and the integration with Apache Atlas and various services. We will also talk through how companies can leverage Ranger’s policies to anonymize or tokenize data while moving into the cloud and de-anonymize dynamically using Apache Kafka, Apache Hive, Apache Spark, or plain old ETL using MapReduce. We will also deep dive into Ranger’s proposed integration with S3 and other cloud-native systems. We will wrap it up with an end-to-end demo showing how tags and tag-based masking policies can be used to anonymize sensitive data and track how tags are propagated within the system and how sensitive data can be protected using tag-based policies
Speakers
Don Bosco Durai, Chief Security Architect, Privacera
Madhan Neethiraj, Sr. Director of Engineering, Hortonworks
Training for AWS Solutions Architect at http://zekelabs.com/courses/amazon-web-services-training-bangalore/.This slide describes about cloud watch key concepts, workflow, dashboard, metrics, cloud watch agent, alarms, events and logs.
___________________________________________________
zekeLabs is a Technology training platform. We provide instructor led corporate training and classroom training on Industry relevant Cutting Edge Technologies like Big Data, Machine Learning, Natural Language Processing, Artificial Intelligence, Data Science, Amazon Web Services, DevOps, Cloud Computing and Frameworks like Django,Spring, Ruby on Rails, Angular 2 and many more to Professionals.
Reach out to us at www.zekelabs.com or call us at +91 8095465880 or drop a mail at info@zekelabs.com
This document outlines an agenda for an AWS Cost Management workshop. The agenda includes introductions and sessions on AWS Cost Explorer, AWS Budgets, AWS Reservations, and AWS Cost & Usage Reports. It provides overviews of AWS cost management products and highlights recent features including budget redesigns, forecasting enhancements, and reserved instance management updates.
DAT302_Deep Dive on Amazon Relational Database Service (RDS)Amazon Web Services
Amazon RDS enables customers to launch an optimally configured, secure and highly available database with just a few clicks. It provides cost-efficient and resizable capacity while managing time-consuming database administration tasks, freeing you up to focus on your applications and business. Amazon RDS provides you six database engines to choose from, including Oracle, Microsoft SQL Server, PostgreSQL, MySQL and MariaDB. In this session, we take a closer look at the capabilities of the RDS service and review the latest features available. We do a deep dive into how RDS works and the best practices to achieve the optimal performance, flexibility, and cost saving for your databases.
According to the ITU, the Internet of Things is defined as a global infrastructure for the information society, enabling advanced services by interconnecting (physical and virtual) things based on existing and evolving interoperable information and communication technologies.
Such a phenomenal infrastructure, demands strong skills, and presents large opportunities for the AWS Ecosystem. With our customers in mind, we will have AWS Principal Business Development Manager - Mark Relph, presenting IoT Case Studies and the AWS IoT Platform.
This document provides an overview of big data architectural patterns and best practices on AWS. It discusses challenges of big data and how to simplify big data processing. It covers ingestion, storage, analysis and visualization technologies to use as well as design patterns. Key technologies discussed include Amazon Kinesis, DynamoDB, S3, Redshift, EMR, Lambda and design approaches like decoupled data bus and using the right tool for each job.
Amazon Kinesis is a platform for streaming data ingestion, processing, and analytics on AWS. The presentation discusses three Amazon Kinesis services - Kinesis Streams, Kinesis Firehose, and Kinesis Analytics. It provides an overview of each service and examples of how customers use streaming data and these services for applications like IoT, online gaming, advertising, and financial services. It also includes a demo of building a serverless IoT analytics solution on AWS using these streaming data services.
(BDT306) How Hearst Publishing Manages Clickstream Analytics with AWSAmazon Web Services
Hearst Publishing uses Amazon Kinesis and Amazon EMR to process clickstream data from over 200 Hearst properties worldwide in near real-time. Originally, Hearst used Pig on EMR to transform and analyze clickstream data in batch mode, but it was too slow with 15 minute latency. Hearst then migrated to using Apache Spark Streaming on EMR to process data from Amazon Kinesis in real-time windows of 5 minutes or less, enabling faster insights. This allowed Hearst to power features like Buzzing to provide instant feedback on article engagement across their properties.
The Heart of the Data Mesh Beats in Real-Time with Apache KafkaKai Wähner
If there were a buzzword of the hour, it would certainly be "data mesh"! This new architectural paradigm unlocks analytic data at scale and enables rapid access to an ever-growing number of distributed domain datasets for various usage scenarios.
As such, the data mesh addresses the most common weaknesses of the traditional centralized data lake or data platform architecture. And the heart of a data mesh infrastructure must be real-time, decoupled, reliable, and scalable.
This presentation explores how Apache Kafka, as an open and scalable decentralized real-time platform, can be the basis of a data mesh infrastructure and - complemented by many other data platforms like a data warehouse, data lake, and lakehouse - solve real business problems.
There is no silver bullet or single technology/product/cloud service for implementing a data mesh. The key outcome of a data mesh architecture is the ability to build data products; with the right tool for the job.
A good data mesh combines data streaming technology like Apache Kafka or Confluent Cloud with cloud-native data warehouse and data lake architectures from Snowflake, Databricks, Google BigQuery, et al.
Azure data analytics platform - A reference architecture Rajesh Kumar
This document provides an overview of Azure data analytics architecture using the Lambda architecture pattern. It covers Azure data and services, including ingestion, storage, processing, analysis and interaction services. It provides a brief overview of the Lambda architecture including the batch layer for pre-computed views, speed layer for real-time views, and serving layer. It also discusses Azure data distribution, SQL Data Warehouse architecture and design best practices, and data modeling guidance.
AWS Glue is a fully managed, serverless extract, transform, and load (ETL) service that makes it easy to move data between data stores. AWS Glue simplifies and automates the difficult and time consuming tasks of data discovery, conversion mapping, and job scheduling so you can focus more of your time querying and analyzing your data using Amazon Redshift Spectrum and Amazon Athena. In this session, we introduce AWS Glue, provide an overview of its components, and share how you can use AWS Glue to automate discovering your data, cataloging it, and preparing it for analysis.
Object Storage 1: The Fundamentals of Objects and Object StorageHitachi Vantara
In part 1 of 3, objects and object storage are defined, their key attributes are identified and the most common use cases for object storage are described. Join Jeff Lundberg, senior product marketing manager at Hitachi Data Systems, to learn the fundamentals of object storage and get answers to your questions. View this WebTech to learn: What makes an object. The difference between block, file and object storage. Key attributes and uses of object store solutions. For more information on Object Storage please view our white paper: http://www.hds.com/assets/pdf/hitachi-white-paper-introduction-to-object-storage-and-hcp.pdf
AWS Glue is a fully managed extract, transform, and load (ETL) service that makes it easy for customers to prepare and load their data for analytics. You can create and run an ETL job with a few clicks in the AWS Management Console. You simply point AWS Glue to your data stored on AWS, and AWS Glue discovers your data and stores the associated metadata (e.g. table definition and schema) in the AWS Glue Data Catalog. Once cataloged, your data is immediately searchable, queryable, and available for ETL. AWS Glue generates the code to execute your data transformations and data loading processes.
Level: Intermediate
Speakers:
Ryan Malecky - Solutions Architect, EdTech, AWS
Rajakumar Sampathkumar - Sr. Technical Account Manager, AWS
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...Databricks
Presto, an open source distributed SQL engine, is widely recognized for its low-latency queries, high concurrency, and native ability to query multiple data sources. Proven at scale in a variety of use cases at Airbnb, Comcast, GrubHub, Facebook, FINRA, LinkedIn, Lyft, Netflix, Twitter, and Uber, in the last few years Presto experienced an unprecedented growth in popularity in both on-premises and cloud deployments over Object Stores, HDFS, NoSQL and RDBMS data stores.
講師: Ivan Cheng, Solution Architect, AWS
Join us for a series of introductory and technical sessions on AWS Big Data solutions. Gain a thorough understanding of what Amazon Web Services offers across the big data lifecycle and learn architectural best practices for applying those solutions to your projects.
We will kick off this technical seminar in the morning with an introduction to the AWS Big Data platform, including a discussion of popular use cases and reference architectures. In the afternoon, we will deep dive into Machine Learning and Streaming Analytics. We will then walk everyone through building your first Big Data application with AWS.
The document discusses building a data lake on AWS. It describes various AWS services that can be used to ingest, store, transform, analyze and visualize data in the data lake. These services include Amazon S3 for storage, AWS Glue for ETL/data cataloging, AWS Lake Formation for governance, Amazon Athena/EMR for analytics and Amazon QuickSight for visualization. The document also covers data movement options from on-premises to the data lake and real-time streaming of data using services like Kinesis. Machine learning workloads can leverage Amazon SageMaker for training and deployment.
This document discusses implementing a data lake on AWS to securely store, categorize, and analyze all types of data in a centralized repository. It describes key attributes of a data lake like decoupled storage and compute, rapid ingestion and transformation, and schema on read. It then outlines various AWS services that can be used to build a data lake like S3, Athena, EMR, Redshift, Glue, and Kinesis. It provides examples of streaming IoT data into a data lake and running queries and analytics on the data.
The document provides an overview of the Databricks platform, which offers a unified environment for data engineering, analytics, and AI. It describes how Databricks addresses the complexity of managing data across siloed systems by providing a single "data lakehouse" platform where all data and analytics workloads can be run. Key features highlighted include Delta Lake for ACID transactions on data lakes, auto loader for streaming data ingestion, notebooks for interactive coding, and governance tools to securely share and catalog data and models.
This document discusses building a modern data analytics architecture on AWS. It provides an overview of AWS services that can be used for ingesting, processing, storing, and analyzing large volumes of data in both real-time and batch scenarios. These include services like Amazon S3, Kinesis, EMR, Redshift, Athena, Elasticsearch, and Glue for ingesting, storing, processing, and querying data. Architectures shown include real-time data pipelines, data lakes, and batch ETL/ELT processes. Performance, cost effectiveness, and scalability benefits of AWS services are highlighted.
Amazon Web Services gives you fast access to flexible and low cost IT resources, so you can rapidly scale and build virtually any big data application including data warehousing, clickstream analytics, fraud detection, recommendation engines, event-driven ETL, serverless computing, and internet-of-things processing regardless of volume, velocity, and variety of data.
https://aws.amazon.com/webinars/anz-webinar-series/
Building Cloud-Native App Series - Part 3 of 11
Microservices Architecture Series
AWS Kinesis Data Streams
AWS Kinesis Firehose
AWS Kinesis Data Analytics
Apache Flink - Analytics
Securing data in hybrid environments using Apache RangerDataWorks Summit
Companies are increasingly moving to the cloud to store and process data. In this talk, we will talk through how companies can use tag-based policies in Apache Ranger to protect access to data both in on-premises environments as well in AWS-based cloud environments. We will go into details of how tag-based policies work and the integration with Apache Atlas and various services. We will also talk through how companies can leverage Ranger’s policies to anonymize or tokenize data while moving into the cloud and de-anonymize dynamically using Apache Kafka, Apache Hive, Apache Spark, or plain old ETL using MapReduce. We will also deep dive into Ranger’s proposed integration with S3 and other cloud-native systems. We will wrap it up with an end-to-end demo showing how tags and tag-based masking policies can be used to anonymize sensitive data and track how tags are propagated within the system and how sensitive data can be protected using tag-based policies
Speakers
Don Bosco Durai, Chief Security Architect, Privacera
Madhan Neethiraj, Sr. Director of Engineering, Hortonworks
Training for AWS Solutions Architect at http://zekelabs.com/courses/amazon-web-services-training-bangalore/.This slide describes about cloud watch key concepts, workflow, dashboard, metrics, cloud watch agent, alarms, events and logs.
___________________________________________________
zekeLabs is a Technology training platform. We provide instructor led corporate training and classroom training on Industry relevant Cutting Edge Technologies like Big Data, Machine Learning, Natural Language Processing, Artificial Intelligence, Data Science, Amazon Web Services, DevOps, Cloud Computing and Frameworks like Django,Spring, Ruby on Rails, Angular 2 and many more to Professionals.
Reach out to us at www.zekelabs.com or call us at +91 8095465880 or drop a mail at info@zekelabs.com
This document outlines an agenda for an AWS Cost Management workshop. The agenda includes introductions and sessions on AWS Cost Explorer, AWS Budgets, AWS Reservations, and AWS Cost & Usage Reports. It provides overviews of AWS cost management products and highlights recent features including budget redesigns, forecasting enhancements, and reserved instance management updates.
DAT302_Deep Dive on Amazon Relational Database Service (RDS)Amazon Web Services
Amazon RDS enables customers to launch an optimally configured, secure and highly available database with just a few clicks. It provides cost-efficient and resizable capacity while managing time-consuming database administration tasks, freeing you up to focus on your applications and business. Amazon RDS provides you six database engines to choose from, including Oracle, Microsoft SQL Server, PostgreSQL, MySQL and MariaDB. In this session, we take a closer look at the capabilities of the RDS service and review the latest features available. We do a deep dive into how RDS works and the best practices to achieve the optimal performance, flexibility, and cost saving for your databases.
According to the ITU, the Internet of Things is defined as a global infrastructure for the information society, enabling advanced services by interconnecting (physical and virtual) things based on existing and evolving interoperable information and communication technologies.
Such a phenomenal infrastructure, demands strong skills, and presents large opportunities for the AWS Ecosystem. With our customers in mind, we will have AWS Principal Business Development Manager - Mark Relph, presenting IoT Case Studies and the AWS IoT Platform.
This document provides an overview of big data architectural patterns and best practices on AWS. It discusses challenges of big data and how to simplify big data processing. It covers ingestion, storage, analysis and visualization technologies to use as well as design patterns. Key technologies discussed include Amazon Kinesis, DynamoDB, S3, Redshift, EMR, Lambda and design approaches like decoupled data bus and using the right tool for each job.
Amazon Kinesis is a platform for streaming data ingestion, processing, and analytics on AWS. The presentation discusses three Amazon Kinesis services - Kinesis Streams, Kinesis Firehose, and Kinesis Analytics. It provides an overview of each service and examples of how customers use streaming data and these services for applications like IoT, online gaming, advertising, and financial services. It also includes a demo of building a serverless IoT analytics solution on AWS using these streaming data services.
(BDT306) How Hearst Publishing Manages Clickstream Analytics with AWSAmazon Web Services
Hearst Publishing uses Amazon Kinesis and Amazon EMR to process clickstream data from over 200 Hearst properties worldwide in near real-time. Originally, Hearst used Pig on EMR to transform and analyze clickstream data in batch mode, but it was too slow with 15 minute latency. Hearst then migrated to using Apache Spark Streaming on EMR to process data from Amazon Kinesis in real-time windows of 5 minutes or less, enabling faster insights. This allowed Hearst to power features like Buzzing to provide instant feedback on article engagement across their properties.
Using Amazon Cloudwatch Events, AWS Lambda and Spark Streaming to Process EC2...Amazon Web Services
In this session we will demonstrate various techniques that allow you to easily ingest and analyze heterogeneous log sources on AWS using Amazon Elasticsearch Service & Amazon Kinesis Firehose.
DevOps on AWS: Deep Dive on Continuous Delivery and the AWS Developer ToolsAmazon Web Services
Today’s cutting-edge companies have software release cycles measured in days instead of months. This agility is enabled by the DevOps practice of continuous delivery, which automates building, testing, and deploying all code changes. This automation helps you catch bugs sooner and accelerates developer productivity. In this session, we’ll share the processes that Amazon’s engineers use to practice DevOps and discuss how you can bring these processes to your company by using a new set of AWS tools (AWS CodePipeline and AWS CodeDeploy). These services were inspired by Amazon's own internal developer tools and DevOps culture.
Automating Software Deployments with AWS CodeDeploy by Matthew Trescot, Manag...Amazon Web Services
This document discusses AWS CodeDeploy, a service that automates software deployments to EC2 instances and on-premises servers. It provides an overview of CodeDeploy's key concepts including applications, deployment groups, deployment configurations, and hooks. It also shows examples of how CodeDeploy can be used for automated deployments across development, test, and production environments. The document suggests additional features like CloudFormation support and integration with CI/CD tools.
AWS re:Invent 2016: Securing Container-Based Applications (CON402)Amazon Web Services
This document discusses securing container-based applications. It covers container and OS security best practices like using Linux namespaces and cgroups for isolation, reducing the container attack surface, and hardening container images. It also discusses securing the container lifecycle through vulnerability scanning, configuration governance with Amazon ECS, and using secrets management. Finally, it shows how to automate security deployments through the CI/CD pipeline and tools like CloudFormation and CodeDeploy.
AWS Summit Auckland - Smaller is Better - Microservices on AWSAmazon Web Services
The document provides an overview of microservices including:
- Defining microservices and comparing them to SOA
- The benefits of a microservices architecture like improved agility, scalability, and innovation
- Common microservice patterns on AWS like serverless and container-based services
- How microservices can address business problems like long feature cycles and technical problems like lack of testability
- A customer story of how MYOB adopted microservices on AWS to support their online products
- Tips for evolving architectures including focusing on automation, organizational structure, and individual service design.
AWS Media and Entertainment - Broadcast and OTT Workloads - TorontoAmazon Web Services
In this presentation, we introduce AWS to Broadcast and OTT Workloads. References, customers, stories and details of Broadcast and OTT workloads implemented on the AWS Cloud.
Originally presented at AWS Toronto - by Bhavik Vyas
AWS re:Invent 2016: Large-Scale, Cloud-Based Analysis of Cancer Genomes: Less...Amazon Web Services
The PanCancer Analysis of Whole Genomes (PCAWG) project is a large-scale, highly distributed research collaboration designed to identify common patterns of mutations across 2,800 cancer genomes. The use of public and private clouds were instrumental in analyzing this dataset using current best practice containerized pipelines. This session describes the technical infrastructure built for the project, how we leveraged cloud environments to perform the “core” analysis, and the lessons learned along the way.
Data Processing without Servers | AWS Public Sector Summit 2016Amazon Web Services
Process your data immediately after ingest or upload without needing to manage or maintain infrastructure while achieving cost-optimized scaling that avoids idle compute. Come learn about how AWS Lambda can be used to process sensor data as it is produced in real-time.This session will feature two demos. The first will show how to use AWS Lambda to automatically process Landsat satellite imagery as it is produced. Development Seed will then introduce how they process geospatial OpenStreetMap data as it is created in real-time by contributors around the world. AWS Lambda provides a low-cost and efficient solution for Development Seed by scaling from little activity to thousands of commits per hour during sponsored "mapathons.”
This document discusses building a serverless data pipeline using AWS Lambda and other AWS managed services like DynamoDB, Kinesis Firehose and S3. It provides steps to create a DynamoDB table with streams enabled, a Lambda function to read from the DynamoDB streams and write to Kinesis Firehose, and a Kinesis Firehose delivery stream to deliver data to S3. With these serverless components, data can be ingested and processed without having to provision or manage any servers.
The document discusses integrating Hadoop into the enterprise data infrastructure. It describes common uses of Hadoop including enabling new analytics by joining transactional data from databases with interaction data in Hadoop. The document outlines key aspects of integration like data import/export between Hadoop and existing data stores using tools like Sqoop, various ETL tools, and connecting business intelligence and analytics tools to Hadoop. Example architectures are shown integrating Hadoop with databases, data warehouses, and other systems.
This session is recommended for anyone considering using the AWS cloud to augment their current capabilities. Adoption of cloud computing provides access to the benefits of new deployment models with significant cost and agility benefits. But how can the cloud benefit existing government organizations that have invested large amounts of resources in existing on-premises technologies? This session outlines several key factors to consider from the point of view of the large-scale IT shop stakeholder. Because each organization has its unique set of challenges in cloud adoption, this session compares some of the opportunities and risks of several hybrid cloud use-case models and then helps customers understand the cloud-native and third-party vendor options available that bridge the gap to the cloud for large-scale government environments.
Speaker: Craig Roach, Solutions Architect, Amazon Web Services
Alexander Aldev - Co-founder and CTO of MammothDB, currently focused on the architecture of the distributed database engine. Notable achievements in the past include managing the launch of the first triple-play cable service in Bulgaria and designing the architecture and interfaces from legacy systems of DHL Global Forwarding's data warehouse. Has lectured on Hadoop at AUBG and MTel.
"The future of Big Data tooling" will briefly review the architectural concepts of current Big Data tools like Hadoop and Spark. It will make the argument, from the perspective of both technology and economics, that the future of Big Data tools is in optimizing local storage and compute efficiency.
Ivo Mitov – Co-founder of Data Fusion Bulgaria, software consultant in the area of EAI and Big Data
"Real-time analytics with HBase" is focused on the usage of coprocessor framework in HBase for event complex processing and simple analytics. The presentation will describe monitoring use case in the context of complex SOA environment.
Big Data: Improving capacity utilization of transport companiesData Science Society
Asparuh Koev is a successful serial entrepreneur. Currently, he is a CEO of Transmetrics, a solution for cargo transport companies that uses Big Data and predictive analytics. Asparuh holds a Bachelor’s Degree in Computer Science from the American University in Bulgaria and an MBA degree from the Vlerick Business School in Belgium.
“Big Data: Improving capacity utilization of transport companies” will explore the practical benefits, IT tools and challenges of implementing a big data solution in a traditional industry (cargo transport), using as a showcase a predictive analytics project done for Speedy.
A presentation pertaining to the integration of real-time data to the cloud with significant potential in the areas of Industrial IT,Real-time sensor information processing and Smart grids applied to various vertical industries. This is related to my blog post at www.cloudshoring.in
AWS April 2016 Webinar Series - Getting Started with Real-Time Data Analytics...Amazon Web Services
It is becoming increasingly important to analyze real time streaming data. It allows organizations to remain competitive by uncovering relevant, actionable insights. AWS makes it easy to capture, store, and analyze real-time streaming data.
In this webinar, we will guide you through some of the proven architectures for processing streaming data, using a combination of tools including Amazon Kinesis Streams, AWS Lambda, and Spark Streaming on Amazon Elastic MapReduce (EMR). We will then talk about common use cases and best practices for real-time data analysis on AWS.
Learning Objectives:
Understand how you can analyze real-time data streams using Amazon Kinesis, AWS Lambda, and Spark running on Amazon EMR
Learn use cases and best practices for streaming data applications on AWS
Deep dive and best practices on real time streaming applications nyc-loft_oct...Amazon Web Services
This document provides an overview of real-time streaming data on AWS and best practices for using Amazon Kinesis, Spark Streaming, AWS Lambda, and Amazon EMR. It discusses ingesting streaming data using Kinesis Streams and Firehose, processing data with Kinesis Client Library, Spark Streaming, and AWS Lambda, and integrating with data stores like S3, Redshift and Elasticsearch. Example use cases are also presented from companies like Sonos, publishers and gaming companies.
BDA307 Real-time Streaming Applications on AWS, Patterns and Use CasesAmazon Web Services
In this session, you will learn best practices for implementing simple to advanced real-time streaming data use cases on AWS. First, we’ll review decision points on near real-time versus real time scenarios. Next, we will take a look at streaming data architecture patterns that include Amazon Kinesis Analytics, Amazon Kinesis Firehose, Amazon Kinesis Streams, Spark Streaming on Amazon EMR, and other open source libraries. Finally, we will dive deep into the most common of these patterns and cover design and implementation considerations.
Learn best practices for building a real-time streaming data architecture on AWS with Spark Streaming, Amazon Kinesis, and Amazon Elastic MapReduce (EMR). Get a closer look at how to ingest streaming data scalably and durably from data producers like mobile devices, servers, and even web browsers, and design a stream processing application with minimal data duplication and exactly-once processing.
Presented by: Guy Ernest, Principal Business Development Manager, Amazon Web Services
Customer Guest: Harry Koch, Solutions Architecture, Philips
This session is recommended for anyone interested in understanding how to use AWS big data services to develop real-time analytics applications. In this session, you will get an overview of a number of Amazon's big data and analytics services that enable you to build highly scaleable cloud applications that immediately and continuously analyze large sets of distributed data. We'll explain how services like Amazon Kinesis, EMR and Redshift can be used for data ingestion, processing and storage to enable real-time insights and analysis into customer, operational and machine generated data and log files. We'll explore system requirements, design considerations, and walk through a specific customer use case to illustrate the power of real-time insights on their business.
Nesta sessão faremos uma demonstração de controle e defesa de tráfego aéreo utilizando processamento em tempo real. Trataremos das boas práticas para ingestão, armazenamento, processamento e visualização de dados através de serviços da AWS como Kinesis, DynamoDB, Lambda, Redshift, Quicksight e Amazon Machine Learning.
BDA307 Real-time Streaming Applications on AWS, Patterns and Use CasesAmazon Web Services
In this session, you will learn best practices for implementing simple to advanced real-time streaming data use cases on AWS. First, we will review decision points on near real-time versus real time scenarios. Next, we will take a look at streaming data architecture patterns that include Amazon Kinesis Analytics, Amazon Kinesis Firehose, Amazon Kinesis Streams, Spark Streaming on Amazon EMR, and other open source libraries. Finally, we will dive deep into the most common of these patterns and cover design and implementation considerations.
Building Big Data Applications with Serverless Architectures - June 2017 AWS...Amazon Web Services
Learning Objectives:
- Use cases and best practices for serverless big data applications
- Leverage AWS technologies such as AWS Lambda and Amazon Kinesis
- Learn to perform ETL, event processing, ad-hoc analysis, real-time processing, and MapReduce with serverless
Building data processing applications is challenging and time-consuming, and often requires specialized expertise to deploy and operate. With serverless computing, you can perform real-time stream processing of multiple data types without needing to spin up servers or install software, allowing you to deploy big data applications quickly and more easily. Come learn how you can use AWS Lambda with Amazon Kinesis to analyze streaming data in real-time and then store the results in a managed NoSQL database such as Amazon DynamoDB. You’ll learn tips and tricks for doing in-line processing, data manipulation, and even distributed MapReduce on large data sets.
- Amazon Kinesis Data Streams and Amazon Managed Streaming for Kafka (MSK) are services for stream storage and processing. Kinesis Data Streams uses shards that can scale out, while MSK uses Kafka brokers that require more manual scaling.
- Key metrics to monitor for stream processing include request/response queues, produce/consume rates, network traffic, and disk usage. Monitoring helps identify bottlenecks or imbalances.
- Common streaming architectures include using Kinesis/MSK as an event bus, log aggregation from IoT devices, event sourcing with CQRS, and real-time analytics with Kinesis Analytics. These patterns are useful for building real-time applications and analytics.
Amazon Kinesis is a platform for processing and analyzing real-time streaming data. It consists of three services: Amazon Kinesis Streams for building custom streaming applications, Amazon Kinesis Firehose for loading streaming data into data stores like S3 and Redshift with no administration required, and Amazon Kinesis Analytics for analyzing streaming data using SQL. The document discusses these services and provides an example of how Hearst Publishing uses them to analyze clickstream data across their properties to gain insights and promote popular content.
Choose Right Stream Storage: Amazon Kinesis Data Streams vs MSKSungmin Kim
This presentation compares Amazon Kinesis Data Streams to Managed Streaming for Kafka (MSK) in both architectural perspective and operational perspective. In addition, it shows common architectural patterns: (1) Data Hub: Event-Bus, (2) Log Aggregation, (3) IoT, (4) Event sourcing and CQRS.
Getting Started with Amazon Kinesis | AWS Public Sector Summit 2016Amazon Web Services
Amazon Kinesis provides services for you to work with streaming data on AWS. Learn how to load streaming data continuously and cost-effectively to Amazon S3 and Amazon Redshift using Amazon Kinesis Firehose without writing custom stream processing code. Get an introduction to building custom stream processing applications with Amazon Kinesis Streams for specialized needs.
Amazon Kinesis is the AWS service for real-time streaming big data ingestion and processing. This talk gives a detailed exploration of Kinesis stream processing. We'll discuss in detail techniques for building, and scaling Kinesis processing applications, including data filtration and transformation. Finally we'll address tips and techniques to emitting data into S3, DynamoDB, and Redshift.
É cada vez mais importante a análise de streaming de dados em tempo real. Ela possibilita que as organizações permaneçam competitivas através da descoberta de insights relevantes para a tomada de decisão. A AWS possibilita que a captura, armazenamento e análise de streaming de dados em tempo real seja feita de maneira simplificada. Nesta sessão, iremos guiá-lo através de algumas das arquiteturas de referência para o processamento de streaming de dados, usando uma combinação de ferramentas, incluindo o Amazon Kinesis Streams, AWS Lambda e o Spark Streaming em Amazon EMR. Em seguida, falaremos sobre casos de uso comuns e melhores práticas para análise de dados em tempo real na AWS.
https://aws.amazon.com/pt/big-data/
Amazon Kinesis Platform – The Complete Overview - Pop-up Loft TLV 2017Amazon Web Services
Real-Time Streaming Analytics became popular amongst many verticals and use cases. In AdTech, Gaming, Financial Service and IoT, AWS customers are leveraging Amazon Kinesis platform to ingest billions of events every day and process them in real-time. In this session, we will discuss Amazon Kinesis Streams, Amazon Kinesis Firehose and Amazon Kinesis Analytics. We will show best practice and design patterns in integrating Amazon Kinesis platform with other services like Amazon EMR, Redshift, Amazon Elasticsearch and AWS lambda as well as 3rd party connectors like storm, Spark and more.
Working with big volumes of data is a complicated task, but it's even harder if you have to do everything in real time and try to figure it all out yourself. Over the past decades many open-source projects helped solve problems within the data analytics lifecycle around ingestion, storage, processing and visualisation of data. This session will use practical examples to discuss architectural best practices and lessons learned when solving real-time analytics and data visualisation decision-making problems with open-source at scale with the power of Amazon Web Services. It furthermore dives into a demo, using source code from the AWS Labs to visualise live data streams at scale.
Olivier Klein, Solutions Architect, Amazon Web Services, Greater China
Evolving Your Big Data Use Cases from Batch to Real-Time - AWS May 2016 Webi...Amazon Web Services
Batch querying and reporting is no longer enough for many organizations. Reducing time to insight – the time it takes to turn data into actionable insights – is becoming increasingly important to remain competitive. That’s why organizations are quickly evolving their data applications to support a broader set of real-time analytic use cases.
In this webinar, we will review some of the common use cases for real-time analytics such as click-stream analysis, event data processing, and real-time analytics. We will show proven architectures for collecting, storing, and processing real-time data using a combination of AWS managed services, including Amazon Kinesis Streams, Amazon Kinesis Firehose, Amazon EMR, and AWS Lambda, as well open source tools, such as Apache Spark. Then, we will discuss common approaches and best practices to incorporate real-time analytics into your existing batch applications.
Learning Objectives:
• Understand how to incorporate real-time analytics into existing applications
• Best practices to combine batch with real-time data flows
• Learn common architectures and use cases for real-time analytics
이제 빅데이터란 개념은 익숙한 것이 되었지만 이를 비지니스에 적용하고 최대의 효과를 얻는 방법에 대한 고찰은 여전히 필요합니다. 소중한 데이터를 쉽게 저장 및 분석하고 시각화하는 것은 비즈니스에 대한 통찰을 얻기 위한 중요한 과정입니다.
이 강연에서는 AWS Elastic MapReduce, Amazon Redshift, Amazon Kinesis 등 AWS가 제공하는 다양한 데이터 분석 도구를 활용해 보다 간편하고 빠른 빅데이터 분석 서비스를 구축하는 방법에 대해 소개합니다.
Real Time Data Processing Using AWS Lambda - DevDay Austin 2017Amazon Web Services
1) The document discusses serverless real-time data processing using AWS Lambda and Amazon Kinesis.
2) It provides an example of how streaming data can be captured in an Amazon Kinesis stream and processed by AWS Lambda functions to output the results to databases or cloud services.
3) The document also discusses how Fannie Mae used a distributed computing approach with AWS Lambda to perform mortgage simulations, achieving a 3x performance increase over their existing process.
Serverless architecture can eliminate the need to provision and manage servers required to process files or streaming data in real time.
In this session, we will cover the fundamentals of using AWS Lambda to process data from sources such as Amazon DynamoDB Streams, Amazon Kinesis, and Amazon S3. We will walk through sample use cases for real-time data processing and discuss best practices on using these services together. We will then demonstrate how to set up a real-time stream processing solution using just Amazon Kinesis and AWS Lambda, all without the need to run or manage servers.
Similar to Deep Dive and Best Practices for Real Time Streaming Applications (20)
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Amazon Web Services
Il Forecasting è un processo importante per tantissime aziende e viene utilizzato in vari ambiti per cercare di prevedere in modo accurato la crescita e distribuzione di un prodotto, l’utilizzo delle risorse necessarie nelle linee produttive, presentazioni finanziarie e tanto altro. Amazon utilizza delle tecniche avanzate di forecasting, in parte questi servizi sono stati messi a disposizione di tutti i clienti AWS.
In questa sessione illustreremo come pre-processare i dati che contengono una componente temporale e successivamente utilizzare un algoritmo che a partire dal tipo di dato analizzato produce un forecasting accurato.
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Amazon Web Services
La varietà e la quantità di dati che si crea ogni giorno accelera sempre più velocemente e rappresenta una opportunità irripetibile per innovare e creare nuove startup.
Tuttavia gestire grandi quantità di dati può apparire complesso: creare cluster Big Data su larga scala sembra essere un investimento accessibile solo ad aziende consolidate. Ma l’elasticità del Cloud e, in particolare, i servizi Serverless ci permettono di rompere questi limiti.
Vediamo quindi come è possibile sviluppare applicazioni Big Data rapidamente, senza preoccuparci dell’infrastruttura, ma dedicando tutte le risorse allo sviluppo delle nostre le nostre idee per creare prodotti innovativi.
Ora puoi utilizzare Amazon Elastic Kubernetes Service (EKS) per eseguire pod Kubernetes su AWS Fargate, il motore di elaborazione serverless creato per container su AWS. Questo rende più semplice che mai costruire ed eseguire le tue applicazioni Kubernetes nel cloud AWS.In questa sessione presenteremo le caratteristiche principali del servizio e come distribuire la tua applicazione in pochi passaggi
Vent'anni fa Amazon ha attraversato una trasformazione radicale con l'obiettivo di aumentare il ritmo dell'innovazione. In questo periodo abbiamo imparato come cambiare il nostro approccio allo sviluppo delle applicazioni ci ha permesso di aumentare notevolmente l'agilità, la velocità di rilascio e, in definitiva, ci ha consentito di creare applicazioni più affidabili e scalabili. In questa sessione illustreremo come definiamo le applicazioni moderne e come la creazione di app moderne influisce non solo sull'architettura dell'applicazione, ma sulla struttura organizzativa, sulle pipeline di rilascio dello sviluppo e persino sul modello operativo. Descriveremo anche approcci comuni alla modernizzazione, compreso l'approccio utilizzato dalla stessa Amazon.com.
Come spendere fino al 90% in meno con i container e le istanze spot Amazon Web Services
L’utilizzo dei container è in continua crescita.
Se correttamente disegnate, le applicazioni basate su Container sono molto spesso stateless e flessibili.
I servizi AWS ECS, EKS e Kubernetes su EC2 possono sfruttare le istanze Spot, portando ad un risparmio medio del 70% rispetto alle istanze On Demand. In questa sessione scopriremo insieme quali sono le caratteristiche delle istanze Spot e come possono essere utilizzate facilmente su AWS. Impareremo inoltre come Spreaker sfrutta le istanze spot per eseguire applicazioni di diverso tipo, in produzione, ad una frazione del costo on-demand!
In recent months, many customers have been asking us the question – how to monetise Open APIs, simplify Fintech integrations and accelerate adoption of various Open Banking business models. Therefore, AWS and FinConecta would like to invite you to Open Finance marketplace presentation on October 20th.
Event Agenda :
Open banking so far (short recap)
• PSD2, OB UK, OB Australia, OB LATAM, OB Israel
Intro to Open Finance marketplace
• Scope
• Features
• Tech overview and Demo
The role of the Cloud
The Future of APIs
• Complying with regulation
• Monetizing data / APIs
• Business models
• Time to market
One platform for all: a Strategic approach
Q&A
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Amazon Web Services
Per creare valore e costruire una propria offerta differenziante e riconoscibile, le startup di successo sanno come combinare tecnologie consolidate con componenti innovativi creati ad hoc.
AWS fornisce servizi pronti all'utilizzo e, allo stesso tempo, permette di personalizzare e creare gli elementi differenzianti della propria offerta.
Concentrandoci sulle tecnologie di Machine Learning, vedremo come selezionare i servizi di intelligenza artificiale offerti da AWS e, anche attraverso una demo, come costruire modelli di Machine Learning personalizzati utilizzando SageMaker Studio.
OpsWorks Configuration Management: automatizza la gestione e i deployment del...Amazon Web Services
Con l'approccio tradizionale al mondo IT per molti anni è stato difficile implementare tecniche di DevOps, che finora spesso hanno previsto attività manuali portando di tanto in tanto a dei downtime degli applicativi interrompendo l'operatività dell'utente. Con l'avvento del cloud, le tecniche di DevOps sono ormai a portata di tutti a basso costo per qualsiasi genere di workload, garantendo maggiore affidabilità del sistema e risultando in dei significativi miglioramenti della business continuity.
AWS mette a disposizione AWS OpsWork come strumento di Configuration Management che mira ad automatizzare e semplificare la gestione e i deployment delle istanze EC2 per mezzo di workload Chef e Puppet.
Scopri come sfruttare AWS OpsWork a garanzia e affidabilità del tuo applicativo installato su Instanze EC2.
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsAmazon Web Services
Vuoi conoscere le opzioni per eseguire Microsoft Active Directory su AWS? Quando si spostano carichi di lavoro Microsoft in AWS, è importante considerare come distribuire Microsoft Active Directory per supportare la gestione, l'autenticazione e l'autorizzazione dei criteri di gruppo. In questa sessione, discuteremo le opzioni per la distribuzione di Microsoft Active Directory su AWS, incluso AWS Directory Service per Microsoft Active Directory e la distribuzione di Active Directory su Windows su Amazon Elastic Compute Cloud (Amazon EC2). Trattiamo argomenti quali l'integrazione del tuo ambiente Microsoft Active Directory locale nel cloud e l'utilizzo di applicazioni SaaS, come Office 365, con AWS Single Sign-On.
Dal riconoscimento facciale al riconoscimento di frodi o difetti di fabbricazione, l'analisi di immagini e video che sfruttano tecniche di intelligenza artificiale, si stanno evolvendo e raffinando a ritmi elevati. In questo webinar esploreremo le possibilità messe a disposizione dai servizi AWS per applicare lo stato dell'arte delle tecniche di computer vision a scenari reali.
Amazon Web Services e VMware organizzano un evento virtuale gratuito il prossimo mercoledì 14 Ottobre dalle 12:00 alle 13:00 dedicato a VMware Cloud ™ on AWS, il servizio on demand che consente di eseguire applicazioni in ambienti cloud basati su VMware vSphere® e di accedere ad una vasta gamma di servizi AWS, sfruttando a pieno le potenzialità del cloud AWS e tutelando gli investimenti VMware esistenti.
Molte organizzazioni sfruttano i vantaggi del cloud migrando i propri carichi di lavoro Oracle e assicurandosi notevoli vantaggi in termini di agilità ed efficienza dei costi.
La migrazione di questi carichi di lavoro, può creare complessità durante la modernizzazione e il refactoring delle applicazioni e a questo si possono aggiungere rischi di prestazione che possono essere introdotti quando si spostano le applicazioni dai data center locali.
Crea la tua prima serverless ledger-based app con QLDB e NodeJSAmazon Web Services
Molte aziende oggi, costruiscono applicazioni con funzionalità di tipo ledger ad esempio per verificare lo storico di accrediti o addebiti nelle transazioni bancarie o ancora per tenere traccia del flusso supply chain dei propri prodotti.
Alla base di queste soluzioni ci sono i database ledger che permettono di avere un log delle transazioni trasparente, immutabile e crittograficamente verificabile, ma sono strumenti complessi e onerosi da gestire.
Amazon QLDB elimina la necessità di costruire sistemi personalizzati e complessi fornendo un database ledger serverless completamente gestito.
In questa sessione scopriremo come realizzare un'applicazione serverless completa che utilizzi le funzionalità di QLDB.
Con l’ascesa delle architetture di microservizi e delle ricche applicazioni mobili e Web, le API sono più importanti che mai per offrire agli utenti finali una user experience eccezionale. In questa sessione impareremo come affrontare le moderne sfide di progettazione delle API con GraphQL, un linguaggio di query API open source utilizzato da Facebook, Amazon e altro e come utilizzare AWS AppSync, un servizio GraphQL serverless gestito su AWS. Approfondiremo diversi scenari, comprendendo come AppSync può aiutare a risolvere questi casi d’uso creando API moderne con funzionalità di aggiornamento dati in tempo reale e offline.
Inoltre, impareremo come Sky Italia utilizza AWS AppSync per fornire aggiornamenti sportivi in tempo reale agli utenti del proprio portale web.
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareAmazon Web Services
Molte organizzazioni sfruttano i vantaggi del cloud migrando i propri carichi di lavoro Oracle e assicurandosi notevoli vantaggi in termini di agilità ed efficienza dei costi.
La migrazione di questi carichi di lavoro, può creare complessità durante la modernizzazione e il refactoring delle applicazioni e a questo si possono aggiungere rischi di prestazione che possono essere introdotti quando si spostano le applicazioni dai data center locali.
In queste slide, gli esperti AWS e VMware presentano semplici e pratici accorgimenti per facilitare e semplificare la migrazione dei carichi di lavoro Oracle accelerando la trasformazione verso il cloud, approfondiranno l’architettura e dimostreranno come sfruttare a pieno le potenzialità di VMware Cloud ™ on AWS.
1) The document discusses building a minimum viable product (MVP) using Amazon Web Services (AWS).
2) It provides an example of an MVP for an omni-channel messenger platform that was built from 2017 to connect ecommerce stores to customers via web chat, Facebook Messenger, WhatsApp, and other channels.
3) The founder discusses how they started with an MVP in 2017 with 200 ecommerce stores in Hong Kong and Taiwan, and have since expanded to over 5000 clients across Southeast Asia using AWS for scaling.
This document discusses pitch decks and fundraising materials. It explains that venture capitalists will typically spend only 3 minutes and 44 seconds reviewing a pitch deck. Therefore, the deck needs to tell a compelling story to grab their attention. It also provides tips on tailoring different types of decks for different purposes, such as creating a concise 1-2 page teaser, a presentation deck for pitching in-person, and a more detailed read-only or fundraising deck. The document stresses the importance of including key information like the problem, solution, product, traction, market size, plans, team, and ask.
This document discusses building serverless web applications using AWS services like API Gateway, Lambda, DynamoDB, S3 and Amplify. It provides an overview of each service and how they can work together to create a scalable, secure and cost-effective serverless application stack without having to manage servers or infrastructure. Key services covered include API Gateway for hosting APIs, Lambda for backend logic, DynamoDB for database needs, S3 for static content, and Amplify for frontend hosting and continuous deployment.
This document provides tips for fundraising from startup founders Roland Yau and Sze Lok Chan. It discusses generating competition to create urgency for investors, fundraising in parallel rather than sequentially, having a clear fundraising narrative focused on what you do and why it's compelling, and prioritizing relationships with people over firms. It also notes how the pandemic has changed fundraising, with examples of deals done virtually during this time. The tips emphasize being fully prepared before fundraising and cultivating connections with investors in advance.
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...Amazon Web Services
This document discusses Amazon's machine learning services for building conversational interfaces and extracting insights from unstructured text and audio. It describes Amazon Lex for creating chatbots, Amazon Comprehend for natural language processing tasks like entity extraction and sentiment analysis, and how they can be used together for applications like intelligent call centers and content analysis. Pre-trained APIs simplify adding machine learning to apps without requiring ML expertise.
Amazon Elastic Container Service (Amazon ECS) è un servizio di gestione dei container altamente scalabile, che semplifica la gestione dei contenitori Docker attraverso un layer di orchestrazione per il controllo del deployment e del relativo lifecycle. In questa sessione presenteremo le principali caratteristiche del servizio, le architetture di riferimento per i differenti carichi di lavoro e i semplici passi necessari per poter velocemente migrare uno o più dei tuo container.
Conversational agents, or chatbots, are increasingly used to access all sorts of services using natural language. While open-domain chatbots - like ChatGPT - can converse on any topic, task-oriented chatbots - the focus of this paper - are designed for specific tasks, like booking a flight, obtaining customer support, or setting an appointment. Like any other software, task-oriented chatbots need to be properly tested, usually by defining and executing test scenarios (i.e., sequences of user-chatbot interactions). However, there is currently a lack of methods to quantify the completeness and strength of such test scenarios, which can lead to low-quality tests, and hence to buggy chatbots.
To fill this gap, we propose adapting mutation testing (MuT) for task-oriented chatbots. To this end, we introduce a set of mutation operators that emulate faults in chatbot designs, an architecture that enables MuT on chatbots built using heterogeneous technologies, and a practical realisation as an Eclipse plugin. Moreover, we evaluate the applicability, effectiveness and efficiency of our approach on open-source chatbots, with promising results.
QA or the Highway - Component Testing: Bridging the gap between frontend appl...zjhamm304
These are the slides for the presentation, "Component Testing: Bridging the gap between frontend applications" that was presented at QA or the Highway 2024 in Columbus, OH by Zachary Hamm.
"What does it really mean for your system to be available, or how to define w...Fwdays
We will talk about system monitoring from a few different angles. We will start by covering the basics, then discuss SLOs, how to define them, and why understanding the business well is crucial for success in this exercise.
ScyllaDB is making a major architecture shift. We’re moving from vNode replication to tablets – fragments of tables that are distributed independently, enabling dynamic data distribution and extreme elasticity. In this keynote, ScyllaDB co-founder and CTO Avi Kivity explains the reason for this shift, provides a look at the implementation and roadmap, and shares how this shift benefits ScyllaDB users.
Getting the Most Out of ScyllaDB Monitoring: ShareChat's TipsScyllaDB
ScyllaDB monitoring provides a lot of useful information. But sometimes it’s not easy to find the root of the problem if something is wrong or even estimate the remaining capacity by the load on the cluster. This talk shares our team's practical tips on: 1) How to find the root of the problem by metrics if ScyllaDB is slow 2) How to interpret the load and plan capacity for the future 3) Compaction strategies and how to choose the right one 4) Important metrics which aren’t available in the default monitoring setup.
Essentials of Automations: Exploring Attributes & Automation ParametersSafe Software
Building automations in FME Flow can save time, money, and help businesses scale by eliminating data silos and providing data to stakeholders in real-time. One essential component to orchestrating complex automations is the use of attributes & automation parameters (both formerly known as “keys”). In fact, it’s unlikely you’ll ever build an Automation without using these components, but what exactly are they?
Attributes & automation parameters enable the automation author to pass data values from one automation component to the next. During this webinar, our FME Flow Specialists will cover leveraging the three types of these output attributes & parameters in FME Flow: Event, Custom, and Automation. As a bonus, they’ll also be making use of the Split-Merge Block functionality.
You’ll leave this webinar with a better understanding of how to maximize the potential of automations by making use of attributes & automation parameters, with the ultimate goal of setting your enterprise integration workflows up on autopilot.
What is an RPA CoE? Session 2 – CoE RolesDianaGray10
In this session, we will review the players involved in the CoE and how each role impacts opportunities.
Topics covered:
• What roles are essential?
• What place in the automation journey does each role play?
Speaker:
Chris Bolin, Senior Intelligent Automation Architect Anika Systems
The Department of Veteran Affairs (VA) invited Taylor Paschal, Knowledge & Information Management Consultant at Enterprise Knowledge, to speak at a Knowledge Management Lunch and Learn hosted on June 12, 2024. All Office of Administration staff were invited to attend and received professional development credit for participating in the voluntary event.
The objectives of the Lunch and Learn presentation were to:
- Review what KM ‘is’ and ‘isn’t’
- Understand the value of KM and the benefits of engaging
- Define and reflect on your “what’s in it for me?”
- Share actionable ways you can participate in Knowledge - - Capture & Transfer
"NATO Hackathon Winner: AI-Powered Drug Search", Taras KlobaFwdays
This is a session that details how PostgreSQL's features and Azure AI Services can be effectively used to significantly enhance the search functionality in any application.
In this session, we'll share insights on how we used PostgreSQL to facilitate precise searches across multiple fields in our mobile application. The techniques include using LIKE and ILIKE operators and integrating a trigram-based search to handle potential misspellings, thereby increasing the search accuracy.
We'll also discuss how the azure_ai extension on PostgreSQL databases in Azure and Azure AI Services were utilized to create vectors from user input, a feature beneficial when users wish to find specific items based on text prompts. While our application's case study involves a drug search, the techniques and principles shared in this session can be adapted to improve search functionality in a wide range of applications. Join us to learn how PostgreSQL and Azure AI can be harnessed to enhance your application's search capability.
QR Secure: A Hybrid Approach Using Machine Learning and Security Validation F...AlexanderRichford
QR Secure: A Hybrid Approach Using Machine Learning and Security Validation Functions to Prevent Interaction with Malicious QR Codes.
Aim of the Study: The goal of this research was to develop a robust hybrid approach for identifying malicious and insecure URLs derived from QR codes, ensuring safe interactions.
This is achieved through:
Machine Learning Model: Predicts the likelihood of a URL being malicious.
Security Validation Functions: Ensures the derived URL has a valid certificate and proper URL format.
This innovative blend of technology aims to enhance cybersecurity measures and protect users from potential threats hidden within QR codes 🖥 🔒
This study was my first introduction to using ML which has shown me the immense potential of ML in creating more secure digital environments!
Connector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectorsDianaGray10
Join us to learn how UiPath Apps can directly and easily interact with prebuilt connectors via Integration Service--including Salesforce, ServiceNow, Open GenAI, and more.
The best part is you can achieve this without building a custom workflow! Say goodbye to the hassle of using separate automations to call APIs. By seamlessly integrating within App Studio, you can now easily streamline your workflow, while gaining direct access to our Connector Catalog of popular applications.
We’ll discuss and demo the benefits of UiPath Apps and connectors including:
Creating a compelling user experience for any software, without the limitations of APIs.
Accelerating the app creation process, saving time and effort
Enjoying high-performance CRUD (create, read, update, delete) operations, for
seamless data management.
Speakers:
Russell Alfeche, Technology Leader, RPA at qBotic and UiPath MVP
Charlie Greenberg, host
As AI technology is pushing into IT I was wondering myself, as an “infrastructure container kubernetes guy”, how get this fancy AI technology get managed from an infrastructure operational view? Is it possible to apply our lovely cloud native principals as well? What benefit’s both technologies could bring to each other?
Let me take this questions and provide you a short journey through existing deployment models and use cases for AI software. On practical examples, we discuss what cloud/on-premise strategy we may need for applying it to our own infrastructure to get it to work from an enterprise perspective. I want to give an overview about infrastructure requirements and technologies, what could be beneficial or limiting your AI use cases in an enterprise environment. An interactive Demo will give you some insides, what approaches I got already working for real.
Keywords: AI, Containeres, Kubernetes, Cloud Native
Event Link: https://meine.doag.org/events/cloudland/2024/agenda/#agendaId.4211
In the realm of cybersecurity, offensive security practices act as a critical shield. By simulating real-world attacks in a controlled environment, these techniques expose vulnerabilities before malicious actors can exploit them. This proactive approach allows manufacturers to identify and fix weaknesses, significantly enhancing system security.
This presentation delves into the development of a system designed to mimic Galileo's Open Service signal using software-defined radio (SDR) technology. We'll begin with a foundational overview of both Global Navigation Satellite Systems (GNSS) and the intricacies of digital signal processing.
The presentation culminates in a live demonstration. We'll showcase the manipulation of Galileo's Open Service pilot signal, simulating an attack on various software and hardware systems. This practical demonstration serves to highlight the potential consequences of unaddressed vulnerabilities, emphasizing the importance of offensive security practices in safeguarding critical infrastructure.
How information systems are built or acquired puts information, which is what they should be about, in a secondary place. Our language adapted accordingly, and we no longer talk about information systems but applications. Applications evolved in a way to break data into diverse fragments, tightly coupled with applications and expensive to integrate. The result is technical debt, which is re-paid by taking even bigger "loans", resulting in an ever-increasing technical debt. Software engineering and procurement practices work in sync with market forces to maintain this trend. This talk demonstrates how natural this situation is. The question is: can something be done to reverse the trend?
What is an RPA CoE? Session 1 – CoE VisionDianaGray10
In the first session, we will review the organization's vision and how this has an impact on the COE Structure.
Topics covered:
• The role of a steering committee
• How do the organization’s priorities determine CoE Structure?
Speaker:
Chris Bolin, Senior Intelligent Automation Architect Anika Systems
Dandelion Hashtable: beyond billion requests per second on a commodity serverAntonios Katsarakis
This slide deck presents DLHT, a concurrent in-memory hashtable. Despite efforts to optimize hashtables, that go as far as sacrificing core functionality, state-of-the-art designs still incur multiple memory accesses per request and block request processing in three cases. First, most hashtables block while waiting for data to be retrieved from memory. Second, open-addressing designs, which represent the current state-of-the-art, either cannot free index slots on deletes or must block all requests to do so. Third, index resizes block every request until all objects are copied to the new index. Defying folklore wisdom, DLHT forgoes open-addressing and adopts a fully-featured and memory-aware closed-addressing design based on bounded cache-line-chaining. This design offers lock-free index operations and deletes that free slots instantly, (2) completes most requests with a single memory access, (3) utilizes software prefetching to hide memory latencies, and (4) employs a novel non-blocking and parallel resizing. In a commodity server and a memory-resident workload, DLHT surpasses 1.6B requests per second and provides 3.5x (12x) the throughput of the state-of-the-art closed-addressing (open-addressing) resizable hashtable on Gets (Deletes).
Dandelion Hashtable: beyond billion requests per second on a commodity server
Deep Dive and Best Practices for Real Time Streaming Applications
1. Real-time Streaming Data on AWS
Deep Dive & Best Practices Using Amazon Kinesis,
Spark Streaming, AWS Lambda, and Amazon EMR
Roy Ben-Alta, Sr. Business Development Manager, AWS
Orit Alul, Data and Analytics R&D Director, Sizmek
June-16-2016
2. Agenda
Real-time streaming overview
Use cases and design patterns
Amazon Kinesis deep dive
Streaming data ingestion & Stream processing
Sizmek Case Study
Q&A
3. Batch Processing
Hourly server logs
Weekly or monthly bills
Daily web-site clickstream
Daily fraud reports
Stream Processing
Real-time metrics
Real-time spending alerts/caps
Real-time clickstream analysis
Real-time detection
It’s All About the Pace
4. Streaming Data Scenarios Across Verticals
Scenarios/
Verticals
Accelerated Ingest-
Transform-Load
Continuous Metrics
Generation
Responsive Data Analysis
Digital Ad
Tech/Marketing
Publisher, bidder data
aggregation
Advertising metrics like
coverage, yield, and
conversion
User engagement with
ads, optimized bid/buy
engines
IoT Sensor, device telemetry
data ingestion
Operational metrics and
dashboards
Device operational
intelligence and alerts
Gaming Online data aggregation,
e.g., top 10 players
Massively multiplayer
online game (MMOG) live
dashboard
Leader board generation,
player-skill match
Consumer
Online
Clickstream analytics Metrics like impressions
and page views
Recommendation engines,
proactive care
5. Customer Use Cases
Sonos runs near real-time streaming
analytics on device data logs from
their connected hi-fi audio equipment.
One of the biggest online brokerages
for real estate in US. Built Hot Homes
feature.
Glu Mobile collects billions of
gaming events data points from
millions of user devices in
real-time every single day.
Nordstorm recommendation team built
online stylist using Amazon Kinesis
Streams and AWS Lambda.
6. Metering Record Common Log Entry
MQTT RecordSyslog Entry
{
"payerId": "Joe",
"productCode": "AmazonS3",
"clientProductCode": "AmazonS3",
"usageType": "Bandwidth",
"operation": "PUT",
"value": "22490",
"timestamp": "1216674828"
}
{
127.0.0.1 user-
identifier frank
[10/Oct/2000:13:5
5:36 -0700] "GET
/apache_pb.gif
HTTP/1.0" 200
2326
}
{
“SeattlePublicWa
ter/Kinesis/123/
Realtime” –
412309129140
}
{
<165>1 2003-10-11T22:14:15.003Z
mymachine.example.com evntslog -
ID47 [exampleSDID@32473 iut="3"
eventSource="Application"
eventID="1011"][examplePriority@
32473 class="high"]
}
Streaming Data Challenges: Variety & Velocity
• Streaming data comes in
different types and
formats
− Metering records,
logs and sensor data
− JSON, CSV, TSV
• Can vary in size from a
few bytes to kilobytes or
megabytes
• High velocity and
continuous processing
7. Two Main Processing Patterns
Stream processing (real time)
• Real-time response to events in data streams
Examples:
• Proactively detect hardware errors in device logs
• Notify when inventory drops below a threshold
• Fraud detection
Micro-batching (near real time)
• Near real-time operations on small batches of events in data streams
Examples:
• Aggregate and archive events
• Monitor performance SLAs
9. Amazon Kinesis
Streams
• For Technical Developers
• Build your own custom
applications that process
or analyze streaming
data
Amazon Kinesis
Firehose
• For all developers, data
scientists
• Easily load massive
volumes of streaming data
into S3, Amazon Redshift
and Amazon Elasticsearch
Amazon Kinesis
Analytics
• For all developers, data
scientists
• Easily analyze data
streams using standard
SQL queries
• Preview
Amazon Kinesis: Streaming Data Made Easy
Services make it easy to capture, deliver and process streams on AWS
10. Amazon Kinesis Firehose
Load massive volumes of streaming data into Amazon S3, Amazon
Redshift and Amazon Elasticsearch
Zero administration: Capture and deliver streaming data into Amazon S3, Amazon Redshift
and Amazon Elasticsearch without writing an application or managing infrastructure.
Direct-to-data store integration: Batch, compress, and encrypt streaming data for
delivery into data destinations in as little as 60 secs using simple configurations.
Seamless elasticity: Seamlessly scales to match data throughput w/o intervention
Capture and submit
streaming data to Firehose
Analyze streaming data using your
favorite BI tools
Firehose loads streaming data
continuously into S3, Amazon Redshift
and Amazon Elasticsearch
13. • Streams are made of shards
• Each shard ingests up to 1MB/sec, and
1000 records/sec
• Each shard emits up to 2 MB/sec
• All data is stored for 24 hours by
default; storage can be extended for
up to 7 days
• Scale Kinesis streams using scaling util
• Replay data
Amazon Kinesis Streams
Managed ability to capture and store data
14. Amazon Kinesis Firehose vs. Amazon Kinesis
Streams
Amazon Kinesis Streams is for use cases that require custom
processing, per incoming record, with sub-1 second processing
latency, and a choice of stream processing frameworks.
Amazon Kinesis Firehose is for use cases that require zero
administration, ability to use existing analytics tools based on
Amazon S3, Amazon Redshift and Amazon Elasticsearch, and a
data latency of 60 seconds or higher.
16. Putting Data into Amazon Kinesis Streams
Determine your partition key strategy
• Managed buffer or streaming MapReduce job
• Ensure high cardinality for your shards
Provision adequate shards
• For ingress needs
• Egress needs for all consuming applications: if more
than two simultaneous applications
• Include headroom for catching up with data in stream
17. Putting Data into Amazon Kinesis
Amazon Kinesis Agent – (supports pre-processing)
• http://docs.aws.amazon.com/firehose/latest/dev/writing-with-agents.html
Pre-batch before Puts for better efficiency
• Consider Flume, Fluentd as collectors/agents
• See https://github.com/awslabs/aws-fluent-plugin-kinesis
Make a tweak to your existing logging
• log4j appender option
• See https://github.com/awslabs/kinesis-log4j-appender
18. Amazon Kinesis Producer Library
• Writes to one or more Amazon Kinesis streams with automatic,
configurable retry mechanism
• Collects records and uses PutRecords to write multiple records to
multiple shards per request
• Aggregates user records to increase payload size and improve
throughput
• Integrates seamlessly with KCL to de-aggregate batched records
• Use Amazon Kinesis Producer Library with AWS Lambda (New!)
• Submits Amazon CloudWatch metrics on your behalf to provide
visibility into producer performance
19. Record Order and Multiple Shards
Unordered processing
• Randomize partition key to distribute events over
many shards and use multiple workers
Exact order processing
• Control partition key to ensure events are
grouped into the same shard and read by the
same worker
Need both? Use global sequence number
Producer
Get Global
Sequence
Unordered
Stream
Campaign Centric
Stream
Fraud Inspection
Stream
Get Event
Metadata
20. Sample Code for Scaling Shards
java -cp
KinesisScalingUtils.jar-complete.jar
-Dstream-name=MyStream
-Dscaling-action=scaleUp
-Dcount=10
-Dregion=eu-west-1 ScalingClient
Options:
• stream-name - The name of the stream to be scaled
• scaling-action - The action to be taken to scale. Must be one of "scaleUp”, "scaleDown"
or “resize”
• count - Number of shards by which to absolutely scale up or down, or resize
See https://github.com/awslabs/amazon-kinesis-scaling-utils
21. Amazon Kinesis Client Library
• Build Kinesis Applications with Kinesis Client Library (KCL)
• Open source client library available for Java, Ruby, Python,
Node.JS dev
• Deploy on your EC2 instances
• KCL Application includes three components:
1. Record Processor Factory – Creates the record processor
2. Record Processor – Processor unit that processes data from a
shard in Amazon Kinesis Streams
3. Worker – Processing unit that maps to each application instance
22. State Management with Kinesis Client Library
• One record processor maps to one shard and processes data records from
that shard
• One worker maps to one or more record processors
• Balances shard-worker associations when worker / instance counts change
• Balances shard-worker associations when shards split or merge
23. Other Options
• Third-party connectors(for example, Apache Storm,
Splunk and more)
• AWS IoT platform
• Amazon EMR with Apache Spark, Pig or Hive
• AWS Lambda
24. Apache Spark and Amazon Kinesis Streams
Apache Spark is an in-memory analytics cluster using
RDD for fast processing
Spark Streaming can read directly from an Amazon
Kinesis stream
Amazon software license linking – Add ASL dependency
to SBT/MAVEN project, artifactId = spark-
streaming-kinesis-asl_2.10
KinesisUtils.createStream(‘twitter-stream’)
.filter(_.getText.contains(”Open-Source"))
.countByWindow(Seconds(5))
Example: Counting tweets on a sliding window
26. Using Spark Streaming with Amazon Kinesis
Streams
1. Use Spark 1.6+ with EMRFS consistent view option – if you
use Amazon S3 as storage for Spark checkpoint
2. Amazon DynamoDB table name – make sure there is only one
instance of the application running with Spark Streaming
3. Enable Spark-based checkpoints
4. Number of Amazon Kinesis receivers is multiple of executors so
they are load-balanced
5. Total processing time is less than the batch interval
6. Number of executors is the same as number of cores per
executor
7. Spark Streaming uses default of 1 sec with KCL
31. The Business Value of
Buzzing@Hearst
Real-Time Reactions
Instant feedback on articles from our audiences
Promoting Popular Content Cross-Channel
Incremental re-syndication of popular articles across properties
(e.g. trending newspaper articles can be adopted by magazines)
Authentic Influence
Inform Hearst editors to write articles that are more relevant to our
audiences
Understanding Engagement
Inform both editors what channels are our audiences leveraging to read
Hearst articles
INCREMENTAL
REVENUE
25% more
page views
15% more
visitors
34. Sizmek – Who? What? How?
Who are we?
2nd Largest Ad management company
in the world operating globally in 50+ countries
Open Ad Management Platform
Who are our customers?
Advertising agencies and advertisers
What is our value?
We enable our clients to manage their cross-
channel campaigns, and find their relevant audience
seamlessly
We built a fast, high throughput infrastructure that
enables real time analytics and decision making.
35. Over 13,000 brands
use our platform
Over 5,000 media
agencies use our
platform
Sizmek – Our customers
36. Data processing at Sizmek
15B events per day
4 TB data per day
150K events per second
Volatile data traffic load
During the day /week/year
During the special event(e.g. Black Friday)
Increase demand for Realtime insights
40. Why Amazon Kinesis?
41
Apache Kafka -> Amazon Kinesis =
Managed Service = no Ops effort
Integration with Apache Storm and AWS
Lambda
Integration with Amazon S3, Amazon
Redshift and other AWS services
Meets our throughput needs
41. Tip 1: Amazon Kinesis Apache
Storm connector
42
Insert a non-processed message back to the stream as new message
Bolt
Tried to process a message
max number of retries
42. Tip 2: putRecord = more shards!
Requirements:
Apache Storm requires processing event by event
Average event size is 1KB
We must use putRecord API
Testing results:
75% of Amazon Kinesis maximum TP limit
300 - 500 shards
Conclusion
Provision more shards (25%)
Always test your workload
43. Tip 3: Error handling on record
level
Date: <Date>
{
"FailedRecordCount": 2,
"Records": [
{
"SequenceNumber":
"49543463076548007577105092703039560359975228518395012686",
"ShardId": "shardId-000000000000"
},
{
"ErrorCode": "ProvisionedThroughputExceededException",
"ErrorMessage": "Rate exceeded for shard shardId-000000000001 in
stream <StreamName> under account 111111111111."
},
{
"ErrorCode": "InternalFailure",
"ErrorMessage": "Internal service failure."
}
]
}
Kinesis putRecords (bulk insert) response
46. Conclusion
• Amazon Kinesis offers: managed service to build applications, streaming
data ingestion, and continuous processing
• Ingest aggregate data using Amazon Producer Library
• Process data using Amazon Connector Library and open source connectors
• Determine your partition key strategy
• Try out Amazon Kinesis at http://aws.amazon.com/kinesis/
47. • Technical documentations
• Amazon Kinesis Agent
• Amazon Kinesis Streams and Spark Streaming
• Amazon Kinesis Producer Library Best Practice
• Amazon Kinesis Firehose and AWS Lambda
• Building Near Real-Time Discovery Platform with Amazon Kinesis
• Public case studies
• Glu mobile – Real-Time Analytics
• Hearst Publishing – Clickstream Analytics
• How Sonos Leverages Amazon Kinesis
• Nordstorm Online Stylist
Reference
Speed matters in business!
To capture “Perishable Insights”:
Insights that can provide incredible value but the value expires and evaporates once the moment is gone.
Mike Gualtieri, Principal Analyst at Forrester Research
The benefits of real-time processing apply to pretty much every scenario where new data is being generated on a continual basis.
We see it pervasively across all the big data areas we’ve looked at, and in pretty much every industry segment whose customers we’ve spoken to.
Customers tend to start with mundane, unglamorous applications, such as system log ingestion and processing.
They then progress over time to ever-more sophisticated forms of real-time processing.
Initially they may simply process their data streams to produce simple metrics and reports, and perform simple actions in response, such as emitting alarms over metrics that have breached their warning thresholds.
Eventually they start to do more sophisticated forms of data analysis in order to obtain deeper understanding of their data, such as applying machine learning algorithms.
In the long run they can be expected to apply a multiplicity of complex stream and event processing algorithms to their data.
We then looked at specific use cases
Alert when public sentiment changes
Customers want to respond quickly to social media feeds
Twitter Firehose: ~450 million events per second, or 10.1 MB/sec1
Respond to changes in the stock market
Customers want to compute value-at-risk or rebalance portfolios based on stock prices
NASDAQ Ultra Feed: 230 GB/day, or 8.1 MB/sec1
Aggregate data before loading in external data store
Customers have billions of click stream records arriving continuously from servers
Large, aggregated PUTs to S3 are cost effective
Internet marketing company: 20MB/sec
Recommendation and Personalization engines
Challenges with Streaming Data
1. Variety: Logs, Clickstream, Sensors, Phones, Transactions. Different types of data often in formats: JSON, CSV, TSV, and more. Streaming data also come in variety of size can be bytes, KB, MB.
2. Streaming data is not about volume it is about Velocity. You have rapid data arriving continuously and simultaneously from thousands of data sources.
The challenge is to handle such variety and velocity: You need a fleet of servers to capture the data, buffer it, and batch it reliably and efficiently. And you need to monitor and maintain these servers. Just imagine if one of these servers goes down. Data is still streaming in and needs to be captured and buffered. And when your servers are back up again, you need a mechanism for checkpoinitng and catching up where you left.
There are two ways to process streaming data:
1. Stream processing (also called real time: This gives you the ability to respond in real-time to events in data streams
Examples: Proactively detect hardware errors in device logs; Notify when inventory drops below a threshold
2. Micro-batching (near real time): operations on small batches of events in data streams
Examples:
Identify fraud from activity logs
Monitor performance SLAs
Kinesis is the platform for Streaming Data on AWS and is usually the central technology component in streaming data and real-time applications.
Since Amazon Kinesis launch in 2013, the ecosystem evolved and we introduced Kinesis Firehose and Kinesis Analytics will be launched later this year.
[2 minutes]
Streaming data processing has two layers: a storage layer and a processing layer. The storage layer needs to support specialized ordering and consistency semantics that enable fast, inexpensive, and replayable reads and writes of large streams of data. Kinesis is the storage layer in Kinesis / Kinesis. The processing layer is responsible for reading data from the storage layer, processing that data, and notifying the storage layer to delete data that is no longer needed. Kinesis supports the processing layer. Customers compile the Kinesis library into their data processing application. Kinesis notifies the application (the Kinesis Worker) when there is new data to process. The Kinesis / Kinesis control plane works with Kinesis Workers to solve scalability and fault tolerance problems in the processing layer.
Moving time window buffer and oldest expire after 24 hours.
Moving time window buffer and oldest expire after 24 hours.
Some types of analysis do not require events to be processed in the exact order in which they were generated. For example, a component that calculates the total number of sales made on an e-Commerce site can process a sale made 5 minutes ago after one made 1 minute ago; the net effect is that 2 sales will be counted. If analysis does not rely on strict ordering of events, the processing can be easily distributed. Different workers on different servers could capture the sales from 5 minutes ago and 1 minute ago. Combining the workers’ counts will give the correct overall picture of 2 sales. This type of analysis can easily be implemented using the Amazon Kinesis Client Library. Events sent to Amazon Kinesis are distributed over many shards. A worker is instantiated for each shard and the results from each worker are combined to produce the final metric.
Other analysis does require strict ordering of events. For example, providing an up-sell offer to a customer may require knowledge of the exact order in which items were added to their basket. This type of analysis can still be distributed across multiple workers, but it is essential that all the events relating to a specific basket be sent to the same worker. The worker will have a complete view of all changes made to the basket during the session and be able to make recommendations. In Amazon Kinesis, the partition key can be used to ensure events are grouped onto the same shard and read by the same worker.
The KCL automatically creates a DynamoDB table for each Kinesis application to keep track and maintain state information.
The KCL uses the name of the Amazon Kinesis application as the name of the DynamoDB table. Each application name must be unique within a region.
Your account is charged for this DynamoDB usage.
Each row in the DynamoDB table represents a shard in the Kinesis stream.
Application State Data:
Shard ID: Hash key for the shard that is being processed by your application.
Checkpoint: The most recent checkpoint sequence number for the shard.
Worker ID: The ID of the worker that is processing the shard.
Lease/Heartbeat: A count that is periodically incremented by the record processor.
The read/write throughput for the DynamoDB table depends on the number of shards and how often your application
Once a shard is completely processed, the record processor that was processing it should receive a call to its shutdown
method with the reason "TERMINATE", client's should call checkpointer.checkpoint() when the shutdown reason is
"TERMINATE" so that the end of the shard is checkpointed. If cleanupLeasesUponShardCompletion is set to true in
your KinesisClientLibConfiguration this will allow the shard sync task to clean up the lease for that shard.
You can check your lease table in dynamodb, it will be named the same as the application name provided to your
Workers. In that table you can confirm whether or not the leases for the closed shards have been checkpointed at
SHARD_END or not.
If the Amazon DynamoDB table for your Amazon Kinesis application does not exist when the application starts up, one of the workers creates the table and calls the describeStream method to populate the table.
Throughput
If your Amazon Kinesis application receives provisioned-throughput exceptions, you should increase the provisioned throughput for the DynamoDB table. The KCL creates the table with a provisioned throughput of 10 reads per second and 10 writes per second, but this might not be sufficient for your application. For example, if your Amazon Kinesis application does frequent checkpointing or operates on a stream that is composed of many shards, you might need more throughput.
These frameworks make it easy to implement real-time analytics, but understanding how they work together helps you optimize performance.
Amazon Kinesis integration with Apache Spark is via Spark Streaming. Spark Streaming is an extension of the core Spark framework that enables scalable, high-throughput, fault-tolerant stream processing of data streams such as Amazon Kinesis Streams. Spark Streaming provides a high-level abstraction called a Discretized Stream or DStream, which represents a continuous sequence of RDDs. Spark Streaming uses the Amazon Kinesis Client Library (KCL) to consume data from an Amazon Kinesis stream. The KCL takes care of many of the complex tasks associated with distributed computing, such as load balancing, failure recovery, and check-pointing. Think of Spark Streaming as two main components:
Fetching data from the streaming sources into DStreams
Processing data in these DStreams as batches
Ensure that the number of Amazon Kinesis receivers created are a multiple of executors so that they are load balanced evenly across all the executors.
Ensure that the total processing time is less than the batch interval.
Use the number of executors and number of cores per executor parameters to optimize parallelism and use the available resources efficiently.
Be aware that Spark Streaming uses the default of 1 sec with KCL to read data from Amazon Kinesis.
For reliable at-least-once semantics, enable the Spark-based checkpoints and ensure that your processing is idempotent (recommended) or you have transactional semantics around your processing.
Ensure that you’re using Spark version 1.6 or later with the EMRFS consistent view option, when using Amazon S3 as the storage for Spark checkpoints.
Ensure that there is only one instance of the application running with Spark Streaming, and that multiple applications are not using the same DynamoDB table (via the KCL).
AWS Lambda is a compute service that runs your code in response to events and automatically manages the compute resources for you, making it easy to build applications that respond quickly to new information. AWS Lambda starts running your code within milliseconds of an event such as an image upload, in-app activity, website click, or output from a connected device. You can also use AWS Lambda to create new back-end services where compute resources are automatically triggered based on custom requests. With AWS Lambda you pay only for the requests served and the compute time required to run your code. Billing is metered in increments of 100 milliseconds, making it cost-effective and easy to scale automatically from a few requests per day to thousands per second.
Sizmek – Who? What? How?
Our Challenges
Our Technology Stack
Sizmek Data Platform High Level Architecture
High Level Requirements Of ESP
Apache Storm on top AWS
Lessons Learned
Our Journey in Data analytics -
Time to insight faster
Kinesis raw data
Kinesis enriched data + insight
Next will be performed: Kinesis analytics, spark streaming
Clean data,
S3 data lake
Apache Storm – stream processing technology
We chose storm from the following reasons:
Familiarity with storm from our on prem with Kafka
There was a ready to use storm connector to kinesis
Spark streaming wasn’t ready
Apache Storm – stream processing technology
We chose storm from the following reasons:
Familiarity with storm from our on prem with Kafka
There was a ready to use storm connector to kinesis
Spark streaming wasn’t ready
Added capability to return the row of data back into the stream after failover (only due to resource un availability)
Ask Roy – will this code be pushed to the available spout in the open source?
The limitations are per shard but not available by APIs
Due to storm we have to work per record and use the putRecord API
We eventualy had to allocate more shards than according to the API
Amazon Kinesis streams write limits:
Up to 1,000 records per second for writes per shard
Up to a maximum total data write rate of 1 MB per second per shard