This document provides an overview of Amazon Elastic MapReduce (EMR), including:
1) EMR allows users to quickly and cost-effectively process vast amounts of data by providing a managed Hadoop framework and supporting popular distributed frameworks like Spark.
2) The document demonstrates how to use EMR for tasks like clickstream analysis, log processing, and genomic research through example use cases.
3) It outlines the agenda which will cover Hadoop fundamentals, EMR features, how to get started, supported tools, and additional resources.
Amazon EMR enables fast processing of large structured or unstructured datasets, and in this presentation we'll show you how to setup an Amazon EMR job flow to analyse application logs, and perform Hive queries against it. We also review best practices around data file organisation on Amazon Simple Storage Service (S3), how clusters can be started from the AWS web console and command line, and how to monitor the status of a Map/Reduce job.
Finally we take a look at Hadoop ecosystem tools you can use with Amazon EMR and the additional features of the service.
See a recording of the webinar based on this presentation on YouTube here:
Check out the rest of the Masterclass webinars for 2015 here: http://aws.amazon.com/campaigns/emea/masterclass/
See the Journey Through the Cloud webinar series here: http://aws.amazon.com/campaigns/emea/journey/
Overview on Amazon EMR and its benefits for a wide variety of use cases and how to get started alongside Apache Zeppelin for interactive data analytics and document collaboration.
This document provides an overview of Amazon Elastic MapReduce (EMR), a service that makes it easy to process large amounts of data using the Hadoop framework. It discusses how EMR allows users to launch Hadoop clusters in minutes, integrate with other AWS services for storage and databases, customize clusters using various Hadoop applications and design patterns, and pay only for the resources used. The document aims to demonstrate how EMR provides an easy, fast, secure and cost-effective way to run Hadoop in the cloud.
Scaling your Analytics with Amazon Elastic MapReduce (BDT301) | AWS re:Invent...Amazon Web Services
Big data technologies let you work with any velocity, volume, or variety of data in a highly productive environment. Join the General Manager of Amazon EMR, Peter Sirota, to learn how to scale your analytics, use Hadoop with Amazon EMR, write queries with Hive, develop real world data flows with Pig, and understand the operational needs of a production data platform.
Best Practices for Running Amazon EC2 Spot Instances with Amazon EMR - AWS On...Amazon Web Services
Learning Objectives:
- Learn how to run Amazon EMR clusters on Spot instances and significantly reduce the cost of processing vast amounts of data on managed Hadoop clusters
- Understand key EC2 Spot Instances concepts and common usage patterns for maximum scale and cost optimization for Big Data workloads
- See a few customer examples that show how to leverage the full scale of the AWS cloud for faster results
Amazon EMR enables fast processing of large structured or unstructured datasets, and in this presentation we'll show you how to setup an Amazon EMR job flow to analyse application logs, and perform Hive queries against it. We also review best practices around data file organisation on Amazon Simple Storage Service (S3), how clusters can be started from the AWS web console and command line, and how to monitor the status of a Map/Reduce job.
Finally we take a look at Hadoop ecosystem tools you can use with Amazon EMR and the additional features of the service.
See a recording of the webinar based on this presentation on YouTube here:
Check out the rest of the Masterclass webinars for 2015 here: http://aws.amazon.com/campaigns/emea/masterclass/
See the Journey Through the Cloud webinar series here: http://aws.amazon.com/campaigns/emea/journey/
Overview on Amazon EMR and its benefits for a wide variety of use cases and how to get started alongside Apache Zeppelin for interactive data analytics and document collaboration.
This document provides an overview of Amazon Elastic MapReduce (EMR), a service that makes it easy to process large amounts of data using the Hadoop framework. It discusses how EMR allows users to launch Hadoop clusters in minutes, integrate with other AWS services for storage and databases, customize clusters using various Hadoop applications and design patterns, and pay only for the resources used. The document aims to demonstrate how EMR provides an easy, fast, secure and cost-effective way to run Hadoop in the cloud.
Scaling your Analytics with Amazon Elastic MapReduce (BDT301) | AWS re:Invent...Amazon Web Services
Big data technologies let you work with any velocity, volume, or variety of data in a highly productive environment. Join the General Manager of Amazon EMR, Peter Sirota, to learn how to scale your analytics, use Hadoop with Amazon EMR, write queries with Hive, develop real world data flows with Pig, and understand the operational needs of a production data platform.
Best Practices for Running Amazon EC2 Spot Instances with Amazon EMR - AWS On...Amazon Web Services
Learning Objectives:
- Learn how to run Amazon EMR clusters on Spot instances and significantly reduce the cost of processing vast amounts of data on managed Hadoop clusters
- Understand key EC2 Spot Instances concepts and common usage patterns for maximum scale and cost optimization for Big Data workloads
- See a few customer examples that show how to leverage the full scale of the AWS cloud for faster results
Best Practices for Managing Hadoop Framework Based Workloads (on Amazon EMR) ...Amazon Web Services
Learning Objectives:
- Learn how to use Amazon EMR for easy, fast, and cost-effective processing of vast amounts of data across dynamically scalable Amazon EC2 instances.
- Learn how using EC2 Spot can significantly reduce the cost of running your clusters.
- Learn how Amazon EMR Instance Fleets can make it easier to quickly obtain and maintain your desired capacity for your clusters.
Amazon Elastic MapReduce (Amazon EMR) is a web service that allows you to easily and securely provision and manage your Hadoop clusters. In this talk, we will introduce you to Amazon EMR design patterns, such as using various data stores like Amazon S3, how to take advantage of both transient and active clusters, and how to work with other Amazon EMR architectural patterns. We will dive deep on how to dynamically scale your cluster and address the ways you can fine-tune your cluster. We will discuss bootstrapping Hadoop applications from our partner ecosystem that you can use natively with Amazon EMR. Lastly, we will share best practices on how to keep your Amazon EMR cluster cost-effective.
(BDT208) A Technical Introduction to Amazon Elastic MapReduceAmazon Web Services
"Amazon EMR provides a managed framework which makes it easy, cost effective, and secure to run data processing frameworks such as Apache Hadoop, Apache Spark, and Presto on AWS. In this session, you learn the key design principles behind running these frameworks on the cloud and the feature set that Amazon EMR offers. We discuss the benefits of decoupling compute and storage and strategies to take advantage of the scale and the parallelism that the cloud offers, while lowering costs. Additionally, you hear from AOL’s Senior Software Engineer on how they used these strategies to migrate their Hadoop workloads to the AWS cloud and lessons learned along the way.
In this session, you learn the benefits of decoupling storage and compute and allowing them to scale independently; how to run Hadoop, Spark, Presto and other supported Hadoop Applications on Amazon EMR; how to use Amazon S3 as a persistent data-store and process data directly from Amazon S3; dDeployment strategies and how to avoid common mistakes when deploying at scale; and how to use Spot instances to scale your transient infrastructure effectively."
AWS Summit Stockholm 2014 – T3 – disaster recovery on AWSAmazon Web Services
Implementation of a disaster recovery (DR) site is crucial for the business continuity of any enterprise. Due to the fundamental nature of features like elasticity, scalability and geographic distribution, DR implementation on AWS can be done at 10-50% of the conventional cost. In this session, we do a deep dive into proven DR architectures on AWS and the best practices, tools and techniques to get the most out of them.
This session is recommended for attendees who wish to explore options for ensuring the continuity of their business.
Steve McPherson presented on Amazon EMR (Elastic MapReduce) at the Facebook Presto Meetup on March 19, 2015. Amazon EMR makes cluster management easy by handling setup and configuration, node monitoring and replacement, log aggregation, and integration with CloudWatch and Spot instances. Amazon EMR supports various big data frameworks like MapReduce, Spark, Hive, Pig, Presto and data stores like Amazon S3. It allows running different clusters optimized for different workloads like Hive/Pig, Presto or Spark.
AWS Summit London 2014 | From One to Many - Evolving VPC Design (400)Amazon Web Services
In this advanced technical session you will learn how you can use AWS to build and deploy virtual data centres as fast as you can design them. Learn how to combine CloudFormation templates together with best practice techniques that are in use by AWS customers today to optimise the design and implementation of your VPCs
Scaling your Application for Growth using Automation (CPN209) | AWS re:Invent...Amazon Web Services
Growing too quickly may sound like a nice problem to have, unless you are the one having it. A growing business can’t afford not to keep up with customer demand and availability. Don’t be left behind. Come learn how start-ups Chute and Euclid kept up with real-time user-generated data from over 3,000 apps and 2 TB of metadata and stayed ahead of retail peak-time traffic, all with AWS. Hear how they used all that data on their own growth to propel their business even further and deepen relationships with customers. Not planning for growth is just like not planning to grow!
SF Big Analytics: Machine Learning with Presto by Christopher BernerChester Chen
PrestoML allows users to perform machine learning tasks like classification and regression directly in SQL queries using Presto. It works by adding machine learning functions as Presto plugins that can be called from SQL. This allows ML tasks to be integrated with querying and analysis of data stored in databases and other data sources. Some key benefits are that ML code does not need to be written separately, and ML models can be stored directly as tables to be queried. The demo shows classifying emails as spam by learning a model directly from a SQL query. Future work may include more algorithms, hyperparameter tuning, and other ML techniques.
Presto @ Netflix: Interactive Queries at Petabyte ScaleDataWorks Summit
This document summarizes Presto, an open source distributed SQL query engine used at Netflix to run interactive queries against large datasets at petabyte scale. Key points:
- Presto allows Netflix to run interactive SQL queries against 15+ PB of data stored in S3 and Hadoop with fast query times.
- Netflix's deployment includes over 220 Presto worker nodes that integrate Presto with other data platforms like S3, Atlas, Amazon RDS, and HCatalog.
- The document outlines contributions Netflix has made to Presto, including optimizations for complex data types, joins, schema evolution, and a custom S3 file system integration.
This document provides a feature-by-feature comparison of Apache Solr, Elasticsearch, and Amazon CloudSearch. It includes a summary card that highlights key differences in administration operations like backup, patching, and reindexing. Backup in Apache Solr uses replication handlers that require customization for storage options beyond local disk. Elasticsearch includes snapshot APIs and plugins for repositories like S3 and HDFS. Amazon CloudSearch handles backups internally without user involvement. Patching in Apache Solr and Elasticsearch requires rolling restarts while Amazon CloudSearch manages upgrades automatically. Reindexing may require manual processes in Apache Solr and Elasticsearch but can be initiated manually from the Amazon CloudSearch console.
The document provides an overview of disaster recovery concepts and how AWS features can be used for disaster recovery. It discusses various architectural patterns for disaster recovery on AWS ranging from simple backup and restore to fully redundant multi-site configurations. Example patterns include using S3 for backups, running a "pilot light" of reduced infrastructure for quick recovery, and running a fully scaled low-capacity standby environment. The presentation concludes by discussing solutions providers and the advantages of using AWS for disaster recovery planning.
This document discusses using AWS for disaster recovery. It outlines several disaster recovery scenarios that can be implemented on AWS, including backup and restore, pilot light, low-capacity standby, and multi-site hot standby. For each scenario, it describes the advantages, preparation needed, and objectives for recovery time and point objectives. It emphasizes testing disaster recovery plans on AWS and notes that initial steps are simple. The presentation encourages attendees to learn more about AWS disaster recovery resources and consider using AWS for a disaster recovery project.
Workload-Aware: Auto-Scaling A new paradigm for Big Data WorkloadsVasu S
Learn more about Workload-Aware-Auto-Scaling-- an alternative architectural approach to Auto-Scaling that is better suited for the Cloud and applications like Hadoop, Spark, and Presto.
qubole.com/resources/white-papers/workload-aware-auto-scaling-qubole
This document introduces AWS Data Pipeline, which allows users to create workflows to move and transform data between different AWS services and on-premises systems. It provides examples of using Data Pipeline to move data between Amazon S3, DynamoDB, RDS, EMR and Redshift. It also discusses using Data Pipeline for ETL workflows and comparing the cost of running workflows on AWS versus on-premises systems.
Disaster Recovery of on-premises IT infrastructure with AWSAmazon Web Services
The objective of this session is to enable customers with any level of DR experience to gain actionable guidance to advance their business up the ladder of DR readiness. AWS enables fast disaster recovery of critical on-premises IT systems without incurring the complexity and expense of a second physical site. With 28 availability zones in 11 regions around the world and a broad set of services, AWS can deliver rapid recovery of on-premises IT infrastructure and data. During this session we will walk you through the ascending levels of DR options made possible with AWS and review the technologies and services that help deliver various DR capabilities, starting from cloud backups all the way up to hot site DR. We will also explore various DR architectures and the balance of recovery time and cost.
Disaster Recovery Sites on AWS: Minimal Cost, Maximum EfficiencyAmazon Web Services
Implementation of a disaster recovery (DR) site is crucial for the business continuity of any enterprise. Due to the fundamental nature of features like elasticity, scalability, and geographic distribution, DR implementation on AWS can be done at 10-50% of the conventional cost. In this session, we do a deep dive into proven DR architectures on AWS and the best practices, tools and techniques to get the most out of them.
Hw09 Making Hadoop Easy On Amazon Web ServicesCloudera, Inc.
Amazon Elastic MapReduce enables customers to easily and cost-effectively process vast amounts of data utilizing a hosted Hadoop framework running on Amazon's web-scale infrastructure. It was launched in 2009 and allows customers to spin up large or small Hadoop job flows in minutes without having to manage compute clusters or tune Hadoop themselves. The service provides a reliable, fault tolerant and cost effective way for customers to solve problems like data mining, genome analysis, financial simulation and web indexing using their existing Hadoop applications.
Cost is often the conversation starter when customers think about moving to the cloud. AWS helps lower costs for customers through its “pay only for what you use” pricing model, frequent price drops, and pricing model choice to support variable & stable workloads. In this session, you will learn about the financial considerations of owning and operating a traditional data center or managed hosting provider versus utilizing AWS. We will detail our TCO methodology and showcase cost comparisons for some common customer use-cases. We’ll also cover a few AWS cost optimization areas, including Spot and Reserved Instances, EC2 Auto Scaling, and consolidated billing.
In this talk from the Dublin Websummit 2014 AWS Technical Evangelist Ian Massingham discusses practices and techniques for optimising and lowering the cost of operations for applications and services that you are running on the AWS cloud.
Includes a discussion of the fundamental tenets of pricing for AWS services, plus tips and tricks for reducing the amount that you need to spend with AWS in order to run your workloads on the AWS cloud.
The webinar based on this presentation discussed strategies that you can adopt to help you save money in the AWS Cloud. From turning systems off at night, to implementing bidding strategies on the spot market, there are many ways in which you can manage and reduce your costs with AWS.
Dive into the differences between instance types; explain how you can reduce costs with Reserved Instances, the spot market and by architecting to reduce costs. We'll discuss how to combine on-demand pricing with spot pricing to perform cost effective big data analysis, and introduce customer examples to illustrate how AWS customers gain the most from AWS whilst at the same time managing their spend.
Topics include:
• Understand different cost optimisation strategies you can employ in the AWS Cloud
• Learn how to take advantage of different instance types
• Discover architectural principles behind cost optimisation in AWS
• Learn about tools to help you keep on top of your AWS spend
You can find a recording of this webinar on YouTube here: http://youtu.be/kId90Q7b6kY
Architetture serverless e pattern avanzati per AWS LambdaAmazon Web Services
This document summarizes an AWS webinar on advanced serverless patterns for AWS Lambda. The webinar covered 4 main patterns: 1) Web applications using API Gateway, Lambda, and DynamoDB; 2) Stream processing using Kinesis and Lambda; 3) Data lakes using services like S3, Athena, and Glue; 4) Machine learning workflows using services like SageMaker, Rekognition, and Lex. The webinar provided best practices for each pattern including techniques for performance optimization, security, and architecture design.
The document discusses using Amazon EMR to scale analytics workloads on AWS. It provides an overview of EMR and how it allows users to easily run Hadoop clusters on AWS. It discusses how EMR allows tuning clusters and reducing costs by using Spot instances. It also discusses using various AWS services like S3, HDFS and integrating various Hadoop ecosystem tools on EMR. It provides examples of using EMR for batch processing logs, as a long-running database and for ad-hoc analysis of large datasets. It emphasizes using S3 for persistent storage and provides best practices around file sizes, compression and bootstrap actions.
Best Practices for Managing Hadoop Framework Based Workloads (on Amazon EMR) ...Amazon Web Services
Learning Objectives:
- Learn how to use Amazon EMR for easy, fast, and cost-effective processing of vast amounts of data across dynamically scalable Amazon EC2 instances.
- Learn how using EC2 Spot can significantly reduce the cost of running your clusters.
- Learn how Amazon EMR Instance Fleets can make it easier to quickly obtain and maintain your desired capacity for your clusters.
Amazon Elastic MapReduce (Amazon EMR) is a web service that allows you to easily and securely provision and manage your Hadoop clusters. In this talk, we will introduce you to Amazon EMR design patterns, such as using various data stores like Amazon S3, how to take advantage of both transient and active clusters, and how to work with other Amazon EMR architectural patterns. We will dive deep on how to dynamically scale your cluster and address the ways you can fine-tune your cluster. We will discuss bootstrapping Hadoop applications from our partner ecosystem that you can use natively with Amazon EMR. Lastly, we will share best practices on how to keep your Amazon EMR cluster cost-effective.
(BDT208) A Technical Introduction to Amazon Elastic MapReduceAmazon Web Services
"Amazon EMR provides a managed framework which makes it easy, cost effective, and secure to run data processing frameworks such as Apache Hadoop, Apache Spark, and Presto on AWS. In this session, you learn the key design principles behind running these frameworks on the cloud and the feature set that Amazon EMR offers. We discuss the benefits of decoupling compute and storage and strategies to take advantage of the scale and the parallelism that the cloud offers, while lowering costs. Additionally, you hear from AOL’s Senior Software Engineer on how they used these strategies to migrate their Hadoop workloads to the AWS cloud and lessons learned along the way.
In this session, you learn the benefits of decoupling storage and compute and allowing them to scale independently; how to run Hadoop, Spark, Presto and other supported Hadoop Applications on Amazon EMR; how to use Amazon S3 as a persistent data-store and process data directly from Amazon S3; dDeployment strategies and how to avoid common mistakes when deploying at scale; and how to use Spot instances to scale your transient infrastructure effectively."
AWS Summit Stockholm 2014 – T3 – disaster recovery on AWSAmazon Web Services
Implementation of a disaster recovery (DR) site is crucial for the business continuity of any enterprise. Due to the fundamental nature of features like elasticity, scalability and geographic distribution, DR implementation on AWS can be done at 10-50% of the conventional cost. In this session, we do a deep dive into proven DR architectures on AWS and the best practices, tools and techniques to get the most out of them.
This session is recommended for attendees who wish to explore options for ensuring the continuity of their business.
Steve McPherson presented on Amazon EMR (Elastic MapReduce) at the Facebook Presto Meetup on March 19, 2015. Amazon EMR makes cluster management easy by handling setup and configuration, node monitoring and replacement, log aggregation, and integration with CloudWatch and Spot instances. Amazon EMR supports various big data frameworks like MapReduce, Spark, Hive, Pig, Presto and data stores like Amazon S3. It allows running different clusters optimized for different workloads like Hive/Pig, Presto or Spark.
AWS Summit London 2014 | From One to Many - Evolving VPC Design (400)Amazon Web Services
In this advanced technical session you will learn how you can use AWS to build and deploy virtual data centres as fast as you can design them. Learn how to combine CloudFormation templates together with best practice techniques that are in use by AWS customers today to optimise the design and implementation of your VPCs
Scaling your Application for Growth using Automation (CPN209) | AWS re:Invent...Amazon Web Services
Growing too quickly may sound like a nice problem to have, unless you are the one having it. A growing business can’t afford not to keep up with customer demand and availability. Don’t be left behind. Come learn how start-ups Chute and Euclid kept up with real-time user-generated data from over 3,000 apps and 2 TB of metadata and stayed ahead of retail peak-time traffic, all with AWS. Hear how they used all that data on their own growth to propel their business even further and deepen relationships with customers. Not planning for growth is just like not planning to grow!
SF Big Analytics: Machine Learning with Presto by Christopher BernerChester Chen
PrestoML allows users to perform machine learning tasks like classification and regression directly in SQL queries using Presto. It works by adding machine learning functions as Presto plugins that can be called from SQL. This allows ML tasks to be integrated with querying and analysis of data stored in databases and other data sources. Some key benefits are that ML code does not need to be written separately, and ML models can be stored directly as tables to be queried. The demo shows classifying emails as spam by learning a model directly from a SQL query. Future work may include more algorithms, hyperparameter tuning, and other ML techniques.
Presto @ Netflix: Interactive Queries at Petabyte ScaleDataWorks Summit
This document summarizes Presto, an open source distributed SQL query engine used at Netflix to run interactive queries against large datasets at petabyte scale. Key points:
- Presto allows Netflix to run interactive SQL queries against 15+ PB of data stored in S3 and Hadoop with fast query times.
- Netflix's deployment includes over 220 Presto worker nodes that integrate Presto with other data platforms like S3, Atlas, Amazon RDS, and HCatalog.
- The document outlines contributions Netflix has made to Presto, including optimizations for complex data types, joins, schema evolution, and a custom S3 file system integration.
This document provides a feature-by-feature comparison of Apache Solr, Elasticsearch, and Amazon CloudSearch. It includes a summary card that highlights key differences in administration operations like backup, patching, and reindexing. Backup in Apache Solr uses replication handlers that require customization for storage options beyond local disk. Elasticsearch includes snapshot APIs and plugins for repositories like S3 and HDFS. Amazon CloudSearch handles backups internally without user involvement. Patching in Apache Solr and Elasticsearch requires rolling restarts while Amazon CloudSearch manages upgrades automatically. Reindexing may require manual processes in Apache Solr and Elasticsearch but can be initiated manually from the Amazon CloudSearch console.
The document provides an overview of disaster recovery concepts and how AWS features can be used for disaster recovery. It discusses various architectural patterns for disaster recovery on AWS ranging from simple backup and restore to fully redundant multi-site configurations. Example patterns include using S3 for backups, running a "pilot light" of reduced infrastructure for quick recovery, and running a fully scaled low-capacity standby environment. The presentation concludes by discussing solutions providers and the advantages of using AWS for disaster recovery planning.
This document discusses using AWS for disaster recovery. It outlines several disaster recovery scenarios that can be implemented on AWS, including backup and restore, pilot light, low-capacity standby, and multi-site hot standby. For each scenario, it describes the advantages, preparation needed, and objectives for recovery time and point objectives. It emphasizes testing disaster recovery plans on AWS and notes that initial steps are simple. The presentation encourages attendees to learn more about AWS disaster recovery resources and consider using AWS for a disaster recovery project.
Workload-Aware: Auto-Scaling A new paradigm for Big Data WorkloadsVasu S
Learn more about Workload-Aware-Auto-Scaling-- an alternative architectural approach to Auto-Scaling that is better suited for the Cloud and applications like Hadoop, Spark, and Presto.
qubole.com/resources/white-papers/workload-aware-auto-scaling-qubole
This document introduces AWS Data Pipeline, which allows users to create workflows to move and transform data between different AWS services and on-premises systems. It provides examples of using Data Pipeline to move data between Amazon S3, DynamoDB, RDS, EMR and Redshift. It also discusses using Data Pipeline for ETL workflows and comparing the cost of running workflows on AWS versus on-premises systems.
Disaster Recovery of on-premises IT infrastructure with AWSAmazon Web Services
The objective of this session is to enable customers with any level of DR experience to gain actionable guidance to advance their business up the ladder of DR readiness. AWS enables fast disaster recovery of critical on-premises IT systems without incurring the complexity and expense of a second physical site. With 28 availability zones in 11 regions around the world and a broad set of services, AWS can deliver rapid recovery of on-premises IT infrastructure and data. During this session we will walk you through the ascending levels of DR options made possible with AWS and review the technologies and services that help deliver various DR capabilities, starting from cloud backups all the way up to hot site DR. We will also explore various DR architectures and the balance of recovery time and cost.
Disaster Recovery Sites on AWS: Minimal Cost, Maximum EfficiencyAmazon Web Services
Implementation of a disaster recovery (DR) site is crucial for the business continuity of any enterprise. Due to the fundamental nature of features like elasticity, scalability, and geographic distribution, DR implementation on AWS can be done at 10-50% of the conventional cost. In this session, we do a deep dive into proven DR architectures on AWS and the best practices, tools and techniques to get the most out of them.
Hw09 Making Hadoop Easy On Amazon Web ServicesCloudera, Inc.
Amazon Elastic MapReduce enables customers to easily and cost-effectively process vast amounts of data utilizing a hosted Hadoop framework running on Amazon's web-scale infrastructure. It was launched in 2009 and allows customers to spin up large or small Hadoop job flows in minutes without having to manage compute clusters or tune Hadoop themselves. The service provides a reliable, fault tolerant and cost effective way for customers to solve problems like data mining, genome analysis, financial simulation and web indexing using their existing Hadoop applications.
Cost is often the conversation starter when customers think about moving to the cloud. AWS helps lower costs for customers through its “pay only for what you use” pricing model, frequent price drops, and pricing model choice to support variable & stable workloads. In this session, you will learn about the financial considerations of owning and operating a traditional data center or managed hosting provider versus utilizing AWS. We will detail our TCO methodology and showcase cost comparisons for some common customer use-cases. We’ll also cover a few AWS cost optimization areas, including Spot and Reserved Instances, EC2 Auto Scaling, and consolidated billing.
In this talk from the Dublin Websummit 2014 AWS Technical Evangelist Ian Massingham discusses practices and techniques for optimising and lowering the cost of operations for applications and services that you are running on the AWS cloud.
Includes a discussion of the fundamental tenets of pricing for AWS services, plus tips and tricks for reducing the amount that you need to spend with AWS in order to run your workloads on the AWS cloud.
The webinar based on this presentation discussed strategies that you can adopt to help you save money in the AWS Cloud. From turning systems off at night, to implementing bidding strategies on the spot market, there are many ways in which you can manage and reduce your costs with AWS.
Dive into the differences between instance types; explain how you can reduce costs with Reserved Instances, the spot market and by architecting to reduce costs. We'll discuss how to combine on-demand pricing with spot pricing to perform cost effective big data analysis, and introduce customer examples to illustrate how AWS customers gain the most from AWS whilst at the same time managing their spend.
Topics include:
• Understand different cost optimisation strategies you can employ in the AWS Cloud
• Learn how to take advantage of different instance types
• Discover architectural principles behind cost optimisation in AWS
• Learn about tools to help you keep on top of your AWS spend
You can find a recording of this webinar on YouTube here: http://youtu.be/kId90Q7b6kY
Architetture serverless e pattern avanzati per AWS LambdaAmazon Web Services
This document summarizes an AWS webinar on advanced serverless patterns for AWS Lambda. The webinar covered 4 main patterns: 1) Web applications using API Gateway, Lambda, and DynamoDB; 2) Stream processing using Kinesis and Lambda; 3) Data lakes using services like S3, Athena, and Glue; 4) Machine learning workflows using services like SageMaker, Rekognition, and Lex. The webinar provided best practices for each pattern including techniques for performance optimization, security, and architecture design.
The document discusses using Amazon EMR to scale analytics workloads on AWS. It provides an overview of EMR and how it allows users to easily run Hadoop clusters on AWS. It discusses how EMR allows tuning clusters and reducing costs by using Spot instances. It also discusses using various AWS services like S3, HDFS and integrating various Hadoop ecosystem tools on EMR. It provides examples of using EMR for batch processing logs, as a long-running database and for ad-hoc analysis of large datasets. It emphasizes using S3 for persistent storage and provides best practices around file sizes, compression and bootstrap actions.
Join us for a for a Amazon Kinesis tutorial webinar. In this session we will provide a reference architecture and instructions for building a system that performs real-time sliding-windows analysis over streaming clickstream data. We will use Amazon Kinesis for managed ingestion of streaming data at scale with the ability to replay past data, and run sliding-window computation using Apache Storm. We’ll demonstrate in the webinar on how to build the system and deploy on AWS and walkthrough all the steps from ingestion, processing, and storing to visualizing of the data in real-time.
Amazon EMR is one of the largest Hadoop operators in the world. In this session, we introduce you to Amazon EMR design patterns such as using Amazon S3 instead of HDFS, taking advantage of both long and short-lived clusters, and other Amazon EMR architectural best practices. We talk about how to scale your cluster up or down dynamically and introduce you to ways you can fine-tune your cluster. We will also share best practices to keep your Amazon EMR cluster cost-efficient. Finally, we dive into some of our recent launches to keep you current on our latest features.
Flink Forward SF 2017: Malo Deniélou - No shard left behind: Dynamic work re...Flink Forward
The Apache Beam programming model is designed to support several advanced data processing features such as autoscaling and dynamic work rebalancing. In this talk, we will first explain how dynamic work rebalancing not only provides a general and robust solution to the problem of stragglers in traditional data processing pipelines, but also how it allows autoscaling to be truly effective. We will then present how dynamic work rebalancing works as implemented in the Google Cloud Dataflow runner and which path other Apache Beam runners link Apache Flink can follow to benefit from it.
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMRAmazon Web Services
by Dario Rivera, Solutions Architect, AWS
Amazon EMR is a managed service that lets you process and analyze extremely large data sets using the latest versions of over 15 open-source frameworks in the Apache Hadoop and Spark ecosystems. In this session, we introduce you to Amazon EMR design patterns such as using Amazon S3 instead of HDFS, taking advantage of both long and short-lived clusters, and other Amazon EMR architectural best practices. We talk about how to scale your cluster up or down dynamically and introduce you to ways you can fine-tune your cluster. We also share best practices to keep your Amazon EMR cluster cost-efficient. Finally, we dive into some of our recent launches to keep you current on our latest features. This session will feature Asurion, a provider of device protection and support services for over 280 million smartphones and other consumer electronics devices.
The document discusses data analytics on AWS. It describes how AWS services like Amazon S3, DynamoDB, Redshift, EMR, and Kinesis can be used to generate, store, analyze and share data at scale. It provides examples of how companies are using these services for tasks like processing millions of records per second and adding thousands of new records daily. The document emphasizes that AWS allows users to remove constraints on data analytics by providing elastic, scalable infrastructure without upfront costs.
(BDT303) Running Spark and Presto on the Netflix Big Data PlatformAmazon Web Services
In this session, we discuss how Spark and Presto complement the Netflix big data platform stack that started with Hadoop, and the use cases that Spark and Presto address. Also, we discuss how we run Spark and Presto on top of the Amazon EMR infrastructure; specifically, how we use Amazon S3 as our data warehouse and how we leverage Amazon EMR as a generic framework for data-processing cluster management.
Introduction to Amazon EMR design patterns such as using Amazon S3 instead of HDFS, taking advantage of Spot EC2 instances to reduce costs, and other Amazon EMR architectural best practices.
The document provides an overview of Amazon Elastic MapReduce (EMR) and how it can be used to process large amounts of data using Hadoop and other big data technologies in the AWS cloud. Some key points:
- EMR allows users to run Hadoop frameworks and analytics tools like Hive and Pig on AWS using a web service API or command line tools.
- It provides a managed Hadoop cluster and integrates with other AWS services for storage, networking, etc. allowing big data workloads to easily scale up and down based on need.
- Users can launch EMR job flows to run data processing jobs, specifying options like instance types, numbers of nodes, bootstrap actions, and steps to execute across
Running Presto and Spark on the Netflix Big Data PlatformEva Tse
This document summarizes Netflix's big data platform, which uses Presto and Spark on Amazon EMR and S3. Key points:
- Netflix processes over 50 billion hours of streaming per quarter from 65+ million members across over 1000 devices.
- Their data warehouse contains over 25PB stored on S3. They read 10% daily and write 10% of reads.
- They use Presto for interactive queries and Spark for both batch and iterative jobs.
- They have customized Presto and Spark for better performance on S3 and Parquet, and contributed code back to open source projects.
- Their architecture leverages dynamic EMR clusters with Presto and Spark deployed via bootstrap actions for scalability.
AWS November Webinar Series - Advanced Analytics with Amazon Redshift and the...Amazon Web Services
Amazon Machine Learning is a service that makes it easy for developers of all skill levels to use machine learning technology and Amazon Redshift is a fast, fully managed, petabyte-scale data warehouse that makes it simple and cost-effective to efficiently analyze all your data using your existing business intelligence tools. The combination of the two can provide a solution to power advanced analytics for not only what has happened in the past, but make intelligent predictions about the future. Please join this webinar to learn how get the most value from your data for your data driven business.
Learning Objectives:
How to scale your Redshift queries with user-defined functions (UDFs)
How to apply Machine learning to historical data in Amazon Redshift
How to visualize your data with Amazon QuickSight
Present a reference architecture for advanced analytics
Who Should Attend:
Application developers looking to add UDFs, or predictive analytics to their applications, database administrators that need to meet the demand of data driven organizations, decision makers looking to derive more insight from their data
Business intelligence is often described as a set of methodologies and technologies that transform raw data into meaningful and useful information for business purposes. But this simple description hides many technical challenges IT teams struggle with. This session will show how to build business intelligence applications leveraging AWS, from the raw data import, consumption and storage down to the information production. We will also cover best practices for services such as Amazon Redshift or Amazon RDS, and how to use applications such as SAP Hana, Jaspersoft and others.
Apache Hadoop and Spark on AWS: Getting started with Amazon EMR - Pop-up Loft...Amazon Web Services
Amazon EMR is a managed service that makes it easy for customers to use big data frameworks and applications like Apache Hadoop, Spark, and Presto to analyze data stored in HDFS or on Amazon S3, Amazon’s highly scalable object storage service. In this session, we will introduce Amazon EMR and the greater Apache Hadoop ecosystem, and show how customers use them to implement and scale common big data use cases such as batch analytics, real-time data processing, interactive data science, and more. Then, we will walk through a demo to show how you can start processing your data at scale within minutes.
Evolving Your Big Data Use Cases from Batch to Real-Time - AWS May 2016 Webi...Amazon Web Services
Batch querying and reporting is no longer enough for many organizations. Reducing time to insight – the time it takes to turn data into actionable insights – is becoming increasingly important to remain competitive. That’s why organizations are quickly evolving their data applications to support a broader set of real-time analytic use cases.
In this webinar, we will review some of the common use cases for real-time analytics such as click-stream analysis, event data processing, and real-time analytics. We will show proven architectures for collecting, storing, and processing real-time data using a combination of AWS managed services, including Amazon Kinesis Streams, Amazon Kinesis Firehose, Amazon EMR, and AWS Lambda, as well open source tools, such as Apache Spark. Then, we will discuss common approaches and best practices to incorporate real-time analytics into your existing batch applications.
Learning Objectives:
• Understand how to incorporate real-time analytics into existing applications
• Best practices to combine batch with real-time data flows
• Learn common architectures and use cases for real-time analytics
The document provides an overview of Apache Spark and Hadoop ecosystem tools on Amazon EMR including Spark, Hive on Tez, and Presto. It discusses building data lakes with Amazon EMR and S3, running jobs and security options, and customer use cases. The demo shows Zeppelin and Hue interfaces. Examples are given of Netflix using Presto on EMR with a 25PB dataset and FINRA saving 60% costs by moving to HBase on EMR.
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMRAmazon Web Services
The document provides an introduction to Apache Spark, Hive on Tez, and Presto on Amazon EMR. It discusses how to build data lakes using Amazon S3 for storage and Amazon EMR for processing. It also covers running jobs on EMR clusters, security options, and two customer use cases - one by FINRA that saved 60% costs by moving to HBase on EMR, and one by Netflix that uses Presto on EMR for a 25PB dataset in S3.
The document provides an overview of Amazon Elastic MapReduce (EMR) including how to easily launch and manage clusters, leverage Amazon S3 for storage, optimize file formats and storage, and design patterns for batch processing, interactive querying, and server clusters. It also shares lessons learned from Swiftkey including using Parquet and Cascalog for ETL, getting serialization right, avoiding many small files in S3, using spot instances, and experimenting with instance types. The document concludes by mentioning Apache Spark on EMR for faster in-memory processing directly from S3.
Amazon Elastic MapReduce (EMR) is a web service that allows you to easily and securely provision and manage your Hadoop clusters. In this talk, we will introduce you to Amazon EMR design patterns, such as using various data stores such as Amazon S3, how to take advantage of both transient and active clusters, as well as other Amazon EMR architectural patterns. We will dive deep on how to dynamically scale your cluster and address the ways you can fine-tune your cluster. We will discuss bootstrapping Hadoop applications from our partner ecosystem that you can use natively with Amazon EMR. Lastly, we will share best practices on how to keep your Amazon EMR cluster cost-effective.
Slides from the Cloudyna event in Katowice, Poland on November 14th, 2015. Data analysis is being used to transform businesses, increase efficiency, and drive innovation. The AWS Cloud has a comprehensive portfolio of analytics services to help you process data of any volume and automate how you put that data to work for your organization. In this session we'll see how to put those services at work on structured, unstructured and real-time data.
Some thoughts on measuring the impact of developer relationsIan Massingham
The document discusses metrics for measuring the impact of developer relations (DevRel). It recommends focusing on three key areas: generating volume by scaling audiences and channels; nurturing community by expanding developer channels, meeting developers where they are, and finding/nurturing leaders; and driving conversion to paid customers by supporting programs to generate leads, facilitating in-account sessions, and impacting C-levels. Common DevRel metrics include event attendees, webinar viewers, calculated impact scores, community members and activity, and leads generated. The goal is to tweak metrics over time to best measure performance and room for improvement in these important areas.
Slides from my talk at the Leeds IoT Meetup on November 20th. Includes links to resources to help you get started with creating connected device applications with the AWS IoT Service
Getting started with AWS Lambda and the Serverless CloudIan Massingham
Slides from the MongoDB user group meetup talk that I did in March 2017.
https://gist.github.com/ianmas-aws/ce847270ecedf9a58cbcc1ed736cf541
^^ Gist containing (a very simple) code sample is here
AWS AWSome Day - Getting Started Best PracticesIan Massingham
The document outlines eight best practices for getting started with AWS: 1) choose your first use case well, 2) lay out your account structure and foundations, 3) think about security, 4) view AWS as services rather than software, 5) optimize costs, 6) use AWS tools and frameworks, 7) get support, and 8) ensure architectures are well designed. It provides guidance on each practice area, such as setting up billing alerts and consolidated billing, using IAM for access management, leveraging managed services, and following the Well-Architected Framework. Resources for learning more about AWS are also listed.
Ian Massingham from Amazon Web Services discusses designing and building applications for the Internet of Things using AWS services. The document outlines how AWS IoT provides scalable connectivity and management for IoT devices, allows processing of sensor data from devices in AWS using services like DynamoDB and Lambda, and addresses challenges of limited capabilities on edge devices through Greengrass, which runs Lambda functions and messaging locally on devices. Greengrass provides the same programming model for both cloud and edge computing with IoT applications.
This document summarizes announcements from AWS re:Invent 2016 related to transforming applications, security, cost optimization, reliability, and operational excellence. Key services discussed include the Well-Architected Framework course, Amazon CloudFormation, AWS OpsWorks for Chef Automate, Amazon EC2 Systems Manager, AWS CodeBuild, AWS X-Ray, AWS Personal Health Dashboard, AWS Shield, Amazon Pinpoint, AWS Glue, AWS Batch, C# support for AWS Lambda, AWS Lambda@Edge, AWS Step Functions, and several others. Many of these services were generally available or in preview at the time.
The document provides an overview of Amazon Web Services (AWS) and its capabilities across compute, storage, database, analytics, artificial intelligence, developer tools, and other services. It highlights the scalability, reliability, and security of the AWS platform and introduces new and expanded capabilities across compute types, databases, analytics, artificial intelligence, edge computing, data transfer, and migration services. It also summarizes AWS' global infrastructure and support offerings.
Getting Started with AWS Lambda & Serverless CloudIan Massingham
This document provides an overview of serverless computing using AWS Lambda. It defines serverless computing as running code without servers by paying only for the compute time consumed. AWS Lambda allows triggering functions from events or APIs which makes it easy to build scalable back-ends, perform data processing, and integrate systems. Recent updates include support for Python, scheduled functions, VPC access, and versioning. The document demonstrates using Lambda for building serverless web apps and microservices.
Building Better IoT Applications without ServersIan Massingham
This document discusses using serverless architectures with AWS services like AWS IoT, Lambda, DynamoDB, and S3 to build IoT applications without having to manage servers. It provides examples of how to connect devices to AWS IoT and trigger AWS Lambda functions in response to device events. These functions can then interact with other AWS services like DynamoDB, S3, and external APIs to implement applications like counting item usage from an IoT button and storing the data in DynamoDB, or starting a device when the button is pressed by invoking an external API via Lambda. The document also provides guidance on setting up a Raspberry Pi with sensors for local IoT development and connecting devices to AWS IoT.
This document provides an overview of Amazon Web Services (AWS) presented by Ian Massingham at an AWSome Day event. Some key points:
- AWS has over 1 million active customers including startups, enterprises, and independent software vendors.
- The cloud has become the new normal for companies of all sizes to build and deploy applications faster.
- AWS offers a vast technology platform of infrastructure and services including compute, storage, databases, analytics and more that allows for agility and innovation.
This document provides an introduction and overview of Amazon Web Services (AWS). It discusses that AWS has over 1 million active customers, including startups, enterprises, and independent software vendors. It highlights how AWS allows for agility through quick provisioning, a vast technology platform, and rapid innovation with new features. The document promotes learning more about AWS through blogs, events, training and certification programs. It encourages readers to create an AWS account and try new services.
1. The document discusses Amazon Web Services (AWS) and serverless computing. It highlights how AWS services like AWS Lambda, Amazon S3, and Amazon DynamoDB allow developers to run applications without managing infrastructure.
2. It provides examples of how serverless architectures can be used for web applications, data processing, internet of things applications, and as a connective tissue across AWS environments.
3. The document concludes by demonstrating how to deploy AWS Lambda functions using Terraform to automate infrastructure provisioning and management.
Getting started with AWS IoT on Raspberry PiIan Massingham
This document discusses getting started with AWS IoT using a Raspberry Pi. It provides an agenda that covers what AWS IoT is, why the Raspberry Pi is a good option for IoT prototyping, necessary hardware, setup instructions, examples, and pricing. The speaker will discuss setting up the Raspberry Pi with an electronics kit and sensors, configuring an AWS IoT device, and provide code examples to emulate an AWS IoT button and control a Raspberry Pi Sense Hat via AWS IoT Device Shadow using Python.
This document provides an introduction and overview of Amazon Web Services (AWS). It summarizes that AWS has over 1 million active customers, including startups, enterprises, and independent software vendors. It describes the vast infrastructure and services available on AWS, including compute, storage, databases, analytics, machine learning and more. It also discusses how AWS supports rapid innovation with new features and services, and provides information on training and certification opportunities to learn AWS.
GOTO Stockholm - AWS Lambda - Logic in the cloud without a back-endIan Massingham
Slides from my session at Goto Stockholm where I talked about AWS Lambda and how it can be used to build reliable, scalable & low-cost applications, without servers for you to manage.
Special thanks to James Hall at Parallax for allowing me to talk about the awesome application that they built using AWS Lambda, Amazon API Gateway & Amazon DynanmoDB :)
Advanced Security Masterclass - Tel Aviv LoftIan Massingham
The document provides an overview of advanced security best practices when using AWS. It discusses identity and access management with IAM, defining virtual networks with Amazon VPC, networking and security for Amazon EC2 instances, working with container and abstracted services, and encryption and key management in AWS. The presentation emphasizes sharing security responsibility between AWS and customers, implementing least privilege access, enabling auditing and monitoring, and using services like IAM, VPC, security groups, and AWS Key Management Service to help secure workloads in AWS.
Conversational agents, or chatbots, are increasingly used to access all sorts of services using natural language. While open-domain chatbots - like ChatGPT - can converse on any topic, task-oriented chatbots - the focus of this paper - are designed for specific tasks, like booking a flight, obtaining customer support, or setting an appointment. Like any other software, task-oriented chatbots need to be properly tested, usually by defining and executing test scenarios (i.e., sequences of user-chatbot interactions). However, there is currently a lack of methods to quantify the completeness and strength of such test scenarios, which can lead to low-quality tests, and hence to buggy chatbots.
To fill this gap, we propose adapting mutation testing (MuT) for task-oriented chatbots. To this end, we introduce a set of mutation operators that emulate faults in chatbot designs, an architecture that enables MuT on chatbots built using heterogeneous technologies, and a practical realisation as an Eclipse plugin. Moreover, we evaluate the applicability, effectiveness and efficiency of our approach on open-source chatbots, with promising results.
Digital Banking in the Cloud: How Citizens Bank Unlocked Their MainframePrecisely
Inconsistent user experience and siloed data, high costs, and changing customer expectations – Citizens Bank was experiencing these challenges while it was attempting to deliver a superior digital banking experience for its clients. Its core banking applications run on the mainframe and Citizens was using legacy utilities to get the critical mainframe data to feed customer-facing channels, like call centers, web, and mobile. Ultimately, this led to higher operating costs (MIPS), delayed response times, and longer time to market.
Ever-changing customer expectations demand more modern digital experiences, and the bank needed to find a solution that could provide real-time data to its customer channels with low latency and operating costs. Join this session to learn how Citizens is leveraging Precisely to replicate mainframe data to its customer channels and deliver on their “modern digital bank” experiences.
5th LF Energy Power Grid Model Meet-up SlidesDanBrown980551
5th Power Grid Model Meet-up
It is with great pleasure that we extend to you an invitation to the 5th Power Grid Model Meet-up, scheduled for 6th June 2024. This event will adopt a hybrid format, allowing participants to join us either through an online Mircosoft Teams session or in person at TU/e located at Den Dolech 2, Eindhoven, Netherlands. The meet-up will be hosted by Eindhoven University of Technology (TU/e), a research university specializing in engineering science & technology.
Power Grid Model
The global energy transition is placing new and unprecedented demands on Distribution System Operators (DSOs). Alongside upgrades to grid capacity, processes such as digitization, capacity optimization, and congestion management are becoming vital for delivering reliable services.
Power Grid Model is an open source project from Linux Foundation Energy and provides a calculation engine that is increasingly essential for DSOs. It offers a standards-based foundation enabling real-time power systems analysis, simulations of electrical power grids, and sophisticated what-if analysis. In addition, it enables in-depth studies and analysis of the electrical power grid’s behavior and performance. This comprehensive model incorporates essential factors such as power generation capacity, electrical losses, voltage levels, power flows, and system stability.
Power Grid Model is currently being applied in a wide variety of use cases, including grid planning, expansion, reliability, and congestion studies. It can also help in analyzing the impact of renewable energy integration, assessing the effects of disturbances or faults, and developing strategies for grid control and optimization.
What to expect
For the upcoming meetup we are organizing, we have an exciting lineup of activities planned:
-Insightful presentations covering two practical applications of the Power Grid Model.
-An update on the latest advancements in Power Grid -Model technology during the first and second quarters of 2024.
-An interactive brainstorming session to discuss and propose new feature requests.
-An opportunity to connect with fellow Power Grid Model enthusiasts and users.
AppSec PNW: Android and iOS Application Security with MobSFAjin Abraham
Mobile Security Framework - MobSF is a free and open source automated mobile application security testing environment designed to help security engineers, researchers, developers, and penetration testers to identify security vulnerabilities, malicious behaviours and privacy concerns in mobile applications using static and dynamic analysis. It supports all the popular mobile application binaries and source code formats built for Android and iOS devices. In addition to automated security assessment, it also offers an interactive testing environment to build and execute scenario based test/fuzz cases against the application.
This talk covers:
Using MobSF for static analysis of mobile applications.
Interactive dynamic security assessment of Android and iOS applications.
Solving Mobile app CTF challenges.
Reverse engineering and runtime analysis of Mobile malware.
How to shift left and integrate MobSF/mobsfscan SAST and DAST in your build pipeline.
Generating privacy-protected synthetic data using Secludy and MilvusZilliz
During this demo, the founders of Secludy will demonstrate how their system utilizes Milvus to store and manipulate embeddings for generating privacy-protected synthetic data. Their approach not only maintains the confidentiality of the original data but also enhances the utility and scalability of LLMs under privacy constraints. Attendees, including machine learning engineers, data scientists, and data managers, will witness first-hand how Secludy's integration with Milvus empowers organizations to harness the power of LLMs securely and efficiently.
Connector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectorsDianaGray10
Join us to learn how UiPath Apps can directly and easily interact with prebuilt connectors via Integration Service--including Salesforce, ServiceNow, Open GenAI, and more.
The best part is you can achieve this without building a custom workflow! Say goodbye to the hassle of using separate automations to call APIs. By seamlessly integrating within App Studio, you can now easily streamline your workflow, while gaining direct access to our Connector Catalog of popular applications.
We’ll discuss and demo the benefits of UiPath Apps and connectors including:
Creating a compelling user experience for any software, without the limitations of APIs.
Accelerating the app creation process, saving time and effort
Enjoying high-performance CRUD (create, read, update, delete) operations, for
seamless data management.
Speakers:
Russell Alfeche, Technology Leader, RPA at qBotic and UiPath MVP
Charlie Greenberg, host
The Microsoft 365 Migration Tutorial For Beginner.pptxoperationspcvita
This presentation will help you understand the power of Microsoft 365. However, we have mentioned every productivity app included in Office 365. Additionally, we have suggested the migration situation related to Office 365 and how we can help you.
You can also read: https://www.systoolsgroup.com/updates/office-365-tenant-to-tenant-migration-step-by-step-complete-guide/
Monitoring and Managing Anomaly Detection on OpenShift.pdfTosin Akinosho
Monitoring and Managing Anomaly Detection on OpenShift
Overview
Dive into the world of anomaly detection on edge devices with our comprehensive hands-on tutorial. This SlideShare presentation will guide you through the entire process, from data collection and model training to edge deployment and real-time monitoring. Perfect for those looking to implement robust anomaly detection systems on resource-constrained IoT/edge devices.
Key Topics Covered
1. Introduction to Anomaly Detection
- Understand the fundamentals of anomaly detection and its importance in identifying unusual behavior or failures in systems.
2. Understanding Edge (IoT)
- Learn about edge computing and IoT, and how they enable real-time data processing and decision-making at the source.
3. What is ArgoCD?
- Discover ArgoCD, a declarative, GitOps continuous delivery tool for Kubernetes, and its role in deploying applications on edge devices.
4. Deployment Using ArgoCD for Edge Devices
- Step-by-step guide on deploying anomaly detection models on edge devices using ArgoCD.
5. Introduction to Apache Kafka and S3
- Explore Apache Kafka for real-time data streaming and Amazon S3 for scalable storage solutions.
6. Viewing Kafka Messages in the Data Lake
- Learn how to view and analyze Kafka messages stored in a data lake for better insights.
7. What is Prometheus?
- Get to know Prometheus, an open-source monitoring and alerting toolkit, and its application in monitoring edge devices.
8. Monitoring Application Metrics with Prometheus
- Detailed instructions on setting up Prometheus to monitor the performance and health of your anomaly detection system.
9. What is Camel K?
- Introduction to Camel K, a lightweight integration framework built on Apache Camel, designed for Kubernetes.
10. Configuring Camel K Integrations for Data Pipelines
- Learn how to configure Camel K for seamless data pipeline integrations in your anomaly detection workflow.
11. What is a Jupyter Notebook?
- Overview of Jupyter Notebooks, an open-source web application for creating and sharing documents with live code, equations, visualizations, and narrative text.
12. Jupyter Notebooks with Code Examples
- Hands-on examples and code snippets in Jupyter Notebooks to help you implement and test anomaly detection models.
In the realm of cybersecurity, offensive security practices act as a critical shield. By simulating real-world attacks in a controlled environment, these techniques expose vulnerabilities before malicious actors can exploit them. This proactive approach allows manufacturers to identify and fix weaknesses, significantly enhancing system security.
This presentation delves into the development of a system designed to mimic Galileo's Open Service signal using software-defined radio (SDR) technology. We'll begin with a foundational overview of both Global Navigation Satellite Systems (GNSS) and the intricacies of digital signal processing.
The presentation culminates in a live demonstration. We'll showcase the manipulation of Galileo's Open Service pilot signal, simulating an attack on various software and hardware systems. This practical demonstration serves to highlight the potential consequences of unaddressed vulnerabilities, emphasizing the importance of offensive security practices in safeguarding critical infrastructure.
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdfChart Kalyan
A Mix Chart displays historical data of numbers in a graphical or tabular form. The Kalyan Rajdhani Mix Chart specifically shows the results of a sequence of numbers over different periods.
Introduction of Cybersecurity with OSS at Code Europe 2024Hiroshi SHIBATA
I develop the Ruby programming language, RubyGems, and Bundler, which are package managers for Ruby. Today, I will introduce how to enhance the security of your application using open-source software (OSS) examples from Ruby and RubyGems.
The first topic is CVE (Common Vulnerabilities and Exposures). I have published CVEs many times. But what exactly is a CVE? I'll provide a basic understanding of CVEs and explain how to detect and handle vulnerabilities in OSS.
Next, let's discuss package managers. Package managers play a critical role in the OSS ecosystem. I'll explain how to manage library dependencies in your application.
I'll share insights into how the Ruby and RubyGems core team works to keep our ecosystem safe. By the end of this talk, you'll have a better understanding of how to safeguard your code.
Have you ever been confused by the myriad of choices offered by AWS for hosting a website or an API?
Lambda, Elastic Beanstalk, Lightsail, Amplify, S3 (and more!) can each host websites + APIs. But which one should we choose?
Which one is cheapest? Which one is fastest? Which one will scale to meet our needs?
Join me in this session as we dive into each AWS hosting service to determine which one is best for your scenario and explain why!
2. Masterclass
Intended to educate you on how to get the best from AWS services
Show you how things work and how to get things done
A technical deep dive that goes beyond the basics
1
2
3
3. Amazon Elastic MapReduce
Provides a managed Hadoop framework
Quickly & cost-effectively process vast amounts of data
Makes it easy, fast & cost-effective for you to process data
Run other popular distributed frameworks such as Spark
5. Amazon EMR: Example Use Cases
Amazon EMR can be used
to process vast amounts of
genomic data and other
large scientific data sets
quickly and efficiently.
Researchers can access
genomic data hosted for
free on AWS.
Genomics
Amazon EMR can be used
to analyze click stream data
in order to segment users
and understand user
preferences. Advertisers
can also analyze click
streams and advertising
impression logs to deliver
more effective ads.
Clickstream Analysis
Amazon EMR can be used
to process logs generated
by web and mobile
applications. Amazon EMR
helps customers turn
petabytes of un-structured
or semi-structured data into
useful insights about their
applications or users.
Log Processing
6. Agenda
Hadoop Fundamentals
Core Features of Amazon EMR
How to Get Started with Amazon EMR
Supported Hadoop Tools
Additional EMR Features
Third Party Tools
Resources where you can learn more
9. Lots of actions by
John Smith
Very large
clickstream
logging data
(e.g TBs)
10. Lots of actions by
John Smith
Split the log
into many
small pieces
Very large
clickstream
logging data
(e.g TBs)
11. Lots of actions by
John Smith
Split the log
into many
small pieces
Process in an EMR
cluster
Very large
clickstream
logging data
(e.g TBs)
12. Lots of actions by
John Smith
Split the log
into many
small pieces
Process in an EMR
cluster
Aggregate the
results from all
the nodes
Very large
clickstream
logging data
(e.g TBs)
13. Lots of actions by
John Smith
Split the log
into many
small pieces
Process in an EMR
cluster
Aggregate the
results from all
the nodes
Very large
clickstream
logging data
(e.g TBs)
What John
Smith did
14. Insight in a fraction of the time
Very large
clickstream
logging data
(e.g TBs)
What John
Smith did
23. Allows you to decouple storage and computing resources
Use Amazon S3 features such as server-side encryption
When you launch your cluster, EMR streams data from S3
Multiple clusters can process the same data concurrently
Amazon S3 + Amazon EMR
27. Develop your data processing application
http://aws.amazon.com/articles/Elastic-MapReduce
28. Develop your data processing application
Upload your application and data to Amazon S3
29. Develop your data processing application
Upload your application and data to Amazon S3
30. Develop your data processing application
Upload your application and data to Amazon S3
31. Develop your data processing application
Upload your application and data to Amazon S3
32. Develop your data processing application
Upload your application and data to Amazon S3
Configure and launch your cluster
33. Configure and launch your cluster Start an EMR cluster
using console, CLI tools
or an AWS SDKAmazon EMR Cluster
34. Configure and launch your cluster Master instance group
created that controls the
clusterAmazon EMR Cluster
Master Instance Group
35. Configure and launch your cluster
Core instance group
created for life of cluster
Amazon EMR Cluster
Master Instance Group
Core Instance Group
36. HDFS HDFS
Configure and launch your cluster
Core instance group
created for life of cluster
Amazon EMR Cluster
Master Instance Group
Core Instance Group
Core instances run
DataNode and
TaskTracker daemons
37. HDFS HDFS
Configure and launch your cluster Optional task instances
can be added or
subtractedAmazon EMR Cluster
Master Instance Group
Core Instance Group Task Instance Group
38. HDFS HDFS
Configure and launch your cluster Master node
coordinates distribution
of work and manages
cluster stateAmazon EMR Cluster
Master Instance Group
Core Instance Group Task Instance Group
39. Develop your data processing application
Upload your application and data to Amazon S3
Configure and launch your cluster
Optionally, monitor the cluster
40. Develop your data processing application
Upload your application and data to Amazon S3
Configure and launch your cluster
Optionally, monitor the cluster
Retrieve the output
41. HDFS HDFS
S3 can be used as
underlying ‘file system’
for input/output dataAmazon EMR Cluster
Master Instance Group
Core Instance Group Task Instance Group
Retrieve the output
43. Utility that comes with the Hadoop distribution
Allows you to create and run Map/Reduce jobs with any executable or
script as the mapper and/or the reducer
Reads the input from standard input and the reducer outputs data
through standard output
By default, each line of input/output represents a record with tab
separated key/value
Hadoop Streaming
46. Step 1: mapper: wordSplitter.py
#!/usr/bin/python
import
sys
import
re
def
main(argv):
pattern
=
re.compile("[a-‐zA-‐Z][a-‐zA-‐Z0-‐9]*")
for
line
in
sys.stdin:
for
word
in
pattern.findall(line):
print
"LongValueSum:"
+
word.lower()
+
"t"
+
"1"
if
__name__
==
"__main__":
main(sys.argv)
47. Step 1: mapper: wordSplitter.py
#!/usr/bin/python
import
sys
import
re
def
main(argv):
pattern
=
re.compile("[a-‐zA-‐Z][a-‐zA-‐Z0-‐9]*")
for
line
in
sys.stdin:
for
word
in
pattern.findall(line):
print
"LongValueSum:"
+
word.lower()
+
"t"
+
"1"
if
__name__
==
"__main__":
main(sys.argv)
Read words from StdIn line by line
48. Step 1: mapper: wordSplitter.py
#!/usr/bin/python
import
sys
import
re
def
main(argv):
pattern
=
re.compile("[a-‐zA-‐Z][a-‐zA-‐Z0-‐9]*")
for
line
in
sys.stdin:
for
word
in
pattern.findall(line):
print
"LongValueSum:"
+
word.lower()
+
"t"
+
"1"
if
__name__
==
"__main__":
main(sys.argv)
Output to StdOut tab delimited records
in the format “LongValueSum:abacus 1”
49. Step 1: reducer: aggregate
Sorts inputs and adds up totals:
“Abacus 1”
“Abacus 1”
“Abacus 1”
becomes
“Abacus 3”
50. Step 1: input/ouput
The input is all the objects in the S3 bucket/prefix:
s3://eu-‐west-‐1.elasticmapreduce/samples/wordcount/input
Output is written to the following S3 bucket/prefix to be used as input for
the next step in the job flow:
s3://ianmas-‐aws-‐emr/intermediate/
One output object is created for each reducer (generally one per core)
51. Job Flow: Step 2
JAR location: /home/hadoop/contrib/streaming/hadoop-‐streaming.jar
Arguments:
-‐mapper
/bin/cat
-‐reducer
org.apache.hadoop.mapred.lib.IdentityReducer
-‐input
s3://ianmas-‐aws-‐emr/intermediate/
-‐output
s3://ianmas-‐aws-‐emr/output
-‐jobconf
mapred.reduce.tasks=1
Accept anything and return as text
55. Job Flow: Step 2
JAR location: /home/hadoop/contrib/streaming/hadoop-‐streaming.jar
Arguments:
-‐mapper
/bin/cat
-‐reducer
org.apache.hadoop.mapred.lib.IdentityReducer
-‐input
s3://ianmas-‐aws-‐emr/intermediate/
-‐output
s3://ianmas-‐aws-‐emr/output
-‐jobconf
mapred.reduce.tasks=1
Use a single reduce task
to get a single output object
57. Supported Hadoop Tools
An open source analytics
package that runs on top of
Hadoop. Pig is operated by Pig
Latin, a SQL-like language
which allows users to structure,
summarize, and query data.
Allows processing of complex
and unstructured data sources
such as text documents and log
files.
Pig
An open source data warehouse
& analytics package the runs on
top of Hadoop. Operated by
Hive QL, a SQL-based language
which allows users to structure,
summarize, and query data
Hive
Provides you an efficient way of
storing large quantities of sparse
data using column-based
storage. HBase provides fast
lookup of data because data is
stored in-memory instead of on
disk. Optimized for sequential
write operations, and it is highly
efficient for batch inserts,
updates, and deletes.
HBase
58. Supported Hadoop Tools
An open source distributed SQL
query engine for running
interactive analytic queries
against data sources of all sizes
ranging from gigabytes to
petabytes.
Presto
A tool in the Hadoop ecosystem
for interactive, ad hoc querying
using SQL syntax.It uses a
massively parallel processing
(MPP) engine similar to that
found in a traditional RDBMS.
This lends Impala to interactive,
low-latency analytics. You can
connect to BI tools through
ODBC and JDBC drivers.
Impala
An open source user interface
for Hadoop that makes it easier
to run and develop Hive queries,
manage files in HDFS, run and
develop Pig scripts, and
manage tables.
Hue
New
61. Create a Cluster with Spark
$
aws
emr
create-‐cluster
-‐-‐name
"Spark
cluster"
-‐-‐ami-‐version
3.8
-‐-‐applications
Name=Spark
-‐-‐ec2-‐attributes
KeyName=myKey
-‐-‐instance-‐type
m3.xlarge
-‐-‐instance-‐count
3
-‐-‐use-‐default-‐roles
$
ssh
-‐i
myKey
hadoop@masternode
invoke the spark shell with
$
spark-‐shell
or
$
pyspark
AWS CLI
62. Working with the Spark Shell
Counting the occurrences of a string a text file stored in Amazon S3 with spark
$
pyspark
>>>
sc
<pyspark.context.SparkContext
object
at
0x7fe7e659fa50>
>>>
textfile
=
sc.textFile(“s3://elasticmapreduce/samples/hive-‐ads/tables/impressions/
dt=2009-‐04-‐13-‐08-‐05/ec2-‐0-‐51-‐75-‐39.amazon.com-‐2009-‐04-‐13-‐08-‐05.log")
>>>
linesWithCartoonNetwork
=
textfile.filter(lambda
line:
"cartoonnetwork.com"
in
line).count()
15/06/04
17:12:22
INFO
lzo.GPLNativeCodeLoader:
Loaded
native
gpl
library
from
the
embedded
binaries
<snip>
<Spark
program
continues>
>>>
linesWithCartoonNetwork
9
pyspark
docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-spark.html
64. CONTROL NETWORK ACCESS TO
YOUR EMR CLUSTER
ssh
-‐i
EMRKeyPair.pem
-‐N
-‐L
8160:ec2-‐52-‐16-‐143-‐78.eu-‐west-‐1.compute.amazonaws.com:8888
hadoop@ec2-‐52-‐16-‐143-‐78.eu-‐west-‐1.compute.amazonaws.com
Using SSH local port forwarding
docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-web-interfaces.html
65. MANAGE USERS, PERMISSIONS
AND ENCRYPTION
docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-plan-access.html
66. INSTALL ADDITIONAL SOFTWARE
WITH BOOTSTRAP ACTIONS
docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-plan-bootstrap.html
67. EFFICIENTLY COPY DATA TO
EMR FROM AMAZON S3
docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/UsingEMR_s3distcp.html
$
hadoop
jar
/home/hadoop/lib/emr-‐s3distcp-‐1.0.jar
-‐
Dmapreduce.job.reduces=30
-‐-‐src
s3://s3bucketname/
-‐-‐dest
hdfs://
$HADOOP_NAMENODE_HOST:$HADOOP_NAMENODE_PORT/data/
-‐-‐outputCodec
'none'
Run on a cluster master node:
71. USE THE MAPR DISTRIBUTION
aws.amazon.com/elasticmapreduce/mapr/
72. TUNE YOUR CLUSTER FOR
COST & PERFORMANCE
Supported EC2 instance types
• General Purpose
• Compute Optimized
• Memory Optimized
• Storage Optimized - D2 instance family
D2 instances are available in four sizes with 6TB, 12TB, 24TB, and 48TB storage options.
• GPU Instances
New
73. TUNE YOUR CLUSTER FOR
COST & PERFORMANCE
Job Flow
Duration:
Scenario #1
Duration:
Scenario #2
Cost without Spot
4 instances *14 hrs * $0.50
Total = $28
Cost with Spot
4 instances *7 hrs * $0.50 = $14 +
5 instances * 7 hrs * $0.25 = $8.75
Total = $22.75
Time Savings: 50%
Cost Savings: ~22%
Job Flow
Allocate
4 on demand
instances
Add
5 additional
spot instances
14 hours 7 hours
78. Getting Started with Amazon EMR Tutorial guide:
docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-get-started.html
Customer Case Studies for Big Data Use-Cases
aws.amazon.com/solutions/case-studies/big-data/
Amazon EMR Documentation:
aws.amazon.com/documentation/emr/