Amazon Elastic MapReduce is one of the largest Hadoop operators in the world. Since its launch four years ago, our customers have launched more than 5.5 million Hadoop clusters. In this talk, we introduce you to Amazon EMR design patterns such as using Amazon S3 instead of HDFS, taking advantage of both long and short-lived clusters and other Amazon EMR architectural patterns. We talk about how to scale your cluster up or down dynamically and introduce you to ways you can fine-tune your cluster. We also share best practices to keep your Amazon EMR cluster cost efficient.
Amazon Elastic MapReduce is one of the largest Hadoop operators in the world. Since its launch five years ago, AWS customers have launched more than 5.5 million Hadoop clusters.
In this talk, we introduce you to Amazon EMR design patterns such as using Amazon S3 instead of HDFS, taking advantage of both long and short-lived clusters and other Amazon EMR architectural patterns. We talk about how to scale your cluster up or down dynamically and introduce you to ways you can fine-tune your cluster. We also share best practices to keep your Amazon EMR cluster cost efficient.
Speakers:
Ian Meyers, AWS Solutions Architect
Ian McDonald, IT Director, SwiftKey
Learning Objectives:
- Learn the common use-cases for using Athena, AWS' interactive query service on S3
- Learn best practices for creating tables and partitions and performance optimizations
- Learn how Athena handles security, authorization, and authentication
Amazon Athena is a new serverless query service that makes it easy to analyze data in Amazon S3, using standard SQL. With Athena, there is no infrastructure to setup or manage, and you can start analyzing your data immediately. You don’t even need to load your data into Athena, it works directly with data stored in S3.
Learn how Amazon Redshift, our fully managed, petabyte-scale data warehouse, can help you quickly and cost-effectively analyze all your data using your existing business intelligence tools. Get an introduction to how Amazon Redshift uses massively parallel processing and scale-out architecture to ensure compute resources grow with your dataset size, and columnar, direct-attached storage to dramatically reduce I/O time. Learn how top online retailer RetailMeNot moved their largest Vertica cluster on Amazon EC2 to Amazon Redshift. See how they gain insights from clickstream, location, merchant, marketing, and operational data across desktop and mobile properties.
Build Data Lakes & Analytics on AWS: Patterns & Best PracticesAmazon Web Services
With over 90% of today’s data generated in the last two years, the rate of data growth is showing no sign of slowing down. In this session, we step through the challenges and best practices for capturing data, understanding what data you own, driving insights, and predicting the future using AWS services. We frame the session and demonstrations around common pitfalls of building data lakes and how to successfully drive analytics and insights from data. We also discuss the architecture patterns brought together key AWS services, including Amazon S3, AWS Glue, Amazon Athena, Amazon Kinesis, and Amazon Machine Learning. Discover the real-world application of data lakes for roles including data scientists and business users.
Stephen Moon, Sr. Solutions Architect, Amazon Web Services
James Juniper, Solution Architect for the Geo-Community Cloud, Natural Resources Canada
Introduction to Amazon EMR design patterns such as using Amazon S3 instead of HDFS, taking advantage of Spot EC2 instances to reduce costs, and other Amazon EMR architectural best practices.
Amazon Elastic MapReduce is one of the largest Hadoop operators in the world. Since its launch five years ago, AWS customers have launched more than 5.5 million Hadoop clusters.
In this talk, we introduce you to Amazon EMR design patterns such as using Amazon S3 instead of HDFS, taking advantage of both long and short-lived clusters and other Amazon EMR architectural patterns. We talk about how to scale your cluster up or down dynamically and introduce you to ways you can fine-tune your cluster. We also share best practices to keep your Amazon EMR cluster cost efficient.
Speakers:
Ian Meyers, AWS Solutions Architect
Ian McDonald, IT Director, SwiftKey
Learning Objectives:
- Learn the common use-cases for using Athena, AWS' interactive query service on S3
- Learn best practices for creating tables and partitions and performance optimizations
- Learn how Athena handles security, authorization, and authentication
Amazon Athena is a new serverless query service that makes it easy to analyze data in Amazon S3, using standard SQL. With Athena, there is no infrastructure to setup or manage, and you can start analyzing your data immediately. You don’t even need to load your data into Athena, it works directly with data stored in S3.
Learn how Amazon Redshift, our fully managed, petabyte-scale data warehouse, can help you quickly and cost-effectively analyze all your data using your existing business intelligence tools. Get an introduction to how Amazon Redshift uses massively parallel processing and scale-out architecture to ensure compute resources grow with your dataset size, and columnar, direct-attached storage to dramatically reduce I/O time. Learn how top online retailer RetailMeNot moved their largest Vertica cluster on Amazon EC2 to Amazon Redshift. See how they gain insights from clickstream, location, merchant, marketing, and operational data across desktop and mobile properties.
Build Data Lakes & Analytics on AWS: Patterns & Best PracticesAmazon Web Services
With over 90% of today’s data generated in the last two years, the rate of data growth is showing no sign of slowing down. In this session, we step through the challenges and best practices for capturing data, understanding what data you own, driving insights, and predicting the future using AWS services. We frame the session and demonstrations around common pitfalls of building data lakes and how to successfully drive analytics and insights from data. We also discuss the architecture patterns brought together key AWS services, including Amazon S3, AWS Glue, Amazon Athena, Amazon Kinesis, and Amazon Machine Learning. Discover the real-world application of data lakes for roles including data scientists and business users.
Stephen Moon, Sr. Solutions Architect, Amazon Web Services
James Juniper, Solution Architect for the Geo-Community Cloud, Natural Resources Canada
Introduction to Amazon EMR design patterns such as using Amazon S3 instead of HDFS, taking advantage of Spot EC2 instances to reduce costs, and other Amazon EMR architectural best practices.
Take an in-depth look at data warehousing with Amazon Redshift and get answers to your technical questions. We will cover performance tuning techniques that take advantage of Amazon Redshift's columnar technology and massively parallel processing architecture. We will also discuss best practices for migrating from existing data warehouses, optimizing your schema, loading data efficiently, and using work load management and interleaved sorting.
AWS Glue is a fully managed, serverless extract, transform, and load (ETL) service that makes it easy to move data between data stores. AWS Glue simplifies and automates the difficult and time consuming tasks of data discovery, conversion mapping, and job scheduling so you can focus more of your time querying and analyzing your data using Amazon Redshift Spectrum and Amazon Athena. In this session, we introduce AWS Glue, provide an overview of its components, and share how you can use AWS Glue to automate discovering your data, cataloging it, and preparing it for analysis.
Amazon EMR enables fast processing of large structured or unstructured datasets, and in this presentation we'll show you how to setup an Amazon EMR job flow to analyse application logs, and perform Hive queries against it. We also review best practices around data file organisation on Amazon Simple Storage Service (S3), how clusters can be started from the AWS web console and command line, and how to monitor the status of a Map/Reduce job.
Finally we take a look at Hadoop ecosystem tools you can use with Amazon EMR and the additional features of the service.
See a recording of the webinar based on this presentation on YouTube here:
Check out the rest of the Masterclass webinars for 2015 here: http://aws.amazon.com/campaigns/emea/masterclass/
See the Journey Through the Cloud webinar series here: http://aws.amazon.com/campaigns/emea/journey/
This overview presentation discusses big data challenges and provides an overview of the AWS Big Data Platform by covering:
- How AWS customers leverage the platform to manage massive volumes of data from a variety of sources while containing costs.
- Reference architectures for popular use cases, including, connected devices (IoT), log streaming, real-time intelligence, and analytics.
- The AWS big data portfolio of services, including, Amazon S3, Kinesis, DynamoDB, Elastic MapReduce (EMR), and Redshift.
- The latest relational database engine, Amazon Aurora— a MySQL-compatible, highly-available relational database engine, which provides up to five times better performance than MySQL at one-tenth the cost of a commercial database.
Created by: Rahul Pathak,
Sr. Manager of Software Development
AWS Glue is a fully managed extract, transform, and load (ETL) service that makes it easy for customers to prepare and load their data for analytics. You can create and run an ETL job with a few clicks in the AWS Management Console. You simply point AWS Glue to your data stored on AWS, and AWS Glue discovers your data and stores the associated metadata (e.g. table definition and schema) in the AWS Glue Data Catalog. Once cataloged, your data is immediately searchable, queryable, and available for ETL. AWS Glue generates the code to execute your data transformations and data loading processes.
Level: Intermediate
Speakers:
Ryan Malecky - Solutions Architect, EdTech, AWS
Rajakumar Sampathkumar - Sr. Technical Account Manager, AWS
AWS S3 | Tutorial For Beginners | AWS S3 Bucket Tutorial | AWS Tutorial For B...Simplilearn
This presentation AWS S3 will help you understand what is cloud storage, types of storage, life before Amazon S3, what is S3 ( Amazon Simple Storage Service ), benefits of S3, objects and buckets, how does Amazon S3 work along with the explanation on features of AWS S3. Amazon S3 is a storage service for the Internet. It is a simple storage service that offers software developers a highly-scalable, reliable, and low-latency data storage infrastructure at a relatively low cost. Amazon S3 gives a simple web service interface that can be used to store and restore any amount of data. Using this, developers can build applications that make use of Internet storage with ease. Amazon S3 is designed to be highly flexible and scalable. Now, lets deep dive into this presentation and understand what Amazon S3 actually is.
Below topics are explained in this AWS S3 presentation:
1. What is Cloud storage?
2. Types of storage
3. Before Amazon S3
4. What is S3
5. Benefits of S3
6. Objects and buckets
7. How does Amazon S3 work
8. Features of S3
This AWS certification training is designed to help you gain in-depth understanding of Amazon Web Services (AWS) architectural principles and services. You will learn how cloud computing is redefining the rules of IT architecture and how to design, plan, and scale AWS Cloud implementations with best practices recommended by Amazon. The AWS Cloud platform powers hundreds of thousands of businesses in 190 countries, and AWS certified solution architects take home about $126,000 per year.
This AWS certification course will help you learn the key concepts, latest trends, and best practices for working with the AWS architecture – and become industry-ready aws certified solutions architect to help you qualify for a position as a high-quality AWS professional.
The course begins with an overview of the AWS platform before diving into its individual elements: IAM, VPC, EC2, EBS, ELB, CDN, S3, EIP, KMS, Route 53, RDS, Glacier, Snowball, Cloudfront, Dynamo DB, Redshift, Auto Scaling, Cloudwatch, Elastic Cache, CloudTrail, and Security. Those who complete the course will be able to:
1. Formulate solution plans and provide guidance on AWS architectural best practices
2. Design and deploy scalable, highly available, and fault tolerant systems on AWS
3. Identify the lift and shift of an existing on-premises application to AWS
4. Decipher the ingress and egress of data to and from AWS
5. Select the appropriate AWS service based on data, compute, database, or security requirements
6. Estimate AWS costs and identify cost control mechanisms
This AWS course is recommended for professionals who want to pursue a career in Cloud computing or develop Cloud applications with AWS. You’ll become an asset to any organization, helping leverage best practices around advanced cloud-based solutions and migrate existing workloads to the cloud.
Learn more at: https://www.simplilearn.com/
by Joyjeet Banerjee, Solutions Architect, AWS
Amazon Athena is a new serverless query service that makes it easy to analyze data in Amazon S3, using standard SQL. With Athena, there is no infrastructure to setup or manage, and you can start analyzing your data immediately. You don’t even need to load your data into Athena, it works directly with data stored in S3. Level 200
In this session, we will show you how easy it is to start querying your data stored in Amazon S3, with Amazon Athena. First we will use Athena to create the schema for data already in S3. Then, we will demonstrate how you can run interactive queries through the built-in query editor. We will provide best practices and use cases for Athena. Then, we will talk about supported queries, data formats, and strategies to save costs when querying data with Athena.
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudNoritaka Sekiyama
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud (Hadoop / Spark Conference Japan 2019)
# English version #
http://hadoop.apache.jp/hcj2019-program/
In addition to running databases in Amazon EC2, AWS customers can choose among a variety of managed database services. These services save effort, save time, and unlock new capabilities and economies. In this session, we make it easy to understand how they differ, what they have in common, and how to choose one or more. We explain the fundamentals of Amazon DynamoDB, a fully managed NoSQL database service; Amazon RDS, a relational database service in the cloud; Amazon ElastiCache, a fast, in-memory caching service in the cloud; and Amazon Redshift, a fully managed, petabyte-scale data-warehouse solution that can be surprisingly economical. We’ll cover how each service might help support your application, how much each service costs, and how to get started.
Amazon Relational Database Service (Amazon RDS) is a web service that makes it easier to set up, operate, and scale a relational database in the cloud. It provides cost-efficient, re-sizable capacity for an industry-standard relational database and manages common database administration tasks
Cloud dw benchmark using tpd-ds( Snowflake vs Redshift vs EMR Hive )SANG WON PARK
몇년 전부터 Data Architecture의 변화가 빠르게 진행되고 있고,
그 중 Cloud DW는 기존 Data Lake(Hadoop 기반)의 한계(성능, 비용, 운영 등)에 대한 대안으로 주목받으며,
많은 기업들이 이미 도입했거나, 도입을 검토하고 있다.
본 자료는 이러한 Cloud DW에 대해서 개념적으로 이해하고,
시장에 존재하는 다양한 Cloud DW 중에서 기업의 환경에 맞는 제품이 어떤 것인지 성능/비용 관점으로 비교했다.
- 왜기업들은 CloudDW에주목하는가?
- 시장에는어떤 제품들이 있는가?
- 우리Biz환경에서는 어떤 제품을 도입해야 하는가?
- CloudDW솔루션의 성능은?
- 기존DataLake(EMR)대비 성능은?
- 유사CloudDW(snowflake vs redshift) 대비성능은?
앞으로도 Data를 둘러싼 시장은 Cloud DW를 기반으로 ELT, Mata Mesh, Reverse ETL등 새로운 생테계가 급속하게 발전할 것이고,
이를 위한 데이터 엔지니어/데이터 아키텍트 관점의 기술적 검토와 고민이 필요할 것 같다.
https://blog.naver.com/freepsw/222654809552
(BDT401) Amazon Redshift Deep Dive: Tuning and Best PracticesAmazon Web Services
Get a look under the covers: Learn tuning best practices for taking advantage of Amazon Redshift's columnar technology and parallel processing capabilities to improve your delivery of queries and improve overall database performance. This session explains how to migrate from existing data warehouses, create an optimized schema, efficiently load data, use work load management, tune your queries, and use Amazon Redshift's interleaved sorting features. Finally, learn how TripAdvisor uses these best practices to give their entire organization access to analytic insights at scale.
A quick tour in 16 slides of Amazon's Redshift clustered, massively parallel database.
Find out what differentiates it from the other database products Amazon has, including SimpleDB, DynamoDB and RDS (MySQL, SQL Server and Oracle).
Learn how it stores data on disk in a columnar format and how this relates to performance and interesting compression techniques.
Contrast the difference between Redshift and a MySQL instance and discover how the clustered architecture may help to dramatically reduce query time.
AWS에서는 Big Data 분석 및 처리를 위해 다양한 Analytics 서비스를 지원합니다. 이 세션에서는 시간이 지날수록 증가하는 데이터 분석 및 처리를 위해 데이터 레이크 카탈로그를 구축하거나 ETL을 위해 사용되는 AWS Glue 내부 구조를 살펴보고 효율적으로 사용할 수 있는 방법들을 소개합니다.
Best Practices for Data Warehousing with Amazon Redshift | AWS Public Sector ...Amazon Web Services
Get a look under the covers: Learn tuning best practices for taking advantage of Amazon Redshift's columnar technology and parallel processing capabilities to improve your delivery of queries and improve overall database performance. This session explains how to migrate from existing data warehouses, create an optimized schema, efficiently load data, use workload management, tune your queries, and use Amazon Redshift's interleaved sorting features.You’ll then hear from a customer who has leveraged Redshift in their industry and how they have adopted many of the best practices. Learn More: https://aws.amazon.com/government-education/
In this session, you get an overview of Amazon Redshift, a fast, fully-managed, petabyte-scale data warehouse service. We'll cover how Amazon Redshift uses columnar technology, optimized hardware, and massively parallel processing to deliver fast query performance on data sets ranging in size from hundreds of gigabytes to a petabyte or more. We'll also discuss new features, architecture best practices, and share how customers are using Amazon Redshift for their Big Data workloads.
For more training on AWS, visit: https://www.qa.com/amazon
AWS Loft | London - Amazon Virtual Private Cloud by Andrew Kane, Solution Architect
April 18, 2016
Take an in-depth look at data warehousing with Amazon Redshift and get answers to your technical questions. We will cover performance tuning techniques that take advantage of Amazon Redshift's columnar technology and massively parallel processing architecture. We will also discuss best practices for migrating from existing data warehouses, optimizing your schema, loading data efficiently, and using work load management and interleaved sorting.
AWS Glue is a fully managed, serverless extract, transform, and load (ETL) service that makes it easy to move data between data stores. AWS Glue simplifies and automates the difficult and time consuming tasks of data discovery, conversion mapping, and job scheduling so you can focus more of your time querying and analyzing your data using Amazon Redshift Spectrum and Amazon Athena. In this session, we introduce AWS Glue, provide an overview of its components, and share how you can use AWS Glue to automate discovering your data, cataloging it, and preparing it for analysis.
Amazon EMR enables fast processing of large structured or unstructured datasets, and in this presentation we'll show you how to setup an Amazon EMR job flow to analyse application logs, and perform Hive queries against it. We also review best practices around data file organisation on Amazon Simple Storage Service (S3), how clusters can be started from the AWS web console and command line, and how to monitor the status of a Map/Reduce job.
Finally we take a look at Hadoop ecosystem tools you can use with Amazon EMR and the additional features of the service.
See a recording of the webinar based on this presentation on YouTube here:
Check out the rest of the Masterclass webinars for 2015 here: http://aws.amazon.com/campaigns/emea/masterclass/
See the Journey Through the Cloud webinar series here: http://aws.amazon.com/campaigns/emea/journey/
This overview presentation discusses big data challenges and provides an overview of the AWS Big Data Platform by covering:
- How AWS customers leverage the platform to manage massive volumes of data from a variety of sources while containing costs.
- Reference architectures for popular use cases, including, connected devices (IoT), log streaming, real-time intelligence, and analytics.
- The AWS big data portfolio of services, including, Amazon S3, Kinesis, DynamoDB, Elastic MapReduce (EMR), and Redshift.
- The latest relational database engine, Amazon Aurora— a MySQL-compatible, highly-available relational database engine, which provides up to five times better performance than MySQL at one-tenth the cost of a commercial database.
Created by: Rahul Pathak,
Sr. Manager of Software Development
AWS Glue is a fully managed extract, transform, and load (ETL) service that makes it easy for customers to prepare and load their data for analytics. You can create and run an ETL job with a few clicks in the AWS Management Console. You simply point AWS Glue to your data stored on AWS, and AWS Glue discovers your data and stores the associated metadata (e.g. table definition and schema) in the AWS Glue Data Catalog. Once cataloged, your data is immediately searchable, queryable, and available for ETL. AWS Glue generates the code to execute your data transformations and data loading processes.
Level: Intermediate
Speakers:
Ryan Malecky - Solutions Architect, EdTech, AWS
Rajakumar Sampathkumar - Sr. Technical Account Manager, AWS
AWS S3 | Tutorial For Beginners | AWS S3 Bucket Tutorial | AWS Tutorial For B...Simplilearn
This presentation AWS S3 will help you understand what is cloud storage, types of storage, life before Amazon S3, what is S3 ( Amazon Simple Storage Service ), benefits of S3, objects and buckets, how does Amazon S3 work along with the explanation on features of AWS S3. Amazon S3 is a storage service for the Internet. It is a simple storage service that offers software developers a highly-scalable, reliable, and low-latency data storage infrastructure at a relatively low cost. Amazon S3 gives a simple web service interface that can be used to store and restore any amount of data. Using this, developers can build applications that make use of Internet storage with ease. Amazon S3 is designed to be highly flexible and scalable. Now, lets deep dive into this presentation and understand what Amazon S3 actually is.
Below topics are explained in this AWS S3 presentation:
1. What is Cloud storage?
2. Types of storage
3. Before Amazon S3
4. What is S3
5. Benefits of S3
6. Objects and buckets
7. How does Amazon S3 work
8. Features of S3
This AWS certification training is designed to help you gain in-depth understanding of Amazon Web Services (AWS) architectural principles and services. You will learn how cloud computing is redefining the rules of IT architecture and how to design, plan, and scale AWS Cloud implementations with best practices recommended by Amazon. The AWS Cloud platform powers hundreds of thousands of businesses in 190 countries, and AWS certified solution architects take home about $126,000 per year.
This AWS certification course will help you learn the key concepts, latest trends, and best practices for working with the AWS architecture – and become industry-ready aws certified solutions architect to help you qualify for a position as a high-quality AWS professional.
The course begins with an overview of the AWS platform before diving into its individual elements: IAM, VPC, EC2, EBS, ELB, CDN, S3, EIP, KMS, Route 53, RDS, Glacier, Snowball, Cloudfront, Dynamo DB, Redshift, Auto Scaling, Cloudwatch, Elastic Cache, CloudTrail, and Security. Those who complete the course will be able to:
1. Formulate solution plans and provide guidance on AWS architectural best practices
2. Design and deploy scalable, highly available, and fault tolerant systems on AWS
3. Identify the lift and shift of an existing on-premises application to AWS
4. Decipher the ingress and egress of data to and from AWS
5. Select the appropriate AWS service based on data, compute, database, or security requirements
6. Estimate AWS costs and identify cost control mechanisms
This AWS course is recommended for professionals who want to pursue a career in Cloud computing or develop Cloud applications with AWS. You’ll become an asset to any organization, helping leverage best practices around advanced cloud-based solutions and migrate existing workloads to the cloud.
Learn more at: https://www.simplilearn.com/
by Joyjeet Banerjee, Solutions Architect, AWS
Amazon Athena is a new serverless query service that makes it easy to analyze data in Amazon S3, using standard SQL. With Athena, there is no infrastructure to setup or manage, and you can start analyzing your data immediately. You don’t even need to load your data into Athena, it works directly with data stored in S3. Level 200
In this session, we will show you how easy it is to start querying your data stored in Amazon S3, with Amazon Athena. First we will use Athena to create the schema for data already in S3. Then, we will demonstrate how you can run interactive queries through the built-in query editor. We will provide best practices and use cases for Athena. Then, we will talk about supported queries, data formats, and strategies to save costs when querying data with Athena.
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudNoritaka Sekiyama
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud (Hadoop / Spark Conference Japan 2019)
# English version #
http://hadoop.apache.jp/hcj2019-program/
In addition to running databases in Amazon EC2, AWS customers can choose among a variety of managed database services. These services save effort, save time, and unlock new capabilities and economies. In this session, we make it easy to understand how they differ, what they have in common, and how to choose one or more. We explain the fundamentals of Amazon DynamoDB, a fully managed NoSQL database service; Amazon RDS, a relational database service in the cloud; Amazon ElastiCache, a fast, in-memory caching service in the cloud; and Amazon Redshift, a fully managed, petabyte-scale data-warehouse solution that can be surprisingly economical. We’ll cover how each service might help support your application, how much each service costs, and how to get started.
Amazon Relational Database Service (Amazon RDS) is a web service that makes it easier to set up, operate, and scale a relational database in the cloud. It provides cost-efficient, re-sizable capacity for an industry-standard relational database and manages common database administration tasks
Cloud dw benchmark using tpd-ds( Snowflake vs Redshift vs EMR Hive )SANG WON PARK
몇년 전부터 Data Architecture의 변화가 빠르게 진행되고 있고,
그 중 Cloud DW는 기존 Data Lake(Hadoop 기반)의 한계(성능, 비용, 운영 등)에 대한 대안으로 주목받으며,
많은 기업들이 이미 도입했거나, 도입을 검토하고 있다.
본 자료는 이러한 Cloud DW에 대해서 개념적으로 이해하고,
시장에 존재하는 다양한 Cloud DW 중에서 기업의 환경에 맞는 제품이 어떤 것인지 성능/비용 관점으로 비교했다.
- 왜기업들은 CloudDW에주목하는가?
- 시장에는어떤 제품들이 있는가?
- 우리Biz환경에서는 어떤 제품을 도입해야 하는가?
- CloudDW솔루션의 성능은?
- 기존DataLake(EMR)대비 성능은?
- 유사CloudDW(snowflake vs redshift) 대비성능은?
앞으로도 Data를 둘러싼 시장은 Cloud DW를 기반으로 ELT, Mata Mesh, Reverse ETL등 새로운 생테계가 급속하게 발전할 것이고,
이를 위한 데이터 엔지니어/데이터 아키텍트 관점의 기술적 검토와 고민이 필요할 것 같다.
https://blog.naver.com/freepsw/222654809552
(BDT401) Amazon Redshift Deep Dive: Tuning and Best PracticesAmazon Web Services
Get a look under the covers: Learn tuning best practices for taking advantage of Amazon Redshift's columnar technology and parallel processing capabilities to improve your delivery of queries and improve overall database performance. This session explains how to migrate from existing data warehouses, create an optimized schema, efficiently load data, use work load management, tune your queries, and use Amazon Redshift's interleaved sorting features. Finally, learn how TripAdvisor uses these best practices to give their entire organization access to analytic insights at scale.
A quick tour in 16 slides of Amazon's Redshift clustered, massively parallel database.
Find out what differentiates it from the other database products Amazon has, including SimpleDB, DynamoDB and RDS (MySQL, SQL Server and Oracle).
Learn how it stores data on disk in a columnar format and how this relates to performance and interesting compression techniques.
Contrast the difference between Redshift and a MySQL instance and discover how the clustered architecture may help to dramatically reduce query time.
AWS에서는 Big Data 분석 및 처리를 위해 다양한 Analytics 서비스를 지원합니다. 이 세션에서는 시간이 지날수록 증가하는 데이터 분석 및 처리를 위해 데이터 레이크 카탈로그를 구축하거나 ETL을 위해 사용되는 AWS Glue 내부 구조를 살펴보고 효율적으로 사용할 수 있는 방법들을 소개합니다.
Best Practices for Data Warehousing with Amazon Redshift | AWS Public Sector ...Amazon Web Services
Get a look under the covers: Learn tuning best practices for taking advantage of Amazon Redshift's columnar technology and parallel processing capabilities to improve your delivery of queries and improve overall database performance. This session explains how to migrate from existing data warehouses, create an optimized schema, efficiently load data, use workload management, tune your queries, and use Amazon Redshift's interleaved sorting features.You’ll then hear from a customer who has leveraged Redshift in their industry and how they have adopted many of the best practices. Learn More: https://aws.amazon.com/government-education/
In this session, you get an overview of Amazon Redshift, a fast, fully-managed, petabyte-scale data warehouse service. We'll cover how Amazon Redshift uses columnar technology, optimized hardware, and massively parallel processing to deliver fast query performance on data sets ranging in size from hundreds of gigabytes to a petabyte or more. We'll also discuss new features, architecture best practices, and share how customers are using Amazon Redshift for their Big Data workloads.
For more training on AWS, visit: https://www.qa.com/amazon
AWS Loft | London - Amazon Virtual Private Cloud by Andrew Kane, Solution Architect
April 18, 2016
Amazon Virtual Private Cloud (Amazon VPC) lets you provision a logically isolated section of the AWS cloud where you can launch AWS resources in a virtual network that you define. In this talk, we discuss advanced tasks in Amazon VPC, including the implementation of Amazon VPC peering, the creation of multiple network zones, the establishment of private connections, and the use of multiple routing tables. We also provide information for current Amazon EC2-Classic network customers and help you prepare to adopt Amazon VPC.
Speakers:
Steve Seymour, AWS Solutions Architect
Eamonn O'Neill, Director, Lemongrass Consulting
Jackie Wong, Head of Networks, Financial Times
For more training on AWS, visit: https://www.qa.com/amazon
AWS Loft | London - Deep Dive: Amazon RDS by Toby Knight, Manager Solutions Architecture, 18 April 2016
AWS re:Invent 2016: Deep Dive on Amazon Relational Database Service (DAT305)Amazon Web Services
Amazon RDS allows customers to launch an optimally configured, secure and highly available database with just a few clicks. It provides cost-efficient and resizable capacity while managing time-consuming database administration tasks, freeing you up to focus on your applications and business. Amazon RDS provides you six database engines to choose from, including Amazon Aurora, Oracle, Microsoft SQL Server, PostgreSQL, MySQL and MariaDB. In this session, we take a closer look at the capabilities of RDS and all the different options available. We do a deep dive into how RDS works and the best practises to achive the optimal perfomance, flexibility, and cost saving for your databases.
AWS Summit London 2014 | Amazon Elastic MapReduce Deep Dive and Best Practice...Amazon Web Services
Join this advanced technical session on Amazon Elastic MapReduce (EMR) for an introduction to Amazon EMR design patterns such as using Amazon S3 instead of HDFS, how you can take advantage of both long and short-lived clusters as well as other Amazon EMR architectural patterns. Learn how to scale your cluster up or down dynamically and about ways you can fine-tune your cluster. We also share best practices to keep your Amazon EMR cluster cost efficient.
AWS Webcast - Amazon Elastic Map Reduce Deep Dive and Best PracticesAmazon Web Services
Amazon Elastic MapReduce (EMR) is one of the largest Hadoop operators in the world. Since its launch five years ago, our customers have launched more than 15 million Hadoop clusters inside of EMR. In this webinar, we introduce you to Amazon EMR design patterns such as using Amazon S3 instead of HDFS, taking advantage of both long and short-lived clusters and other Amazon EMR architectural patterns. We talk about how to scale your cluster up or down dynamically and introduce you to ways you can fine-tune your cluster. We also share best practices to keep your Amazon EMR cluster cost efficient.
Big Data Analytics using Amazon Elastic MapReduce and Amazon RedshiftIndicThreads
The talk will focus on the various run time challenges that we experienced while dealing with BigData and the scalable solution that we built using various Amazon Cloud services such as EMR,Redshift and Amazon S3.
The talk will introduce the audience with Amazon Elastic MapReduce (EMR) ,fully managed hosted Hadoop framework on top of Amazon Elastic Compute Cloud (EC2) and Amazon Redshift,fast, fully managed,cost-effective, petabyte-scale data warehouse service.It will cover how to deal with processed data using Hadoop, with the processed data being so huge that it will create a bottleneck for traditional relation databases like Mysql and Oracle. We will analyse the solution to this problem using Amzon Redshift. In addition we will discuss cost effectiveness of Amazon EMR and Redshift when dealing with Big Data of few hundred gigabytes to a petabyte size.
Our AdSever clients generate daily 5TB user logs ( logs of Requests, Impressions and Clicks etc ) . We process these logs using EMR and we store processed output in Amazon Redshift cluster . Our Redshift cluster currently holds around 10 TB processed data which is available for various end user reports.
Session at the IndicThreads.com Confence held in Pune, India on 27-28 Feb 2015
http://www.indicthreads.com
http://pune15.indicthreads.com
If you are interested to know more about AWS Chicago Summit, please use the following to register: http://amzn.to/1RooPPL
Many AWS customers store vast amounts of data in Amazon S3, a low cost, scalable, and durable object store; Amazon DynamoDB, a NoSQL database; or Amazon Kinesis, a real time data stream processing service. With large datasets in various AWS services, how do you derive value from this information in a cost-effective way? Using Amazon Elastic MapReduce (Amazon EMR) with applications in the Apache Hadoop ecosystem, you can directly interact with data in each of these storage services for scalable analytics workloads or ad hoc queries. You can quickly and easily launch an Amazon EMR cluster from the AWS Management Console, and scale your cluster to match the compute and memory resources needed for your workflow, independent from the storage capacity used in your AWS storage services. The webinar will accelerate your use of Amazon EMR by showing you how to create and monitor Amazon EMR clusters, and provide several use cases and architectures for using Amazon EMR with different AWS data stores.
Learning Objectives: • Recognize when to use Amazon EMR • Understand the steps required to set up and monitor an Amazon EMR cluster • Architect applications that effectively use Amazon EMR • Understand how to use HUE for ad hoc query of data in Amazon S3
Who Should Attend: • Developers, LOB owners, Continuous Integration & Continuous Delivery (CICD) practitioners
Amazon Elastic MapReduce (Amazon EMR) is a web service that allows you to easily and securely provision and manage your Hadoop clusters. In this talk, we will introduce you to Amazon EMR design patterns, such as using various data stores like Amazon S3, how to take advantage of both transient and active clusters, and how to work with other Amazon EMR architectural patterns. We will dive deep on how to dynamically scale your cluster and address the ways you can fine-tune your cluster. We will discuss bootstrapping Hadoop applications from our partner ecosystem that you can use natively with Amazon EMR. Lastly, we will share best practices on how to keep your Amazon EMR cluster cost-effective.
Amazon EMR provides a managed framework which makes it easy, cost effective, and secure to run data processing frameworks such as Apache Hadoop, Apache Spark, and Presto on AWS. In this session, you learn the key design principles behind running these frameworks on the cloud and the feature set that Amazon EMR offers. We discuss the benefits of decoupling compute and storage and strategies to take advantage of the scale and the parallelism that the cloud offers, while lowering costs. In this session, you learn the benefits of decoupling storage and compute and allowing them to scale independently; how to run Hadoop, Spark, Presto and other supported Hadoop Applications on Amazon EMR; how to use Amazon S3 as a persistent data-store and process data directly from Amazon S3; Deployment strategies and how to avoid common mistakes when deploying at scale; and how to use Spot instances to scale your transient infrastructure effectively.
Amazon Elastic MapReduce (EMR) is a web service that allows you to easily and securely provision and manage your Hadoop clusters. In this talk, we will introduce you to Amazon EMR design patterns, such as using various data stores such as Amazon S3, how to take advantage of both transient and active clusters, as well as other Amazon EMR architectural patterns. We will dive deep on how to dynamically scale your cluster and address the ways you can fine-tune your cluster. We will discuss bootstrapping Hadoop applications from our partner ecosystem that you can use natively with Amazon EMR. Lastly, we will share best practices on how to keep your Amazon EMR cluster cost-effective.
Tune your Big Data Platform to Work at Scale: Taking Hadoop to the Next Level...Amazon Web Services
Learn how to set up a highly scalable, robust, and secure Hadoop platform using Amazon EMR. We'll perform a demonstration using a 100-node Amazon EMR cluster and take you through the best practices and performance tuning required for different workloads to ensure they are production ready.
Speaker: Amo Abeyaratne, Big Data Consultant, Amazon Web Services
Featured Customer - Ambidata
Amazon EMR is one of the largest Hadoop operators in the world. In this session, we introduce you to Amazon EMR design patterns such as using Amazon S3 instead of HDFS, taking advantage of both long and short-lived clusters, and other Amazon EMR architectural best practices. We talk about how to scale your cluster up or down dynamically and introduce you to ways you can fine-tune your cluster. We will also share best practices to keep your Amazon EMR cluster cost-efficient. Finally, we dive into some of our recent launches to keep you current on our latest features.
(BDT208) A Technical Introduction to Amazon Elastic MapReduceAmazon Web Services
"Amazon EMR provides a managed framework which makes it easy, cost effective, and secure to run data processing frameworks such as Apache Hadoop, Apache Spark, and Presto on AWS. In this session, you learn the key design principles behind running these frameworks on the cloud and the feature set that Amazon EMR offers. We discuss the benefits of decoupling compute and storage and strategies to take advantage of the scale and the parallelism that the cloud offers, while lowering costs. Additionally, you hear from AOL’s Senior Software Engineer on how they used these strategies to migrate their Hadoop workloads to the AWS cloud and lessons learned along the way.
In this session, you learn the benefits of decoupling storage and compute and allowing them to scale independently; how to run Hadoop, Spark, Presto and other supported Hadoop Applications on Amazon EMR; how to use Amazon S3 as a persistent data-store and process data directly from Amazon S3; dDeployment strategies and how to avoid common mistakes when deploying at scale; and how to use Spot instances to scale your transient infrastructure effectively."
Scaling your Analytics with Amazon Elastic MapReduce (BDT301) | AWS re:Invent...Amazon Web Services
Big data technologies let you work with any velocity, volume, or variety of data in a highly productive environment. Join the General Manager of Amazon EMR, Peter Sirota, to learn how to scale your analytics, use Hadoop with Amazon EMR, write queries with Hive, develop real world data flows with Pig, and understand the operational needs of a production data platform.
by Dave Stein, Business Development, AWS
Discover how EBS can take your application deployments on EC2 to the next level. You will learn service features and benefits, how to identify applications that are appropriate for use with EBS, best practices, and details about its performance and volume types.
Customers are migrating their analytics, data processing (ETL), and data science workloads running on Apache Hadoop, Spark, and data warehouse appliances from on-premise deployments to AWS in order to save costs, increase availability, and improve performance. AWS offers a broad set of analytics services, including solutions for batch processing, stream processing, machine learning, data workflow orchestration, and data warehousing. This session will focus on identifying the components and workflows in your current environment; and providing the best practices to migrate these workloads to the right AWS data analytics product. We will cover services such as Amazon EMR, Amazon Athena, Amazon Redshift, Amazon Kinesis, and more. We will also feature Vanguard, an American investment management company based in Malvern, Pennsylvania with over $4.4 trillion in assets under management. Ritesh Shah, Sr. Program Manager for Cloud Analytics Program at Vanguard, will describe how they orchestrated their migration to AWS analytics services, including Hadoop and Spark workloads to Amazon EMR. Ritesh will highlight the technical challenges they faced and overcame along the way, as well as share common recommendations and tuning tips to accelerate the time to production.
BDA302 Deep Dive on Migrating Big Data Workloads to Amazon EMRAmazon Web Services
Customers are migrating their analytics, data processing (ETL), and data science workloads running on Apache Hadoop, Spark, and data warehouse appliances from on-premise deployments to Amazon EMR in order to save costs, increase availability, and improve performance. Amazon EMR is a managed service that lets you process and analyze extremely large data sets using the latest versions of over 15 open-source frameworks in the Apache Hadoop and Spark ecosystems. This session will focus on identifying the components and workflows in your current environment and providing the best practices to migrate these workloads to Amazon EMR. We will explain how to move from HDFS to Amazon S3 as a durable storage layer, and how to lower costs with Amazon EC2 Spot instances and Auto Scaling. Additionally, we will go over common security recommendations and tuning tips to accelerate the time to production.
Data processing and analysis is where big data is most often consumed - driving business intelligence (BI) use cases that discover and report on meaningful patterns in the data. In this session, we will discuss options for processing, analyzing and visualizing data. We will also look at partner solutions and BI-enabling services from AWS. Attendees will learn about optimal approaches for stream processing, batch processing and Interactive analytics. AWS services to be covered include: Amazon Machine Learning, Elastic MapReduce (EMR), and Redshift.
Similar to Amazon Elastic MapReduce Deep Dive and Best Practices (BDT404) | AWS re:Invent 2013 (20)
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Amazon Web Services
Il Forecasting è un processo importante per tantissime aziende e viene utilizzato in vari ambiti per cercare di prevedere in modo accurato la crescita e distribuzione di un prodotto, l’utilizzo delle risorse necessarie nelle linee produttive, presentazioni finanziarie e tanto altro. Amazon utilizza delle tecniche avanzate di forecasting, in parte questi servizi sono stati messi a disposizione di tutti i clienti AWS.
In questa sessione illustreremo come pre-processare i dati che contengono una componente temporale e successivamente utilizzare un algoritmo che a partire dal tipo di dato analizzato produce un forecasting accurato.
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Amazon Web Services
La varietà e la quantità di dati che si crea ogni giorno accelera sempre più velocemente e rappresenta una opportunità irripetibile per innovare e creare nuove startup.
Tuttavia gestire grandi quantità di dati può apparire complesso: creare cluster Big Data su larga scala sembra essere un investimento accessibile solo ad aziende consolidate. Ma l’elasticità del Cloud e, in particolare, i servizi Serverless ci permettono di rompere questi limiti.
Vediamo quindi come è possibile sviluppare applicazioni Big Data rapidamente, senza preoccuparci dell’infrastruttura, ma dedicando tutte le risorse allo sviluppo delle nostre le nostre idee per creare prodotti innovativi.
Ora puoi utilizzare Amazon Elastic Kubernetes Service (EKS) per eseguire pod Kubernetes su AWS Fargate, il motore di elaborazione serverless creato per container su AWS. Questo rende più semplice che mai costruire ed eseguire le tue applicazioni Kubernetes nel cloud AWS.In questa sessione presenteremo le caratteristiche principali del servizio e come distribuire la tua applicazione in pochi passaggi
Vent'anni fa Amazon ha attraversato una trasformazione radicale con l'obiettivo di aumentare il ritmo dell'innovazione. In questo periodo abbiamo imparato come cambiare il nostro approccio allo sviluppo delle applicazioni ci ha permesso di aumentare notevolmente l'agilità, la velocità di rilascio e, in definitiva, ci ha consentito di creare applicazioni più affidabili e scalabili. In questa sessione illustreremo come definiamo le applicazioni moderne e come la creazione di app moderne influisce non solo sull'architettura dell'applicazione, ma sulla struttura organizzativa, sulle pipeline di rilascio dello sviluppo e persino sul modello operativo. Descriveremo anche approcci comuni alla modernizzazione, compreso l'approccio utilizzato dalla stessa Amazon.com.
Come spendere fino al 90% in meno con i container e le istanze spot Amazon Web Services
L’utilizzo dei container è in continua crescita.
Se correttamente disegnate, le applicazioni basate su Container sono molto spesso stateless e flessibili.
I servizi AWS ECS, EKS e Kubernetes su EC2 possono sfruttare le istanze Spot, portando ad un risparmio medio del 70% rispetto alle istanze On Demand. In questa sessione scopriremo insieme quali sono le caratteristiche delle istanze Spot e come possono essere utilizzate facilmente su AWS. Impareremo inoltre come Spreaker sfrutta le istanze spot per eseguire applicazioni di diverso tipo, in produzione, ad una frazione del costo on-demand!
In recent months, many customers have been asking us the question – how to monetise Open APIs, simplify Fintech integrations and accelerate adoption of various Open Banking business models. Therefore, AWS and FinConecta would like to invite you to Open Finance marketplace presentation on October 20th.
Event Agenda :
Open banking so far (short recap)
• PSD2, OB UK, OB Australia, OB LATAM, OB Israel
Intro to Open Finance marketplace
• Scope
• Features
• Tech overview and Demo
The role of the Cloud
The Future of APIs
• Complying with regulation
• Monetizing data / APIs
• Business models
• Time to market
One platform for all: a Strategic approach
Q&A
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Amazon Web Services
Per creare valore e costruire una propria offerta differenziante e riconoscibile, le startup di successo sanno come combinare tecnologie consolidate con componenti innovativi creati ad hoc.
AWS fornisce servizi pronti all'utilizzo e, allo stesso tempo, permette di personalizzare e creare gli elementi differenzianti della propria offerta.
Concentrandoci sulle tecnologie di Machine Learning, vedremo come selezionare i servizi di intelligenza artificiale offerti da AWS e, anche attraverso una demo, come costruire modelli di Machine Learning personalizzati utilizzando SageMaker Studio.
OpsWorks Configuration Management: automatizza la gestione e i deployment del...Amazon Web Services
Con l'approccio tradizionale al mondo IT per molti anni è stato difficile implementare tecniche di DevOps, che finora spesso hanno previsto attività manuali portando di tanto in tanto a dei downtime degli applicativi interrompendo l'operatività dell'utente. Con l'avvento del cloud, le tecniche di DevOps sono ormai a portata di tutti a basso costo per qualsiasi genere di workload, garantendo maggiore affidabilità del sistema e risultando in dei significativi miglioramenti della business continuity.
AWS mette a disposizione AWS OpsWork come strumento di Configuration Management che mira ad automatizzare e semplificare la gestione e i deployment delle istanze EC2 per mezzo di workload Chef e Puppet.
Scopri come sfruttare AWS OpsWork a garanzia e affidabilità del tuo applicativo installato su Instanze EC2.
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsAmazon Web Services
Vuoi conoscere le opzioni per eseguire Microsoft Active Directory su AWS? Quando si spostano carichi di lavoro Microsoft in AWS, è importante considerare come distribuire Microsoft Active Directory per supportare la gestione, l'autenticazione e l'autorizzazione dei criteri di gruppo. In questa sessione, discuteremo le opzioni per la distribuzione di Microsoft Active Directory su AWS, incluso AWS Directory Service per Microsoft Active Directory e la distribuzione di Active Directory su Windows su Amazon Elastic Compute Cloud (Amazon EC2). Trattiamo argomenti quali l'integrazione del tuo ambiente Microsoft Active Directory locale nel cloud e l'utilizzo di applicazioni SaaS, come Office 365, con AWS Single Sign-On.
Dal riconoscimento facciale al riconoscimento di frodi o difetti di fabbricazione, l'analisi di immagini e video che sfruttano tecniche di intelligenza artificiale, si stanno evolvendo e raffinando a ritmi elevati. In questo webinar esploreremo le possibilità messe a disposizione dai servizi AWS per applicare lo stato dell'arte delle tecniche di computer vision a scenari reali.
Amazon Web Services e VMware organizzano un evento virtuale gratuito il prossimo mercoledì 14 Ottobre dalle 12:00 alle 13:00 dedicato a VMware Cloud ™ on AWS, il servizio on demand che consente di eseguire applicazioni in ambienti cloud basati su VMware vSphere® e di accedere ad una vasta gamma di servizi AWS, sfruttando a pieno le potenzialità del cloud AWS e tutelando gli investimenti VMware esistenti.
Molte organizzazioni sfruttano i vantaggi del cloud migrando i propri carichi di lavoro Oracle e assicurandosi notevoli vantaggi in termini di agilità ed efficienza dei costi.
La migrazione di questi carichi di lavoro, può creare complessità durante la modernizzazione e il refactoring delle applicazioni e a questo si possono aggiungere rischi di prestazione che possono essere introdotti quando si spostano le applicazioni dai data center locali.
Crea la tua prima serverless ledger-based app con QLDB e NodeJSAmazon Web Services
Molte aziende oggi, costruiscono applicazioni con funzionalità di tipo ledger ad esempio per verificare lo storico di accrediti o addebiti nelle transazioni bancarie o ancora per tenere traccia del flusso supply chain dei propri prodotti.
Alla base di queste soluzioni ci sono i database ledger che permettono di avere un log delle transazioni trasparente, immutabile e crittograficamente verificabile, ma sono strumenti complessi e onerosi da gestire.
Amazon QLDB elimina la necessità di costruire sistemi personalizzati e complessi fornendo un database ledger serverless completamente gestito.
In questa sessione scopriremo come realizzare un'applicazione serverless completa che utilizzi le funzionalità di QLDB.
Con l’ascesa delle architetture di microservizi e delle ricche applicazioni mobili e Web, le API sono più importanti che mai per offrire agli utenti finali una user experience eccezionale. In questa sessione impareremo come affrontare le moderne sfide di progettazione delle API con GraphQL, un linguaggio di query API open source utilizzato da Facebook, Amazon e altro e come utilizzare AWS AppSync, un servizio GraphQL serverless gestito su AWS. Approfondiremo diversi scenari, comprendendo come AppSync può aiutare a risolvere questi casi d’uso creando API moderne con funzionalità di aggiornamento dati in tempo reale e offline.
Inoltre, impareremo come Sky Italia utilizza AWS AppSync per fornire aggiornamenti sportivi in tempo reale agli utenti del proprio portale web.
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareAmazon Web Services
Molte organizzazioni sfruttano i vantaggi del cloud migrando i propri carichi di lavoro Oracle e assicurandosi notevoli vantaggi in termini di agilità ed efficienza dei costi.
La migrazione di questi carichi di lavoro, può creare complessità durante la modernizzazione e il refactoring delle applicazioni e a questo si possono aggiungere rischi di prestazione che possono essere introdotti quando si spostano le applicazioni dai data center locali.
In queste slide, gli esperti AWS e VMware presentano semplici e pratici accorgimenti per facilitare e semplificare la migrazione dei carichi di lavoro Oracle accelerando la trasformazione verso il cloud, approfondiranno l’architettura e dimostreranno come sfruttare a pieno le potenzialità di VMware Cloud ™ on AWS.
Amazon Elastic Container Service (Amazon ECS) è un servizio di gestione dei container altamente scalabile, che semplifica la gestione dei contenitori Docker attraverso un layer di orchestrazione per il controllo del deployment e del relativo lifecycle. In questa sessione presenteremo le principali caratteristiche del servizio, le architetture di riferimento per i differenti carichi di lavoro e i semplici passi necessari per poter velocemente migrare uno o più dei tuo container.
GraphRAG is All You need? LLM & Knowledge GraphGuy Korland
Guy Korland, CEO and Co-founder of FalkorDB, will review two articles on the integration of language models with knowledge graphs.
1. Unifying Large Language Models and Knowledge Graphs: A Roadmap.
https://arxiv.org/abs/2306.08302
2. Microsoft Research's GraphRAG paper and a review paper on various uses of knowledge graphs:
https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionAggregage
Join Maher Hanafi, VP of Engineering at Betterworks, in this new session where he'll share a practical framework to transform Gen AI prototypes into impactful products! He'll delve into the complexities of data collection and management, model selection and optimization, and ensuring security, scalability, and responsible use.
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...UiPathCommunity
💥 Speed, accuracy, and scaling – discover the superpowers of GenAI in action with UiPath Document Understanding and Communications Mining™:
See how to accelerate model training and optimize model performance with active learning
Learn about the latest enhancements to out-of-the-box document processing – with little to no training required
Get an exclusive demo of the new family of UiPath LLMs – GenAI models specialized for processing different types of documents and messages
This is a hands-on session specifically designed for automation developers and AI enthusiasts seeking to enhance their knowledge in leveraging the latest intelligent document processing capabilities offered by UiPath.
Speakers:
👨🏫 Andras Palfi, Senior Product Manager, UiPath
👩🏫 Lenka Dulovicova, Product Program Manager, UiPath
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Albert Hoitingh
In this session I delve into the encryption technology used in Microsoft 365 and Microsoft Purview. Including the concepts of Customer Key and Double Key Encryption.
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf91mobiles
91mobiles recently conducted a Smart TV Buyer Insights Survey in which we asked over 3,000 respondents about the TV they own, aspects they look at on a new TV, and their TV buying preferences.
State of ICS and IoT Cyber Threat Landscape Report 2024 previewPrayukth K V
The IoT and OT threat landscape report has been prepared by the Threat Research Team at Sectrio using data from Sectrio, cyber threat intelligence farming facilities spread across over 85 cities around the world. In addition, Sectrio also runs AI-based advanced threat and payload engagement facilities that serve as sinks to attract and engage sophisticated threat actors, and newer malware including new variants and latent threats that are at an earlier stage of development.
The latest edition of the OT/ICS and IoT security Threat Landscape Report 2024 also covers:
State of global ICS asset and network exposure
Sectoral targets and attacks as well as the cost of ransom
Global APT activity, AI usage, actor and tactic profiles, and implications
Rise in volumes of AI-powered cyberattacks
Major cyber events in 2024
Malware and malicious payload trends
Cyberattack types and targets
Vulnerability exploit attempts on CVEs
Attacks on counties – USA
Expansion of bot farms – how, where, and why
In-depth analysis of the cyber threat landscape across North America, South America, Europe, APAC, and the Middle East
Why are attacks on smart factories rising?
Cyber risk predictions
Axis of attacks – Europe
Systemic attacks in the Middle East
Download the full report from here:
https://sectrio.com/resources/ot-threat-landscape-reports/sectrio-releases-ot-ics-and-iot-security-threat-landscape-report-2024/
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfPaige Cruz
Monitoring and observability aren’t traditionally found in software curriculums and many of us cobble this knowledge together from whatever vendor or ecosystem we were first introduced to and whatever is a part of your current company’s observability stack.
While the dev and ops silo continues to crumble….many organizations still relegate monitoring & observability as the purview of ops, infra and SRE teams. This is a mistake - achieving a highly observable system requires collaboration up and down the stack.
I, a former op, would like to extend an invitation to all application developers to join the observability party will share these foundational concepts to build on:
Essentials of Automations: Optimizing FME Workflows with ParametersSafe Software
Are you looking to streamline your workflows and boost your projects’ efficiency? Do you find yourself searching for ways to add flexibility and control over your FME workflows? If so, you’re in the right place.
Join us for an insightful dive into the world of FME parameters, a critical element in optimizing workflow efficiency. This webinar marks the beginning of our three-part “Essentials of Automation” series. This first webinar is designed to equip you with the knowledge and skills to utilize parameters effectively: enhancing the flexibility, maintainability, and user control of your FME projects.
Here’s what you’ll gain:
- Essentials of FME Parameters: Understand the pivotal role of parameters, including Reader/Writer, Transformer, User, and FME Flow categories. Discover how they are the key to unlocking automation and optimization within your workflows.
- Practical Applications in FME Form: Delve into key user parameter types including choice, connections, and file URLs. Allow users to control how a workflow runs, making your workflows more reusable. Learn to import values and deliver the best user experience for your workflows while enhancing accuracy.
- Optimization Strategies in FME Flow: Explore the creation and strategic deployment of parameters in FME Flow, including the use of deployment and geometry parameters, to maximize workflow efficiency.
- Pro Tips for Success: Gain insights on parameterizing connections and leveraging new features like Conditional Visibility for clarity and simplicity.
We’ll wrap up with a glimpse into future webinars, followed by a Q&A session to address your specific questions surrounding this topic.
Don’t miss this opportunity to elevate your FME expertise and drive your projects to new heights of efficiency.
Welocme to ViralQR, your best QR code generator.ViralQR
Welcome to ViralQR, your best QR code generator available on the market!
At ViralQR, we design static and dynamic QR codes. Our mission is to make business operations easier and customer engagement more powerful through the use of QR technology. Be it a small-scale business or a huge enterprise, our easy-to-use platform provides multiple choices that can be tailored according to your company's branding and marketing strategies.
Our Vision
We are here to make the process of creating QR codes easy and smooth, thus enhancing customer interaction and making business more fluid. We very strongly believe in the ability of QR codes to change the world for businesses in their interaction with customers and are set on making that technology accessible and usable far and wide.
Our Achievements
Ever since its inception, we have successfully served many clients by offering QR codes in their marketing, service delivery, and collection of feedback across various industries. Our platform has been recognized for its ease of use and amazing features, which helped a business to make QR codes.
Our Services
At ViralQR, here is a comprehensive suite of services that caters to your very needs:
Static QR Codes: Create free static QR codes. These QR codes are able to store significant information such as URLs, vCards, plain text, emails and SMS, Wi-Fi credentials, and Bitcoin addresses.
Dynamic QR codes: These also have all the advanced features but are subscription-based. They can directly link to PDF files, images, micro-landing pages, social accounts, review forms, business pages, and applications. In addition, they can be branded with CTAs, frames, patterns, colors, and logos to enhance your branding.
Pricing and Packages
Additionally, there is a 14-day free offer to ViralQR, which is an exceptional opportunity for new users to take a feel of this platform. One can easily subscribe from there and experience the full dynamic of using QR codes. The subscription plans are not only meant for business; they are priced very flexibly so that literally every business could afford to benefit from our service.
Why choose us?
ViralQR will provide services for marketing, advertising, catering, retail, and the like. The QR codes can be posted on fliers, packaging, merchandise, and banners, as well as to substitute for cash and cards in a restaurant or coffee shop. With QR codes integrated into your business, improve customer engagement and streamline operations.
Comprehensive Analytics
Subscribers of ViralQR receive detailed analytics and tracking tools in light of having a view of the core values of QR code performance. Our analytics dashboard shows aggregate views and unique views, as well as detailed information about each impression, including time, device, browser, and estimated location by city and country.
So, thank you for choosing ViralQR; we have an offer of nothing but the best in terms of QR code services to meet business diversity!
10. Amazon EMR Introduction
• Launch clusters of any size in a matter of
minutes
• Use variety of different instance sizes that match
your workload
11. Amazon EMR Introduction
• Don’t get stuck with hardware
• Don’t deal with capacity planning
• Run multiple clusters with different sizes, specs
and node types
16. Pattern #1: Transient Clusters
• Cluster lives for the duration of the job
• Shut down the cluster when the job is done
• Data persist on
Amazon S3
• Input & output
Data on
Amazon S3
17. Benefits of Transient Clusters
1. Control your cost
2. Minimum maintenance
•
Cluster goes away when job is done
3. Practice cloud architecture
•
Pay for what you use
•
Data processing as a workflow
18. When to use Transient cluster?
If ( Data Load Time + Processing Time) *
Number Of Jobs < 24
Use Transient Clusters
Else
Use Alive Clusters
19. When to use Transient cluster?
( 20min data load + 1 hour
Processing time) * 10 jobs = 13
hours < 24 hour = Use Transient
Clusters
20. Alive Clusters
• Very similar to traditional Hadoop deployments
• Cluster stays around after the job is done
• Data persistence model:
• Amazon S3
• Amazon S3 Copy To HDFS
• HDFS and Amazon S3 as
backup
21. Alive Clusters
• Always keep data safe on Amazon S3 even if
you’re using HDFS for primary storage
• Get in the habit of shutting down your cluster and
start a new one, once a week or month
•
Design your data processing workflow to account for
failure
• You can use workflow managements such as
AWS Data Pipeline
22. Benefits of Alive Clusters
•
Ability to share data between multiple jobs
Transient cluster
Long running clusters
EMR
EMR
Amazon S3
Amazon S3
EMR
HDFS
HDFS
23. Benefit of Alive Clusters
•
Cost effective for repetitive jobs
pm
EMR
pm
pm
EMR
EMR
EMR
EMR
pm
24. When to use Alive cluster?
If ( Data Load Time + Processing Time) *
Number Of Jobs > 24
Use Alive Clusters
Else
Use Transient Clusters
25. When to use Alive cluster?
( 20min data load + 1 hour
Processing time) * 20 jobs =
26hours > 24 hour = Use Alive
Clusters
27. Core Nodes
Amazon EMR cluster
Master instance group
Run
TaskTrackers
(Compute)
Run DataNode
(HDFS)
HDFS
HDFS
Core instance group
28. Core Nodes
Amazon EMR cluster
Master instance group
Can add core
nodes
HDFS
HDFS
Core instance group
29. Core Nodes
Amazon EMR cluster
Can add core
nodes
Master instance group
More HDFS
space
More
CPU/mem
HDFS
HDFS
Core instance group
HDFS
30. Core Nodes
Amazon EMR cluster
Can’t remove
core nodes
because of
HDFS
Master instance group
HDFS
HDFS
Core instance group
HDFS
31. Amazon EMR Task Nodes
Amazon EMR cluster
Run TaskTrackers
Master instance group
No HDFS
Reads from core
node HDFS
HDFS
HDFS
Core instance group
Task instance group
32. Amazon EMR Task Nodes
Amazon EMR cluster
Can add
task nodes
Master instance group
HDFS
HDFS
Core instance group
Task instance group
33. Amazon EMR Task Nodes
Amazon EMR cluster
More CPU
power
Master instance group
More
memory
HDFS
HDFS
Core instance group
Task instance group
34. Amazon EMR Task Nodes
You can
remove task
nodes
Amazon EMR cluster
Master instance group
HDFS
HDFS
Core instance group
Task instance group
35. Amazon EMR Task Nodes
You can
remove task
nodes
Amazon EMR cluster
Master instance group
HDFS
HDFS
Core instance group
Task instance group
36. Tasknode Use-Case #1
• Speed up job processing using Spot
market
• Run task nodes on Spot market
• Get discount on hourly price
• Nodes can come and go without
interruption to your cluster
37. Tasknode Use-Case #2
• When you need extra horse power
for a short amount of time
• Example: Need to pull large amount
of data from Amazon S3
42. Amazon S3 as HDFS
Amazon EMR
cluster
• Use Amazon S3 as your
permanent data store
• HDFS for temporary
storage data between
jobs
• No additional step to
copy data to HDFS
HDF
S
HDF
S
Core instance group
Task instance group
Amazon S3
43. Benefits: Amazon S3 as HDFS
• Ability to shut down your cluster
HUGE Benefit!!
• Use Amazon S3 as your durable storage
11 9s of durability
44. Benefits: Amazon S3 as HDFS
• No need to scale HDFS
• Capacity
• Replication for durability
• Amazon S3 scales with your data
• Both in IOPs and data storage
45. Benefits: Amazon S3 as HDFS
• Ability to share data between multiple clusters
•
Hard to do with HDFS
EMR
EMR
Amazon S3
46. Benefits: Amazon S3 as HDFS
•
Take advantage of Amazon S3 features
• Amazon S3 ServerSideEncryption
• Amazon S3 LifeCyclePolicy
• Amazon S3 versioning to protect against corruption
•
Build elastic clusters
•
Add nodes to read from Amazon S3
•
Remove nodes with data safe on Amazon S3
47. What About Data Locality?
• Run your job in the same region as your
Amazon S3 bucket
• Amazon EMR nodes have high speed
connectivity to Amazon S3
• If your job Is CPU/memory-bounded data,
locality doesn’t make a difference
48. Anti-Pattern: Amazon S3 as HDFS
• Iterative workloads
– If you’re processing the same dataset more than once
• Disk I/O intensive workloads
60. Amazon EMR Elastic Cluster (a)
3. Get HTTP Amazon SNS notification to a simple
app deployed on Elastic Beanstalk
61. Amazon EMR Elastic Cluster (a)
4. Your app calls the API to add nodes to your
cluster
API
62. Outline
Introduction to Amazon EMR
Amazon EMR Design Patterns
Amazon EMR Best Practices
Amazon Controlling Cost with EMR
Advanced Optimizations
63. Amazon EMR Nodes and Size
• Use m1.smal, m1.large, c1.medium for
functional testing
• Use M1.xlarge and larger nodes for
production workloads
64. Amazon EMR Nodes and Size
• Use CC2 for memory and CPU intensive
jobs
• Use CC2/C1.xlarge for CPU intensive
jobs
• Hs1 instances for HDFS workloads
65. Amazon EMR Nodes and Size
• Hi1 and HS1 instances for disk I/Ointensive workload
• CC2 instances are more cost effective
than M2.4xlarge
• Prefer smaller cluster of larger nodes than
larger cluster of smaller nodes
74. Introduction to Hadoop Splits
•
Data mappers > cluster mapper capacity =
mappers wait for capacity = processing delay
Queue
75. Introduction to Hadoop Splits
•
More nodes = reduced queue size = faster
processing
Queue
76. Calculating the Number of Splits for
Your Job
Uncompressed files: Hadoop splits a single file to
multiple splits.
Example: 128MB = 2 splits based on 64MB split size
77. Calculating the Number of Splits for
Your Job
Compressed files:
1. Splittable compressions: same logic as uncompressed files
64MB BZIP
128MB BZIP
78. Calculating the Number of Splits for
Your Job
Compressed files:
2. Unsplittable compressions: the entire file is a
single split.
128MB GZ
128MB GZ
79. Calculating the Number of Splits for
Your Job
Number of splits
If data files have unsplittable compression
# of splits = number of files
Example: 10 GZ files = 10 mappers
82. Cluster Sizing Calculation
2. Pick an instance and note down the number of
mappers it can run in parallel
M1.xlarge = 8 mappers in parallel
83. Cluster Sizing Calculation
3. We need to pick some sample data files to run
a test workload. The number of sample files
should be the same number from step #2.
84. Cluster Sizing Calculation
4. Run an Amazon EMR cluster with a single core
node and process your sample files from #3.
Note down the amount of time taken to process
your sample files.
85. Cluster Sizing Calculation
Estimated Number Of Nodes:
Total Mappers * Time To Process Sample Files
Instance Mapper Capacity * Desired Processing Time
86. Example: Cluster Sizing Calculation
1. Estimate the number of mappers your job
requires
150
2. Pick an instance and note down the number of
mappers it can run in parallel
m1.xlarge with 8 mapper capacity
per instance
87. Example: Cluster Sizing Calculation
3. We need to pick some sample data files to run a
test workload. The number of sample files should
be the same number from step #2.
8 files selected for our sample test
88. Example: Cluster Sizing Calculation
4. Run an Amazon EMR cluster with a single core
node and process your sample files from #3.
Note down the amount of time taken to process
your sample files.
3 min to process 8 files
89. Cluster Sizing Calculation
Estimated number of nodes:
Total Mappers For Your Job * Time To Process Sample Files
Per Instance Mapper Capacity * Desired Processing Time
150 * 3 min
= 11 m1.xlarge
8 * 5 min
90. File Size Best Practices
• Avoid small files at all costs
• Anything smaller than 100MB
• Each mapper is a single JVM
• CPU time required to spawn
JVMs/mappers
91. File Size Best Practices
Mappers take 2 sec to spawn up and be ready
for processing
10TB of 100mgB = 100,000 mappers * 2Sec =
total of 55 hours mapper CPU setup time
92. File Size Best Practices
Mappers take 2 sec to spawn up and be ready
for processing
10TB of 1000MB = 10,000 mappers * 2Sec =
total of 5 hours mapper CPU setup time
93. File Size on Amazon S3: Best Practices
• What’s the best Amazon S3 file size for
Hadoop?
About 1-2GB
• Why?
94. File Size on Amazon S3: Best Practices
• Life of mapper should not be less than 60 sec
• Single mapper can get 10MB-15MB/s speed to
Amazon S3
60sec * 15MB
≈
1GB
96. Dealing with Small Files
• Use S3DistCP to combine smaller files together
• S3DistCP takes a pattern and target file to
combine smaller input files to larger ones
97. Dealing with Small Files
Example:
./elastic-mapreduce --jobflow j-3GY8JC4179IOK --jar
/home/hadoop/lib/emr-s3distcp-1.0.jar
--args '--src,s3://myawsbucket/cf,
--dest,hdfs:///local,
--groupBy,.*XABCD12345678.([0-9]+-[09]+-[0-9]+-[0-9]+).*,
--targetSize,128,
98. Compressions
• Compress as much as you can
• Compress Amazon S3 input data files
– Reduces cost
– Speed up Amazon S3->mapper data transfer
time
99. Compressions
• Always Compress Data Files On Amazon S3
• Reduces Storage Cost
• Reduces Bandwidth Between Amazon S3
and Amazon EMR
• Speeds Up Your Job
101. Compressions
• Compression Types:
– Some are fast BUT offer less space reduction
– Some are space efficient BUT Slower
– Some are splitable and some are not
Algorithm
% Space
Remaining
Encoding
Speed
Decoding
Speed
GZIP
13%
21MB/s
118MB/s
LZO
20%
135MB/s
410MB/s
Snappy
22%
172MB/s
409MB/s
102. Compressions
• If You Are Time Sensitive, Faster Compressions
Are A Better Choice
• If You Have Large Amount Of Data, Use Space
Efficient Compressions
• If You Don’t Care, Pick GZIP
103. Change Compression Type
• You May Decide To Change Compression Type
• Use S3DistCP to change the compression types of
your files
• Example:
./elastic-mapreduce --jobflow j-3GY8JC4179IOK --jar
/home/hadoop/lib/emr-s3distcp-1.0.jar
--args '--src,s3://myawsbucket/cf,
--dest,hdfs:///local,
--outputCodec,lzo’
104. Outline
Introduction to Amazon EMR
Amazon EMR Design Patterns
Amazon EMR Best Practices
Amazon Controlling Cost with EMR
Advanced Optimizations
105. Architecting for cost
• AWS pricing models:
– On-demand: Pay as you go model.
– Spot: Market place. Bid for instances
and get a discount
– Reserved Instance: upfront payment
(for 1 or 3 year) for reduction in overall
monthly payment
109. Outline
Introduction to Amazon EMR
Amazon EMR Design Patterns
Amazon EMR Best Practices
Amazon Controlling Cost with EMR
Advanced Optimizations
110. Adv. Optimizations (Stage 1)
• The best optimization is to structure your data
(i.e., smart data partitioning)
• Efficient data structuring= limit the amount of
data being processed by Hadoop= faster jobs
111. Adv. Optimizations (Stage 1)
• Hadoop is a batch processing framework
• Data processing time = an hour to days
• Not a great use-case for shorter jobs
• Other frameworks may be a better fit
–
Twitter Storm
–
Spark
–
Amazon Redshift, etc.
112. Adv. Optimizations (Stage 1)
• Amazon EMR team has done a great deal of
optimization already
• For smaller clusters, Amazon EMR configuration
optimization won’t buy you much
– Remember you’re paying for the full hour
cost of an instance
120. Network IO
• Most important metric to watch for if using
Amazon S3 for storage
• Goal: Drive as much network IO as possible
from a single instance
121. Network IO
• Larger instances can drive > 600Mbps
• Cluster computes can drive 1Gbps -2 Gbps
• Optimize to get more out of your instance
throughput
– Add more mappers?
122. Network IO
• If you’re using Amazon S3 with Amazon
EMR, monitor Ganglia and watch network
throughput.
• Your goal is to maximize your NIC
throughput by having enough mappers
per node
123. Network IO, Example
Low network utilization
Increase number of mappers if possible to drive
more traffic
124. CPU
• Watch for CPU utilization of your clusters
• If >50% idle, increase # of mapper/reducer
per instance
– Reduces the number of nodes and reduces
cost
127. Disk IO
• Limit the amount of disk IO
• Can increase mapper/reducer memory
• Compress data anywhere you can
• Monitor cluster and pay attention to HDFS
bytes written metrics
• One play to pay attention to is mapper/reducer
disk spill
128. Disk IO
• Mapper has in memory buffer
mapper
mapper memory buffer
129. Disk IO
• When memory gets full, data spills to disk
mapper
mapper memory buffer
data spills to
disk
130.
131. Disk IO
• If you see mapper/reducer excessive spill to disk,
increase buffer memory per mapper
• Excessive spill when ratio of
“MAPPER_SPILLED_RECORDS” and
“MAPPER_OUTPUT_RECORDS” is more than 1
133. Disk IO
• Increase mapper buffer memory by increasing
“io.sort.mb”
<property><name>io.sort.mb<name><value>200</value><
/property>
• Same logic applies to reducers
134. Disk IO
• Monitor disk IO using Ganglia
• Look out for disk IO wait
135. Disk IO
• Monitor disk IO using Ganglia
• Look out for disk IO wait