Build a simple data lake on AWS using a combination of services, including AWS Glue Data Catalog, AWS Glue Crawlers, AWS Glue Jobs, AWS Glue Studio, Amazon Athena, Amazon Relational Database Service (Amazon RDS), and Amazon S3.
Link to the blog post and video: https://garystafford.medium.com/building-a-simple-data-lake-on-aws-df21ca092e32
AWS Glue is a fully managed, serverless extract, transform, and load (ETL) service that makes it easy to move data between data stores. AWS Glue simplifies and automates the difficult and time consuming tasks of data discovery, conversion mapping, and job scheduling so you can focus more of your time querying and analyzing your data using Amazon Redshift Spectrum and Amazon Athena. In this session, we introduce AWS Glue, provide an overview of its components, and share how you can use AWS Glue to automate discovering your data, cataloging it, and preparing it for analysis.
AWS delivers an integrated suite of services that provide everything needed to quickly and easily build and manage a data lake for analytics. AWS-powered data lakes can handle the scale, agility, and flexibility required to combine different types of data and analytics approaches to gain deeper insights, in ways that traditional data silos and data warehouses cannot. In this session, we will show you how you can quickly build a data lake on AWS that ingests, catalogs and processes incoming data and makes it ready for analysis. Using a live demo, we demonstrate the capabilities of AWS provided analytical services such as AWS Glue, Amazon Athena and Amazon EMR and how to build a Data Lake on AWS step-by-step.
AWS Cost Management Workshop at the San Francisco Loft
AWS offers a number of products that allow you to access, organize, understand, optimize, and control your AWS costs and usage. This workshop will help you get started using AWS Cost Explorer to visualize your usage patterns and identify your underlying cost drivers. From there, you can take action on your insights by learning how to set custom cost and usage budgets and receive alerts via email or Amazon SNS topic using AWS Budgets.
Organizations need to gain insight and knowledge from a growing number of Internet of Things (IoT), APIs, clickstreams, unstructured and log data sources. However, organizations are also often limited by legacy data warehouses and ETL processes that were designed for transactional data. In this session, we introduce key ETL features of AWS Glue, cover common use cases ranging from scheduled nightly data warehouse loads to near real-time, event-driven ETL flows for your data lake. We discuss how to build scalable, efficient, and serverless ETL pipelines using AWS Glue. Additionally, Merck will share how they built an end-to-end ETL pipeline for their application release management system, and launched it in production in less than a week using AWS Glue.
AWS Glue is a fully managed extract, transform, and load (ETL) service that makes it easy for customers to prepare and load their data for analytics. You can create and run an ETL job with a few clicks in the AWS Management Console. You simply point AWS Glue to your data stored on AWS, and AWS Glue discovers your data and stores the associated metadata (e.g. table definition and schema) in the AWS Glue Data Catalog. Once cataloged, your data is immediately searchable, queryable, and available for ETL. AWS Glue generates the code to execute your data transformations and data loading processes.
Level: Intermediate
Speakers:
Ryan Malecky - Solutions Architect, EdTech, AWS
Rajakumar Sampathkumar - Sr. Technical Account Manager, AWS
AWS Glue is a fully managed, serverless extract, transform, and load (ETL) service that makes it easy to move data between data stores. AWS Glue simplifies and automates the difficult and time consuming tasks of data discovery, conversion mapping, and job scheduling so you can focus more of your time querying and analyzing your data using Amazon Redshift Spectrum and Amazon Athena. In this session, we introduce AWS Glue, provide an overview of its components, and share how you can use AWS Glue to automate discovering your data, cataloging it, and preparing it for analysis.
AWS delivers an integrated suite of services that provide everything needed to quickly and easily build and manage a data lake for analytics. AWS-powered data lakes can handle the scale, agility, and flexibility required to combine different types of data and analytics approaches to gain deeper insights, in ways that traditional data silos and data warehouses cannot. In this session, we will show you how you can quickly build a data lake on AWS that ingests, catalogs and processes incoming data and makes it ready for analysis. Using a live demo, we demonstrate the capabilities of AWS provided analytical services such as AWS Glue, Amazon Athena and Amazon EMR and how to build a Data Lake on AWS step-by-step.
AWS Cost Management Workshop at the San Francisco Loft
AWS offers a number of products that allow you to access, organize, understand, optimize, and control your AWS costs and usage. This workshop will help you get started using AWS Cost Explorer to visualize your usage patterns and identify your underlying cost drivers. From there, you can take action on your insights by learning how to set custom cost and usage budgets and receive alerts via email or Amazon SNS topic using AWS Budgets.
Organizations need to gain insight and knowledge from a growing number of Internet of Things (IoT), APIs, clickstreams, unstructured and log data sources. However, organizations are also often limited by legacy data warehouses and ETL processes that were designed for transactional data. In this session, we introduce key ETL features of AWS Glue, cover common use cases ranging from scheduled nightly data warehouse loads to near real-time, event-driven ETL flows for your data lake. We discuss how to build scalable, efficient, and serverless ETL pipelines using AWS Glue. Additionally, Merck will share how they built an end-to-end ETL pipeline for their application release management system, and launched it in production in less than a week using AWS Glue.
AWS Glue is a fully managed extract, transform, and load (ETL) service that makes it easy for customers to prepare and load their data for analytics. You can create and run an ETL job with a few clicks in the AWS Management Console. You simply point AWS Glue to your data stored on AWS, and AWS Glue discovers your data and stores the associated metadata (e.g. table definition and schema) in the AWS Glue Data Catalog. Once cataloged, your data is immediately searchable, queryable, and available for ETL. AWS Glue generates the code to execute your data transformations and data loading processes.
Level: Intermediate
Speakers:
Ryan Malecky - Solutions Architect, EdTech, AWS
Rajakumar Sampathkumar - Sr. Technical Account Manager, AWS
Today organizations find themselves in a data rich world with a growing need for increased agility and accessibility of all this data for analysis and deriving keen insights to drive strategic decisions. Creating a data lake helps you to manage all the disparate sources of data you are collecting (in its original format) and extract value. In this session, learn how to architect and implement a data lake in the AWS Cloud. Learn about best practices as we walk through architectural blueprints.
Amazon Athena is a new serverless query service that makes it easy to analyze data in Amazon S3, using standard SQL. With Athena, there is no infrastructure to setup or manage, and you can start analyzing your data immediately. You don’t even need to load your data into Athena, it works directly with data stored in S3.
Amazon Elastic MapReduce is one of the largest Hadoop operators in the world. Since its launch five years ago, AWS customers have launched more than 5.5 million Hadoop clusters.
In this talk, we introduce you to Amazon EMR design patterns such as using Amazon S3 instead of HDFS, taking advantage of both long and short-lived clusters and other Amazon EMR architectural patterns. We talk about how to scale your cluster up or down dynamically and introduce you to ways you can fine-tune your cluster. We also share best practices to keep your Amazon EMR cluster cost efficient.
Speakers:
Ian Meyers, AWS Solutions Architect
Ian McDonald, IT Director, SwiftKey
Data Con LA 2020
Description
In this session, I introduce the Amazon Redshift lake house architecture which enables you to query data across your data warehouse, data lake, and operational databases to gain faster and deeper insights. With a lake house architecture, you can store data in open file formats in your Amazon S3 data lake.
Speaker
Antje Barth, Amazon Web Services, Sr. Developer Advocate, AI and Machine Learning
by Joyjeet Banerjee, Solutions Architect, AWS
Amazon Athena is a new serverless query service that makes it easy to analyze data in Amazon S3, using standard SQL. With Athena, there is no infrastructure to setup or manage, and you can start analyzing your data immediately. You don’t even need to load your data into Athena, it works directly with data stored in S3. Level 200
In this session, we will show you how easy it is to start querying your data stored in Amazon S3, with Amazon Athena. First we will use Athena to create the schema for data already in S3. Then, we will demonstrate how you can run interactive queries through the built-in query editor. We will provide best practices and use cases for Athena. Then, we will talk about supported queries, data formats, and strategies to save costs when querying data with Athena.
The Zen of DataOps – AWS Lake Formation and the Data Supply Chain PipelineAmazon Web Services
Many organizations have adopted or are in the process of adopting DevOps methodologies in their quest to accelerate the delivery of software capabilities, features, and functionalities to support their organizational objectives. By applying the same practices, DataOps aims to provide the same level of agility in delivering data and information to the organization. AWS Lake Formation, in coordination with other AWS Services, enables DevOps methodologies to be realized through the Data Supply Chain Pipeline.
An Overview of Best Practices for Large Scale Migrations - AWS Transformation...Amazon Web Services
Whether you are moving a small application or entire datacenters, migrating to the cloud can be a complex process. In this session, we will share some of the common challenges that our customers face on their journey to the cloud and discuss how these challenges can be overcome. We will outline the patterns of success that we have observed from partnering with hundreds of customers on their large-scale migrations as well as highlight the mechanisms we have created to help our customers migrate faster.
About the Event:
AWS Transformation Day is designed for enterprise organizations migrating to the cloud to become more responsive, agile and innovative, while staying secure and compliant. Join us for this one-day event and we’ll share our experiences of helping enterprise customers accelerate the pace of migration and adoption of strategic services.
Who should attend?
This event is recommended for IT and business leaders who are looking to create sustainable benefits and a competitive advantage by using the AWS Cloud. CIOs, CTOs, CISOs, CDOs, CFOs, IT leaders and IT professionals, enterprise developers, business decision makers, and finance executives.
AWS offers storage, networking, and data transfer services so you can build and deploy solutions to extend backup and archive targets to the AWS Cloud, increasing scalability, durability, security, and compliance.
With AWS, you can choose the right storage service for the right use case. This session shows the range of AWS choices - object storage to block storage - that is available to you. We include specifics about real-world deployments from customers who are using Amazon S3, Amazon EBS, Amazon Glacier, and AWS Storage Gateway.
Moving from an on-premises environment into AWS is just the start of the journey towards cost optimisation. In this session we’ll look at a range of ways in which our customers can understand their costs and increase their return-on-investment: building the business case; selecting the right models for the right workloads; benefiting from tiered pricing aggregation; using data to drive the choice of AWS services; implementation of intelligent auto-scaling; and, where appropriate, re-platforming to make use of new architectural patterns such as Serverless.
Build Data Lakes & Analytics on AWS: Patterns & Best PracticesAmazon Web Services
With over 90% of today’s data generated in the last two years, the rate of data growth is showing no sign of slowing down. In this session, we step through the challenges and best practices for capturing data, understanding what data you own, driving insights, and predicting the future using AWS services. We frame the session and demonstrations around common pitfalls of building data lakes and how to successfully drive analytics and insights from data. We also discuss the architecture patterns brought together key AWS services, including Amazon S3, AWS Glue, Amazon Athena, Amazon Kinesis, and Amazon Machine Learning. Discover the real-world application of data lakes for roles including data scientists and business users.
Stephen Moon, Sr. Solutions Architect, Amazon Web Services
James Juniper, Solution Architect for the Geo-Community Cloud, Natural Resources Canada
Database Migration Using AWS DMS and AWS SCT (GPSCT307) - AWS re:Invent 2018Amazon Web Services
Database migrations are an important step in any journey to AWS. In this session, we show you how to get started with AWS Database Migration Service (AWS DMS) and AWS Schema Conversion Tool (AWS SCT) to quickly and securely migrate your databases to AWS. Learn how to simplify your database migrations by using this service to migrate your data to and from commercial and open-source databases. We also explain how you can perform homogenous migrations such as MySQL to MySQL, as well as heterogeneous migrations between different database platforms, such as Oracle to Amazon Aurora.
Today’s organisations require a data storage and analytics solution that offers more agility and flexibility than traditional data management systems. Data Lake is a new and increasingly popular way to store all of your data, structured and unstructured, in one, centralised repository. Since data can be stored as-is, there is no need to convert it to a predefined schema and you no longer need to know what questions you want to ask of your data beforehand.
In this webinar, you will discover how AWS gives you fast access to flexible and low-cost IT resources, so you can rapidly scale and build your data lake that can power any kind of analytics such as data warehousing, clickstream analytics, fraud detection, recommendation engines, event-driven ETL, serverless computing, and internet-of-things processing regardless of volume, velocity and variety of data.
Learning Objectives:
• Discover how you can rapidly scale and build your data lake with AWS.
• Explore the key pillars behind a successful data lake implementation.
• Learn how to use the Amazon Simple Storage Service (S3) as the basis for your data lake.
• Learn about the new AWS services recently launched, Amazon Athena and Amazon Redshift Spectrum, that help customers directly query that data lake.
Building Data Lakes with Apache AirflowGary Stafford
Build a simple Data Lake on AWS using a combination of services, including Amazon Managed Workflows for Apache Airflow (Amazon MWAA), AWS Glue, AWS Glue Studio, Amazon Athena, and Amazon S3.
Blog post and link to the video: https://garystafford.medium.com/building-a-data-lake-with-apache-airflow-b48bd953c2b
A data lake can be used as a source for both structured and unstructured data - but how? We'll look at using open standards including Spark and Presto with Amazon EMR, Amazon Redshift Spectrum and Amazon Athena to process and understand data.
Speakers:
Neel Mitra - Solutions Architect, AWS
Roger Dahlstrom - Solutions Architect, AWS
Today organizations find themselves in a data rich world with a growing need for increased agility and accessibility of all this data for analysis and deriving keen insights to drive strategic decisions. Creating a data lake helps you to manage all the disparate sources of data you are collecting (in its original format) and extract value. In this session, learn how to architect and implement a data lake in the AWS Cloud. Learn about best practices as we walk through architectural blueprints.
Amazon Athena is a new serverless query service that makes it easy to analyze data in Amazon S3, using standard SQL. With Athena, there is no infrastructure to setup or manage, and you can start analyzing your data immediately. You don’t even need to load your data into Athena, it works directly with data stored in S3.
Amazon Elastic MapReduce is one of the largest Hadoop operators in the world. Since its launch five years ago, AWS customers have launched more than 5.5 million Hadoop clusters.
In this talk, we introduce you to Amazon EMR design patterns such as using Amazon S3 instead of HDFS, taking advantage of both long and short-lived clusters and other Amazon EMR architectural patterns. We talk about how to scale your cluster up or down dynamically and introduce you to ways you can fine-tune your cluster. We also share best practices to keep your Amazon EMR cluster cost efficient.
Speakers:
Ian Meyers, AWS Solutions Architect
Ian McDonald, IT Director, SwiftKey
Data Con LA 2020
Description
In this session, I introduce the Amazon Redshift lake house architecture which enables you to query data across your data warehouse, data lake, and operational databases to gain faster and deeper insights. With a lake house architecture, you can store data in open file formats in your Amazon S3 data lake.
Speaker
Antje Barth, Amazon Web Services, Sr. Developer Advocate, AI and Machine Learning
by Joyjeet Banerjee, Solutions Architect, AWS
Amazon Athena is a new serverless query service that makes it easy to analyze data in Amazon S3, using standard SQL. With Athena, there is no infrastructure to setup or manage, and you can start analyzing your data immediately. You don’t even need to load your data into Athena, it works directly with data stored in S3. Level 200
In this session, we will show you how easy it is to start querying your data stored in Amazon S3, with Amazon Athena. First we will use Athena to create the schema for data already in S3. Then, we will demonstrate how you can run interactive queries through the built-in query editor. We will provide best practices and use cases for Athena. Then, we will talk about supported queries, data formats, and strategies to save costs when querying data with Athena.
The Zen of DataOps – AWS Lake Formation and the Data Supply Chain PipelineAmazon Web Services
Many organizations have adopted or are in the process of adopting DevOps methodologies in their quest to accelerate the delivery of software capabilities, features, and functionalities to support their organizational objectives. By applying the same practices, DataOps aims to provide the same level of agility in delivering data and information to the organization. AWS Lake Formation, in coordination with other AWS Services, enables DevOps methodologies to be realized through the Data Supply Chain Pipeline.
An Overview of Best Practices for Large Scale Migrations - AWS Transformation...Amazon Web Services
Whether you are moving a small application or entire datacenters, migrating to the cloud can be a complex process. In this session, we will share some of the common challenges that our customers face on their journey to the cloud and discuss how these challenges can be overcome. We will outline the patterns of success that we have observed from partnering with hundreds of customers on their large-scale migrations as well as highlight the mechanisms we have created to help our customers migrate faster.
About the Event:
AWS Transformation Day is designed for enterprise organizations migrating to the cloud to become more responsive, agile and innovative, while staying secure and compliant. Join us for this one-day event and we’ll share our experiences of helping enterprise customers accelerate the pace of migration and adoption of strategic services.
Who should attend?
This event is recommended for IT and business leaders who are looking to create sustainable benefits and a competitive advantage by using the AWS Cloud. CIOs, CTOs, CISOs, CDOs, CFOs, IT leaders and IT professionals, enterprise developers, business decision makers, and finance executives.
AWS offers storage, networking, and data transfer services so you can build and deploy solutions to extend backup and archive targets to the AWS Cloud, increasing scalability, durability, security, and compliance.
With AWS, you can choose the right storage service for the right use case. This session shows the range of AWS choices - object storage to block storage - that is available to you. We include specifics about real-world deployments from customers who are using Amazon S3, Amazon EBS, Amazon Glacier, and AWS Storage Gateway.
Moving from an on-premises environment into AWS is just the start of the journey towards cost optimisation. In this session we’ll look at a range of ways in which our customers can understand their costs and increase their return-on-investment: building the business case; selecting the right models for the right workloads; benefiting from tiered pricing aggregation; using data to drive the choice of AWS services; implementation of intelligent auto-scaling; and, where appropriate, re-platforming to make use of new architectural patterns such as Serverless.
Build Data Lakes & Analytics on AWS: Patterns & Best PracticesAmazon Web Services
With over 90% of today’s data generated in the last two years, the rate of data growth is showing no sign of slowing down. In this session, we step through the challenges and best practices for capturing data, understanding what data you own, driving insights, and predicting the future using AWS services. We frame the session and demonstrations around common pitfalls of building data lakes and how to successfully drive analytics and insights from data. We also discuss the architecture patterns brought together key AWS services, including Amazon S3, AWS Glue, Amazon Athena, Amazon Kinesis, and Amazon Machine Learning. Discover the real-world application of data lakes for roles including data scientists and business users.
Stephen Moon, Sr. Solutions Architect, Amazon Web Services
James Juniper, Solution Architect for the Geo-Community Cloud, Natural Resources Canada
Database Migration Using AWS DMS and AWS SCT (GPSCT307) - AWS re:Invent 2018Amazon Web Services
Database migrations are an important step in any journey to AWS. In this session, we show you how to get started with AWS Database Migration Service (AWS DMS) and AWS Schema Conversion Tool (AWS SCT) to quickly and securely migrate your databases to AWS. Learn how to simplify your database migrations by using this service to migrate your data to and from commercial and open-source databases. We also explain how you can perform homogenous migrations such as MySQL to MySQL, as well as heterogeneous migrations between different database platforms, such as Oracle to Amazon Aurora.
Today’s organisations require a data storage and analytics solution that offers more agility and flexibility than traditional data management systems. Data Lake is a new and increasingly popular way to store all of your data, structured and unstructured, in one, centralised repository. Since data can be stored as-is, there is no need to convert it to a predefined schema and you no longer need to know what questions you want to ask of your data beforehand.
In this webinar, you will discover how AWS gives you fast access to flexible and low-cost IT resources, so you can rapidly scale and build your data lake that can power any kind of analytics such as data warehousing, clickstream analytics, fraud detection, recommendation engines, event-driven ETL, serverless computing, and internet-of-things processing regardless of volume, velocity and variety of data.
Learning Objectives:
• Discover how you can rapidly scale and build your data lake with AWS.
• Explore the key pillars behind a successful data lake implementation.
• Learn how to use the Amazon Simple Storage Service (S3) as the basis for your data lake.
• Learn about the new AWS services recently launched, Amazon Athena and Amazon Redshift Spectrum, that help customers directly query that data lake.
Building Data Lakes with Apache AirflowGary Stafford
Build a simple Data Lake on AWS using a combination of services, including Amazon Managed Workflows for Apache Airflow (Amazon MWAA), AWS Glue, AWS Glue Studio, Amazon Athena, and Amazon S3.
Blog post and link to the video: https://garystafford.medium.com/building-a-data-lake-with-apache-airflow-b48bd953c2b
A data lake can be used as a source for both structured and unstructured data - but how? We'll look at using open standards including Spark and Presto with Amazon EMR, Amazon Redshift Spectrum and Amazon Athena to process and understand data.
Speakers:
Neel Mitra - Solutions Architect, AWS
Roger Dahlstrom - Solutions Architect, AWS
AWS Analytics Immersion Day - Build BI System from Scratch (Day1, Day2 Full V...Sungmin Kim
How to build Business Intelligence System from scratch on AWS (Day1, Day2)
------------------------------------------------------------------------------------------
2020-03-18(수)~19(목) 2일 동안 온라인으로 진행한 Online AWS Analytics Immersion Day 전체 발표 자료 입니다.
BI(Business Intelligence) 시스템을 설계하는 과정에서 AWS Analytics 서비스들을 어떻게 활용할 수 있는지 설명 드리고자 만든 자료 입니다.
Target Audience
-------------------
Online Analytics Immersion Day는 다음과 같은 고객을 대상으로 진행됩니다.
- AWS Analytics Services (ex. Kinesis, Athena, Redshift, EMR, etc)의 기본 개념을 알고 있지만, 이러한 서비스 활용 방법 및 데이터 분석 시스템 구축 과정이 궁금하신 분
- 데이터 분석 시스템을 구축한 경험은 있지만, 자신이 만든 시스템을 아키텍처 관점에서
어떻게 평가하고 확인할 수 있는지 궁금하신 분
Data Analytics Week at the San Francisco Loft
Using Data Lakes
A data lake can be used as a source for both structured and unstructured data - but how? We'll look at using open standards including Spark and Presto with Amazon EMR, Amazon Redshift Spectrum and Amazon Athena to process and understand data.
Speakers:
John Mallory - Principal Business Development Manager Storage (Object), AWS
Hemant Borole - Sr. Big Data Consultant, AWS
AWS re:Invent 2016: How to Build a Big Data Analytics Data Lake (LFS303)Amazon Web Services
For discovery-phase research, life sciences companies have to support infrastructure that processes millions to billions of transactions. The advent of a data lake to accomplish such a task is showing itself to be a stable and productive data platform pattern to meet the goal. We discuss how to build a data lake on AWS, using services and techniques such as AWS CloudFormation, Amazon EC2, Amazon S3, IAM, and AWS Lambda. We also review a reference architecture from Amgen that uses a data lake to aid in their Life Science Research.
Understanding AWS Managed Database and Analytics Services | AWS Public Sector...Amazon Web Services
The world is creating more data in more ways than ever before. The average internet user in 2017 generates 1.5GB of data per day, with the rate doubling every 18 months. A single autonomous vehicle can generate 4TB per day. Each smart manufacturing plant generates 1PB per day. Storing, managing, and analyzing this data requires integrated database and analytic services that provide reliability and security at scale. AWS offers a range of managed data services that let customers focus on making data useful, including Amazon Aurora, RDS, DynamoDB, Redshift, Spectrum, ElastiCache, Kinesis, EMR, Elasticsearch Service, and Glue. In this session, we discuss these services, share our vision for innovation, and show how our customers use these services today. Learn More: https://aws.amazon.com/government-education/
Understanding AWS Managed Database and Analytics Services | AWS Public Sector...Amazon Web Services
The world is creating more data in more ways than ever before. The average internet user in 2017 generates 1.5GB of data per day, with the rate doubling every 18 months. A single autonomous vehicle can generate 4TB per day. Each smart manufacturing plant generates 1PB per day. Storing, managing, and analyzing this data requires integrated database and analytic services that provide reliability and security at scale. AWS offers a range of managed data services that let customers focus on making data useful, including Amazon Aurora, RDS, DynamoDB, Redshift, Spectrum, ElastiCache, Kinesis, EMR, Elasticsearch Service, and Glue. In this session, we discuss these services, share our vision for innovation, and show how our customers use these services today. Learn More: https://aws.amazon.com/government-education/
A data lake can be used as a source for both structured and unstructured data - but how? We'll look at using open standards including Spark and Presto with Amazon EMR, Amazon Redshift Spectrum and Amazon Athena to process and understand data.
Level: Intermediate
Speakers:
Tony Nguyen - Senior Consultant, ProServe, AWS
Hannah Marlowe - Consultant - Federal, AWS
By using a Data Lake, you no longer need to worry about structuring or transforming data before storing it. A Data Lake on AWS enables your organization to more rapidly analyze data, helping you quickly discover new business insights. Join us for our webinar to learn about the benefits of building a Data Lake on AWS and how your organization can begin reaping their rewards. In this session, we will share methodology for implementing a Data Lake on AWS and best practices for getting the most from your Data Lake.
Speaker: Russell Nash,
APAC Solution Architect, DW, AWS APAC
From Data Collection to Actionable Insights in 60 Seconds: AWS Developer Work...Amazon Web Services
From Data Collection to Actionable Insights in 60 Seconds: AWS Developer Workshop - Web Summit 2018
Columnar data formats such as Parquet and ORC are designed to optimize both query performance and costs for analytics scenarios. On the other hand, serverless computing platforms such as AWS Lambda allow you to run highly scalable applications without provisioning or managing servers. The combination of columnar storage and serverless computing can drastically simplify many of the pain points related to big data analytics, data collection, data exploration, and ETL orchestration, while at the same time reducing the total cost of ownership.
Speaker: Alex Casalboni - Technical Evangelist, AWS
This overview presentation discusses big data challenges and provides an overview of the AWS Big Data Platform by covering:
- How AWS customers leverage the platform to manage massive volumes of data from a variety of sources while containing costs.
- Reference architectures for popular use cases, including, connected devices (IoT), log streaming, real-time intelligence, and analytics.
- The AWS big data portfolio of services, including, Amazon S3, Kinesis, DynamoDB, Elastic MapReduce (EMR), and Redshift.
- The latest relational database engine, Amazon Aurora— a MySQL-compatible, highly-available relational database engine, which provides up to five times better performance than MySQL at one-tenth the cost of a commercial database.
Created by: Rahul Pathak,
Sr. Manager of Software Development
Your data has value for multiple business functions in your organization. Shorten your time to analytics and take faster, better decisions based on data.
In this session you will learn how you can access your data from a myriad of tools such as multiple EMR clusters, Athena & Redshift.
Build Data Lakes and Analytics on AWS: Patterns & Best Practices - BDA305 - A...Amazon Web Services
In this session, we show you how to understand what data you have, how to drive insights, and how to make predictions using purpose-built AWS services. Learn about the common pitfalls of building data lakes and discover how to successfully drive analytics and insights from your data. Also learn how services such as Amazon S3, AWS Glue, Amazon Redshift, Amazon Athena, Amazon EMR, Amazon Kinesis, and Amazon ML services work together to build a successful data lake for various roles, including data scientists and business users.
Stream Processing with Apache Spark, Kafka, Avro, and Apicurio Registry on AW...Gary Stafford
We will read and write messages to and from Amazon MSK in Apache Avro format. We will store the Avro-format Kafka message’s key and value schemas in Apicurio Registry and retrieve the schemas instead of hard-coding the schemas in the PySpark scripts. We will also use the registry to store schemas for CSV-format data files.
Link to the blog post and video: https://itnext.io/stream-processing-with-apache-spark-kafka-avro-and-apicurio-registry-on-amazon-emr-and-amazon-13080defa3be
Building Open Data Lakes on AWS with Debezium and Apache HudiGary Stafford
Build a simple open data lake on AWS using a combination of open-source software (OSS), including Red Hat’s Debezium, Apache Kafka, and Kafka Connect for change data capture (CDC), and Apache Hive, Apache Spark, Apache Hudi, and Hudi’s DeltaStreamer for managing our data lake. We will use fully-managed AWS services to host the open data lake components, including Amazon RDS, Amazon MKS, Amazon EKS, and EMR.
Link to the blog post and video: https://garystafford.medium.com/building-open-data-lakes-with-debezium-and-apache-hudi-c3370d3f86fb
Version 2 of the IaC Maturity Model Presentation
What helps leading technology companies like Facebook, Amazon, Netflix, and Etsy increase their speed to market while lowering overall IT costs and increasing customer satisfaction? Examine how to apply the principles from Humble and Farley’s Continuous Delivery Maturity Model to the concepts found in Morris’ Infrastructure as Code, using the new Infrastructure as Code Maturity Model.
Link to v2.1 of the IaC Maturity Model: https://github.com/garystafford/cd-maturity-model/raw/requirejs/images/IaC_Maturity_Model%20v2_1.pdf
Infrastructure as Code Maturity Model v1Gary Stafford
Systematically Evolving an Organization’s Infrastructure . The original version of the IaC Maturity Model. See the latest version here: https://www.slideshare.net/garystafford/how-mature-is-your-infrastructure.
From Zurich to the Cosmos, by Artist Steve CarpenterGary Stafford
Presentation from Steve Carpenter’s gallery opening of his new show "From Zurich to the Cosmos", held December 4, 2009. Photographed and produced by Gary Stafford. Fine art prints by Lazer Incorporated.
As Europe's leading economic powerhouse and the fourth-largest hashtag#economy globally, Germany stands at the forefront of innovation and industrial might. Renowned for its precision engineering and high-tech sectors, Germany's economic structure is heavily supported by a robust service industry, accounting for approximately 68% of its GDP. This economic clout and strategic geopolitical stance position Germany as a focal point in the global cyber threat landscape.
In the face of escalating global tensions, particularly those emanating from geopolitical disputes with nations like hashtag#Russia and hashtag#China, hashtag#Germany has witnessed a significant uptick in targeted cyber operations. Our analysis indicates a marked increase in hashtag#cyberattack sophistication aimed at critical infrastructure and key industrial sectors. These attacks range from ransomware campaigns to hashtag#AdvancedPersistentThreats (hashtag#APTs), threatening national security and business integrity.
🔑 Key findings include:
🔍 Increased frequency and complexity of cyber threats.
🔍 Escalation of state-sponsored and criminally motivated cyber operations.
🔍 Active dark web exchanges of malicious tools and tactics.
Our comprehensive report delves into these challenges, using a blend of open-source and proprietary data collection techniques. By monitoring activity on critical networks and analyzing attack patterns, our team provides a detailed overview of the threats facing German entities.
This report aims to equip stakeholders across public and private sectors with the knowledge to enhance their defensive strategies, reduce exposure to cyber risks, and reinforce Germany's resilience against cyber threats.
Explore our comprehensive data analysis project presentation on predicting product ad campaign performance. Learn how data-driven insights can optimize your marketing strategies and enhance campaign effectiveness. Perfect for professionals and students looking to understand the power of data analysis in advertising. for more details visit: https://bostoninstituteofanalytics.org/data-science-and-artificial-intelligence/
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Subhajit Sahu
Abstract — Levelwise PageRank is an alternative method of PageRank computation which decomposes the input graph into a directed acyclic block-graph of strongly connected components, and processes them in topological order, one level at a time. This enables calculation for ranks in a distributed fashion without per-iteration communication, unlike the standard method where all vertices are processed in each iteration. It however comes with a precondition of the absence of dead ends in the input graph. Here, the native non-distributed performance of Levelwise PageRank was compared against Monolithic PageRank on a CPU as well as a GPU. To ensure a fair comparison, Monolithic PageRank was also performed on a graph where vertices were split by components. Results indicate that Levelwise PageRank is about as fast as Monolithic PageRank on the CPU, but quite a bit slower on the GPU. Slowdown on the GPU is likely caused by a large submission of small workloads, and expected to be non-issue when the computation is performed on massive graphs.
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...John Andrews
SlideShare Description for "Chatty Kathy - UNC Bootcamp Final Project Presentation"
Title: Chatty Kathy: Enhancing Physical Activity Among Older Adults
Description:
Discover how Chatty Kathy, an innovative project developed at the UNC Bootcamp, aims to tackle the challenge of low physical activity among older adults. Our AI-driven solution uses peer interaction to boost and sustain exercise levels, significantly improving health outcomes. This presentation covers our problem statement, the rationale behind Chatty Kathy, synthetic data and persona creation, model performance metrics, a visual demonstration of the project, and potential future developments. Join us for an insightful Q&A session to explore the potential of this groundbreaking project.
Project Team: Jay Requarth, Jana Avery, John Andrews, Dr. Dick Davis II, Nee Buntoum, Nam Yeongjin & Mat Nicholas
4. Agenda
What is a Data Lake?
Dataset
Source Code
Architecture
Demonstration
5. What is a Data Lake?
“A data lake is a central location that holds a large amount of data in its native, raw
format. Compared to a hierarchical data warehouse, which stores data in files or
folders, a data lake uses a flat architecture and object storage to store the data.” -
Databricks
“A centralized repository that allows you to store all your structured and
unstructured data at any scale. You can store your data as-is, without having to
first structure the data, and run different types of analytics—from dashboards and
visualizations to big data processing, real-time analytics, and machine learning to
guide better decisions.” - AWS
8. Dataset
TICKIT database
E-commerce platform
Bringing together buyers and sellers of tickets to entertainments events
Designed for demonstrating Amazon Redshift
Small database consists of seven tables: two fact tables and five dimensions
Tables: Categories, Events, Venues, Users, Listings, Sales, Dates
https://docs.aws.amazon.com/redshift/latest/dg/c_sampledb.html
10. Dataset
Table Simulated Data Source Demo Data Source
Category Software as a Service (SaaS) Amazon RDS for PostgreSQL
Event Software as a Service (SaaS) Amazon RDS for PostgreSQL
Venue Software as a Service (SaaS) Amazon RDS for PostgreSQL
Listing Ecommerce Platform Amazon RDS for MySQL
Sales Ecommerce Platform Amazon RDS for MySQL
Date Ecommerce Platform Amazon RDS for MySQL
Users Customer Relationship Management (CRM) Microsoft SQL Server
15. Architecture: AWS Services Used
AWS Glue Studio (alt. AWS Glue DataBrew)
AWS Glue Data Catalog (alt. Apache Hive on EMR)
AWS Glue Crawlers (alt. CDC with Kafka Connect or DMS)
AWS Glue Jobs (alt. AWS Glue DataBrew, or Spark or Presto on EMR)
Amazon Athena (alt. Presto on EMR)
Amazon S3
16.
17. Architecture: Out of Scope (but critically important)
Change Data Capture (CDC): handling changes to systems of record
Transactional Storage Layer: Apache Hudi, Apache Iceberg, Delta Lake
Streaming Data: Spark Structured Streaming, Kinesis, Flink
Fine-grained Authorization: database-, table-, column-, and row-level access
Data Lineage: Tracking data as it flows from sources to consumption
18. Architecture: Out of Scope (but critically important)
Data Inspection: Scanning incoming data for sensitive info such as PII
DevOps/DataOps: Automating testing, deployment, job execution
Data Warehouse / Lake House Architecture
Data Lake Storage Tiering