This document discusses a presentation about big data and analytics on AWS. It describes what big data is, provides examples of AWS services for ingesting, storing, processing, analyzing and visualizing big data. It also provides examples of industries using AWS for data analysis and discusses Amazon Kinesis for real-time processing of streaming data. Finally, it discusses putting the various AWS services together in an end-to-end big data workflow.
Amazon Web Services gives you fast access to flexible and low cost IT resources, so you can rapidly scale and build virtually any big data application including data warehousing, clickstream analytics, fraud detection, recommendation engines, event-driven ETL, serverless computing, and internet-of-things processing regardless of volume, velocity, and variety of data.
https://aws.amazon.com/webinars/anz-webinar-series/
Amazon Web Services gives you fast access to flexible and low cost IT resources, so you can rapidly scale and build virtually any big data and analytics application including data warehousing, clickstream analytics, fraud detection, recommendation engines, event-driven ETL, serverless computing, and internet-of-things processing regardless of volume, velocity, and variety of data.
In this one-hour webinar, we will look at the portfolio of AWS Big Data services and how they can be used to build a modern data architecture.
We will cover:
Using different SQL engines to analyze large amounts of structured data
Analysing streaming data in near-real time
Architectures for batch processing
Best practices for Data Lake architectures
This session is suited for:
Solution and enterprise architects
Data architects/ Data warehouse owners
IT & Innovation team members
An overview of Amazon Kinesis Firehose, Amazon Kinesis Analytics, and Amazon Kinesis Streams so you can quickly get started with real-time, streaming data.
AWS Glue is a fully managed extract, transform, and load (ETL) service that makes it easy for customers to prepare and load their data for analytics. You can create and run an ETL job with a few clicks in the AWS Management Console. You simply point AWS Glue to your data stored on AWS, and AWS Glue discovers your data and stores the associated metadata (e.g. table definition and schema) in the AWS Glue Data Catalog. Once cataloged, your data is immediately searchable, queryable, and available for ETL. AWS Glue generates the code to execute your data transformations and data loading processes.
Level: Intermediate
Speakers:
Ryan Malecky - Solutions Architect, EdTech, AWS
Rajakumar Sampathkumar - Sr. Technical Account Manager, AWS
Microsoft Data Platform - What's includedJames Serra
The pace of Microsoft product innovation is so fast that even though I spend half my days learning, I struggle to keep up. And as I work with customers I find they are often in the dark about many of the products that we have since they are focused on just keeping what they have running and putting out fires. So, let me cover what products you might have missed in the Microsoft data platform world. Be prepared to discover all the various Microsoft technologies and products for collecting data, transforming it, storing it, and visualizing it. My goal is to help you not only understand each product but understand how they all fit together and there proper use case, allowing you to build the appropriate solution that can incorporate any data in the future no matter the size, frequency, or type. Along the way we will touch on technologies covering NoSQL, Hadoop, and open source.
Amazon QuickSight is a fast, cloud-powered business intelligence (BI) service that makes it easy to build visualizations, perform ad-hoc analysis, and quickly get business insights from your data. In this session, we demonstrate how you can point Amazon QuickSight to AWS data stores, flat files, or other third-party data sources and begin visualizing your data in minutes. We also introduce SPICE - a new Super-fast, Parallel, In-memory, Calculation Engine in Amazon QuickSight, which performs advanced calculations and render visualizations rapidly without requiring any additional infrastructure, SQL programming, or dimensional modeling, so you can seamlessly scale to hundreds of thousands of users and petabytes of data. Lastly, you will see how Amazon QuickSight provides you with smart visualizations and graphs that are optimized for your different data types, to ensure the most suitable and appropriate visualization to conduct your analysis, and how to share these visualization stories using the built-in collaboration tools.
Presented by: Matthew McClean, AWS Partner Solutions Architect, Amazon Web Services
Amazon Web Services gives you fast access to flexible and low cost IT resources, so you can rapidly scale and build virtually any big data application including data warehousing, clickstream analytics, fraud detection, recommendation engines, event-driven ETL, serverless computing, and internet-of-things processing regardless of volume, velocity, and variety of data.
https://aws.amazon.com/webinars/anz-webinar-series/
Amazon Web Services gives you fast access to flexible and low cost IT resources, so you can rapidly scale and build virtually any big data and analytics application including data warehousing, clickstream analytics, fraud detection, recommendation engines, event-driven ETL, serverless computing, and internet-of-things processing regardless of volume, velocity, and variety of data.
In this one-hour webinar, we will look at the portfolio of AWS Big Data services and how they can be used to build a modern data architecture.
We will cover:
Using different SQL engines to analyze large amounts of structured data
Analysing streaming data in near-real time
Architectures for batch processing
Best practices for Data Lake architectures
This session is suited for:
Solution and enterprise architects
Data architects/ Data warehouse owners
IT & Innovation team members
An overview of Amazon Kinesis Firehose, Amazon Kinesis Analytics, and Amazon Kinesis Streams so you can quickly get started with real-time, streaming data.
AWS Glue is a fully managed extract, transform, and load (ETL) service that makes it easy for customers to prepare and load their data for analytics. You can create and run an ETL job with a few clicks in the AWS Management Console. You simply point AWS Glue to your data stored on AWS, and AWS Glue discovers your data and stores the associated metadata (e.g. table definition and schema) in the AWS Glue Data Catalog. Once cataloged, your data is immediately searchable, queryable, and available for ETL. AWS Glue generates the code to execute your data transformations and data loading processes.
Level: Intermediate
Speakers:
Ryan Malecky - Solutions Architect, EdTech, AWS
Rajakumar Sampathkumar - Sr. Technical Account Manager, AWS
Microsoft Data Platform - What's includedJames Serra
The pace of Microsoft product innovation is so fast that even though I spend half my days learning, I struggle to keep up. And as I work with customers I find they are often in the dark about many of the products that we have since they are focused on just keeping what they have running and putting out fires. So, let me cover what products you might have missed in the Microsoft data platform world. Be prepared to discover all the various Microsoft technologies and products for collecting data, transforming it, storing it, and visualizing it. My goal is to help you not only understand each product but understand how they all fit together and there proper use case, allowing you to build the appropriate solution that can incorporate any data in the future no matter the size, frequency, or type. Along the way we will touch on technologies covering NoSQL, Hadoop, and open source.
Amazon QuickSight is a fast, cloud-powered business intelligence (BI) service that makes it easy to build visualizations, perform ad-hoc analysis, and quickly get business insights from your data. In this session, we demonstrate how you can point Amazon QuickSight to AWS data stores, flat files, or other third-party data sources and begin visualizing your data in minutes. We also introduce SPICE - a new Super-fast, Parallel, In-memory, Calculation Engine in Amazon QuickSight, which performs advanced calculations and render visualizations rapidly without requiring any additional infrastructure, SQL programming, or dimensional modeling, so you can seamlessly scale to hundreds of thousands of users and petabytes of data. Lastly, you will see how Amazon QuickSight provides you with smart visualizations and graphs that are optimized for your different data types, to ensure the most suitable and appropriate visualization to conduct your analysis, and how to share these visualization stories using the built-in collaboration tools.
Presented by: Matthew McClean, AWS Partner Solutions Architect, Amazon Web Services
Want to see a high-level overview of the products in the Microsoft data platform portfolio in Azure? I’ll cover products in the categories of OLTP, OLAP, data warehouse, storage, data transport, data prep, data lake, IaaS, PaaS, SMP/MPP, NoSQL, Hadoop, open source, reporting, machine learning, and AI. It’s a lot to digest but I’ll categorize the products and discuss their use cases to help you narrow down the best products for the solution you want to build.
At wetter.com we build analytical B2B data products and heavily use Spark and AWS technologies for data processing and analytics. I explain why we moved from AWS EMR to Databricks and Delta and share our experiences from different angles like architecture, application logic and user experience. We will look how security, cluster configuration, resource consumption and workflow changed by using Databricks clusters as well as how using Delta tables simplified our application logic and data operations.
As the volume and types of data continues to grow, customers often have valuable data that is not easily discoverable and available for analytics. A common challenge for data engineering teams is architecting a data lake that can cater to the needs of diverse users - from developers to business analysts to data scientists. In this session, dive deep into building a data lake using Amazon S3, Amazon Kinesis, Amazon Athena and AWS Glue. Learn how AWS Glue crawlers can automatically discover your data, extracting and cataloguing relevant metadata to reduce operations in preparing your data for downstream consumers.
by Joyjeet Banerjee, Solutions Architect, AWS
Amazon Athena is a new serverless query service that makes it easy to analyze data in Amazon S3, using standard SQL. With Athena, there is no infrastructure to setup or manage, and you can start analyzing your data immediately. You don’t even need to load your data into Athena, it works directly with data stored in S3. Level 200
In this session, we will show you how easy it is to start querying your data stored in Amazon S3, with Amazon Athena. First we will use Athena to create the schema for data already in S3. Then, we will demonstrate how you can run interactive queries through the built-in query editor. We will provide best practices and use cases for Athena. Then, we will talk about supported queries, data formats, and strategies to save costs when querying data with Athena.
Azure Synapse Analytics is Azure SQL Data Warehouse evolved: a limitless analytics service, that brings together enterprise data warehousing and Big Data analytics into a single service. It gives you the freedom to query data on your terms, using either serverless on-demand or provisioned resources, at scale. Azure Synapse brings these two worlds together with a unified experience to ingest, prepare, manage, and serve data for immediate business intelligence and machine learning needs. This is a huge deck with lots of screenshots so you can see exactly how it works.
Databricks CEO Ali Ghodsi introduces Databricks Delta, a new data management system that combines the scale and cost-efficiency of a data lake, the performance and reliability of a data warehouse, and the low latency of streaming.
AWS delivers an integrated suite of services that provide everything needed to quickly and easily build and manage a data lake for analytics. AWS-powered data lakes can handle the scale, agility, and flexibility required to combine different types of data and analytics approaches to gain deeper insights, in ways that traditional data silos and data warehouses cannot. In this session, we will show you how you can quickly build a data lake on AWS that ingests, catalogs and processes incoming data and makes it ready for analysis. Using a live demo, we demonstrate the capabilities of AWS provided analytical services such as AWS Glue, Amazon Athena and Amazon EMR and how to build a Data Lake on AWS step-by-step.
Data processing and analysis is where big data is most often consumed - driving business intelligence (BI) use cases that discover and report on meaningful patterns in the data. In this session, we will discuss options for processing, analyzing and visualizing data. We will also look at partner solutions and BI-enabling services from AWS. Attendees will learn about optimal approaches for stream processing, batch processing and Interactive analytics. AWS services to be covered include: Amazon Machine Learning, Elastic MapReduce (EMR), and Redshift.
AWS Glue is a fully managed, serverless extract, transform, and load (ETL) service that makes it easy to move data between data stores. AWS Glue simplifies and automates the difficult and time consuming tasks of data discovery, conversion mapping, and job scheduling so you can focus more of your time querying and analyzing your data using Amazon Redshift Spectrum and Amazon Athena. In this session, we introduce AWS Glue, provide an overview of its components, and share how you can use AWS Glue to automate discovering your data, cataloging it, and preparing it for analysis.
Tackle Your Dark Data Challenge with AWS Glue - AWS Online Tech TalksAmazon Web Services
Learning Objectives:
- Discover dark data that you are currently not analyzing.
- Analyze dark data without moving it into your data warehouse.
- Visualize the results of your dark data analytics.
Amazon EMR enables fast processing of large structured or unstructured datasets, and in this presentation we'll show you how to setup an Amazon EMR job flow to analyse application logs, and perform Hive queries against it. We also review best practices around data file organisation on Amazon Simple Storage Service (S3), how clusters can be started from the AWS web console and command line, and how to monitor the status of a Map/Reduce job.
Finally we take a look at Hadoop ecosystem tools you can use with Amazon EMR and the additional features of the service.
See a recording of the webinar based on this presentation on YouTube here:
Check out the rest of the Masterclass webinars for 2015 here: http://aws.amazon.com/campaigns/emea/masterclass/
See the Journey Through the Cloud webinar series here: http://aws.amazon.com/campaigns/emea/journey/
The volume of data businesses create and process is growing every day. To get the most value out of this data, companies often invest in traditional BI tools. These tools however require investments in costly on-premises hardware and software. It takes weeks or months of data engineering time to build complex data models; not to mention the additional infrastructure needed to maintain fast query performance as data sets grow. Amazon QuickSight is built from the ground up to solve these problems by bringing the scale and flexibility of the AWS Cloud and by providing a business user focused experience to business analytics. This session will provide you with the relevant capabilities, benefits and use cases for AWS Quicksight.
Join us to learn about the state of serverless computing from Dr. Tim Wagner, General Manager of AWS Lambda. Dr. Wagner discusses the latest developments from AWS Lambda and the serverless computing ecosystem. He talks about how serverless computing is becoming a core component in how companies build and run their applications and services, and he also discusses how serverless computing will continue to evolve.
The introductory morning session will discuss big data challenges and provide an overview of the AWS Big Data Platform. We will also cover:
• How AWS customers leverage the platform to manage massive volumes of data from a variety of sources while containing costs.
• Reference architectures for popular use cases, including: connected devices (IoT), log streaming, real-time intelligence, and analytics.
• The AWS big data portfolio of services, including Amazon S3, Kinesis, DynamoDB, Elastic MapReduce (EMR) and Redshift.
• The latest relational database engine, Amazon Aurora - a MySQL-compatible, highly-available relational database engine which provides up to five times better performance than MySQL at a price one-tenth the cost of a commercial database.
• Amazon Machine Learning – the latest big data service from AWS provides visualization tools and wizards that guide you through the process of creating machine learning (ML) models without having to learn complex ML algorithms and technology.
(BDT310) Big Data Architectural Patterns and Best Practices on AWSAmazon Web Services
The world is producing an ever increasing volume, velocity, and variety of big data. Consumers and businesses are demanding up-to-the-second (or even millisecond) analytics on their fast-moving data, in addition to classic batch processing. AWS delivers many technologies for solving big data problems. But what services should you use, why, when, and how? In this session, we simplify big data processing as a data bus comprising various stages: ingest, store, process, and visualize. Next, we discuss how to choose the right technology in each stage based on criteria such as data structure, query latency, cost, request rate, item size, data volume, durability, and so on. Finally, we provide reference architecture, design patterns, and best practices for assembling these technologies to solve your big data problems at the right cost.
Want to see a high-level overview of the products in the Microsoft data platform portfolio in Azure? I’ll cover products in the categories of OLTP, OLAP, data warehouse, storage, data transport, data prep, data lake, IaaS, PaaS, SMP/MPP, NoSQL, Hadoop, open source, reporting, machine learning, and AI. It’s a lot to digest but I’ll categorize the products and discuss their use cases to help you narrow down the best products for the solution you want to build.
At wetter.com we build analytical B2B data products and heavily use Spark and AWS technologies for data processing and analytics. I explain why we moved from AWS EMR to Databricks and Delta and share our experiences from different angles like architecture, application logic and user experience. We will look how security, cluster configuration, resource consumption and workflow changed by using Databricks clusters as well as how using Delta tables simplified our application logic and data operations.
As the volume and types of data continues to grow, customers often have valuable data that is not easily discoverable and available for analytics. A common challenge for data engineering teams is architecting a data lake that can cater to the needs of diverse users - from developers to business analysts to data scientists. In this session, dive deep into building a data lake using Amazon S3, Amazon Kinesis, Amazon Athena and AWS Glue. Learn how AWS Glue crawlers can automatically discover your data, extracting and cataloguing relevant metadata to reduce operations in preparing your data for downstream consumers.
by Joyjeet Banerjee, Solutions Architect, AWS
Amazon Athena is a new serverless query service that makes it easy to analyze data in Amazon S3, using standard SQL. With Athena, there is no infrastructure to setup or manage, and you can start analyzing your data immediately. You don’t even need to load your data into Athena, it works directly with data stored in S3. Level 200
In this session, we will show you how easy it is to start querying your data stored in Amazon S3, with Amazon Athena. First we will use Athena to create the schema for data already in S3. Then, we will demonstrate how you can run interactive queries through the built-in query editor. We will provide best practices and use cases for Athena. Then, we will talk about supported queries, data formats, and strategies to save costs when querying data with Athena.
Azure Synapse Analytics is Azure SQL Data Warehouse evolved: a limitless analytics service, that brings together enterprise data warehousing and Big Data analytics into a single service. It gives you the freedom to query data on your terms, using either serverless on-demand or provisioned resources, at scale. Azure Synapse brings these two worlds together with a unified experience to ingest, prepare, manage, and serve data for immediate business intelligence and machine learning needs. This is a huge deck with lots of screenshots so you can see exactly how it works.
Databricks CEO Ali Ghodsi introduces Databricks Delta, a new data management system that combines the scale and cost-efficiency of a data lake, the performance and reliability of a data warehouse, and the low latency of streaming.
AWS delivers an integrated suite of services that provide everything needed to quickly and easily build and manage a data lake for analytics. AWS-powered data lakes can handle the scale, agility, and flexibility required to combine different types of data and analytics approaches to gain deeper insights, in ways that traditional data silos and data warehouses cannot. In this session, we will show you how you can quickly build a data lake on AWS that ingests, catalogs and processes incoming data and makes it ready for analysis. Using a live demo, we demonstrate the capabilities of AWS provided analytical services such as AWS Glue, Amazon Athena and Amazon EMR and how to build a Data Lake on AWS step-by-step.
Data processing and analysis is where big data is most often consumed - driving business intelligence (BI) use cases that discover and report on meaningful patterns in the data. In this session, we will discuss options for processing, analyzing and visualizing data. We will also look at partner solutions and BI-enabling services from AWS. Attendees will learn about optimal approaches for stream processing, batch processing and Interactive analytics. AWS services to be covered include: Amazon Machine Learning, Elastic MapReduce (EMR), and Redshift.
AWS Glue is a fully managed, serverless extract, transform, and load (ETL) service that makes it easy to move data between data stores. AWS Glue simplifies and automates the difficult and time consuming tasks of data discovery, conversion mapping, and job scheduling so you can focus more of your time querying and analyzing your data using Amazon Redshift Spectrum and Amazon Athena. In this session, we introduce AWS Glue, provide an overview of its components, and share how you can use AWS Glue to automate discovering your data, cataloging it, and preparing it for analysis.
Tackle Your Dark Data Challenge with AWS Glue - AWS Online Tech TalksAmazon Web Services
Learning Objectives:
- Discover dark data that you are currently not analyzing.
- Analyze dark data without moving it into your data warehouse.
- Visualize the results of your dark data analytics.
Amazon EMR enables fast processing of large structured or unstructured datasets, and in this presentation we'll show you how to setup an Amazon EMR job flow to analyse application logs, and perform Hive queries against it. We also review best practices around data file organisation on Amazon Simple Storage Service (S3), how clusters can be started from the AWS web console and command line, and how to monitor the status of a Map/Reduce job.
Finally we take a look at Hadoop ecosystem tools you can use with Amazon EMR and the additional features of the service.
See a recording of the webinar based on this presentation on YouTube here:
Check out the rest of the Masterclass webinars for 2015 here: http://aws.amazon.com/campaigns/emea/masterclass/
See the Journey Through the Cloud webinar series here: http://aws.amazon.com/campaigns/emea/journey/
The volume of data businesses create and process is growing every day. To get the most value out of this data, companies often invest in traditional BI tools. These tools however require investments in costly on-premises hardware and software. It takes weeks or months of data engineering time to build complex data models; not to mention the additional infrastructure needed to maintain fast query performance as data sets grow. Amazon QuickSight is built from the ground up to solve these problems by bringing the scale and flexibility of the AWS Cloud and by providing a business user focused experience to business analytics. This session will provide you with the relevant capabilities, benefits and use cases for AWS Quicksight.
Join us to learn about the state of serverless computing from Dr. Tim Wagner, General Manager of AWS Lambda. Dr. Wagner discusses the latest developments from AWS Lambda and the serverless computing ecosystem. He talks about how serverless computing is becoming a core component in how companies build and run their applications and services, and he also discusses how serverless computing will continue to evolve.
The introductory morning session will discuss big data challenges and provide an overview of the AWS Big Data Platform. We will also cover:
• How AWS customers leverage the platform to manage massive volumes of data from a variety of sources while containing costs.
• Reference architectures for popular use cases, including: connected devices (IoT), log streaming, real-time intelligence, and analytics.
• The AWS big data portfolio of services, including Amazon S3, Kinesis, DynamoDB, Elastic MapReduce (EMR) and Redshift.
• The latest relational database engine, Amazon Aurora - a MySQL-compatible, highly-available relational database engine which provides up to five times better performance than MySQL at a price one-tenth the cost of a commercial database.
• Amazon Machine Learning – the latest big data service from AWS provides visualization tools and wizards that guide you through the process of creating machine learning (ML) models without having to learn complex ML algorithms and technology.
(BDT310) Big Data Architectural Patterns and Best Practices on AWSAmazon Web Services
The world is producing an ever increasing volume, velocity, and variety of big data. Consumers and businesses are demanding up-to-the-second (or even millisecond) analytics on their fast-moving data, in addition to classic batch processing. AWS delivers many technologies for solving big data problems. But what services should you use, why, when, and how? In this session, we simplify big data processing as a data bus comprising various stages: ingest, store, process, and visualize. Next, we discuss how to choose the right technology in each stage based on criteria such as data structure, query latency, cost, request rate, item size, data volume, durability, and so on. Finally, we provide reference architecture, design patterns, and best practices for assembling these technologies to solve your big data problems at the right cost.
The AWS cloud computing platform has disrupted big data. Managing big data applications used to be for only well-funded research organizations and large corporations, but not any longer. Hear from Ben Butler, Big Data Solutions Marketing Manager for AWS, to learn how our customers are using big data services in the AWS cloud to innovate faster than ever before. Not only is AWS technology available to everyone, but it is self-service, on-demand, and featuring innovative technology and flexible pricing models at low cost with no commitments. Learn from customer success stories, as Ben shares real-world case studies describing the specific big data challenges being solved on AWS. We will conclude with a discussion around the tutorials, public datasets, test drives, and our grants program - all of the resources needed to get you started quickly.
Join us for a series of introductory and technical sessions on AWS Big Data solutions. Gain a thorough understanding of what Amazon Web Services offers across the big data lifecycle and learn architectural best practices for applying those solutions to your projects.
We will kick off this technical seminar in the morning with an introduction to the AWS Big Data platform, including a discussion of popular use cases and reference architectures. In the afternoon, we will deep dive into Machine Learning and Streaming Analytics. We will then walk everyone through building your first Big Data application with AWS.
Best Practices for Building a Data Lake with Amazon S3 - August 2016 Monthly ...Amazon Web Services
Uncovering new, valuable insights from big data requires organizations to collect, store, and analyze increasing volumes of data from multiple, often disparate sources at disparate points in time. This makes it difficult to handle big data with data warehouses or relational database management systems alone. A Data Lake allows you to store massive amounts of data in its original form, without the need to enforce a predefined schema, enabling a far more agile and flexible architecture, which makes it easier to gain new types of analytical insights from your data.
Learning Objectives:
• Introduce key architectural concepts to build a Data Lake using Amazon S3 as the storage layer
• Explore storage options and best practices to build your Data Lake on AWS
• Learn how AWS can help enable a Data Lake architecture
• Understand some of the key architectural considerations when building a Data Lake
• Hear some important Data Lake implementation considerations when using Amazon S3 as your Data Lake
AWS re:Invent 2016: Big Data Architectural Patterns and Best Practices on AWS...Amazon Web Services
The world is producing an ever increasing volume, velocity, and variety of big data. Consumers and businesses are demanding up-to-the-second (or even millisecond) analytics on their fast-moving data, in addition to classic batch processing. AWS delivers many technologies for solving big data problems. But what services should you use, why, when, and how? In this session, we simplify big data processing as a data bus comprising various stages: ingest, store, process, and visualize. Next, we discuss how to choose the right technology in each stage based on criteria such as data structure, query latency, cost, request rate, item size, data volume, durability, and so on. Finally, we provide reference architecture, design patterns, and best practices for assembling these technologies to solve your big data problems at the right cost.
This overview presentation discusses big data challenges and provides an overview of the AWS Big Data Platform by covering:
- How AWS customers leverage the platform to manage massive volumes of data from a variety of sources while containing costs.
- Reference architectures for popular use cases, including, connected devices (IoT), log streaming, real-time intelligence, and analytics.
- The AWS big data portfolio of services, including, Amazon S3, Kinesis, DynamoDB, Elastic MapReduce (EMR), and Redshift.
- The latest relational database engine, Amazon Aurora— a MySQL-compatible, highly-available relational database engine, which provides up to five times better performance than MySQL at one-tenth the cost of a commercial database.
Created by: Rahul Pathak,
Sr. Manager of Software Development
"Conceptually, a data lake is a flat data store to collect data in its original form, without the need to enforce a predefined schema. Instead, new schemas or views are created “on demand”, providing a far more agile and flexible architecture while enabling new types of analytical insights. AWS provides many of the building blocks required to help organizations implement a data lake. In this session, we will introduce key concepts for a data lake and present aspects related to its implementation. We will discuss critical success factors, pitfalls to avoid as well as operational aspects such as security, governance, search, indexing and metadata management. We will also provide insight on how AWS enables a data lake architecture.
A data lake is a flat data store to collect data in its original form, without the need to enforce a predefined schema. Instead, new schemas or views are created ""on demand"", providing a far more agile and flexible architecture while enabling new types of analytical insights. AWS provides many of the building blocks required to help organizations implement a data lake. In this session, we introduce key concepts for a data lake and present aspects related to its implementation. We discuss critical success factors and pitfalls to avoid, as well as operational aspects such as security, governance, search, indexing, and metadata management. We also provide insight on how AWS enables a data lake architecture. Attendees get practical tips and recommendations to get started with their data lake implementations on AWS."
Big Data and Analytics on Amazon Web Services: Building A Business-Friendly P...Amazon Web Services
If you are crafting a better customer experience, automating your business, or modernizing your systems, you are likely finding that your data and analytics platform is absolutely critical to your success. In this session, we will look at how customers are building on the managed services from Amazon Web Services to meet the needs of the business. Patterns we see gaining popularity are near-real time engagement with customers over mobile, also combining and analyzing unstructured consumer behavior with structured transactional data, as well as managing spiky data workloads. See how our customers use our managed, elastic, secure, and highly available services to change what is possible.
Craig Stires, Head of Big Data and Analytics, Amazon Web Services, APAC
Big data architectures and the data lakeJames Serra
With so many new technologies it can get confusing on the best approach to building a big data architecture. The data lake is a great new concept, usually built in Hadoop, but what exactly is it and how does it fit in? In this presentation I'll discuss the four most common patterns in big data production implementations, the top-down vs bottoms-up approach to analytics, and how you can use a data lake and a RDBMS data warehouse together. We will go into detail on the characteristics of a data lake and its benefits, and how you still need to perform the same data governance tasks in a data lake as you do in a data warehouse. Come to this presentation to make sure your data lake does not turn into a data swamp!
Big Data Pipeline for Analytics at Scale @ FIT CVUT 2014Jaroslav Gergic
The recent boom in big data processing and democratization of the big data space has been enabled by the fact that most of the concepts originated in the research labs of companies such as Google, Amazon, Yahoo and Facebook are now available as open source. Technologies such as Hadoop, Cassandra let businesses around the world to become more data driven and tap into their massive data feeds to mine valuable insights.
At the same time, we are still at a certain stage of the maturity curve of these new big data technologies and of the entire big data technology stack. Many of the technologies originated from a particular use case and attempts to apply them in a more generic fashion are hitting the limits of their technological foundations. In some areas, there are several competing technologies for the same set of use cases, which increases risks and costs of big data implementations.
We will show how GoodData solves the entire big data pipeline today, starting from raw data feeds all the way up to actionable business insights. All this provided as a hosted multi-tenant environment letting its customers to solve their particular analytical use case or many analytical use cases for thousands of their customers all using the same platform and tools while abstracting them away from the technological details of the big data stack.
Log Analytics with Amazon Elasticsearch Service and Amazon Kinesis - March 20...Amazon Web Services
Log analytics is a common big data use case that allows you to analyze log data from websites, mobile devices, servers, sensors, and more for a wide variety of applications including digital marketing, application monitoring, fraud detection, ad tech, gaming, and IoT. In this tech talk, we will walk you step-by-step through the process of building an end-to-end analytics solution that ingests, transforms, and loads streaming data using Amazon Kinesis Firehose, Amazon Kinesis Analytics and AWS Lambda. The processed data will be saved to an Amazon Elasticsearch Service cluster, and we will use Kibana to visualize the data in near real-time.
Learning Objectives:
1. Reference architecture for building a complete log analytics solution
2. Overview of the services used and how they fit together
3. Best practices for log analytics implementation
Data collection and storage is a primary challenge for any big data architecture. In this session, we will describe the different types of data that customers are handling to drive high-scale workloads on AWS, and help you choose the best approach for your workload. We will cover optimization techniques that improve performance and reduce the cost of data ingestion.AWS services to be covered include: Amazon S3, DynamoDB, and Kinesis.
We will introduce key concepts for a data lake and present aspects related to its implementation. Also discussing critical success factors, pitfalls to avoid operational aspects, and insights on how AWS enables a server-less data lake architecture.
Speaker: Sebastien Menant, Solutions Architect, Amazon Web Services
面對日新月異的大數據工具,有時候很難跟上這節奏。有鑑於此,Amazon Web Services提供了廣泛而完善的雲端運算服務組合,幫助您構建、維護和部署大數據應用程式。
這場線上研討會,將為各位深入淺出介紹AWS 雲端平台提供的各種大數據選項,包括現正流行的大數據框架,如Hadoop、Spark、NoSQL數據庫等,同時透過使用案例來瞭解最佳實踐方式。最後,您將了解如何應用這些工具服務,將大數據導入您的現實應用程式中。
AWS Webcast - Managing Big Data in the AWS Cloud_20140924Amazon Web Services
This presentation deck will cover specific services such as Amazon S3, Kinesis, Redshift, Elastic MapReduce, and DynamoDB, including their features and performance characteristics. It will also cover architectural designs for the optimal use of these services based on dimensions of your data source (structured or unstructured data, volume, item size and transfer rates) and application considerations - for latency, cost and durability. It will also share customer success stories and resources to help you get started.
AWS as a Data Platform - AWS Symposium 2014 - Washington D.C. Amazon Web Services
Come hear about the services that AWS provides to manage data and when to use which tools to manage data appropriately. You will learn about both data movement and coordination, as well as data storage and analysis, including when to use relational and NoSQL approaches, Hadoop, and data warehousing. This session will highlight how AWS data services have helped real-world customers.
Big Data on AWS is a deep dive into Cloud-based big data solutions using Amazon Elastic MapReduce (EMR) and Amazon Redshift. In this session, you will learn how to create big data environments and leverage best practices to design big data environments for security and cost-effectiveness. Demonstrations will include using Amazon EMR to process log data and the ease of provisioning a Redshift data warehouse.
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Amazon Web Services
Il Forecasting è un processo importante per tantissime aziende e viene utilizzato in vari ambiti per cercare di prevedere in modo accurato la crescita e distribuzione di un prodotto, l’utilizzo delle risorse necessarie nelle linee produttive, presentazioni finanziarie e tanto altro. Amazon utilizza delle tecniche avanzate di forecasting, in parte questi servizi sono stati messi a disposizione di tutti i clienti AWS.
In questa sessione illustreremo come pre-processare i dati che contengono una componente temporale e successivamente utilizzare un algoritmo che a partire dal tipo di dato analizzato produce un forecasting accurato.
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Amazon Web Services
La varietà e la quantità di dati che si crea ogni giorno accelera sempre più velocemente e rappresenta una opportunità irripetibile per innovare e creare nuove startup.
Tuttavia gestire grandi quantità di dati può apparire complesso: creare cluster Big Data su larga scala sembra essere un investimento accessibile solo ad aziende consolidate. Ma l’elasticità del Cloud e, in particolare, i servizi Serverless ci permettono di rompere questi limiti.
Vediamo quindi come è possibile sviluppare applicazioni Big Data rapidamente, senza preoccuparci dell’infrastruttura, ma dedicando tutte le risorse allo sviluppo delle nostre le nostre idee per creare prodotti innovativi.
Ora puoi utilizzare Amazon Elastic Kubernetes Service (EKS) per eseguire pod Kubernetes su AWS Fargate, il motore di elaborazione serverless creato per container su AWS. Questo rende più semplice che mai costruire ed eseguire le tue applicazioni Kubernetes nel cloud AWS.In questa sessione presenteremo le caratteristiche principali del servizio e come distribuire la tua applicazione in pochi passaggi
Vent'anni fa Amazon ha attraversato una trasformazione radicale con l'obiettivo di aumentare il ritmo dell'innovazione. In questo periodo abbiamo imparato come cambiare il nostro approccio allo sviluppo delle applicazioni ci ha permesso di aumentare notevolmente l'agilità, la velocità di rilascio e, in definitiva, ci ha consentito di creare applicazioni più affidabili e scalabili. In questa sessione illustreremo come definiamo le applicazioni moderne e come la creazione di app moderne influisce non solo sull'architettura dell'applicazione, ma sulla struttura organizzativa, sulle pipeline di rilascio dello sviluppo e persino sul modello operativo. Descriveremo anche approcci comuni alla modernizzazione, compreso l'approccio utilizzato dalla stessa Amazon.com.
Come spendere fino al 90% in meno con i container e le istanze spot Amazon Web Services
L’utilizzo dei container è in continua crescita.
Se correttamente disegnate, le applicazioni basate su Container sono molto spesso stateless e flessibili.
I servizi AWS ECS, EKS e Kubernetes su EC2 possono sfruttare le istanze Spot, portando ad un risparmio medio del 70% rispetto alle istanze On Demand. In questa sessione scopriremo insieme quali sono le caratteristiche delle istanze Spot e come possono essere utilizzate facilmente su AWS. Impareremo inoltre come Spreaker sfrutta le istanze spot per eseguire applicazioni di diverso tipo, in produzione, ad una frazione del costo on-demand!
In recent months, many customers have been asking us the question – how to monetise Open APIs, simplify Fintech integrations and accelerate adoption of various Open Banking business models. Therefore, AWS and FinConecta would like to invite you to Open Finance marketplace presentation on October 20th.
Event Agenda :
Open banking so far (short recap)
• PSD2, OB UK, OB Australia, OB LATAM, OB Israel
Intro to Open Finance marketplace
• Scope
• Features
• Tech overview and Demo
The role of the Cloud
The Future of APIs
• Complying with regulation
• Monetizing data / APIs
• Business models
• Time to market
One platform for all: a Strategic approach
Q&A
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Amazon Web Services
Per creare valore e costruire una propria offerta differenziante e riconoscibile, le startup di successo sanno come combinare tecnologie consolidate con componenti innovativi creati ad hoc.
AWS fornisce servizi pronti all'utilizzo e, allo stesso tempo, permette di personalizzare e creare gli elementi differenzianti della propria offerta.
Concentrandoci sulle tecnologie di Machine Learning, vedremo come selezionare i servizi di intelligenza artificiale offerti da AWS e, anche attraverso una demo, come costruire modelli di Machine Learning personalizzati utilizzando SageMaker Studio.
OpsWorks Configuration Management: automatizza la gestione e i deployment del...Amazon Web Services
Con l'approccio tradizionale al mondo IT per molti anni è stato difficile implementare tecniche di DevOps, che finora spesso hanno previsto attività manuali portando di tanto in tanto a dei downtime degli applicativi interrompendo l'operatività dell'utente. Con l'avvento del cloud, le tecniche di DevOps sono ormai a portata di tutti a basso costo per qualsiasi genere di workload, garantendo maggiore affidabilità del sistema e risultando in dei significativi miglioramenti della business continuity.
AWS mette a disposizione AWS OpsWork come strumento di Configuration Management che mira ad automatizzare e semplificare la gestione e i deployment delle istanze EC2 per mezzo di workload Chef e Puppet.
Scopri come sfruttare AWS OpsWork a garanzia e affidabilità del tuo applicativo installato su Instanze EC2.
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsAmazon Web Services
Vuoi conoscere le opzioni per eseguire Microsoft Active Directory su AWS? Quando si spostano carichi di lavoro Microsoft in AWS, è importante considerare come distribuire Microsoft Active Directory per supportare la gestione, l'autenticazione e l'autorizzazione dei criteri di gruppo. In questa sessione, discuteremo le opzioni per la distribuzione di Microsoft Active Directory su AWS, incluso AWS Directory Service per Microsoft Active Directory e la distribuzione di Active Directory su Windows su Amazon Elastic Compute Cloud (Amazon EC2). Trattiamo argomenti quali l'integrazione del tuo ambiente Microsoft Active Directory locale nel cloud e l'utilizzo di applicazioni SaaS, come Office 365, con AWS Single Sign-On.
Dal riconoscimento facciale al riconoscimento di frodi o difetti di fabbricazione, l'analisi di immagini e video che sfruttano tecniche di intelligenza artificiale, si stanno evolvendo e raffinando a ritmi elevati. In questo webinar esploreremo le possibilità messe a disposizione dai servizi AWS per applicare lo stato dell'arte delle tecniche di computer vision a scenari reali.
Amazon Web Services e VMware organizzano un evento virtuale gratuito il prossimo mercoledì 14 Ottobre dalle 12:00 alle 13:00 dedicato a VMware Cloud ™ on AWS, il servizio on demand che consente di eseguire applicazioni in ambienti cloud basati su VMware vSphere® e di accedere ad una vasta gamma di servizi AWS, sfruttando a pieno le potenzialità del cloud AWS e tutelando gli investimenti VMware esistenti.
Molte organizzazioni sfruttano i vantaggi del cloud migrando i propri carichi di lavoro Oracle e assicurandosi notevoli vantaggi in termini di agilità ed efficienza dei costi.
La migrazione di questi carichi di lavoro, può creare complessità durante la modernizzazione e il refactoring delle applicazioni e a questo si possono aggiungere rischi di prestazione che possono essere introdotti quando si spostano le applicazioni dai data center locali.
Crea la tua prima serverless ledger-based app con QLDB e NodeJSAmazon Web Services
Molte aziende oggi, costruiscono applicazioni con funzionalità di tipo ledger ad esempio per verificare lo storico di accrediti o addebiti nelle transazioni bancarie o ancora per tenere traccia del flusso supply chain dei propri prodotti.
Alla base di queste soluzioni ci sono i database ledger che permettono di avere un log delle transazioni trasparente, immutabile e crittograficamente verificabile, ma sono strumenti complessi e onerosi da gestire.
Amazon QLDB elimina la necessità di costruire sistemi personalizzati e complessi fornendo un database ledger serverless completamente gestito.
In questa sessione scopriremo come realizzare un'applicazione serverless completa che utilizzi le funzionalità di QLDB.
Con l’ascesa delle architetture di microservizi e delle ricche applicazioni mobili e Web, le API sono più importanti che mai per offrire agli utenti finali una user experience eccezionale. In questa sessione impareremo come affrontare le moderne sfide di progettazione delle API con GraphQL, un linguaggio di query API open source utilizzato da Facebook, Amazon e altro e come utilizzare AWS AppSync, un servizio GraphQL serverless gestito su AWS. Approfondiremo diversi scenari, comprendendo come AppSync può aiutare a risolvere questi casi d’uso creando API moderne con funzionalità di aggiornamento dati in tempo reale e offline.
Inoltre, impareremo come Sky Italia utilizza AWS AppSync per fornire aggiornamenti sportivi in tempo reale agli utenti del proprio portale web.
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareAmazon Web Services
Molte organizzazioni sfruttano i vantaggi del cloud migrando i propri carichi di lavoro Oracle e assicurandosi notevoli vantaggi in termini di agilità ed efficienza dei costi.
La migrazione di questi carichi di lavoro, può creare complessità durante la modernizzazione e il refactoring delle applicazioni e a questo si possono aggiungere rischi di prestazione che possono essere introdotti quando si spostano le applicazioni dai data center locali.
In queste slide, gli esperti AWS e VMware presentano semplici e pratici accorgimenti per facilitare e semplificare la migrazione dei carichi di lavoro Oracle accelerando la trasformazione verso il cloud, approfondiranno l’architettura e dimostreranno come sfruttare a pieno le potenzialità di VMware Cloud ™ on AWS.
Amazon Elastic Container Service (Amazon ECS) è un servizio di gestione dei container altamente scalabile, che semplifica la gestione dei contenitori Docker attraverso un layer di orchestrazione per il controllo del deployment e del relativo lifecycle. In questa sessione presenteremo le principali caratteristiche del servizio, le architetture di riferimento per i differenti carichi di lavoro e i semplici passi necessari per poter velocemente migrare uno o più dei tuo container.
Essentials of Automations: Optimizing FME Workflows with ParametersSafe Software
Are you looking to streamline your workflows and boost your projects’ efficiency? Do you find yourself searching for ways to add flexibility and control over your FME workflows? If so, you’re in the right place.
Join us for an insightful dive into the world of FME parameters, a critical element in optimizing workflow efficiency. This webinar marks the beginning of our three-part “Essentials of Automation” series. This first webinar is designed to equip you with the knowledge and skills to utilize parameters effectively: enhancing the flexibility, maintainability, and user control of your FME projects.
Here’s what you’ll gain:
- Essentials of FME Parameters: Understand the pivotal role of parameters, including Reader/Writer, Transformer, User, and FME Flow categories. Discover how they are the key to unlocking automation and optimization within your workflows.
- Practical Applications in FME Form: Delve into key user parameter types including choice, connections, and file URLs. Allow users to control how a workflow runs, making your workflows more reusable. Learn to import values and deliver the best user experience for your workflows while enhancing accuracy.
- Optimization Strategies in FME Flow: Explore the creation and strategic deployment of parameters in FME Flow, including the use of deployment and geometry parameters, to maximize workflow efficiency.
- Pro Tips for Success: Gain insights on parameterizing connections and leveraging new features like Conditional Visibility for clarity and simplicity.
We’ll wrap up with a glimpse into future webinars, followed by a Q&A session to address your specific questions surrounding this topic.
Don’t miss this opportunity to elevate your FME expertise and drive your projects to new heights of efficiency.
DevOps and Testing slides at DASA ConnectKari Kakkonen
My and Rik Marselis slides at 30.5.2024 DASA Connect conference. We discuss about what is testing, then what is agile testing and finally what is Testing in DevOps. Finally we had lovely workshop with the participants trying to find out different ways to think about quality and testing in different parts of the DevOps infinity loop.
PHP Frameworks: I want to break free (IPC Berlin 2024)Ralf Eggert
In this presentation, we examine the challenges and limitations of relying too heavily on PHP frameworks in web development. We discuss the history of PHP and its frameworks to understand how this dependence has evolved. The focus will be on providing concrete tips and strategies to reduce reliance on these frameworks, based on real-world examples and practical considerations. The goal is to equip developers with the skills and knowledge to create more flexible and future-proof web applications. We'll explore the importance of maintaining autonomy in a rapidly changing tech landscape and how to make informed decisions in PHP development.
This talk is aimed at encouraging a more independent approach to using PHP frameworks, moving towards a more flexible and future-proof approach to PHP development.
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Jeffrey Haguewood
Sidekick Solutions uses Bonterra Impact Management (fka Social Solutions Apricot) and automation solutions to integrate data for business workflows.
We believe integration and automation are essential to user experience and the promise of efficient work through technology. Automation is the critical ingredient to realizing that full vision. We develop integration products and services for Bonterra Case Management software to support the deployment of automations for a variety of use cases.
This video focuses on the notifications, alerts, and approval requests using Slack for Bonterra Impact Management. The solutions covered in this webinar can also be deployed for Microsoft Teams.
Interested in deploying notification automations for Bonterra Impact Management? Contact us at sales@sidekicksolutionsllc.com to discuss next steps.
"Impact of front-end architecture on development cost", Viktor TurskyiFwdays
I have heard many times that architecture is not important for the front-end. Also, many times I have seen how developers implement features on the front-end just following the standard rules for a framework and think that this is enough to successfully launch the project, and then the project fails. How to prevent this and what approach to choose? I have launched dozens of complex projects and during the talk we will analyze which approaches have worked for me and which have not.
Transcript: Selling digital books in 2024: Insights from industry leaders - T...BookNet Canada
The publishing industry has been selling digital audiobooks and ebooks for over a decade and has found its groove. What’s changed? What has stayed the same? Where do we go from here? Join a group of leading sales peers from across the industry for a conversation about the lessons learned since the popularization of digital books, best practices, digital book supply chain management, and more.
Link to video recording: https://bnctechforum.ca/sessions/selling-digital-books-in-2024-insights-from-industry-leaders/
Presented by BookNet Canada on May 28, 2024, with support from the Department of Canadian Heritage.
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Ramesh Iyer
In today's fast-changing business world, Companies that adapt and embrace new ideas often need help to keep up with the competition. However, fostering a culture of innovation takes much work. It takes vision, leadership and willingness to take risks in the right proportion. Sachin Dev Duggal, co-founder of Builder.ai, has perfected the art of this balance, creating a company culture where creativity and growth are nurtured at each stage.
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...UiPathCommunity
💥 Speed, accuracy, and scaling – discover the superpowers of GenAI in action with UiPath Document Understanding and Communications Mining™:
See how to accelerate model training and optimize model performance with active learning
Learn about the latest enhancements to out-of-the-box document processing – with little to no training required
Get an exclusive demo of the new family of UiPath LLMs – GenAI models specialized for processing different types of documents and messages
This is a hands-on session specifically designed for automation developers and AI enthusiasts seeking to enhance their knowledge in leveraging the latest intelligent document processing capabilities offered by UiPath.
Speakers:
👨🏫 Andras Palfi, Senior Product Manager, UiPath
👩🏫 Lenka Dulovicova, Product Program Manager, UiPath
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf91mobiles
91mobiles recently conducted a Smart TV Buyer Insights Survey in which we asked over 3,000 respondents about the TV they own, aspects they look at on a new TV, and their TV buying preferences.
Search and Society: Reimagining Information Access for Radical FuturesBhaskar Mitra
The field of Information retrieval (IR) is currently undergoing a transformative shift, at least partly due to the emerging applications of generative AI to information access. In this talk, we will deliberate on the sociotechnical implications of generative AI for information access. We will argue that there is both a critical necessity and an exciting opportunity for the IR community to re-center our research agendas on societal needs while dismantling the artificial separation between the work on fairness, accountability, transparency, and ethics in IR and the rest of IR research. Instead of adopting a reactionary strategy of trying to mitigate potential social harms from emerging technologies, the community should aim to proactively set the research agenda for the kinds of systems we should build inspired by diverse explicitly stated sociotechnical imaginaries. The sociotechnical imaginaries that underpin the design and development of information access technologies needs to be explicitly articulated, and we need to develop theories of change in context of these diverse perspectives. Our guiding future imaginaries must be informed by other academic fields, such as democratic theory and critical theory, and should be co-developed with social science scholars, legal scholars, civil rights and social justice activists, and artists, among others.
Neuro-symbolic is not enough, we need neuro-*semantic*Frank van Harmelen
Neuro-symbolic (NeSy) AI is on the rise. However, simply machine learning on just any symbolic structure is not sufficient to really harvest the gains of NeSy. These will only be gained when the symbolic structures have an actual semantics. I give an operational definition of semantics as “predictable inference”.
All of this illustrated with link prediction over knowledge graphs, but the argument is general.
2. AWS Government, Education, and Nonprofit Symposium
Washington, DC I June 25-26, 2015
What is big data?
When your data sets become so large that you have to start
innovating around how to collect, store, organize, analyze, and
share it
• Velocity
– Rate of data flow in
• Latency
– High or Low
• Volume
– High or Low
• Variety
– Diversity of source data
• Item Size
– KB or MB
• Request Rate
– Access patterns
• Change Rate
– How much is the data changing?
• Processing Requirements
– How much computation?
• Durability
– Preservation of source data?
• Availability
– Tolerance for downtime?
• Growth Rate
– Rate of data growth?
• Views
– The diversity of consumers?
3. AWS Government, Education, and Nonprofit Symposium
Washington, DC I June 25-26, 2015
Plethora of tools
Amazon
Glacier
Amazon
S3
Amazon
DynamoDB
Amazon
RDS
Amazon
EMR
Amazon
Redshift
AWS
Data Pipeline
Amazon
Kinesis
Cassandra
Amazon
CloudSearch
4. AWS Government, Education, and Nonprofit Symposium
Washington, DC I June 25-26, 2015
Ingest Store Analyze Visualize
Data Answers
Time
Multiple stages
Storage decoupled from processing
Simplify data analytics flow
5. AWS Government, Education, and Nonprofit Symposium
Washington, DC I June 25-26, 2015
Amazon
GlacierS3
DynamoDB
RDS
Amazon Kinesis
Spark
Streaming
EMR
Ingest Store Process/Analyze Visualize
Data Pipeline
Storm
Kafka
Amazon
Redshift
Cassandra
Amazon
CloudSearch
Amazon
Kinesis
Connector
Kinesis
enabled app
App Server
Web Server
Devices
6. AWS Government, Education, and Nonprofit Symposium
Washington, DC I June 25-26, 2015
Collect / Ingest
Amazon Kinesis
Process / Analyze
Amazon EMR Amazon EC2
Amazon
Redshift
AWS
Data Pipeline
Visualize / ReportStore
Amazon Glacier
Amazon S3
Amazon DynamoDB
Amazon RDS
AWS Import/Export
AWS Direct Connect
Amazon SQS
AWS big data portfolio
7. AWS Government, Education, and Nonprofit Symposium
Washington, DC I June 25-26, 2015
Mobile / Cable
Telecom
Oil and Gas
Industrial
Manufacturing
Retail/Consumer
Entertainment
Hospitality
Life Sciences
Scientific
Exploration
Financial
Services
Publishing Media
Advertising
Online Media
Social Network
Gaming
Industries using AWS for data analysis
8. AWS Government, Education, and Nonprofit Symposium
Washington, DC I June 25-26, 2015
Ingest: The act of collecting and storing data
9. AWS Government, Education, and Nonprofit Symposium
Washington, DC I June 25-26, 2015
Types of data ingest
• Transactional
– Database reads/writes
• File
– Media files; log files
• Stream
– Click-stream logs (sets of
events)
Database
Cloud
Storage
Stream
Storage
LoggingFrameworksDevicesApps
10. AWS Government, Education, and Nonprofit Symposium
Washington, DC I June 25-26, 2015
Real-time processing of streaming data
High throughput
Elastic
Easy to use
Connectors for EMR, S3, Amazon Redshift,
DynamoDB
Amazon
Kinesis
11. AWS Government, Education, and Nonprofit Symposium
Washington, DC I June 25-26, 2015
Sending and reading data from Amazon
Kinesis streams
HTTP Post
AWS SDK
LOG4J
Flume
Fluentd
Get* APIs
Kinesis Client Library
+
Connector Library
Apache
Storm
Amazon Elastic
MapReduce
Sending Reading
Write Read
12. AWS Government, Education, and Nonprofit Symposium
Washington, DC I June 25-26, 2015
Hparser, Big Data Edition
Flume, Sqoop
AWS Partners for data ingest, load, and
transformation
14. AWS Government, Education, and Nonprofit Symposium
Washington, DC I June 25-26, 2015
App/Web Tier
Client Tier
Database & Storage Tier
Cloud database and storage tier anti-pattern
15. AWS Government, Education, and Nonprofit Symposium
Washington, DC I June 25-26, 2015
App/Web Tier
Client Tier
Data TierDatabase & Storage Tier
Search
Hadoop/HDFS
Cache
Blob Store
SQL NoSQL
Cloud database and storage tier — use the right
tool for the job!
16. AWS Government, Education, and Nonprofit Symposium
Washington, DC I June 25-26, 2015
Database & Storage Tier
Amazon RDSAmazon
DynamoDB
Amazon
ElastiCache
Amazon S3
Amazon
Glacier
Amazon
CloudSearch
HDFS on Amazon EMR
Cloud database and storage tier — use the right
tool for the job!
17. AWS Government, Education, and Nonprofit Symposium
Washington, DC I June 25-26, 2015
Store anything
Object storage
Scalable
Designed for 99.999999999% durability
Amazon
S3
18. AWS Government, Education, and Nonprofit Symposium
Washington, DC I June 25-26, 2015
Aggregate all data in S3 surrounded by a collection of the right tools
Amazon
EMR
Amazon
Kinesis
Amazon
Redshift
Amazon
DynamoDB
Amazon RDS
AWS
Data Pipeline
Spark
Streaming
Cassandra Storm
Amazon S3
• No limit on the number of objects
• Object size up to 5 TB
• Central data storage for all
systems
• High bandwidth
• 99.999999999% durability
• Versioning; lifecycle policies
• Amazon Glacier integration
19. AWS Government, Education, and Nonprofit Symposium
Washington, DC I June 25-26, 2015
Fully managed NoSQL database service
Built on solid-state drives (SSDs)
Consistent low-latency performance
Any throughput rate
No storage limits
Amazon
DynamoDB
20. AWS Government, Education, and Nonprofit Symposium
Washington, DC I June 25-26, 2015
• Scaling without downtime
• Automatic sharding
• Security inspections, patches,
upgrades
• Automatic hardware failover
• Multi-AZ replication
• Hardware configuration
designed specifically for
DynamoDB
• Performance tuning
DynamoDB: managed high availability and
durability
21. AWS Government, Education, and Nonprofit Symposium
Washington, DC I June 25-26, 2015
Relational databases
Fully managed; zero admin
MySQL, PostgreSQL, Oracle, SQL Server
Aurora
Amazon
RDS
22. AWS Government, Education, and Nonprofit Symposium
Washington, DC I June 25-26, 2015
Process and analyze
23. AWS Government, Education, and Nonprofit Symposium
Washington, DC I June 25-26, 2015
Processing frameworks
• Batch processing
– Take large amount (>100 TB) of cold data and ask questions
– Takes minutes or hours to get answers back
– Example: Generating hourly, daily, weekly reports
• Stream processing (real-time)
– Take small amount of hot data and ask questions
– Takes short amount of time to get your answer back
– Example: 1 min metrics
24. AWS Government, Education, and Nonprofit Symposium
Washington, DC I June 25-26, 2015
Processing frameworks
• Batch processing/analytic
– Amazon Redshift
– Amazon EMR (Hadoop)
– Spark, Hive/Tez, Pig, Impala, Presto, ….
• Stream processing
– Amazon Kinesis client and connector library
– Spark Streaming
– Storm (+Trident)
MPPMPPHadoopStreamProcessing
25. AWS Government, Education, and Nonprofit Symposium
Washington, DC I June 25-26, 2015
Columnar data warehouse
ANSI SQL compatible
Massively parallel
Petabyte scale
Fully managed
Very cost-effective
Amazon
Redshift
26. AWS Government, Education, and Nonprofit Symposium
Washington, DC I June 25-26, 2015
Amazon Redshift architecture
• Leader Node
– SQL endpoint
– Stores metadata
– Coordinates query execution
• Compute Nodes
– Local, columnar storage
– Execute queries in parallel
– Load, backup, restore via
Amazon S3
– Parallel load from Amazon DynamoDB
• Hardware optimized for data processing
• Two hardware platforms
– DS2 (dense storage): HDD; scale to 1.6PB
– DC1 (dense compute): SSD; scale to 256TB
10 GigE
(HPC)
Ingestion
Backup
Restore
JDBC/ODBC
27. AWS Government, Education, and Nonprofit Symposium
Washington, DC I June 25-26, 2015
Hadoop/HDFS clusters
Hive, Pig, Impala, HBase
Easy to use; fully managed
On-demand and spot pricing
Tight integration with S3,
DynamoDB, and Amazon Kinesis
Amazon
Elastic
MapReduce
28. AWS Government, Education, and Nonprofit Symposium
Washington, DC I June 25-26, 2015
EMR Cluster
S3
1. Put the data
into S3
2. Choose: Hadoop distribution, #
of nodes, types of nodes, Hadoop
apps like Hive/Pig/HBase
4. Get the output
from S3
3. Launch the cluster using
the EMR console, CLI, SDK, or
APIs
How does EMR work?
29. AWS Government, Education, and Nonprofit Symposium
Washington, DC I June 25-26, 2015
The Hadoop ecosystem works with EMR
30. AWS Government, Education, and Nonprofit Symposium
Washington, DC I June 25-26, 2015
Partners – advanced analytics
35. AWS Government, Education, and Nonprofit Symposium
Washington, DC I June 25-26, 2015
Amazon EMR
as ETL Grid
and Analysis
Amazon Redshift –
Production DWH
VisualizationLogs
Traffic Statistics
Demo
37. AWS Government, Education, and Nonprofit Symposium
Washington, DC I June 25-26, 2015
AWS Government, Education, and Nonprofit Symposium
Washington, DC I June 25-26, 2015
ICAO and Hadoop
Marco Merens
Chief (Acting) Integrated Analysis
International Civil Aviation Organization
39. AWS Government, Education, and Nonprofit Symposium
Washington, DC I June 25-26, 2015
Cloudability principles at ICAO
1. What comes from the cloud, can stay in the
cloud
2. What comes from in-house
A. should stay in-house if private, or
B. can be synced with the cloud if public
40. AWS Government, Education, and Nonprofit Symposium
Washington, DC I June 25-26, 2015
Data sync
Data
Basic UI
Create
Read
Update
Delete
sync Data
FancyUI
Read
Metrics
41. AWS Government, Education, and Nonprofit Symposium
Washington, DC I June 25-26, 2015
Collect Map Reduce Publish
Key Priority
EMR example: blended accident list
42. AWS Government, Education, and Nonprofit Symposium
Washington, DC I June 25-26, 2015
Input format
XML
<?xml version="1.0" encoding="utf-8"?>
<root>
<ADREP><FilingInformation
State="XX"><ReportingOrganization>Ascend</ReportingOrganization>
<StateFileNumber>S1982045</StateFileNumber>
<Headline>MU-2, Collision with high ground, (near) Kelowna</Headline>
</FilingInformation>
….
</root>
CSV
|26/12/2001|Germany|Germany|"ICE:Icing"|Accident|Fatal|8|Germany|Bremerhav
en|D-IAAI|"BRITTEN NORMAN"||||"2 251 to 5 700 Kg"|Scheduled|Airplane|Take-
off||
…
43. AWS Government, Education, and Nonprofit Symposium
Washington, DC I June 25-26, 2015
#!/bin/sh
wget "http://somexml" -qO- | tr -d "n" | tr -d "r" |
sed "s#<Accident>#n<Accident>#g" > tmp
aws s3 put tmp s3://accidents/input/source1
…….
Amazon S3
Amazon EC2
Use linux
crontab
to schedule
Make One XML
element per line for
EMR
Collect
44. AWS Government, Education, and Nonprofit Symposium
Washington, DC I June 25-26, 2015
EMR command line
elastic-mapreduce
--create
--bootstrap-action
s3://elasticmapreduce/samples/node/install-node-bin-
x86.sh
--instance-type m1.small --instance-count 3
--json job.json
--put /home/ec2-user/key/newtest.pem
--to /home/hadoop
--enable-debugging
Put ssh key to hadoop
if you need to remote
sh
45. AWS Government, Education, and Nonprofit Symposium
Washington, DC I June 25-26, 2015
EMR json config file
[{
"Name": "Make accident map",
"ActionOnFailure": "CANCEL_AND_WAIT",
"HadoopJarStep": {
"Jar":"/home/hadoop/contrib/streaming/hadoop-streaming.jar",
"Args": [
"-input",
"s3://accidentstats/input/*", …
]},{
"Name": "Store in mongo",
"ActionOnFailure": "CANCEL_AND_WAIT",
"HadoopJarStep": {
"Jar":"s3://elasticmapreduce/libs/script-runner/script-runner.jar" ,
"Args": [
"s3://edmscripts/uploadtomongo.sh",
"accidentstats/output",
"NEWACCIDENTLIST"
]}}
Move the results
from S3 to
somewhere else
46. AWS Government, Education, and Nonprofit Symposium
Washington, DC I June 25-26, 2015
Map
sourceX
#!/usr/bin/env node
function treatline(line) {
If (line.indexOf(“<ADREP>”))
{
source1(line)
}
……
Function source1(line)
{
Var data=xml2json(line)
data.records.forEach(function(v){
Var el={ Date:v.Date,
Registration:v.Registration,
Model:v.Model,
Source:”Source1”,
Priority:1
}
var key=el.Date+”#”+el.Registration
process.stdout.write(key+”/t”+JSON.stringify(el)
)
})
}
mapped
Amazon Elastic
MapReduce
Amazon S3
47. AWS Government, Education, and Nonprofit Symposium
Washington, DC I June 25-26, 2015
Reduce
Mapped and
sorted
#!/usr/bin/env node
Var oldkey,key,array=[]
function treatline(line) {
key=line.split(“/t”)[0]
data=JSON.parse(line.split(“/t”)[1])
If ((key==oldkey) || !oldkey)
{
array.push(data)}
Else {
treat(array)
array=[]}
oldkey=key
……}
Function treat(array)
{
el={}
array=array.sort(prioritysort)
array.forEach(function(v){
el=updateresult(el,v)
})
process.stdout.write(JSON.stringify(el)+”n”)
}
Reduced
Amazon Elastic
MapReduce
Amazon S3
48. AWS Government, Education, and Nonprofit Symposium
Washington, DC I June 25-26, 2015
Real-time statistics
Amazon Elastic
MapReduce
49. AWS Government, Education, and Nonprofit Symposium
Washington, DC I June 25-26, 2015
Thank You.
This presentation will be loaded to SlideShare the week following the Symposium.
http://www.slideshare.net/AmazonWebServices
AWS Government, Education, and Nonprofit Symposium
Washington, DC I June 25-26, 2015
Editor's Notes
The world is producing an ever increasing volume, velocity, and variety of big data. Consumers and businesses are demanding up-to-the-second (or even millisecond) analytics on their fast-moving data, in addition to classic batch processing. AWS delivers many technologies for solving big data problems.
But what services should you use, why, when, and how? In this session, we simplify big data processing as a data bus comprising various stages and we dig deeper into AWS services for these different stages
AWS Big Data Portfolio
Customers can of course can use compute, storage and networking building blocks + open source tools. But managed services take care of undifferentiated heavy lifting of setting up, patching and scaling allowing you to focus on the mission.
Visualization we rely on our partners – really good at it and what our customers are using.
Amazon Kinesis is a fully managed service for real-time data processing over large, distributed data streams. Amazon Kinesis can continuously capture and store terabytes of data per hour from hundreds of thousands of sources such as website clickstreams, financial transactions, social media feeds, IT logs, sensor data, IoT, location-tracking events.
With Amazon Kinesis Client Library (KCL), you can build Amazon Kinesis Applications and use streaming data to power real-time dashboards, generate alerts, implement dynamic pricing and advertising, and more. You can also emit data from Amazon Kinesis to other AWS services such as Amazon Simple Storage Service (Amazon S3), Amazon Redshift, Amazon Elastic Map Reduce (Amazon EMR), and AWS Lambda.
A shard is the base throughput unit of an Amazon Kinesis stream. One shard provides a capacity of 1MB/sec data input and 2MB/sec data output. One shard can support up to 1000 PUT records per second. You will specify the number of shards needed when you create a stream.
Amazon Kinesis Client Library (KCL) is a pre-built library that helps you easily build Amazon Kinesis Applications for reading and processing data from an Amazon Kinesis stream.
It handles complex issues such as adapting to changes in stream volume, load-balancing streaming data, coordinating distributed services, and processing data with fault-tolerance.
Amazon Kinesis Connector Library is a pre-built library that helps you easily integrate Amazon Kinesis with other AWS services and third-party tools.
The current version of this library provides connectors to Amazon DynamoDB, Amazon Redshift, Amazon S3, and Elasticsearch. The library also includes sample connectors of each type, plus Apache Ant build files for running the samples.
Data structure
Query complexity
Data characteristics: hot, warm, cold
2 x 2 Matrix
Structured
Level of query (from none to complex)
Amazon S3 is object storage service which is highly-scalable, reliable, low-latency and low cost.
Designed for 11 9’s of durability.
Amazon S3 stores data as objects within resources called "buckets." You can store as many objects as you want within a bucket, and write, read, and delete objects in your bucket. Objects can be up to 5 terabytes in size.
You can control access to the bucket (who can create, delete, and retrieve objects in the bucket for example), view access logs for the bucket and its objects, and choose the AWS region where a bucket is stored to optimize for latency, minimize costs, or address regulatory requirements.
Integrates well with other AWS services and a lot of tools from ISVs and in the open source community
Acts as data lake in a large majority of big data solutions
Features include versioning and life cycle management. Glacier integration for archival of data
It protects your data by offering encryption of data at rest and in-flight and provides security and access management features for fine grained control on who can access the data
Several other features including Event Notifications that can be delivered using Amazon SQS or Amazon SNS, or sent directly to AWS Lambda, enabling you to trigger workflows, alerts, or other processing.
Amazon DynamoDB is a fully managed NoSQL database service that provides fast and predictable performance with seamless scalability. Amazon DynamoDB enables customers to offload the administrative burdens of operating and scaling distributed databases to AWS, so they don’t have to worry about hardware provisioning, setup and configuration, replication, software patching, or cluster scaling.
DynamoDB supports key-value and document data structures.
A key-value store is a database service that provides support for storing, querying and updating collections of objects that are identified using a key and values that contain the actual content being stored.
A document store provides support for storing, querying and updating items in a document format such as JSON, XML, and HTML.
Table: A table is a collection of data items – just like a table in a relational database is a collection of rows. Each table can have an infinite number of data items. Amazon DynamoDB is schema-less, in that the data items in a table need not have the same attributes or even the same number of attributes. Each table must have a primary key.
Item: An Item is composed of a primary or composite key and a flexible number of attributes. There is no explicit limitation on the number of attributes associated with an individual item, but the aggregate size of an item, including all the attribute names and attribute values, is 400K
Attribute: Each attribute associated with a data item is composed of an attribute name (e.g. “Color”) and a value or set of values (e.g. “Red” or “Red, Yellow, Green”). Individual attributes have no explicit size limit, but the total value of an item (including all attribute names and values) cannot exceed 400KB.
Amazon DynamoDB supports GET/PUT operations using a user-defined primary key. The primary key is the only required attribute for items in a table and it uniquely identifies each item. You specify the primary key when you create a table. In addition to that DynamoDB provides flexible querying by letting query on non-primary key attributes using Global Secondary Indexes and Local Secondary Indexes.
Transition Statement – RDBMS is still a viable and important component in Big Data Architecture
Amazon Relational Database Service (Amazon RDS) is a managed service that makes it easy to set up, operate, and scale a relational database in the cloud.
Amazon RDS gives you access to the capabilities of a familiar MySQL, Oracle, SQL Server, or PostgreSQL database. This means that the code, applications, and tools you already use today with your existing databases should work seamlessly with Amazon RDS. Amazon RDS automatically patches the database software and backs up your database, storing the backups for a user-defined retention period.
For optional Multi-AZ deployments, Amazon RDS also manages synchronous data replication across Availability Zones and automatic failover.
Amazon Aurora is a MySQL-compatible, relational database engine that combines the speed and availability of high-end commercial databases with the simplicity and cost-effectiveness of open source databases.
Generally come in two major types
Batch
Streaming
Examples
Query Speed
Redshift – Extremely fast SQL queries
Spark, Impala – Extremely Fast to Fast Hive QL
Hive, Tez – Moderately Fast to Slow Hive QL
Data Volume?
UDFs?
Manageability?
http://yahoodevelopers.tumblr.com/post/85930551108/yahoo-betting-on-apache-hive-tez-and-yarn
https://amplab.cs.berkeley.edu/benchmark/
Add connector
Direct Acyclic Graphs?
Exactly once processing & DAG? – how do you do this??
https://storm.apache.org/documentation/Rationale.html
http://www.slideshare.net/ptgoetz/apache-storm-vs-spark-streaming
Amazon Redshift is a fast and powerful, fully managed, petabyte-scale data warehouse service in the cloud. Customers can start small for just $0.25 per hour with no commitments or upfront costs and scale to a petabyte or more for $1,000 per terabyte per year, less than a tenth of most other data warehousing solutions.
Traditional data warehouses require significant time and resource to administer, especially for large datasets. And they are costly.
Amazon Redshift not only significantly lowers the cost of a data warehouse, but also makes it easy to analyze large amounts of data very quickly.
Amazon Redshift uses a variety of innovations to achieve up to ten times higher performance than traditional databases for data warehousing and analytics workloads:
Columnar Data Storage: Instead of storing data as a series of rows, Amazon Redshift organizes the data by column. Unlike row-based systems, which are ideal for transaction processing, column-based systems are ideal for data warehousing and analytics, where queries often involve aggregates performed over large data sets. Since only the columns involved in the queries are processed and columnar data is stored sequentially on the storage media, column-based systems require far fewer I/Os, greatly improving query performance.
Advanced Compression: Columnar data stores can be compressed much more than row-based data stores because similar data is stored sequentially on disk. Amazon Redshift employs multiple compression techniques and can often achieve significant compression relative to traditional relational data stores. In addition, Amazon Redshift doesn't require indexes or materialized views and so uses less space than traditional relational database systems. When loading data into an empty table, Amazon Redshift automatically samples your data and selects the most appropriate compression scheme.
Massively Parallel Processing (MPP): Amazon Redshift automatically distributes data and query load across all nodes. Amazon Redshift makes it easy to add nodes to your data warehouse and enables you to maintain fast query performance as your data warehouse grows.
Amazon Redshift gives you fast querying capabilities over structured data using familiar SQL-based clients and business intelligence (BI) tools using standard ODBC and JDBC connections. Queries are distributed and parallelized across multiple physical resources.
Easy to scale
Amazon Redshift automatically patches and backs up your data warehouse, storing the backups for a user-defined retention period.
You can create a cluster using either Dense Storage (DS) nodes or Dense Compute nodes (DC). Dense Storage nodes allow you to create very large data warehouses using hard disk drives (HDDs) for a very low price point. Dense Compute nodes allow you to create very high performance data warehouses using fast CPUs, large amounts of RAM and solid-state disks (SSDs).
An Amazon Redshift data warehouse cluster can contain from 1-128 compute nodes, depending on the node type
Amazon EMR enables businesses, researchers, data analysts, and developers to easily and cost-effectively process vast amounts of data for applications such as web indexing, data mining, log file analysis, machine learning, financial analysis, scientific simulation, and bioinformatics research.
Amazon EMR uses Apache Hadoop as its distributed data processing engine. Hadoop is an open source, Java software framework that supports data-intensive distributed applications running on large clusters of commodity hardware. Hadoop implements a programming model named “MapReduce,” where the data is divided into many small fragments of work, each of which may be executed on any node in the cluster.
EMR uses EC2 and S3 to set up hadoop clusters.
Regular Hadoop/HDFS
Support for popular add-ons
Fully managed and easy to use
On demand and SPOT pricing
Integrated with other AWS services
S3
DDB
Kinesis
Bootstrap capabilities have most flexibility at the layer above core Hadoop/HDFS
Popular pattern
1-Customer puts data into S3
2-Make some decisions about what to run (type, number and other technologies to install)
3-Use CLI, SDK, Console or API to launch
4-Output is sent to S3
Easy to resize cluster
Use spot instances to save money
Time to resize is going to be a combination of EC2/AMI boot time + the bootstrap options.
Task nodes - Additional nodes to a running cluster that are SPOT
S3DistCp to load/unload from HDFS
Shutdown the cluster (stop being charged except
Core Hadoop is:
Map Reduce – Computational Model
HDFS – Hadoop Distributed File System
Additional Tools have entered the eco system
Tools to help get data into Hadoop
Tools to connect to Relational Systems
Monitoring
Machine Learning
This slide is a small slice
Scientific, algorithmic, predictive, etc
Real time / stream processing: kinesis and dynamo (First two boxes in first row)
Batch processing: last two boxes, hdfs and s3
This is a summary of all six design patterns together. This summarizes all of the solutions available in the context of the temperature of the data and the data processing latency requirements.
Hive – 1 year worth of click stream data
Spark – 1 year of click stream data – what people are buying frequently together
Redshift – reporting, enterprise reporting tool – SQL Heavy
Impala – same as redshift
Preseto same league as Impala presto – Interactive SQL analytics – have a Hadoop installed base….
NoSQL – Analytics on NoSQL
Contains several months of hourly pageview statistics for all articles in Wikipedia
Data copied into EMR from S3
emr does data transformation using hive
raw data doesn’t have date and time
processing file name to get that info, hive is doing it
Contains several months of hourly pageview statistics for all articles in Wikipedia
Data copied into EMR from S3
emr does data transformation using hive
raw data doesn’t have date and time
processing file name to get that info, hive is doing it