This document discusses modern data architectures for business insights at scale. It begins by explaining how businesses can gain insights from analyzing customer data and logs. It then discusses the challenges posed by big data in terms of increasing volume, velocity, and variety of data. The document outlines several AWS services that can be used to ingest, store, process, and analyze data at different speeds (batch, real-time, interactive). It provides examples of how companies like Redfin, Nordstrom, and Euclid leverage AWS to gain insights from customer data. The document emphasizes experimenting with available data and AWS services to deliver business outcomes and continuous differentiation.
Delta Lake OSS: Create reliable and performant Data Lake by Quentin AmbardParis Data Engineers !
Delta Lake is an open source framework living on top of parquet in your data lake to provide Reliability and performances. It has been open-sourced by Databricks this year and is gaining traction to become the defacto delta lake format.
We’ll see all the goods Delta Lake can do to your data with ACID transactions, DDL operations, Schema enforcement, batch and stream support etc !
Mario Molina, Software Engineer
CDC systems are usually used to identify changes in data sources, capture and replicate those changes to other systems. Companies are using CDC to sync data across systems, cloud migration or even applying stream processing, among others.
In this presentation we’ll see CDC patterns, how to use it in Apache Kafka, and do a live demo!
https://www.meetup.com/Mexico-Kafka/events/277309497/
Eugene Kim takes us on a detailed overview of the AWS Cloud, and how SAP ERP workloads can be implemented. He discusses instance sizing in terms of SAPS, High Availability and Disaster Recovery scenarios. SAP Hana and certified solutions are presented as well.
AWS 환경에서 Dell Technologies 데이터 보호 솔루션을 활용한 데이터 보호 방안 - 정진환 이사, Dell EMC :: AW...Amazon Web Services Korea
스폰서 발표 세션 | AWS 환경에서 Dell Technologies 데이터 보호 솔루션을 활용한 데이터 보호 방안
정진환 이사, Dell EMC
본 세션에서는 AWS 환경에서 운용하고 있는 주요 서비스에 대해 Dell Technologies의 데이터 보호 솔루션을 활용하여 데이터를 보호하는 방법과 온프레미스 환경에서 운용하고 있는 주요 가상화 시스템을 AWS 환경으로 손쉽게 DR 환경을 구축할 수 있는 방법을 살펴봅니다. 또한 고객의 장기 보관 데이터를 AWS 환경으로 비용 효율적으로 보관하는 솔루션에 대해서 살펴봅니다.
講師: Ivan Cheng, Solution Architect, AWS
Join us for a series of introductory and technical sessions on AWS Big Data solutions. Gain a thorough understanding of what Amazon Web Services offers across the big data lifecycle and learn architectural best practices for applying those solutions to your projects.
We will kick off this technical seminar in the morning with an introduction to the AWS Big Data platform, including a discussion of popular use cases and reference architectures. In the afternoon, we will deep dive into Machine Learning and Streaming Analytics. We will then walk everyone through building your first Big Data application with AWS.
Data Analytics Meetup: Introduction to Azure Data Lake Storage CCG
Microsoft Azure Data Lake Storage is designed to enable operational and exploratory analytics through a hyper-scale repository. Journey through Azure Data Lake Storage Gen1 with Microsoft Data Platform Specialist, Audrey Hammonds. In this video she explains the fundamentals to Gen 1 and Gen 2, walks us through how to provision a Data Lake, and gives tips to avoid turning your Data Lake into a swamp.
Learn more about Data Lakes with our blog - Data Lakes: Data Agility is Here Now https://bit.ly/2NUX1H6
Delta Lake OSS: Create reliable and performant Data Lake by Quentin AmbardParis Data Engineers !
Delta Lake is an open source framework living on top of parquet in your data lake to provide Reliability and performances. It has been open-sourced by Databricks this year and is gaining traction to become the defacto delta lake format.
We’ll see all the goods Delta Lake can do to your data with ACID transactions, DDL operations, Schema enforcement, batch and stream support etc !
Mario Molina, Software Engineer
CDC systems are usually used to identify changes in data sources, capture and replicate those changes to other systems. Companies are using CDC to sync data across systems, cloud migration or even applying stream processing, among others.
In this presentation we’ll see CDC patterns, how to use it in Apache Kafka, and do a live demo!
https://www.meetup.com/Mexico-Kafka/events/277309497/
Eugene Kim takes us on a detailed overview of the AWS Cloud, and how SAP ERP workloads can be implemented. He discusses instance sizing in terms of SAPS, High Availability and Disaster Recovery scenarios. SAP Hana and certified solutions are presented as well.
AWS 환경에서 Dell Technologies 데이터 보호 솔루션을 활용한 데이터 보호 방안 - 정진환 이사, Dell EMC :: AW...Amazon Web Services Korea
스폰서 발표 세션 | AWS 환경에서 Dell Technologies 데이터 보호 솔루션을 활용한 데이터 보호 방안
정진환 이사, Dell EMC
본 세션에서는 AWS 환경에서 운용하고 있는 주요 서비스에 대해 Dell Technologies의 데이터 보호 솔루션을 활용하여 데이터를 보호하는 방법과 온프레미스 환경에서 운용하고 있는 주요 가상화 시스템을 AWS 환경으로 손쉽게 DR 환경을 구축할 수 있는 방법을 살펴봅니다. 또한 고객의 장기 보관 데이터를 AWS 환경으로 비용 효율적으로 보관하는 솔루션에 대해서 살펴봅니다.
講師: Ivan Cheng, Solution Architect, AWS
Join us for a series of introductory and technical sessions on AWS Big Data solutions. Gain a thorough understanding of what Amazon Web Services offers across the big data lifecycle and learn architectural best practices for applying those solutions to your projects.
We will kick off this technical seminar in the morning with an introduction to the AWS Big Data platform, including a discussion of popular use cases and reference architectures. In the afternoon, we will deep dive into Machine Learning and Streaming Analytics. We will then walk everyone through building your first Big Data application with AWS.
Data Analytics Meetup: Introduction to Azure Data Lake Storage CCG
Microsoft Azure Data Lake Storage is designed to enable operational and exploratory analytics through a hyper-scale repository. Journey through Azure Data Lake Storage Gen1 with Microsoft Data Platform Specialist, Audrey Hammonds. In this video she explains the fundamentals to Gen 1 and Gen 2, walks us through how to provision a Data Lake, and gives tips to avoid turning your Data Lake into a swamp.
Learn more about Data Lakes with our blog - Data Lakes: Data Agility is Here Now https://bit.ly/2NUX1H6
How to go from zero to data lakes in days - ADB202 - New York AWS SummitAmazon Web Services
AWS provides the most comprehensive, secure, scalable, and cost-effective portfolio of services for building and managing data lakes. Now with AWS Lake Formation, you can build a secure data lake in days. In this session, learn how Lake Formation makes it simple to discover, catalog, clean, and load your data into a new data lake. Discover how you can easily secure access to that data and analyze it with services like Amazon Athena, Amazon Redshift, and Amazon EMR. Hear about Alcon’s data lake journey to the AWS Cloud and the challenges it overcame for a successful and productive data lake implementation.
Ditching the overhead - Moving Apache Kafka workloads into Amazon MSK - ADB30...Amazon Web Services
Apache Kafka is a popular stream-processing platform, but it’s no secret that it can be tough to set up, manage, and scale. Amazon Managed Streaming for Kafka (Amazon MSK) can help remove some of that toil for you. In this session, you learn about new Amazon MSK features and capabilities. You also get a glimpse under the hood, giving you a better understanding of how Amazon MSK operationalizes Apache Kafka so you don't have to. We compare and contrast Amazon Kinesis Data Streams and Apache Kafka (with/without MSK) and show how to lift-and-shift your workload into Amazon MSK with minimal downtime.
Data Con LA 2020
Description
In this session, I introduce the Amazon Redshift lake house architecture which enables you to query data across your data warehouse, data lake, and operational databases to gain faster and deeper insights. With a lake house architecture, you can store data in open file formats in your Amazon S3 data lake.
Speaker
Antje Barth, Amazon Web Services, Sr. Developer Advocate, AI and Machine Learning
Amazon Web Services gives you fast access to flexible and low cost IT resources, so you can rapidly scale and build virtually any big data application including data warehousing, clickstream analytics, fraud detection, recommendation engines, event-driven ETL, serverless computing, and internet-of-things processing regardless of volume, velocity, and variety of data.
https://aws.amazon.com/webinars/anz-webinar-series/
Build Real-Time Applications with Databricks StreamingDatabricks
In this presentation, we will study a recent use case we implemented recently. In this use case we are working with a large, metropolitan fire department. Our company has already created a complete analytics architecture for the department based upon Azure Data Factory, Databricks, Delta Lake, Azure SQL and Azure SQL Server Analytics Services (SSAS). While this architecture works very well for the department, they would like to add a real-time channel to their reporting infrastructure.
This channel should serve up the following information: •The most up-to-date locations and status of equipment (fire trucks, ambulances, ladders etc.)
• The current locations and status of firefighters, EMT personnel and other relevant fire department employees
• The current list of active incidents within the city The above information should be visualized through an automatically updating dashboard. The central component of the dashboard will be map which automatically updates with the locations and incidents. This view should be as real-time as possible and will be used by the fire chiefs to assist with real-time decision-making on resource and equipment deployments.
In this presentation, we will leverage Databricks, Spark Structured Streaming, Delta Lake and the Azure platform to create this real-time delivery channel.
A Thorough Comparison of Delta Lake, Iceberg and HudiDatabricks
Recently, a set of modern table formats such as Delta Lake, Hudi, Iceberg spring out. Along with Hive Metastore these table formats are trying to solve problems that stand in traditional data lake for a long time with their declared features like ACID, schema evolution, upsert, time travel, incremental consumption etc.
Simplify CDC Pipeline with Spark Streaming SQL and Delta LakeDatabricks
Change Data Capture CDC is a typical use case in Real-Time Data Warehousing. It tracks the data change log -binlog- of a relational database [OLTP], and replay these change log timely to an external storage to do Real-Time OLAP, such as delta/kudu. To implement a robust CDC streaming pipeline, lots of factors should be concerned, such as how to ensure data accuracy , how to process OLTP source schema changed, whether it is easy to build for variety databases with less code.
Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...Databricks
Uber has real needs to provide faster, fresher data to data consumers & products, running hundreds of thousands of analytical queries everyday. Uber engineers will share the design, architecture & use-cases of the second generation of ‘Hudi’, a self contained Apache Spark library to build large scale analytical datasets designed to serve such needs and beyond. Hudi (formerly Hoodie) is created to effectively manage petabytes of analytical data on distributed storage, while supporting fast ingestion & queries. In this talk, we will discuss how we leveraged Spark as a general purpose distributed execution engine to build Hudi, detailing tradeoffs & operational experience. We will also show to ingest data into Hudi using Spark Datasource/Streaming APIs and build Notebooks/Dashboards on top using Spark SQL.
Making Apache Spark Better with Delta LakeDatabricks
Delta Lake is an open-source storage layer that brings reliability to data lakes. Delta Lake offers ACID transactions, scalable metadata handling, and unifies the streaming and batch data processing. It runs on top of your existing data lake and is fully compatible with Apache Spark APIs.
In this talk, we will cover:
* What data quality problems Delta helps address
* How to convert your existing application to Delta Lake
* How the Delta Lake transaction protocol works internally
* The Delta Lake roadmap for the next few releases
* How to get involved!
Introduction to Amazon EMR design patterns such as using Amazon S3 instead of HDFS, taking advantage of Spot EC2 instances to reduce costs, and other Amazon EMR architectural best practices.
Apache Iceberg - A Table Format for Hige Analytic DatasetsAlluxio, Inc.
Data Orchestration Summit
www.alluxio.io/data-orchestration-summit-2019
November 7, 2019
Apache Iceberg - A Table Format for Hige Analytic Datasets
Speaker:
Ryan Blue, Netflix
For more Alluxio events: https://www.alluxio.io/events/
Building Data Lakes and Analytics on AWS; Patterns and Best Practices - BDA30...Amazon Web Services
In this session, we show you how to understand what data you have, how to drive insights, and how to make predictions using purpose-built AWS services. Learn about the common pitfalls of building data lakes, and discover how to successfully drive analytics and insights from your data. Also learn how services such as Amazon S3, AWS Glue, Amazon Redshift, Amazon Athena, Amazon EMR, Amazon Kinesis, and Amazon ML services work together to build a successful data lake for various roles, including data scientists and business users.
Presentation given at Coolblue B.V. demonstrating Apache Airflow (incubating), what we learned from the underlying design principles and how an implementation of these principles reduce the amount of ETL effort. Why choose Airflow? Because it makes your engineering life easier, more people can contribute to how data flows through the organization, so that you can spend more time applying your brain to more difficult problems like Machine Learning, Deep Learning and higher level analysis.
Apache Kafka is the de facto standard for data streaming to process data in motion. With its significant adoption growth across all industries, I get a very valid question every week: When NOT to use Apache Kafka? What limitations does the event streaming platform have? When does Kafka simply not provide the needed capabilities? How to qualify Kafka out as it is not the right tool for the job?
This session explores the DOs and DONTs. Separate sections explain when to use Kafka, when NOT to use Kafka, and when to MAYBE use Kafka.
No matter if you think about open source Apache Kafka, a cloud service like Confluent Cloud, or another technology using the Kafka protocol like Redpanda or Pulsar, check out this slide deck.
A detailed article about this topic:
https://www.kai-waehner.de/blog/2022/01/04/when-not-to-use-apache-kafka/
Using Amazon CloudSearch With Databases - CloudSearch Meetup 061913Michael Bohlig
Presentation on using Amazon CloudSearch with databases. What to use when? How can you use CloudSearch with a database? Tom Hill, Solutions Architect, Amazon CloudSearch
Driving Business Insights with a Modern Data Architecture AWS Summit SG 2017Amazon Web Services
Your customers probably want a better experience with your brand. Your different business teams want and need better insights in their decision making. Almost certainly, your finance and operations teams require this to happen at a fraction of the cost of traditional on-premises options. Modern data architectures on AWS help many of our best customers realize all of those goals. Your business data contains critical information about customer behaviors, operational decisions, and many factors that have financial impact on your organization. Increasingly, this data sits beyond your transactional systems, and is too big, too fast, and too complex for existing systems to handle. AWS Data and Analytics services are designed from our customers' requirements to ingest, store, analyze, and consume information at record-breaking scale. In this session you will learn how these services work together to deliver business automation, enhance customer engagement and intelligence.
How to go from zero to data lakes in days - ADB202 - New York AWS SummitAmazon Web Services
AWS provides the most comprehensive, secure, scalable, and cost-effective portfolio of services for building and managing data lakes. Now with AWS Lake Formation, you can build a secure data lake in days. In this session, learn how Lake Formation makes it simple to discover, catalog, clean, and load your data into a new data lake. Discover how you can easily secure access to that data and analyze it with services like Amazon Athena, Amazon Redshift, and Amazon EMR. Hear about Alcon’s data lake journey to the AWS Cloud and the challenges it overcame for a successful and productive data lake implementation.
Ditching the overhead - Moving Apache Kafka workloads into Amazon MSK - ADB30...Amazon Web Services
Apache Kafka is a popular stream-processing platform, but it’s no secret that it can be tough to set up, manage, and scale. Amazon Managed Streaming for Kafka (Amazon MSK) can help remove some of that toil for you. In this session, you learn about new Amazon MSK features and capabilities. You also get a glimpse under the hood, giving you a better understanding of how Amazon MSK operationalizes Apache Kafka so you don't have to. We compare and contrast Amazon Kinesis Data Streams and Apache Kafka (with/without MSK) and show how to lift-and-shift your workload into Amazon MSK with minimal downtime.
Data Con LA 2020
Description
In this session, I introduce the Amazon Redshift lake house architecture which enables you to query data across your data warehouse, data lake, and operational databases to gain faster and deeper insights. With a lake house architecture, you can store data in open file formats in your Amazon S3 data lake.
Speaker
Antje Barth, Amazon Web Services, Sr. Developer Advocate, AI and Machine Learning
Amazon Web Services gives you fast access to flexible and low cost IT resources, so you can rapidly scale and build virtually any big data application including data warehousing, clickstream analytics, fraud detection, recommendation engines, event-driven ETL, serverless computing, and internet-of-things processing regardless of volume, velocity, and variety of data.
https://aws.amazon.com/webinars/anz-webinar-series/
Build Real-Time Applications with Databricks StreamingDatabricks
In this presentation, we will study a recent use case we implemented recently. In this use case we are working with a large, metropolitan fire department. Our company has already created a complete analytics architecture for the department based upon Azure Data Factory, Databricks, Delta Lake, Azure SQL and Azure SQL Server Analytics Services (SSAS). While this architecture works very well for the department, they would like to add a real-time channel to their reporting infrastructure.
This channel should serve up the following information: •The most up-to-date locations and status of equipment (fire trucks, ambulances, ladders etc.)
• The current locations and status of firefighters, EMT personnel and other relevant fire department employees
• The current list of active incidents within the city The above information should be visualized through an automatically updating dashboard. The central component of the dashboard will be map which automatically updates with the locations and incidents. This view should be as real-time as possible and will be used by the fire chiefs to assist with real-time decision-making on resource and equipment deployments.
In this presentation, we will leverage Databricks, Spark Structured Streaming, Delta Lake and the Azure platform to create this real-time delivery channel.
A Thorough Comparison of Delta Lake, Iceberg and HudiDatabricks
Recently, a set of modern table formats such as Delta Lake, Hudi, Iceberg spring out. Along with Hive Metastore these table formats are trying to solve problems that stand in traditional data lake for a long time with their declared features like ACID, schema evolution, upsert, time travel, incremental consumption etc.
Simplify CDC Pipeline with Spark Streaming SQL and Delta LakeDatabricks
Change Data Capture CDC is a typical use case in Real-Time Data Warehousing. It tracks the data change log -binlog- of a relational database [OLTP], and replay these change log timely to an external storage to do Real-Time OLAP, such as delta/kudu. To implement a robust CDC streaming pipeline, lots of factors should be concerned, such as how to ensure data accuracy , how to process OLTP source schema changed, whether it is easy to build for variety databases with less code.
Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...Databricks
Uber has real needs to provide faster, fresher data to data consumers & products, running hundreds of thousands of analytical queries everyday. Uber engineers will share the design, architecture & use-cases of the second generation of ‘Hudi’, a self contained Apache Spark library to build large scale analytical datasets designed to serve such needs and beyond. Hudi (formerly Hoodie) is created to effectively manage petabytes of analytical data on distributed storage, while supporting fast ingestion & queries. In this talk, we will discuss how we leveraged Spark as a general purpose distributed execution engine to build Hudi, detailing tradeoffs & operational experience. We will also show to ingest data into Hudi using Spark Datasource/Streaming APIs and build Notebooks/Dashboards on top using Spark SQL.
Making Apache Spark Better with Delta LakeDatabricks
Delta Lake is an open-source storage layer that brings reliability to data lakes. Delta Lake offers ACID transactions, scalable metadata handling, and unifies the streaming and batch data processing. It runs on top of your existing data lake and is fully compatible with Apache Spark APIs.
In this talk, we will cover:
* What data quality problems Delta helps address
* How to convert your existing application to Delta Lake
* How the Delta Lake transaction protocol works internally
* The Delta Lake roadmap for the next few releases
* How to get involved!
Introduction to Amazon EMR design patterns such as using Amazon S3 instead of HDFS, taking advantage of Spot EC2 instances to reduce costs, and other Amazon EMR architectural best practices.
Apache Iceberg - A Table Format for Hige Analytic DatasetsAlluxio, Inc.
Data Orchestration Summit
www.alluxio.io/data-orchestration-summit-2019
November 7, 2019
Apache Iceberg - A Table Format for Hige Analytic Datasets
Speaker:
Ryan Blue, Netflix
For more Alluxio events: https://www.alluxio.io/events/
Building Data Lakes and Analytics on AWS; Patterns and Best Practices - BDA30...Amazon Web Services
In this session, we show you how to understand what data you have, how to drive insights, and how to make predictions using purpose-built AWS services. Learn about the common pitfalls of building data lakes, and discover how to successfully drive analytics and insights from your data. Also learn how services such as Amazon S3, AWS Glue, Amazon Redshift, Amazon Athena, Amazon EMR, Amazon Kinesis, and Amazon ML services work together to build a successful data lake for various roles, including data scientists and business users.
Presentation given at Coolblue B.V. demonstrating Apache Airflow (incubating), what we learned from the underlying design principles and how an implementation of these principles reduce the amount of ETL effort. Why choose Airflow? Because it makes your engineering life easier, more people can contribute to how data flows through the organization, so that you can spend more time applying your brain to more difficult problems like Machine Learning, Deep Learning and higher level analysis.
Apache Kafka is the de facto standard for data streaming to process data in motion. With its significant adoption growth across all industries, I get a very valid question every week: When NOT to use Apache Kafka? What limitations does the event streaming platform have? When does Kafka simply not provide the needed capabilities? How to qualify Kafka out as it is not the right tool for the job?
This session explores the DOs and DONTs. Separate sections explain when to use Kafka, when NOT to use Kafka, and when to MAYBE use Kafka.
No matter if you think about open source Apache Kafka, a cloud service like Confluent Cloud, or another technology using the Kafka protocol like Redpanda or Pulsar, check out this slide deck.
A detailed article about this topic:
https://www.kai-waehner.de/blog/2022/01/04/when-not-to-use-apache-kafka/
Using Amazon CloudSearch With Databases - CloudSearch Meetup 061913Michael Bohlig
Presentation on using Amazon CloudSearch with databases. What to use when? How can you use CloudSearch with a database? Tom Hill, Solutions Architect, Amazon CloudSearch
Driving Business Insights with a Modern Data Architecture AWS Summit SG 2017Amazon Web Services
Your customers probably want a better experience with your brand. Your different business teams want and need better insights in their decision making. Almost certainly, your finance and operations teams require this to happen at a fraction of the cost of traditional on-premises options. Modern data architectures on AWS help many of our best customers realize all of those goals. Your business data contains critical information about customer behaviors, operational decisions, and many factors that have financial impact on your organization. Increasingly, this data sits beyond your transactional systems, and is too big, too fast, and too complex for existing systems to handle. AWS Data and Analytics services are designed from our customers' requirements to ingest, store, analyze, and consume information at record-breaking scale. In this session you will learn how these services work together to deliver business automation, enhance customer engagement and intelligence.
Amazon Machine Learning is a service that makes it easy for developers of all skill levels to use machine learning technology. In this session, we will introduce how to use S3 as a Data Lake to collect device information via AWS IoT, and then generate prediction for your application.
(BDT403) Best Practices for Building Real-time Streaming Applications with Am...Amazon Web Services
Amazon Kinesis is a fully managed, cloud-based service for real-time data processing over large, distributed data streams. Customers who use Amazon Kinesis can continuously capture and process real-time data such as website clickstreams, financial transactions, social media feeds, IT logs, location-tracking events, and more. In this session, we first focus on building a scalable, durable streaming data ingest workflow, from data producers like mobile devices, servers, or even a web browser, using the right tool for the right job. Then, we cover code design that minimizes duplicates and achieves exactly-once processing semantics in your elastic stream-processing application, built with the Kinesis Client Library. Attend this session to learn best practices for building a real-time streaming data architecture with Amazon Kinesis, and get answers to technical questions frequently asked by those starting to process streaming events.
The AWS Workshop Series Online is a series of live webinars designed for IT professionals who are looking to leverage the AWS Cloud to build and transform their business, are new to the AWS Cloud or looking to further expand their skills and expertise. In this series, we will cover : 'Modern Data Architectures for Business Insights at Scale'.
An overview of Amazon Kinesis Firehose, Amazon Kinesis Analytics, and Amazon Kinesis Streams so you can quickly get started with real-time, streaming data.
How to use big data to improve a EC platform is a hot topic. In this session, we will discuss some big data case studies in retail and EC, and introduce how to create a recommendation service with Amazon Machine Learning.
Driving Business Outcomes with a Modern Data Architecture - Level 100Amazon Web Services
Your business data contains critical information about customer behaviors, operational decisions, and many factors that have financial impact on your organisation. Increasingly though, this data is too big, too fast, and too complex for existing systems to handle. AWS Data and Analytics services are designed to ingest, store, analyse, and consume information at record-breaking scale. In this session you will learn how these services work together to deliver business automation, enhance customer engagement and intelligence.
Speaker: Craig Stires, APAC Business Development - Big Data & Analytics, Amazon Web Services
AWS Summit 2013 | Singapore - Big Data Analytics, Presented by AWS, Intel and...Amazon Web Services
Learn more about the tools, techniques and technologies for working productively with data at any scale. This session will introduce the family of data analytics tools on AWS which you can use to collect, compute and collaborate around data, from gigabytes to petabytes. We'll discuss Amazon Elastic MapReduce, Hadoop, structured and unstructured data, and the EC2 instance types which enable high performance analytics.
Data Lake allows an organisation to store all of their data, structured and unstructured, in one, centralised repository. Since data can be stored as-is, there is no need to convert it to a predefined schema and you no longer need to know what questions you want to ask of your data beforehand. In this session we will explore the architecture of a Data Lake on AWS and cover topics such as storage, processing and security.
Using AWS to design and build your data architecture has never been easier to gain insights and uncover new opportunities to scale and grow your business. Join this workshop to learn how you can gain insights at scale with the right big data applications.
This overview presentation discusses big data challenges and provides an overview of the AWS Big Data Platform by covering:
- How AWS customers leverage the platform to manage massive volumes of data from a variety of sources while containing costs.
- Reference architectures for popular use cases, including, connected devices (IoT), log streaming, real-time intelligence, and analytics.
- The AWS big data portfolio of services, including, Amazon S3, Kinesis, DynamoDB, Elastic MapReduce (EMR), and Redshift.
- The latest relational database engine, Amazon Aurora— a MySQL-compatible, highly-available relational database engine, which provides up to five times better performance than MySQL at one-tenth the cost of a commercial database.
Created by: Rahul Pathak,
Sr. Manager of Software Development
Euronext, the 1st European stock exchange with €3.7 trillion in market cap, built a governed data lake based on Amazon AWS to analyze data from one of the largest databases in Europe enriched with 1.5 billion new messages every day. Euronext uses Talend and AWS services - Amazon S3, Amazon Redshift and Amazon EMR for better agility, elasticity, breadth of functionality and cost savings, compared to the previous Netezza-based solution, while guaranteeing data governance and regulatory compliance.
Using AWS to design and build your data architecture has never been easier to gain insights and uncover new opportunities to scale and grow your business. Join this workshop to learn how you can gain insights at scale with the right big data applications.
Big Data and Analytics on Amazon Web Services: Building A Business-Friendly P...Amazon Web Services
If you are crafting a better customer experience, automating your business, or modernizing your systems, you are likely finding that your data and analytics platform is absolutely critical to your success. In this session, we will look at how customers are building on the managed services from Amazon Web Services to meet the needs of the business. Patterns we see gaining popularity are near-real time engagement with customers over mobile, also combining and analyzing unstructured consumer behavior with structured transactional data, as well as managing spiky data workloads. See how our customers use our managed, elastic, secure, and highly available services to change what is possible.
Craig Stires, Head of Big Data and Analytics, Amazon Web Services, APAC
APAC Principal Solutions Architect, Johnathon Meichtry will run through the highlights of 2015 showcasing the biggest announcements and how customers are using these new features. This session will cover the entire breadth of the AWS platform, and is a chance to get a high level overview of all of the announcements, feature updates and new services that AWS has launched in 2015.
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Amazon Web Services
Il Forecasting è un processo importante per tantissime aziende e viene utilizzato in vari ambiti per cercare di prevedere in modo accurato la crescita e distribuzione di un prodotto, l’utilizzo delle risorse necessarie nelle linee produttive, presentazioni finanziarie e tanto altro. Amazon utilizza delle tecniche avanzate di forecasting, in parte questi servizi sono stati messi a disposizione di tutti i clienti AWS.
In questa sessione illustreremo come pre-processare i dati che contengono una componente temporale e successivamente utilizzare un algoritmo che a partire dal tipo di dato analizzato produce un forecasting accurato.
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Amazon Web Services
La varietà e la quantità di dati che si crea ogni giorno accelera sempre più velocemente e rappresenta una opportunità irripetibile per innovare e creare nuove startup.
Tuttavia gestire grandi quantità di dati può apparire complesso: creare cluster Big Data su larga scala sembra essere un investimento accessibile solo ad aziende consolidate. Ma l’elasticità del Cloud e, in particolare, i servizi Serverless ci permettono di rompere questi limiti.
Vediamo quindi come è possibile sviluppare applicazioni Big Data rapidamente, senza preoccuparci dell’infrastruttura, ma dedicando tutte le risorse allo sviluppo delle nostre le nostre idee per creare prodotti innovativi.
Ora puoi utilizzare Amazon Elastic Kubernetes Service (EKS) per eseguire pod Kubernetes su AWS Fargate, il motore di elaborazione serverless creato per container su AWS. Questo rende più semplice che mai costruire ed eseguire le tue applicazioni Kubernetes nel cloud AWS.In questa sessione presenteremo le caratteristiche principali del servizio e come distribuire la tua applicazione in pochi passaggi
Vent'anni fa Amazon ha attraversato una trasformazione radicale con l'obiettivo di aumentare il ritmo dell'innovazione. In questo periodo abbiamo imparato come cambiare il nostro approccio allo sviluppo delle applicazioni ci ha permesso di aumentare notevolmente l'agilità, la velocità di rilascio e, in definitiva, ci ha consentito di creare applicazioni più affidabili e scalabili. In questa sessione illustreremo come definiamo le applicazioni moderne e come la creazione di app moderne influisce non solo sull'architettura dell'applicazione, ma sulla struttura organizzativa, sulle pipeline di rilascio dello sviluppo e persino sul modello operativo. Descriveremo anche approcci comuni alla modernizzazione, compreso l'approccio utilizzato dalla stessa Amazon.com.
Come spendere fino al 90% in meno con i container e le istanze spot Amazon Web Services
L’utilizzo dei container è in continua crescita.
Se correttamente disegnate, le applicazioni basate su Container sono molto spesso stateless e flessibili.
I servizi AWS ECS, EKS e Kubernetes su EC2 possono sfruttare le istanze Spot, portando ad un risparmio medio del 70% rispetto alle istanze On Demand. In questa sessione scopriremo insieme quali sono le caratteristiche delle istanze Spot e come possono essere utilizzate facilmente su AWS. Impareremo inoltre come Spreaker sfrutta le istanze spot per eseguire applicazioni di diverso tipo, in produzione, ad una frazione del costo on-demand!
In recent months, many customers have been asking us the question – how to monetise Open APIs, simplify Fintech integrations and accelerate adoption of various Open Banking business models. Therefore, AWS and FinConecta would like to invite you to Open Finance marketplace presentation on October 20th.
Event Agenda :
Open banking so far (short recap)
• PSD2, OB UK, OB Australia, OB LATAM, OB Israel
Intro to Open Finance marketplace
• Scope
• Features
• Tech overview and Demo
The role of the Cloud
The Future of APIs
• Complying with regulation
• Monetizing data / APIs
• Business models
• Time to market
One platform for all: a Strategic approach
Q&A
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Amazon Web Services
Per creare valore e costruire una propria offerta differenziante e riconoscibile, le startup di successo sanno come combinare tecnologie consolidate con componenti innovativi creati ad hoc.
AWS fornisce servizi pronti all'utilizzo e, allo stesso tempo, permette di personalizzare e creare gli elementi differenzianti della propria offerta.
Concentrandoci sulle tecnologie di Machine Learning, vedremo come selezionare i servizi di intelligenza artificiale offerti da AWS e, anche attraverso una demo, come costruire modelli di Machine Learning personalizzati utilizzando SageMaker Studio.
OpsWorks Configuration Management: automatizza la gestione e i deployment del...Amazon Web Services
Con l'approccio tradizionale al mondo IT per molti anni è stato difficile implementare tecniche di DevOps, che finora spesso hanno previsto attività manuali portando di tanto in tanto a dei downtime degli applicativi interrompendo l'operatività dell'utente. Con l'avvento del cloud, le tecniche di DevOps sono ormai a portata di tutti a basso costo per qualsiasi genere di workload, garantendo maggiore affidabilità del sistema e risultando in dei significativi miglioramenti della business continuity.
AWS mette a disposizione AWS OpsWork come strumento di Configuration Management che mira ad automatizzare e semplificare la gestione e i deployment delle istanze EC2 per mezzo di workload Chef e Puppet.
Scopri come sfruttare AWS OpsWork a garanzia e affidabilità del tuo applicativo installato su Instanze EC2.
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsAmazon Web Services
Vuoi conoscere le opzioni per eseguire Microsoft Active Directory su AWS? Quando si spostano carichi di lavoro Microsoft in AWS, è importante considerare come distribuire Microsoft Active Directory per supportare la gestione, l'autenticazione e l'autorizzazione dei criteri di gruppo. In questa sessione, discuteremo le opzioni per la distribuzione di Microsoft Active Directory su AWS, incluso AWS Directory Service per Microsoft Active Directory e la distribuzione di Active Directory su Windows su Amazon Elastic Compute Cloud (Amazon EC2). Trattiamo argomenti quali l'integrazione del tuo ambiente Microsoft Active Directory locale nel cloud e l'utilizzo di applicazioni SaaS, come Office 365, con AWS Single Sign-On.
Dal riconoscimento facciale al riconoscimento di frodi o difetti di fabbricazione, l'analisi di immagini e video che sfruttano tecniche di intelligenza artificiale, si stanno evolvendo e raffinando a ritmi elevati. In questo webinar esploreremo le possibilità messe a disposizione dai servizi AWS per applicare lo stato dell'arte delle tecniche di computer vision a scenari reali.
Amazon Web Services e VMware organizzano un evento virtuale gratuito il prossimo mercoledì 14 Ottobre dalle 12:00 alle 13:00 dedicato a VMware Cloud ™ on AWS, il servizio on demand che consente di eseguire applicazioni in ambienti cloud basati su VMware vSphere® e di accedere ad una vasta gamma di servizi AWS, sfruttando a pieno le potenzialità del cloud AWS e tutelando gli investimenti VMware esistenti.
Molte organizzazioni sfruttano i vantaggi del cloud migrando i propri carichi di lavoro Oracle e assicurandosi notevoli vantaggi in termini di agilità ed efficienza dei costi.
La migrazione di questi carichi di lavoro, può creare complessità durante la modernizzazione e il refactoring delle applicazioni e a questo si possono aggiungere rischi di prestazione che possono essere introdotti quando si spostano le applicazioni dai data center locali.
Crea la tua prima serverless ledger-based app con QLDB e NodeJSAmazon Web Services
Molte aziende oggi, costruiscono applicazioni con funzionalità di tipo ledger ad esempio per verificare lo storico di accrediti o addebiti nelle transazioni bancarie o ancora per tenere traccia del flusso supply chain dei propri prodotti.
Alla base di queste soluzioni ci sono i database ledger che permettono di avere un log delle transazioni trasparente, immutabile e crittograficamente verificabile, ma sono strumenti complessi e onerosi da gestire.
Amazon QLDB elimina la necessità di costruire sistemi personalizzati e complessi fornendo un database ledger serverless completamente gestito.
In questa sessione scopriremo come realizzare un'applicazione serverless completa che utilizzi le funzionalità di QLDB.
Con l’ascesa delle architetture di microservizi e delle ricche applicazioni mobili e Web, le API sono più importanti che mai per offrire agli utenti finali una user experience eccezionale. In questa sessione impareremo come affrontare le moderne sfide di progettazione delle API con GraphQL, un linguaggio di query API open source utilizzato da Facebook, Amazon e altro e come utilizzare AWS AppSync, un servizio GraphQL serverless gestito su AWS. Approfondiremo diversi scenari, comprendendo come AppSync può aiutare a risolvere questi casi d’uso creando API moderne con funzionalità di aggiornamento dati in tempo reale e offline.
Inoltre, impareremo come Sky Italia utilizza AWS AppSync per fornire aggiornamenti sportivi in tempo reale agli utenti del proprio portale web.
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareAmazon Web Services
Molte organizzazioni sfruttano i vantaggi del cloud migrando i propri carichi di lavoro Oracle e assicurandosi notevoli vantaggi in termini di agilità ed efficienza dei costi.
La migrazione di questi carichi di lavoro, può creare complessità durante la modernizzazione e il refactoring delle applicazioni e a questo si possono aggiungere rischi di prestazione che possono essere introdotti quando si spostano le applicazioni dai data center locali.
In queste slide, gli esperti AWS e VMware presentano semplici e pratici accorgimenti per facilitare e semplificare la migrazione dei carichi di lavoro Oracle accelerando la trasformazione verso il cloud, approfondiranno l’architettura e dimostreranno come sfruttare a pieno le potenzialità di VMware Cloud ™ on AWS.
Amazon Elastic Container Service (Amazon ECS) è un servizio di gestione dei container altamente scalabile, che semplifica la gestione dei contenitori Docker attraverso un layer di orchestrazione per il controllo del deployment e del relativo lifecycle. In questa sessione presenteremo le principali caratteristiche del servizio, le architetture di riferimento per i differenti carichi di lavoro e i semplici passi necessari per poter velocemente migrare uno o più dei tuo container.
UiPath Test Automation using UiPath Test Suite series, part 4DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 4. In this session, we will cover Test Manager overview along with SAP heatmap.
The UiPath Test Manager overview with SAP heatmap webinar offers a concise yet comprehensive exploration of the role of a Test Manager within SAP environments, coupled with the utilization of heatmaps for effective testing strategies.
Participants will gain insights into the responsibilities, challenges, and best practices associated with test management in SAP projects. Additionally, the webinar delves into the significance of heatmaps as a visual aid for identifying testing priorities, areas of risk, and resource allocation within SAP landscapes. Through this session, attendees can expect to enhance their understanding of test management principles while learning practical approaches to optimize testing processes in SAP environments using heatmap visualization techniques
What will you get from this session?
1. Insights into SAP testing best practices
2. Heatmap utilization for testing
3. Optimization of testing processes
4. Demo
Topics covered:
Execution from the test manager
Orchestrator execution result
Defect reporting
SAP heatmap example with demo
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Transcript: Selling digital books in 2024: Insights from industry leaders - T...BookNet Canada
The publishing industry has been selling digital audiobooks and ebooks for over a decade and has found its groove. What’s changed? What has stayed the same? Where do we go from here? Join a group of leading sales peers from across the industry for a conversation about the lessons learned since the popularization of digital books, best practices, digital book supply chain management, and more.
Link to video recording: https://bnctechforum.ca/sessions/selling-digital-books-in-2024-insights-from-industry-leaders/
Presented by BookNet Canada on May 28, 2024, with support from the Department of Canadian Heritage.
State of ICS and IoT Cyber Threat Landscape Report 2024 previewPrayukth K V
The IoT and OT threat landscape report has been prepared by the Threat Research Team at Sectrio using data from Sectrio, cyber threat intelligence farming facilities spread across over 85 cities around the world. In addition, Sectrio also runs AI-based advanced threat and payload engagement facilities that serve as sinks to attract and engage sophisticated threat actors, and newer malware including new variants and latent threats that are at an earlier stage of development.
The latest edition of the OT/ICS and IoT security Threat Landscape Report 2024 also covers:
State of global ICS asset and network exposure
Sectoral targets and attacks as well as the cost of ransom
Global APT activity, AI usage, actor and tactic profiles, and implications
Rise in volumes of AI-powered cyberattacks
Major cyber events in 2024
Malware and malicious payload trends
Cyberattack types and targets
Vulnerability exploit attempts on CVEs
Attacks on counties – USA
Expansion of bot farms – how, where, and why
In-depth analysis of the cyber threat landscape across North America, South America, Europe, APAC, and the Middle East
Why are attacks on smart factories rising?
Cyber risk predictions
Axis of attacks – Europe
Systemic attacks in the Middle East
Download the full report from here:
https://sectrio.com/resources/ot-threat-landscape-reports/sectrio-releases-ot-ics-and-iot-security-threat-landscape-report-2024/
Search and Society: Reimagining Information Access for Radical FuturesBhaskar Mitra
The field of Information retrieval (IR) is currently undergoing a transformative shift, at least partly due to the emerging applications of generative AI to information access. In this talk, we will deliberate on the sociotechnical implications of generative AI for information access. We will argue that there is both a critical necessity and an exciting opportunity for the IR community to re-center our research agendas on societal needs while dismantling the artificial separation between the work on fairness, accountability, transparency, and ethics in IR and the rest of IR research. Instead of adopting a reactionary strategy of trying to mitigate potential social harms from emerging technologies, the community should aim to proactively set the research agenda for the kinds of systems we should build inspired by diverse explicitly stated sociotechnical imaginaries. The sociotechnical imaginaries that underpin the design and development of information access technologies needs to be explicitly articulated, and we need to develop theories of change in context of these diverse perspectives. Our guiding future imaginaries must be informed by other academic fields, such as democratic theory and critical theory, and should be co-developed with social science scholars, legal scholars, civil rights and social justice activists, and artists, among others.
Essentials of Automations: Optimizing FME Workflows with ParametersSafe Software
Are you looking to streamline your workflows and boost your projects’ efficiency? Do you find yourself searching for ways to add flexibility and control over your FME workflows? If so, you’re in the right place.
Join us for an insightful dive into the world of FME parameters, a critical element in optimizing workflow efficiency. This webinar marks the beginning of our three-part “Essentials of Automation” series. This first webinar is designed to equip you with the knowledge and skills to utilize parameters effectively: enhancing the flexibility, maintainability, and user control of your FME projects.
Here’s what you’ll gain:
- Essentials of FME Parameters: Understand the pivotal role of parameters, including Reader/Writer, Transformer, User, and FME Flow categories. Discover how they are the key to unlocking automation and optimization within your workflows.
- Practical Applications in FME Form: Delve into key user parameter types including choice, connections, and file URLs. Allow users to control how a workflow runs, making your workflows more reusable. Learn to import values and deliver the best user experience for your workflows while enhancing accuracy.
- Optimization Strategies in FME Flow: Explore the creation and strategic deployment of parameters in FME Flow, including the use of deployment and geometry parameters, to maximize workflow efficiency.
- Pro Tips for Success: Gain insights on parameterizing connections and leveraging new features like Conditional Visibility for clarity and simplicity.
We’ll wrap up with a glimpse into future webinars, followed by a Q&A session to address your specific questions surrounding this topic.
Don’t miss this opportunity to elevate your FME expertise and drive your projects to new heights of efficiency.
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf91mobiles
91mobiles recently conducted a Smart TV Buyer Insights Survey in which we asked over 3,000 respondents about the TV they own, aspects they look at on a new TV, and their TV buying preferences.
Accelerate your Kubernetes clusters with Varnish CachingThijs Feryn
A presentation about the usage and availability of Varnish on Kubernetes. This talk explores the capabilities of Varnish caching and shows how to use the Varnish Helm chart to deploy it to Kubernetes.
This presentation was delivered at K8SUG Singapore. See https://feryn.eu/presentations/accelerate-your-kubernetes-clusters-with-varnish-caching-k8sug-singapore-28-2024 for more details.
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Tobias Schneck
As AI technology is pushing into IT I was wondering myself, as an “infrastructure container kubernetes guy”, how get this fancy AI technology get managed from an infrastructure operational view? Is it possible to apply our lovely cloud native principals as well? What benefit’s both technologies could bring to each other?
Let me take this questions and provide you a short journey through existing deployment models and use cases for AI software. On practical examples, we discuss what cloud/on-premise strategy we may need for applying it to our own infrastructure to get it to work from an enterprise perspective. I want to give an overview about infrastructure requirements and technologies, what could be beneficial or limiting your AI use cases in an enterprise environment. An interactive Demo will give you some insides, what approaches I got already working for real.
2. Data analysis for a better customer experience
• Your business creates and stores
data and logs all the time
• Data points and logs allow you to
understand individual customer
experience and improve it
• Analysis of logs and trails help
gain insights
4. 95% of the 1.2 zettabytes
of data in the digital
universe is unstructured
70% of of this is user-
generated content
Unstructured data growth
explosive, with estimates
of compound annual
growth (CAGR) at 62%
from 2008 – 2012.
Source: IDC
GB TB
PB
ZB
EB
Big Data: Unconstrained data growth
5. Gartner: User Survey Analysis: Key Trends Shaping the Future of Data Center Infrastructure Through 2011
IDC: Worldwide Business Analytics Software 2012–2016 Forecast and 2011 Vendor Shares
Available for analysis
Generated data
Data volume - Gap
1990 2000 2010 2020
7. Plethora of Tools
Amazon
Glacier
S3 DynamoDB
RDS
EMR
Amazon
Redshift
Data Pipeline
Amazon
Kinesis
Kinesis-enabled
app
Lambda ML
SQS
ElastiCache
DynamoDB
Streams
Amazon Elasticsearch
Service
8. Big Data Challenges
Is there a reference architecture?
What tools should I use?
How?
Why?
9. Outcome 1 : Modernize and consolidate
• Insights to enhance business applications and
create new digital services
Outcome 2 : Innovate for new revenues
• Personalization, demand forecasting, risk analysis
Outcome 3 : Real-time engagement
• Interactive customer experience, event-driven
automation, fraud detection
Outcome 4 : Automate for expansive reach
• Automation of business processes and physical
infrastructure
Driving Business Outcomes via Data Analytics
12. A full-service residential real estate brokerage
Redfin manages data on
hundreds of millions
of properties and
millions of customers
The Hot Homes algorithm
automatically calculates
the likelihood by analyzing
more than 500 attributes
of each home
Was fully AWS-native
since day one
https://aws.amazon.com/solutions/case-studies/redfin/
13. Hot Homes
There's an 80% chance this home will sell in the next 11 days – go tour it soon.
14. Ingest/
Collect
Consume/
visualize
Store
Process/
analyze
Data
1 4
0 9
5
Amazon S3
Data lake
Amazon EMR
Amazon
Kinesis
Amazon RedShift
Answers &
Insights
Hot HomesUsers
Properties
Agents
User Profile
Recommendation
Hot Homes
Similar Homes
Agent Follow-up
Agent Scorecard
Marketing
A/B Testing
Real Time Data
…
Amazon
DynamoDB
BI / Reporting
15. Redfin Manages Data on Hundreds of Millions of Properties Using AWS
.
Once we solved the
infrastructure problem, we
could dream a little bigger. Now
we can deliver results without
worrying about how to scale.
Yong Huang, Director, Big Data and Analytics
”
“ • Zero on-premises infrastructure
• Using spot pricing for EC2, Redfin saved 90%
compared to running on-demand
• Using AWS, Redfin maintains a small technical team,
allowing much simplified server management and
allowing the transition to DevOps
• Redfin is able to launch products like Hot Homes to
greatly increase the buyer experience, by leveraging
the agility and scale of AWS
17. American upscale fashion retailer
Nordstrom has
323 stores operating
in 38 of the United States
and also in Canada; the
largest in number of
stores and
geographic footprint
of its retail competitors
Fashion retailer that sells
clothing, shoes,
cosmetics, and
accessories
Nordstrom is
going all in on AWS
https://aws.amazon.com/solutions/case-studies/nordstrom/
NORDSTROM
18.
19. Ingest/
Collect
Consume/
visualize
Store
Process/
analyze
Data
1 4
0 9
5
Outcomes
& Insights
Personalized
recommendations within
seconds (from 15-20 min)
Scale the expertise of
stylists to all shoppers
Reduce costs by 2X order
of magnitude
…
Mobile Users
Desktop Users
Analytics
Tools
Online Stylist
Amazon
RedShift
Amazon
Kinesis
AWS
Lambda
Amazon
DynamoDB
AWS
Lambda
Amazon S3
Data Storage
NORDSTROM
20. Nordstrom gives personalized style recommendations in seconds
.
Alert me when the
internet is down ...
Keith Homewood
Cloud Product Owner, Nordstrom
”
“ • Nordstrom Recommendation is the online version of a
stylist. It can analyze and deliver personalized
recommendations in seconds
• Going All-In on AWS has resulted in reducing costs
by 2X
• Continuous delivery allows Nordstrom to deliver
multiple production launches a day in a single
application
• Can now create a personalized recommendation in
seconds, in what used to take 15-20 minutes of
processing
• Nordstrom Cloud Product Owner finds the reliability
and availability of AWS so suitable that as long as the
internet is working, Nordstrom Recommendation is
working
Nordstrom
22. Technology that helps brick-and-mortar retailers optimize performance
Trusted by over
500 global brands
in 45 countries worldwide
and counting
Euclid analyzes customer
movement data to
correlate traffic with
marketing campaigns and
to help retailers optimize
hours for peak traffic
Was fully AWS-native
since day one
https://aws.amazon.com/solutions/case-studies/euclid/
23.
24. Ingest/
Collect
Consume/
visualize
Store
Process/
analyze
Data
1 4
0 9
5
Answers &
Insights
Euclid Analytics
Campaigns
WiFi - Foot traffic
Transactions
Walk-Bys
New & Return Visitors
Visit Duration
Engagement Rate
Bounce Rate
Storefront Potential &
Conversion
Customer segmentation
and loyalty assessment
Regional and categorical
roll-up reporting
Zoning for large-format
locations
Euclid EventIQAmazon S3
Data lake
Amazon RDS
for MySQL
Amazon EMR
Amazon
RedShift
Amazon EC2
Amazon
Elastic
Beanstalk
Elastic Load
Balancing
25. Euclid analytics processes POS analytics for 600 global brands in hours
.
We were totally amazed at the
speed - a simple count of rows
that would take 5½ hours
using MySQL only took 30
seconds with Amazon Redshift
Dexin Wang, Director of Platform Engineering, Euclid
”
“ • Process 10’s of TB in hours vs. 2 weeks
• 80-90% reduction in costs
• Euclid has a network of traffic counting sensors in
nearly 400 shopping centers, malls, and street
locations
• Euclid analyzes 10+ billion events monthly and 300
million shopping sessions yearly
• "We might have to re-compute up to 18 months of
customer data. That requires a lot of computational
power, which spikes traffic. We need resources that
can scale up on demand and scale down when we
don’t need it.”
26. Experiment and scale based on your business needs
Ingest/
Collect
Consume/
visualize
Store
Process/
analyze
Data
1 4
0 9
5
Answers &
Insights
SHORT LIST
BUSINESS CASES
Modernization Automation
27. Experiment and scale based on your business needs
MATCH
AVAILABLE DATA
Metrics and
Monitoring
Workflow
Logs
ERP
Transactions
Ingest/
Collect
Consume/
visualize
Store
Process/
analyze
Data
1 4
0 9
5
Answers &
Insights
28. Experiment and scale based on your business needs
AWS
Import/ Export
Amazon S3
Amazon
Kinesis
Amazon
EMR
Ingest/
Collect
Consume/
visualize
Store
Process/
analyze
Data
1 4
0 9
5
Answers &
Insights
Amazon
Redshift
Amazon
QuickSight
Amazon
SQS
CHOOSE
BEST FIT
29. Ingest/
Collect
Consume/
visualize
Store
Process/
analyze
Data
1 4
0 9
5
Amazon S3
Data lake
Amazon EMR
Amazon
Kinesis
Amazon RedShift
Answers &
Insights
Hot HomesUsers
Properties
Agents
User Profile
Recommendation
Hot Homes
Similar Homes
Agent Follow-up
Agent Scorecard
Marketing
A/B Testing
Real Time Data
…
Amazon
DynamoDB
BI / Reporting
30. A platform to build business outcomes from data
Ingest/
Collect
Consume/
visualize
Store
Process/
analyze
1 4
0 9
5
36. Amazon Kinesis Firehose
• Fully managed data streaming service to ingest and
capture data into your storage or data warehouse
• Ability to batch load, compress or encrypt streaming
data
• Elastic to scale to any throughput (no more sharding)
• Charged only per GB processed ($0.035 per GB)
37. What Stream Storage should I use?
Amazon
DynamoDB
Streams
Amazon
Kinesis
Streams
Amazon
Kinesis
Firehose
Apache
Kafka
Amazon
SQS
AWS managed
service
Yes Yes Yes No Yes
Guaranteed
ordering
Yes Yes Yes Yes No
Delivery exactly-once at-least-once exactly-once at-least-once at-least-once
Data retention
period
24 hours 7 days N/A Configurable 14 days
Availability 3 AZ 3 AZ 3 AZ Configurable 3 AZ
Scale /
throughput
No limit /
~ table IOPS
No limit /
~ shards
No limit /
automatic
No limit /
~ nodes
No limits /
automatic
Parallel clients Yes Yes No Yes No
Stream MapReduce Yes Yes N/A Yes N/A
Record/object size 400 KB 1 MB Amazon Redshift row size Configurable 256 KB
Cost Higher (table cost) Low Low Low (+admin) Low-medium
Hot Warm
38. COLLECT STORE
Mobile apps
Web apps
Data centers
AWS Direct
Connect
RECORDS
Database
AWS Import/Export
Snowball
Logging
Amazon
CloudWatch
AWS
CloudTrail
DOCUMENTS
FILES
Search
Messaging
Message MESSAGES
Devices
Sensors &
IoT platforms
AWS IoT STREAMS
Apache Kafka
Amazon Kinesis
Streams
Amazon Kinesis
Firehose
Amazon DynamoDB
Streams
Hot
Stream
Amazon S3
Amazon SQS
Message
Amazon S3
File
LoggingIoTApplicationsTransportMessaging
File Storage
39. Amazon S3
• Highly available object storage
• Designed for 99.999999999% annual
data durability
• Replicated across 3 facilities
• Virtually unlimited scale
• Pay only for what you use, you don’t
need to pre-provision
• Allows event notifications to trigger
further action
• Native support by big data frameworks
Amazon S3
40. Cost Conscious Design
Example: Should I use Amazon S3 or Amazon DynamoDB?
“I’m currently scoping out a project that will greatly increase
my team’s use of Amazon S3. Hoping you could answer
some questions. The current iteration of the design calls for
many small files, perhaps up to a billion during peak. The
total size would be on the order of 1.5 TB per month…”
Request rate
(Writes/sec)
Object size
(Bytes)
Total size
(GB/month)
Objects per month
300 2048 1483 777,600,000
43. No need to
move data
Query S3 directly
& right away
No infrastructure to
setup & manage
Fast results
within seconds
Pay for just the
queries you run
Amazon Athena
Interactive query service that makes it
easy to analyze data in Amazon S3
using standard SQL
44. What about HDFS & Amazon Glacier?
• Use HDFS for very frequently accessed (hot)
data
• Use Amazon S3 Standard for frequently
accessed data
• Use Amazon S3 Standard – IA for infrequently
accessed data
• Use Amazon Glacier for archiving cold data
45. Cache, database, search
COLLECT STORE
Mobile apps
Web apps
Data centers
AWS Direct
Connect
RECORDS
AWS Import/Export
Snowball
Logging
Amazon
CloudWatch
AWS
CloudTrail
DOCUMENTS
FILES
Messaging
Message MESSAGES
Devices
Sensors &
IoT platforms
AWS IoT STREAMS
Apache Kafka
Amazon Kinesis
Streams
Amazon Kinesis
Firehose
Amazon DynamoDB
Streams
Hot
Stream
Amazon SQS
Message
Amazon Elasticsearch
Service
Amazon DynamoDB
Amazon S3
Amazon ElastiCache
Amazon RDS
SearchSQLNoSQLCacheFile
LoggingIoTApplicationsTransportMessaging
HotWarm
47. Best Practice - Use the Right Tool for the Job
Data Tier
Search
Amazon
Elasticsearch
Service
Cache
Amazon
ElastiCache
• Redis
• Memcached
SQL
• Amazon Aurora
• MySQL
• PostgreSQL
• Oracle
• SQL Server
NoSQL
• Amazon
DynamoDB
• Cassandra
• HBase
• MongoDB
Database tier options
52. Amazon EMR
• Amazon EMR is a fully managed
Hadoop cluster
• Transient and long running clusters
• Direct integration into Amazon S3
• Easy to scale and enable burstable
capacity
• Integration with AWS Spot Market
53. Amazon EMR
• Amazon EMR supports all common
Hadoop Frameworks such as:
• Spark, Pig, Hive, Hue, Oozie …
• Hbase, Presto, Impala …
• Decouples storage from compute
• Allows independent scaling
• Direct Integration with DynamoDB
and S3
Amazon S3Amazon
DynamoDB
Amazon EMR
54. 1 instance x 100 hours = 100 instances x 1 hour
(and with Spot Pricing not only faster but also cheaper)
55. Amazon Redshift
• Fully managed petabyte-scale data
warehouse
• Scalable amount of cluster nodes
• ODBC/JDBC connector for BI tools
using SQL
• Supports Amazon DynamoDB and
Amazon S3 to load data
• Less than a 10th of a cost of traditional
solutions
Amazon Redshift
56. Intel® Processor Technologies
Intel® AVX – Dramatically increases performance for highly parallel HPC workloads
such as life science engineering, data mining, financial analysis, media processing
Intel® AES-NI – Enhances security with new encryption instructions that reduce the
performance penalty associated with encrypting/decrypting data
Intel® Turbo Boost Technology – Increases computing power with performance that
adapts to spikes in workloads
Intel Transactional Synchronization (TSX) Extensions – Enables execution of
transactions that are independent to accelerate throughput
P state & C state control – provides granular performance tuning for cores and sleep
states to improve overall application performance
57. New X1 Instance - Tons of Memory
• Designed for large-scale, in-memory
applications in the cloud
• Ideal for in-memory databases like SAP
HANA and big data processing apps like
Spark and Presto
• Powered by Intel® Xeon® E7 8880 v3
Haswell processors
• Features up to 2TB of memory and up to
128 vCPUs per instance
• 8X the memory offered by any other Amazon EC2
instance
58. 3. Affordable Petabyte-scale Analytics
AWS helps customers maximize the value of Big Data
investments while reducing overall IT costs
Secure,
Highly Durable storage
$28.16 / TB / month
Data
Archiving
$7.16 / TB / month
Real-time
streaming data load
$0.035 / GB
10-node
Spark Cluster
$0.15 / hr
Petabyte-scale
Data Warehouse
$0.25 / hr
Amazon Glacier Amazon S3 Amazon RedshiftAmazon EMRAmazon Kinesis
60. Predictions via Machine Learning
ML gives computers the ability to learn without being explicitly
programmed
Machine learning algorithms:
• Supervised learning ← “teach” program
- Classification ← Is this transaction fraud? (yes / no)
- Regression ← Customer life-time value?
• Unsupervised learning ← Let it learn by itself
- Clustering ← Market segmentation
61. Amazon Machine Learning
• Easy to use, managed machine
learning service built for developers
• Machine learning technology based
on Amazon’s internal systems
• Create models using data stored in
Amazon S3, Amazon RDS or Amazon
Redshift
• Request predictions on batch or real-
time
Amazon Machine
Learning
62. Machine Learning Algorithms
• Classification
• Sentiment analysis – Do people like my new product?
• Linear Regression
• Trend prediction – How much revenue next month?
• Clustering
• Recommendation - Other people bought this!
• Association
• Market basket analysis – Bundled products
• Neural Networks
• Pattern recognition - Speech recognition
Amazon Machine
Learning
Amazon EMR +
Spark Mlib
GPU Optimized
EC2 Instance
63. Amazon Rekognition
Image Recognitions and Analysis
powered by Deep Learning which
allows to search, verify and organize
millions of images
Easy to use Batch Analysis Real-time
Analysis
Continually Improving Low Cost
66. Serverless Rekognition Demo
Serverless website that uses Rekognition to identify
faces and classify pictures
Amazon S3
AWS Lambda
Amazon API
Gateway
Amazon
DynamoDB
Amazon
Rekognition
Mobile
CodeFor.Cloud/image
67.
68. Unlimited
Replays
Returns an MP3
or audio stream
Lightning Fast
Response
Fully Managed and
Low Cost
Amazon Polly
Turn text into lifelike speech using deep
learning technologies to synthesize
speech that sounds like a human voice
69. Amazon Polly
“The temperature
in WA is 75°F”
“The temperature
in Washington is 75 degrees
Fahrenheit”
Amazon Polly: Text In, Life-like Speech Out
70. Amazon Lex
Conversational interfaces for your
applications, powered by the same
Natural Language Understanding
(NLU) & Automatic Speech Recognition
(ASR) models as Alexa
Integrated
development in
AWS console
Trigger AWS
Lambda
functions
Multi-step
conversations
Continually improving
ASR & NLU models
Enterprise
connectors
Fully Managed
71. Intents
A particular goal that the
user wants to achieve
Utterances
Spoken or typed phrases
that invoke your intent
Slots
Data the user must provide to fulfill the
intent
Prompts
Questions that ask the user to input
data
Fulfillment
The business logic required to fulfill the
user’s intent
BookHotel
74. AWS Glue
Easily understand your data sources,
prepare the data, and load it reliably to
data stores and your analytics pipeline
Integrated with:
S3, RDS, Redshift & any JDBC-
compliant data store
79. STORE CONSUMEPROCESS / ANALYZE
Amazon QuickSight
Apps & Services
Analysis&visualizationNotebooksIDEAPI
Applications & API
Analysis and visualization
Notebooks
IDE
Business
users
Data scientist,
developers
COLLECT ETL
80. Amazon Quicksight
• Fast, cloud-powered, BI service that
makes it easy to build visualizations,
perform ad-hoc analysis, and get insights
from data.
• Connectors for files, third party platforms,
AWS services and other partner BI tools
• In-memory calculation engine (SPICE)
to accelerate analysis and visualization
• $9 per user per month
81.
82. Athena & Quicksight Demo
Amazon
S3
Amazon
Athena
Amazon
Quicksight
Analyze past flight performance data stored in S3
Bureau of Transportation Flight Data Statistics
www.transtats.bts.gov
Create visualizations from S3 with Athena & Quicksight
86. Suncorp is moving "all-in" on cloud.
Project Ignite will extract benefits of $170 million
- Group CEO Patrick Snowball
Insurance Policy Insurance Claim Core Banking Life Admin
92. AdRoll: AWS Lambda for log files
Valentino Volonghi
CTO, AdRoll
“Polling is not a scalable strategy to
figure out when new files are added to S3,
especially when you add 17M of them per
month. So we moved Lambda in front of
S3.”
• Cross-platform, cross-device
advertising platform
• Offers retargeting based on
clickstream data
300TB
new
data/mont
h
93. Rethink how to become a data-driven business
• Business outcomes - start with the insights and actions you
want to drive, then work backwards to a streamlined design
• Experimentation - start small, test many ideas, keep the
good ones and scale those up, paying only for what you
consume
• Agile and timely - deploy data processing infrastructure in
minutes, not months. take advantage of a rich platform of
services to respond quickly to changing business needs
Volume – 100 to 150TB a day
Velocity – 1million reads and writes per second is becoming a norm
Variety ->
IOT/log data/ streaming data
Transactional data
File data
Fixed Schema
CSV
Parquet
Avero
Schema-free
JSON
Key-value
Small files, large files,
Hourly server logs: were your systems misbehaving 1hr ago
Weekly / Monthly Bill: what you spent this billing cycle
Daily customer-preferences report from your web site’s click stream: what deal or ad to try next time
Daily fraud reports: was there fraud yesterday
Real-time alerts: what went wrong now
Real-time spending caps: prevent overspending now
Real-time analysis: what to offer the current customer now
Real-time detection: block fraudulent use now
I need to harness big data, fast
I want more happy customers
I want to save/make more money
Is there a reference architecture?What tools should I use?How? Why?
Primary Drivers:
Maximize revenue by delivering consistent and personalized marketing and multichannel shopping experiences and keeping a fresh assortment of merchandise in stock
Streamline supply chain operations by analyzing wholesale, inventory, RFID, and POS retail data in real time, automating data exchange with small suppliers, and leveraging consistent supplier data
Secondary Drivers:
Optimize store operations
Boost performance and increase operational efficiency by archiving inactive data from key retail applications
Empower customer service reps to manage issues effectively and be active in social media
And now if you look at the retail market drivers, no surprise here to find as number priority: personalisation and the impact of personalisation down the chain.
This is the number one priority for all retailers, but again we need to qualify a personalisation exercise. We need to understand the perimeter of personalisation, we need to be strategic on the different engagements we want propose to our customers, we need to be specific on the benefits the customer is expecting.
The benefits could be related to basket transformation ratios, churn prevention strategies through tailored landing pages based on a simple analysis of search terms or a wider exercise conversion rates optimisations.
Concept: Qualify the perimeter and the success criteria. Start small, prove your point and have a clear road map on what good looks like by qualifying difficulty/effort vs return. It’s great to talk about omnichannel but you will go nowhere if you don’t have a clear road maps of events and if you cannot demonstrate value.
The second priority is the bottom line – supply chain – logistics – stock management.
A more complex area due to the tools that are currently being used and the legacy aspects of these tools. However the market is changing, supply chain is becoming a commodity, most supply chain tools are moving to a SAAS model.
Also, it’s fair to say that there is a close link in between the primary drivers – product assortment has got an impact on supply chain, providing a brand experience needs adaptation of the entire operations from both sales and support angles.
AWS will definitely play a part on this market driver and we are currently helping organisations in few of these areas, for example NISA developing a mobile OCS or Kelloggs using analytics to optimize trade spent and avoid waste or Unilever for decreasing time to market of campaigns.
The key here, is to understand the benefits we are bringing to the organisations.
Sifting through data is challenging. Need a solution to store and process them and translate them into knowledge and insights
Matchmaking millions of users with 100million of properties with thousands of agents.
Users:
Clickstream (View, Search, )
Contacts, Tours, Open Houses, Offers...
Properties:
Property facts & history
Neighborhood & POI
Agents:
Availability
Performance, Survey…
"Redfin Hot Homes gives my clients the ultimate insider information," said Keith Thomas, a Redfin real estate agent in Orange County. "Now we know which homes we need to see today, and which ones can wait until next week."
Users:
Clickstream (View, Search, )
Contacts, Tours, Open Houses, Offers...
Properties:
Property facts & history
Neighborhood & POI
Agents:
Availability
Performance, Survey…
Talk about the services AWS Lambda is a compute service that runs your code in response to events and automatically manages the compute resources for you, making it easy to build applications that respond quickly to new information. AWS Lambda starts running your code within milliseconds of an event such as an image upload, in-app activity, website click, or output from a connected device. You can also use AWS Lambda to create new back-end services where compute resources are automatically triggered based on custom requests. With AWS Lambda you pay only for the requests served and the compute time required to run your code. Billing is metered in increments of 100 milliseconds, making it cost-effective and easy to scale automatically from a few requests per day to thousands per second.
Amazon DynamoDB
Amazon DynamoDB is a fast and flexible NoSQL database service for all applications that need consistent, single-digit millisecond latency at any scale.
Product owner shares journey to all in
https://www.youtube.com/watch?v=TXmkj2a0fRE
Euclid, a fast-growing technology start-up, helps brick-and-mortar retailers optimize marketing, merchandising, and operations performance by measuring foot traffic, store visits, walk-by conversion, bounce rate, visit duration, and customer loyalty. Euclid analyzes customer movement data to correlate traffic with marketing campaigns and to help retailers optimize hours for peak traffic, among other activities. Euclid stores up to 30 GB of uncompressed data per day in Amazon S3.
Dexin Wang, Director of Platform Engineering reports, “Amazon Redshift is very easy to scale with minimal management requirements,” he comments. “It’s also cost effective. We saw a 90 percent cost reduction moving from our previous database system to Amazon Redshift.”
Euclid stores information on Amazon Simple Storage Service (Amazon S3), and processes data in parallel with Amazon Elastic MapReduce (Amazon EMR).
Initially, the company ran its data store on MySQL but moved to Amazon Redshift to improve performance for analytic workloads. “Using Amazon Redshift, our analysts can work with large data sets and run SQL-based queries to our stack quickly,” Dexin Wang, Director of Platform Engineering reports. “We were totally amazed at the speed—a simple count of rows that would take 5 1/2 hours using MySQL only took 30 seconds with Amazon Redshift.” Wang estimates that it only took a few days to port production data over to Amazon Redshift and start running analysis on it. “Amazon Redshift is very easy to scale with minimal management requirements,” he comments. “It’s also cost effective. We saw a 90 percent cost reduction moving from our previous database system to Amazon Redshift.”
The analytics team leverages Amazon EMR and Hadoop to aggregate and analyze data. “Amazon EMR does most of the heavy lifting,” says Leung. “I used Hadoop in my previous work and we had to spend time installing and managing the cluster. We don’t have to do that with AWS. We only use the service when we need it, which is a great cost savings.” Figure 1 below demonstrates Euclid’s environment on AWS.
As the company continues to grow, it takes advantage of Amazon Redshift and Amazon EMR to run complex queries on large and growing data sets with improved performance. “We’ve collected 1 to 30 GB of data per day over the last three years,” notes Leung. “By running on AWS and taking advantage of Amazon Redshift, we can scale to provide the computational power to complete a task on our entire data set, tens of terabytes, in a couple of hours—a task that used to take two weeks. Overall, compared to what we would have to spend to build an infrastructure capable of meeting our peak compute load requirements, we’re saving 80 to 90 percent using AWS.”
Wang adds, “We didn’t want to worry about infrastructure or scaling. We just want to be able to ask questions and get answers. AWS helps us get answers quickly.”
Turn on Euclid Express today and get key insights you never had before:
Walk-Bys
New & Return Visitors
Visit Duration
Engagement Rate
Bounce Rate
Storefront Potential & Conversion
New insights across all your locations:
Identify leaders and laggards across your chain by KPI
Quickly replicate best practices from your top-performing stores
Pinpoint key trends and regional differences
Get a clearer overall picture by integrating existing systems
Comprehensive features. Powerful insights:
Customer segmentation and loyalty assessment
Regional and categorical roll-up reporting
Zoning for large-format locations
Labor optimization and staffing schedules
Automated insights and predictive analysis
Analyze events, resets and promotions with Euclid EventIQ
Industry, segment, and geographic benchmarking
So, as you get started, you'll want to first shortlist the business cases that you are looking to address, and then work backwards from there. Once you've gotten down to the one or two things that you believe you can change with the right insights, then this is the launch point.
I'll share with you a starting point is for some customers, which is a combination of automating part of the business, and modernizing systems in the process. An example would be improving response times in a call center. You might have different response approaches and times, depending on the channel that a customer communicates with you. But, you may want to standardize your first response to a customer to happen within 1 hour, regardless of channel, whether that is call center, complaining on twitter, or leaving feedback in an app store.
You would identify that a starting place will be to modernize the event and data capture systems, and focus on those that are used by your most valued customers. The key action is to automate the capture of events and data. Then, we will eventually look to automate the response that gets triggered to the customer.
With that as a starting point, you would then look back to what sources of data would be usable to automate that response. In this case, it may be using existing metrics and monitoring systems to establish a baseline and see where the current responses are happening and not. The next source could be the actual workflow logs, let's say in your call center. This could give you an indication of keywords to look for, and how to start automating detection and response. Finally, there could be valuable information available within your ERP systems. This could be information like orders, returns, customer feedback, etc. The important thing is that you take a targeted set of data, and for a limited time scope to begin with
Once you have found the specific data that you expect will be able to provide you insights, then you start thinking about lean design. You would ask, "what's the least amount of infrastructure that I need to turn this data into insights?" This is the key to unlocking a new world of solving challenges in an agile way. Start with a small, fully decoupled design. Each part of the system can scale up, as you add more users or add more data., and as you add more use cases. It's maybe the first time you've had the chance to only pay for what is giving you benefit.
In the scenario of improving response time to customers feedback, we would look at using AWS Import/Export to load data into from the ERP and workflow systems. We may also be using our monitoring systems to detect customer feedback through other channels, and we would use some Kinesis to capture that in real-time. Then, all data would be put into S3 for that very inexpesnsive and durable storage and staging. Then if you remember the patterns we saw earlier of using Redshift as a flexible, purpose-built data warehouse. For the data that arrives unstructured, and needs context added, that would first be processed in EMR, the fully managed Hadoop service, but keeping the data in S3, and not having to lock yourself to a large persistent cluster. That processed data can then be moved straight into Redshift. And, for the users who will create and consume these insights? The key system requirement here is to automate response. So, the Amazon Simple Queueing Service, or SQS, would be used to connect to your existing applications which service your customers. For measuring performance and customer satisfaction, you could then use our cloud-native business intelligence and data visualization tool, which will be available later this year. Amazon Quicksight will give business users BI capabilities against source like Redshift for just $9 per month per user.
"Redfin Hot Homes gives my clients the ultimate insider information," said Keith Thomas, a Redfin real estate agent in Orange County. "Now we know which homes we need to see today, and which ones can wait until next week."
Users:
Clickstream (View, Search, )
Contacts, Tours, Open Houses, Offers...
Properties:
Property facts & history
Neighborhood & POI
Agents:
Availability
Performance, Survey…
But there's a flaw to this, right? This is just the data. Traditionally, when we look at this, we say "start with all your data, then ... question mark ... question mark ... profit!" That's really hard! This is the most expensive way to go about it, because you're paying all your costs upfront; trying to capture everything to begin with and hoping that there's some results. But what we're really after is that we want things like revenue lift. We want to enter and expand in new markets. We want customer delight and brand advocacy. We want operational excellence. These are the real goals.
So what we'll talk about is how our customers are starting here, they're starting with what they need to get done and finding the shortest path with the least amount of data to get there. This is where we talk about these iterations and innovation cycles. So we'll cover some parts the platform -- what it is that you will use as you start your journey -- what's the smallest amount of "stuff" that you can use to get started.
We have a lot of services, right? AWS has got over seventy services and if you're using AWS for the first time it might be hard to know exactly where should I get started. where should I get started so if you're looking at things like scaling your analytics and beginning a Big Data project, this is where you can drill in and start. From a data warehousing standpoint, I've talked about there being a lot of cost benefits of moving to Redshift, but more importantly you rethink what it means to run a data warehouse. It's no longer about buying a massive appliance. You can start with these really small clusters that are tuned for a particular group. So you can build a set of small, specialized data warehouses that are very inexpensive, very scalable. If you're working with unstructured data, which may include touching mobile data for the first time, and you want to run Hadoop, and you've got the skills internally but you're tired of trying to manage your own Hadoop clusters, because it's not a great experience on-premises. Moving over to something that's fully managed let's you say "right now I'm running 10 nodes and I want to change and run 50 nodes for one hour." That's clicks of a button. When you're done with that, you just shut it down and stop paying for the Hadoop cluster, because you've decoupled the storage, which lives in S3. This combination of Redshift, plus EMR, plus S3 is a really common combination of services with our customers. If you're also doing real-time streaming, Amazon Kinesis, our fully managed stream processing service, is a fit for capturing and moving the data. This often goes together with DynamoDB, our fully managed NoSQL service, for the real-time serving of data to customers. These are often the key services used for Big Data and analytics initiatives. We also have predictive modeling, with our machine learning service, Amazon Machine Learning. Now, a common pattern we see, because these are all interoperable, customers use one of these services and then drop the data back down into that very inexpensive storage layer, S3, consume from that and then push it back down. And we also provide backup services with glacier. So, this may be a pattern that also is attractive to you.
Types of Data
Database Records
Search Documents
Log Files
Messaging Events
Devices / Sensors / IoT Stream
Huge buffer…
http://calculator.s3.amazonaws.com/index.html#r=IAD&key=calc-BE3BA3E4-1AC5-4E7A-B542-015056D8EDAF
Kinesis -> $52.14 per month
SQS -> $133.42 per month for puts or $400/month (put, get, delete)
DynamoDB -> $3809.88 per month (10TB of storage cost itself is $2500/month)
Cost (100rpsx 35KB)
$52/month
$133/month * 2 = $266/month
?
Amazon DynamoDB Service (US-East)
$
Provisioned Throughput Capacity:
$120
Indexed Data Storage:
$2560.90
DynamoDB Streams:
$1.3
Amazon SQS Service (US-East)
Pricing Example
Let’s assume that our data producers put 100 records per second in aggregate, and each record is 35KB. In this case, the total data input rate is 3.4MB/sec (100 records/sec*35KB/record). For simplicity, we assume that the throughput and data size of each record are stable and constant throughout the day. Please note that we can dynamically adjust the throughput of our Amazon Kinesis stream at any time.
We first calculate the number of shards needed for our stream to achieve the required throughput. As one shard provides a capacity of 1MB/sec data input and supports 1000 records/sec, four shards provide a capacity of 4MB/sec data input and support 4000 records/sec. So a stream with four shards satisfies our required throughput of 3.4MB/sec at 100 records/sec.
We then calculate our monthly Amazon Kinesis costs using Amazon Kinesis pricing in the US-East Region:
Shard Hour: One shard costs $0.015 per hour, or $0.36 per day ($0.015*24). Our stream has four shards so that it costs $1.44 per day ($0.36*4). For a month with 31 days, our monthly Shard Hour cost is $44.64 ($1.44*31).
PUT Payload Unit (25KB): As our record is 35KB, each record contains two PUT Payload Units. Our data producers put 100 records or 200 PUT Payload Units per second in aggregate. That is 267,840,000 records or 535,680,000 PUT Payload Units per month. As one million PUT Payload Units cost $0.014, our monthly PUT Payload Units cost is $7.499 ($0.014*535.68).
Adding the Shard Hour and PUT Payload Unit costs together, our total Amazon Kinesis costs are $1.68 per day, or $52.14 per month. For $1.68 per day, we have a fully-managed streaming data infrastructure that enables us to continuously ingest 4MB of data per second, or 337GB of data per day in a reliable and elastic manner.
2 x 2 Matrix
Structured
Level of query (from none to complex)
Draw down the slide
More : https://aws.amazon.com/blogs/aws/ec2-instance-update-x1-sap-hana-t2-nano-websites/
AWS helps customers maximize the value of Big Data investments while reducing overall IT costs. Amazon S3 provides secure, highly durable storage as low as $28.16 per terabyte. With Amazon Glacier, AWS provides low cost data archive platform that starts at only $7.17 per terabyte. That’s why customers like Netflix, Nasdaq and Pinterest store and process petabytes of data for analytics in S3.
AWS also provides a broad range of analytic options that provide customers with enterprise capabilities and performance without the typical high price and up-front investment of traditional enterprise software:
AWS provides a managed petabyte-scale data warehouse and a super-fast business intelligence and visualization service at 1/10th de cost of traditional software solutions. With Amazon RedShift you can analyze a petabyte of data for only $0.25/hour and then use Amazon QuickSight to explore that data for only $10 per user per month.
For streaming data, you can load a terabyte of streaming data with Amazon Kinesis Firehose for only $0.035 per GB.
You can spin up a 10-node managed Spark cluster to aggregate data with Amazon EMR for only $0.15 per hour.
https://na32.salesforce.com/06938000001bpTh
Will this customer leave us?
Add connector
Direct Acyclic Graphs?
Exactly once processing & DAG? – how do you do this??
https://storm.apache.org/documentation/Rationale.html
http://www.slideshare.net/ptgoetz/apache-storm-vs-spark-streaming
Cost:
Redshift – Moderate
Impala -
Presto – Low
S3A* is an open source connector. It is not in EMR 1.2.1 – using bootstrap you can install 2.2 ( we have a bootstrap action)
Query Speed
Redshift – Extremely fast SQL queries
Spark, Impala – Extremely Fast to Fast Hive QL
Hive, Tez – Moderately Fast to Slow Hive QL
Data Volume?
UDFs?
Manageability?
http://yahoodevelopers.tumblr.com/post/85930551108/yahoo-betting-on-apache-hive-tez-and-yarn
https://amplab.cs.berkeley.edu/benchmark/
http://nerds.airbnb.com/redshift-performance-cost/
Applications & API
Analysis and Visualization
Notebooks
IDE
"Over the next two years as we move to our optimised platform, we'll be able to extract ... benefits of $170 million," in addition to benefits already realised from the transformation process begun in 2010, Snowball said.
Suncorp's vision for its "optimised platform" is digitally enabled customer-facing systems sitting atop simplified core administration systems that feed into a data lake that can drive predictive analytics and business intelligence across the group.
"Increasingly our customers want to connect digitally and we're living in a world of both mobility and technological disruption," Snowball said.
"To ensure that we stay ahead of the competition, we've been investing in systems that are digitally enabled to allow our customers and business partners to access us, how and where they want.
"Standing behind our digital frontend we are completing the development of four core administration systems: One policy and one claim system for all our general insurance businesses both here and in New Zealand, a world-class banking system and a new life administration system."
"These core systems will feed our customer, policy and claims data along with HR, finance and management data, into our single, centralised data lake," the group CEO said.
"This will allow us to establish a best in class business intelligence function providing forward-looking, predictive analytics to deliver better solutions and outcomes for our customers.
"All of this will sit in a secure and flexible cloud environment where our lean and agile capabilities will enable us to deliver new services at high speed and lower cost," Snowball said.
When you leave here and go back to your office, hopefully some of the things you've seen today will spark ideas of how you can build systems that will better enable the business. As there is increasingly a recognition that businesses need to be more data-driven to enable automation and to enhance the decision making of the business, now is a good time to really rethink how to go about that. More and more, we see our customers moving on from old, legacy approaches of buying large inexpensive data infrastructure, which make months or years to start getting results from. To be truly business focused, there are three ways they are thinking differently.
First is to start projects with specific business outcomes in mind. Start backwards from the insights and actions you want, then work backwards to a streamlined design
Second is experimentation. Start with a lean design. Use just enough data to test your ideas, Use just enough services to test those ideas. The design of the system is to scale up capacity as and when you need it. So, if you hit on a great result, you scale that one up, and the ones that didn't work out can just be turned off. Think... win quick and fail cheap.
Finally, speed. Our best customers are changing their markets; they're redefining what service levels and customer experience means in those markets. Much of this is that they move quickly. When an opportunity presents itself, and the business wants to move on that opportunity, they think in terms of weeks to design, and minutes to deploy. This gives a material advantage over businesses that will wait for 6 months to get approvals to buy an appliance and more storage.
We've had a lot of customers really succeeding here in Southeast Asia. In Singapore, we have some great customers, like Redmart and Grab, and a number of others that are fundamentally changing aspects of our daily lives, as consumers. There are people in this room today who are going to be the ones we're talking about this time next year and I'm really looking forward to sharing your success at that time. I want to thank you very much I hope the rest of the conference is great for you guys