Hearst Publishing uses Amazon Kinesis and Amazon EMR to process clickstream data from over 200 Hearst properties worldwide in near real-time. Originally, Hearst used Pig on EMR to transform and analyze clickstream data in batch mode, but it was too slow with 15 minute latency. Hearst then migrated to using Apache Spark Streaming on EMR to process data from Amazon Kinesis in real-time windows of 5 minutes or less, enabling faster insights. This allowed Hearst to power features like Buzzing to provide instant feedback on article engagement across their properties.
Join us for a for a Amazon Kinesis tutorial webinar. In this session we will provide a reference architecture and instructions for building a system that performs real-time sliding-windows analysis over streaming clickstream data. We will use Amazon Kinesis for managed ingestion of streaming data at scale with the ability to replay past data, and run sliding-window computation using Apache Storm. We’ll demonstrate in the webinar on how to build the system and deploy on AWS and walkthrough all the steps from ingestion, processing, and storing to visualizing of the data in real-time.
An overview of Amazon Kinesis Firehose, Amazon Kinesis Analytics, and Amazon Kinesis Streams so you can quickly get started with real-time, streaming data.
Streaming data analytics (Kinesis, EMR/Spark) - Pop-up Loft Tel Aviv Amazon Web Services
"Low latency analytics is becoming a very popular scenario. In this session we will discuss several architectural options for doing
analytics on moving data using Amazon Kinesis and EMR/Spark Streaming and share some best practices and real world examples."
This overview presentation discusses big data challenges and provides an overview of the AWS Big Data Platform by covering:
- How AWS customers leverage the platform to manage massive volumes of data from a variety of sources while containing costs.
- Reference architectures for popular use cases, including, connected devices (IoT), log streaming, real-time intelligence, and analytics.
- The AWS big data portfolio of services, including, Amazon S3, Kinesis, DynamoDB, Elastic MapReduce (EMR), and Redshift.
- The latest relational database engine, Amazon Aurora— a MySQL-compatible, highly-available relational database engine, which provides up to five times better performance than MySQL at one-tenth the cost of a commercial database.
Created by: Rahul Pathak,
Sr. Manager of Software Development
"Amgen discovers, develops, manufactures, and delivers innovative human therapeutics, helping millions of people in the fight against serious illnesses. In 2014, Amgen implemented a solution to offload ETL data across a diverse data set (U.S. pharmaceutical prescriptions and claims) using Amazon EMR. The solution has transformed the way Amgen delivers insights and reports to its sales force. To support Amgen’s entry into a much larger market, the ETL process had to scale to eight times its existing data volume. We used Amazon EC2, Amazon S3, Amazon EMR, and Amazon Redshift to generate weekly sales reporting metrics.
This session discusses highlights in Amgen's journey to leverage big data technologies and lay the foundation for future growth: benefits of ETL offloading in Amazon EMR as an entry point for big data technologies; benefits and challenges of using Amazon EMR vs. expanding on-premises ETL and reporting technologies; and how to architect an ETL offload solution using Amazon S3, Amazon EMR, and Impala."
(ISM213) Building and Deploying a Modern Big Data Architecture on AWSAmazon Web Services
"The AWS platform enables large enterprises to use data to solve business problems and uncover opportunities more easily and affordably than ever before. However, to truly take advantage of AWS, enterprises need a way to collect, store, process, analyze, and continually execute on their data.
Datapipe has been an AWS partner for more than five years. In that time, it has developed a proprietary process for the deployment of AWS environments, as well as the processing and evaluation of big data analytics to optimize these environments over time. This flexible solution includes automation tools, continuous monitoring, and cloud analytics. It protects against architectural sprawl and continually redesigns for scalability. This kind of continuous build environment allows Datapipe to examine the AWS environment as a complete picture and ensure the cloud environment is running as efficiently and effectively as possible, ultimately reducing overhead costs for the enterprise.
In this session, Jason Woodlee, Senior Director of Cloud Products at Datapipe, will discuss the technical details of designing and deploying a modern big data architecture on AWS, including application purpose and design, development environment and language overview, DevOps automation best practices, and continuous build and test frameworks. Session sponsored by Datapipe."
Join us for a series of introductory and technical sessions on AWS Big Data solutions. Gain a thorough understanding of what Amazon Web Services offers across the big data lifecycle and learn architectural best practices for applying those solutions to your projects.
We will kick off this technical seminar in the morning with an introduction to the AWS Big Data platform, including a discussion of popular use cases and reference architectures. In the afternoon, we will deep dive into Machine Learning and Streaming Analytics. We will then walk everyone through building your first Big Data application with AWS.
Join us for a for a Amazon Kinesis tutorial webinar. In this session we will provide a reference architecture and instructions for building a system that performs real-time sliding-windows analysis over streaming clickstream data. We will use Amazon Kinesis for managed ingestion of streaming data at scale with the ability to replay past data, and run sliding-window computation using Apache Storm. We’ll demonstrate in the webinar on how to build the system and deploy on AWS and walkthrough all the steps from ingestion, processing, and storing to visualizing of the data in real-time.
An overview of Amazon Kinesis Firehose, Amazon Kinesis Analytics, and Amazon Kinesis Streams so you can quickly get started with real-time, streaming data.
Streaming data analytics (Kinesis, EMR/Spark) - Pop-up Loft Tel Aviv Amazon Web Services
"Low latency analytics is becoming a very popular scenario. In this session we will discuss several architectural options for doing
analytics on moving data using Amazon Kinesis and EMR/Spark Streaming and share some best practices and real world examples."
This overview presentation discusses big data challenges and provides an overview of the AWS Big Data Platform by covering:
- How AWS customers leverage the platform to manage massive volumes of data from a variety of sources while containing costs.
- Reference architectures for popular use cases, including, connected devices (IoT), log streaming, real-time intelligence, and analytics.
- The AWS big data portfolio of services, including, Amazon S3, Kinesis, DynamoDB, Elastic MapReduce (EMR), and Redshift.
- The latest relational database engine, Amazon Aurora— a MySQL-compatible, highly-available relational database engine, which provides up to five times better performance than MySQL at one-tenth the cost of a commercial database.
Created by: Rahul Pathak,
Sr. Manager of Software Development
"Amgen discovers, develops, manufactures, and delivers innovative human therapeutics, helping millions of people in the fight against serious illnesses. In 2014, Amgen implemented a solution to offload ETL data across a diverse data set (U.S. pharmaceutical prescriptions and claims) using Amazon EMR. The solution has transformed the way Amgen delivers insights and reports to its sales force. To support Amgen’s entry into a much larger market, the ETL process had to scale to eight times its existing data volume. We used Amazon EC2, Amazon S3, Amazon EMR, and Amazon Redshift to generate weekly sales reporting metrics.
This session discusses highlights in Amgen's journey to leverage big data technologies and lay the foundation for future growth: benefits of ETL offloading in Amazon EMR as an entry point for big data technologies; benefits and challenges of using Amazon EMR vs. expanding on-premises ETL and reporting technologies; and how to architect an ETL offload solution using Amazon S3, Amazon EMR, and Impala."
(ISM213) Building and Deploying a Modern Big Data Architecture on AWSAmazon Web Services
"The AWS platform enables large enterprises to use data to solve business problems and uncover opportunities more easily and affordably than ever before. However, to truly take advantage of AWS, enterprises need a way to collect, store, process, analyze, and continually execute on their data.
Datapipe has been an AWS partner for more than five years. In that time, it has developed a proprietary process for the deployment of AWS environments, as well as the processing and evaluation of big data analytics to optimize these environments over time. This flexible solution includes automation tools, continuous monitoring, and cloud analytics. It protects against architectural sprawl and continually redesigns for scalability. This kind of continuous build environment allows Datapipe to examine the AWS environment as a complete picture and ensure the cloud environment is running as efficiently and effectively as possible, ultimately reducing overhead costs for the enterprise.
In this session, Jason Woodlee, Senior Director of Cloud Products at Datapipe, will discuss the technical details of designing and deploying a modern big data architecture on AWS, including application purpose and design, development environment and language overview, DevOps automation best practices, and continuous build and test frameworks. Session sponsored by Datapipe."
Join us for a series of introductory and technical sessions on AWS Big Data solutions. Gain a thorough understanding of what Amazon Web Services offers across the big data lifecycle and learn architectural best practices for applying those solutions to your projects.
We will kick off this technical seminar in the morning with an introduction to the AWS Big Data platform, including a discussion of popular use cases and reference architectures. In the afternoon, we will deep dive into Machine Learning and Streaming Analytics. We will then walk everyone through building your first Big Data application with AWS.
Easy Analytics on AWS with Amazon Redshift, Amazon QuickSight, and Amazon Mac...Amazon Web Services
AWS has a large and growing portfolio of big data management and analytics services, designed to be integrated into solution architectures that meet the needs of your business. In this session, we look at analytics through the eyes of a business intelligence analyst, a data scientist, and an application developer, and we explore how to quickly leverage Amazon Redshift, Amazon QuickSight, RStudio, and Amazon Machine Learning to create powerful, yet straightforward, business solutions.
Data processing and analysis is where big data is most often consumed - driving business intelligence (BI) use cases that discover and report on meaningful patterns in the data. In this session, we will discuss options for processing, analyzing and visualizing data. We will also look at partner solutions and BI-enabling services from AWS. Attendees will learn about optimal approaches for stream processing, batch processing and Interactive analytics. AWS services to be covered include: Amazon Machine Learning, Elastic MapReduce (EMR), and Redshift.
Introducing Amazon Kinesis: Real-time Processing of Streaming Big Data (BDT10...Amazon Web Services
"This presentation will introduce Kinesis, the new AWS service for real-time streaming big data ingestion and processing.
We’ll provide an overview of the key scenarios and business use cases suitable for real-time processing, and discuss how AWS designed Amazon Kinesis to help customers shift from a traditional batch-oriented processing of data to a continual real-time processing model. We’ll provide an overview of the key concepts, attributes, APIs and features of the service, and discuss building a Kinesis-enabled application for real-time processing. We’ll also contrast with other approaches for streaming data ingestion and processing. Finally, we’ll also discuss how Kinesis fits as part of a larger big data infrastructure on AWS, including S3, DynamoDB, EMR, and Redshift."
AWS Webcast - Managing Big Data in the AWS Cloud_20140924Amazon Web Services
This presentation deck will cover specific services such as Amazon S3, Kinesis, Redshift, Elastic MapReduce, and DynamoDB, including their features and performance characteristics. It will also cover architectural designs for the optimal use of these services based on dimensions of your data source (structured or unstructured data, volume, item size and transfer rates) and application considerations - for latency, cost and durability. It will also share customer success stories and resources to help you get started.
AWS re:Invent 2016: Streaming ETL for RDS and DynamoDB (DAT315)Amazon Web Services
During this session Greg Brandt and Liyin Tang, Data Infrastructure engineers from Airbnb, will discuss the design and architecture of Airbnb's streaming ETL infrastructure, which exports data from RDS for MySQL and DynamoDB into Airbnb's data warehouse, using a system called SpinalTap. We will also discuss how we leverage Spark Streaming to compute derived data from tracking topics and/or database tables, and HBase to provide immediate data access and generate cleanly time-partitioned Hive tables.
Day 4 - Big Data on AWS - RedShift, EMR & the Internet of ThingsAmazon Web Services
Big Data is everywhere these days. But what is it and how can you use it to fuel your business? Data is as important to organizations as labour and capital, and if organizations can effectively capture, analyze, visualize and apply big data insights to their business goals, they can differentiate themselves from their competitors and outperform them in terms of operational efficiency and the bottom line.
Join this session to understand the different AWS Big Data and Analytics services such as Amazon Elastic MapReduce (Hadoop), Amazon Redshift (Data Warehouse) and Amazon Kinesis (Streaming), when to use them and how they work together.
Reasons to attend:
- Learn how AWS can help you process and make better use of your data with meaningful insights.
- Learn about Amazon Elastic MapReduce and Amazon Redshift, fully managed petabyte-scale data warehouse solutions.
- Learn about real time data processing with Amazon Kinesis.
Join us for a series of introductory and technical sessions on AWS Big Data solutions. Gain a thorough understanding of what Amazon Web Services offers across the big data lifecycle and learn architectural best practices for applying those solutions to your projects.
We will kick off this technical seminar in the morning with an introduction to the AWS Big Data platform, including a discussion of popular use cases and reference architectures. In the afternoon, we will deep dive into Machine Learning and Streaming Analytics. We will then walk everyone through building your first Big Data application with AWS.
The AWS cloud computing platform has disrupted big data. Managing big data applications used to be for only well-funded research organizations and large corporations, but not any longer. Hear from Ben Butler, Big Data Solutions Marketing Manager for AWS, to learn how our customers are using big data services in the AWS cloud to innovate faster than ever before. Not only is AWS technology available to everyone, but it is self-service, on-demand, and featuring innovative technology and flexible pricing models at low cost with no commitments. Learn from customer success stories, as Ben shares real-world case studies describing the specific big data challenges being solved on AWS. We will conclude with a discussion around the tutorials, public datasets, test drives, and our grants program - all of the resources needed to get you started quickly.
In this session, we'll review the features and architecture of the new AWS Data Pipeline service and explain how you can use it to better manage your data-driven workloads. We'll then go over a few examples of setting up and provisioning a pipeline in the system.
2 years ago if someone had claimed they could stand up a petabyte scale data warehouse in under an hour and then have a non-technical business user querying it live 30 minutes later without knowing any SQL or coding language, they would have been laughed out of the room. These days, that’s called taking advantage of disruptive technology. Amazon Web Services and Tableau Software have shifted the entire paradigm by which organizations not only store and access their data, but ultimately how they innovate with it. The fast, scalable, and inexpensive services that AWS provides for housing data combined with Tableau’s unbelievably flexible and user friendly visual analytic solution means that within hours an organization can securely put the power of their massive data assets into the hands of their domain experts without expensive overhead or lengthy ramp-up time. Attend this webinar to learn how Amazon Web Services and Tableau Software are leveraged together everyday to: • Empower visual ad-hoc data discovery against big data • Revolutionize corporate reporting and dashboards • Promote data driven decision making at every level The presentation will include: • A live demonstration of AWS and Tableau working together • A real customer case study focused on fraud detection and online video metrics • Live Q&A and an opportunity to trial both solutions
The AWS Workshop Series Online is a series of live webinars designed for IT professionals who are looking to leverage the AWS Cloud to build and transform their business, are new to the AWS Cloud or looking to further expand their skills and expertise. In this series, we will cover : 'Modern Data Architectures for Business Insights at Scale'.
AWS Storage and Database Architecture Best Practices (DAT203) | AWS re:Invent...Amazon Web Services
Learn about architecture best practices for combining AWS storage and database technologies. We outline AWS storage options (Amazon EBS, Amazon EC2 Instance Storage, Amazon S3 and Amazon Glacier) along with AWS database options including Amazon ElastiCache (in-memory data store), Amazon RDS (SQL database), Amazon DynamoDB (NoSQL database), Amazon CloudSearch (search), Amazon EMR (hadoop) and Amazon Redshift (data warehouse). Then we discuss how to architect your database tier by using the right database and storage technologies to achieve the required functionality, performance, availability, and durability—at the right cost.
Real-time Streaming and Querying with Amazon Kinesis and Amazon Elastic MapRe...Amazon Web Services
Originally, Hadoop was used as a batch analytics tool; however, this is rapidly changing, as applications move towards real-time processing and streaming. Amazon Elastic MapReduce has made running Hadoop in the cloud easier and more accessible than ever. Each day, tens of thousands of Hadoop clusters are run on the Amazon Elastic MapReduce infrastructure by users of every size — from university students to Fortune 50 companies. We recently launched Amazon Kinesis – a managed service for real-time processing of high volume, streaming data. Amazon Kinesis enables a new class of big data applications which can continuously analyze data at any volume and throughput, in real-time. Adi will discuss each service, dive into how customers are adopting the services for different use cases, and share emerging best practices. Learn how you can architect Amazon Kinesis and Amazon Elastic MapReduce together to create a highly scalable real-time analytics solution which can ingest and process terabytes of data per hour from hundreds of thousands of different concurrent sources. Forever change how you process web site click-streams, marketing and financial transactions, social media feeds, logs and metering data, and location-tracking events.
Data warehousing in the era of Big Data: Deep Dive into Amazon RedshiftAmazon Web Services
Analyzing big data quickly and efficiently requires a data warehouse optimized to handle and scale for large datasets. Amazon Redshift is a fast, petabyte-scale data warehouse that makes it simple and cost-effective to analyze all of your data for a fraction of the cost of traditional data warehouses. In this session, we take an in-depth look at data warehousing with Amazon Redshift for big data analytics. We cover best practices to take advantage of Amazon Redshift's columnar technology and parallel processing capabilities to deliver high throughput and query performance. We also discuss how to design optimal schemas, load data efficiently, and use work load management.
AWS APAC Webinar Week - Big Data on AWS. RedShift, EMR, & IOTAmazon Web Services
The world is producing an ever-increasing volume, velocity, and variety of data including data from devices. As we step into the era of Internet of things (IOT), for many consumers, batch analytics is no longer enough; they need sub-second analysis on fast-moving data. AWS delivers many technologies for solving big data and IOT problems. But what services should you use, why, when, and how? In this webinar where we simplify big data processing as a pipeline comprising various stages: ingest, store, process, analyze & visualize. Next, we discuss how to choose the right technology in each stage based on criteria such as data structure, query latency, cost, request rate, item size, data volume, and durability. Finally, we provide a reference architecture, design patterns, and best practices for assembling these technologies to solve your big data problems.
AWS re:Invent 2016: Visualizing Big Data Insights with Amazon QuickSight (BDM...Amazon Web Services
Amazon QuickSight is a fast BI service that makes it easy for you to build visualizations, perform ad-hoc analysis, and quickly get business insights from your data. QuickSight is built to harness the power and scalability of the cloud, so you can easily run analysis on large datasets, and support hundreds of thousands of users. In this session, we’ll demonstrate how you can easily get started with Amazon QuickSight, uploading files, connecting to S3 and Redshift and creating analyses from visualizations that are optimized based on the underlying data. Once we’ve built our analysis and dashboard, we’ll show you easy it is to share it with colleagues and stakeholders in just a few seconds. And with SPICE – QuckSight’s in-memory calculation engine – you can go from data to insights, faster than ever.
Amazon Kinesis is the AWS service for real-time streaming big data ingestion and processing. This talk gives a detailed exploration of Kinesis stream processing. We'll discuss in detail techniques for building, and scaling Kinesis processing applications, including data filtration and transformation. Finally we'll address tips and techniques to emitting data into S3, DynamoDB, and Redshift.
Structured, Unstructured and Streaming Big Data on the AWSAmazon Web Services
Using AWS has never been easier or more affordable to solve business problems and uncover new opportunities using data. Now, businesses of all sizes and across all industries can take advantage of big data technologies and easily collect, store, process, analyze, and share their data. Gain a thorough understanding of what AWS offers across the big data lifecycle and learn architectural best practices for applying these technologies to your projects. We will also deep dive into how to use AWS services such as Kinesis, DynamoDB, Redshift, and Quicksight to optimize logging, build real-time applications, and analyze and visualize data at any scale.
Clickstream Data Warehouse - Turning clicks into customersAlbert Hui
As web is becoming a main channel for reaching customers and prospects, Clickstream data generated by websites has become another important enterprise data source, like other traditional business data sources, like store transactions, CRM data, call center’s logs etc. As simple as it sounds for recording every click a customer made, Clickstream data actually offers a wide range of opportunities for modelling user behaviour, gaining valuable customer insights. This is definitely a data source which has been under utilized. However, benefits also come with a problem. Amazon records 5 Billion clicks a day and the whole US generates 400 Billion clicks, equivalent to 3.4 Petabytes a day. This immense volume has given enterprises and their IT professionals a big data problem before they can fully utilize this insight-rich data source.
This presentation will use big data technology to help solve this big data problem; the presenter will explain everything about Clickstream data, like benefits, challenges and the solution. The end-to-end solution will include proposed data architecture, ETL, and various machine learning algorithms. A real world successful example will also be presented for audience to better grasp the concept and its applications. Sample codes and demo will also be presented for audience to apply in their respective areas.
Deep Dive and Best Practices for Real Time Streaming ApplicationsAmazon Web Services
Get answers to technical questions, frequently asked by those starting to work with streaming data. Learn best practices for building a real-time streaming data architecture on AWS with Amazon Kinesis, Spark Streaming, AWS Lambda, and Amazon EMR. First, we will focus on building a scalable, durable streaming data ingestion workflow from data producers like mobile devices, servers, or even web browsers. We will provide guidelines to minimize duplicates and achieve exactly-once processing semantics in your stream-processing applications. Then, we will show some of the proven architectures for processing streaming data using a combination of tools including Amazon Kinesis Stream, AWS Lambda, and Spark Streaming running on Amazon EMR.
Easy Analytics on AWS with Amazon Redshift, Amazon QuickSight, and Amazon Mac...Amazon Web Services
AWS has a large and growing portfolio of big data management and analytics services, designed to be integrated into solution architectures that meet the needs of your business. In this session, we look at analytics through the eyes of a business intelligence analyst, a data scientist, and an application developer, and we explore how to quickly leverage Amazon Redshift, Amazon QuickSight, RStudio, and Amazon Machine Learning to create powerful, yet straightforward, business solutions.
Data processing and analysis is where big data is most often consumed - driving business intelligence (BI) use cases that discover and report on meaningful patterns in the data. In this session, we will discuss options for processing, analyzing and visualizing data. We will also look at partner solutions and BI-enabling services from AWS. Attendees will learn about optimal approaches for stream processing, batch processing and Interactive analytics. AWS services to be covered include: Amazon Machine Learning, Elastic MapReduce (EMR), and Redshift.
Introducing Amazon Kinesis: Real-time Processing of Streaming Big Data (BDT10...Amazon Web Services
"This presentation will introduce Kinesis, the new AWS service for real-time streaming big data ingestion and processing.
We’ll provide an overview of the key scenarios and business use cases suitable for real-time processing, and discuss how AWS designed Amazon Kinesis to help customers shift from a traditional batch-oriented processing of data to a continual real-time processing model. We’ll provide an overview of the key concepts, attributes, APIs and features of the service, and discuss building a Kinesis-enabled application for real-time processing. We’ll also contrast with other approaches for streaming data ingestion and processing. Finally, we’ll also discuss how Kinesis fits as part of a larger big data infrastructure on AWS, including S3, DynamoDB, EMR, and Redshift."
AWS Webcast - Managing Big Data in the AWS Cloud_20140924Amazon Web Services
This presentation deck will cover specific services such as Amazon S3, Kinesis, Redshift, Elastic MapReduce, and DynamoDB, including their features and performance characteristics. It will also cover architectural designs for the optimal use of these services based on dimensions of your data source (structured or unstructured data, volume, item size and transfer rates) and application considerations - for latency, cost and durability. It will also share customer success stories and resources to help you get started.
AWS re:Invent 2016: Streaming ETL for RDS and DynamoDB (DAT315)Amazon Web Services
During this session Greg Brandt and Liyin Tang, Data Infrastructure engineers from Airbnb, will discuss the design and architecture of Airbnb's streaming ETL infrastructure, which exports data from RDS for MySQL and DynamoDB into Airbnb's data warehouse, using a system called SpinalTap. We will also discuss how we leverage Spark Streaming to compute derived data from tracking topics and/or database tables, and HBase to provide immediate data access and generate cleanly time-partitioned Hive tables.
Day 4 - Big Data on AWS - RedShift, EMR & the Internet of ThingsAmazon Web Services
Big Data is everywhere these days. But what is it and how can you use it to fuel your business? Data is as important to organizations as labour and capital, and if organizations can effectively capture, analyze, visualize and apply big data insights to their business goals, they can differentiate themselves from their competitors and outperform them in terms of operational efficiency and the bottom line.
Join this session to understand the different AWS Big Data and Analytics services such as Amazon Elastic MapReduce (Hadoop), Amazon Redshift (Data Warehouse) and Amazon Kinesis (Streaming), when to use them and how they work together.
Reasons to attend:
- Learn how AWS can help you process and make better use of your data with meaningful insights.
- Learn about Amazon Elastic MapReduce and Amazon Redshift, fully managed petabyte-scale data warehouse solutions.
- Learn about real time data processing with Amazon Kinesis.
Join us for a series of introductory and technical sessions on AWS Big Data solutions. Gain a thorough understanding of what Amazon Web Services offers across the big data lifecycle and learn architectural best practices for applying those solutions to your projects.
We will kick off this technical seminar in the morning with an introduction to the AWS Big Data platform, including a discussion of popular use cases and reference architectures. In the afternoon, we will deep dive into Machine Learning and Streaming Analytics. We will then walk everyone through building your first Big Data application with AWS.
The AWS cloud computing platform has disrupted big data. Managing big data applications used to be for only well-funded research organizations and large corporations, but not any longer. Hear from Ben Butler, Big Data Solutions Marketing Manager for AWS, to learn how our customers are using big data services in the AWS cloud to innovate faster than ever before. Not only is AWS technology available to everyone, but it is self-service, on-demand, and featuring innovative technology and flexible pricing models at low cost with no commitments. Learn from customer success stories, as Ben shares real-world case studies describing the specific big data challenges being solved on AWS. We will conclude with a discussion around the tutorials, public datasets, test drives, and our grants program - all of the resources needed to get you started quickly.
In this session, we'll review the features and architecture of the new AWS Data Pipeline service and explain how you can use it to better manage your data-driven workloads. We'll then go over a few examples of setting up and provisioning a pipeline in the system.
2 years ago if someone had claimed they could stand up a petabyte scale data warehouse in under an hour and then have a non-technical business user querying it live 30 minutes later without knowing any SQL or coding language, they would have been laughed out of the room. These days, that’s called taking advantage of disruptive technology. Amazon Web Services and Tableau Software have shifted the entire paradigm by which organizations not only store and access their data, but ultimately how they innovate with it. The fast, scalable, and inexpensive services that AWS provides for housing data combined with Tableau’s unbelievably flexible and user friendly visual analytic solution means that within hours an organization can securely put the power of their massive data assets into the hands of their domain experts without expensive overhead or lengthy ramp-up time. Attend this webinar to learn how Amazon Web Services and Tableau Software are leveraged together everyday to: • Empower visual ad-hoc data discovery against big data • Revolutionize corporate reporting and dashboards • Promote data driven decision making at every level The presentation will include: • A live demonstration of AWS and Tableau working together • A real customer case study focused on fraud detection and online video metrics • Live Q&A and an opportunity to trial both solutions
The AWS Workshop Series Online is a series of live webinars designed for IT professionals who are looking to leverage the AWS Cloud to build and transform their business, are new to the AWS Cloud or looking to further expand their skills and expertise. In this series, we will cover : 'Modern Data Architectures for Business Insights at Scale'.
AWS Storage and Database Architecture Best Practices (DAT203) | AWS re:Invent...Amazon Web Services
Learn about architecture best practices for combining AWS storage and database technologies. We outline AWS storage options (Amazon EBS, Amazon EC2 Instance Storage, Amazon S3 and Amazon Glacier) along with AWS database options including Amazon ElastiCache (in-memory data store), Amazon RDS (SQL database), Amazon DynamoDB (NoSQL database), Amazon CloudSearch (search), Amazon EMR (hadoop) and Amazon Redshift (data warehouse). Then we discuss how to architect your database tier by using the right database and storage technologies to achieve the required functionality, performance, availability, and durability—at the right cost.
Real-time Streaming and Querying with Amazon Kinesis and Amazon Elastic MapRe...Amazon Web Services
Originally, Hadoop was used as a batch analytics tool; however, this is rapidly changing, as applications move towards real-time processing and streaming. Amazon Elastic MapReduce has made running Hadoop in the cloud easier and more accessible than ever. Each day, tens of thousands of Hadoop clusters are run on the Amazon Elastic MapReduce infrastructure by users of every size — from university students to Fortune 50 companies. We recently launched Amazon Kinesis – a managed service for real-time processing of high volume, streaming data. Amazon Kinesis enables a new class of big data applications which can continuously analyze data at any volume and throughput, in real-time. Adi will discuss each service, dive into how customers are adopting the services for different use cases, and share emerging best practices. Learn how you can architect Amazon Kinesis and Amazon Elastic MapReduce together to create a highly scalable real-time analytics solution which can ingest and process terabytes of data per hour from hundreds of thousands of different concurrent sources. Forever change how you process web site click-streams, marketing and financial transactions, social media feeds, logs and metering data, and location-tracking events.
Data warehousing in the era of Big Data: Deep Dive into Amazon RedshiftAmazon Web Services
Analyzing big data quickly and efficiently requires a data warehouse optimized to handle and scale for large datasets. Amazon Redshift is a fast, petabyte-scale data warehouse that makes it simple and cost-effective to analyze all of your data for a fraction of the cost of traditional data warehouses. In this session, we take an in-depth look at data warehousing with Amazon Redshift for big data analytics. We cover best practices to take advantage of Amazon Redshift's columnar technology and parallel processing capabilities to deliver high throughput and query performance. We also discuss how to design optimal schemas, load data efficiently, and use work load management.
AWS APAC Webinar Week - Big Data on AWS. RedShift, EMR, & IOTAmazon Web Services
The world is producing an ever-increasing volume, velocity, and variety of data including data from devices. As we step into the era of Internet of things (IOT), for many consumers, batch analytics is no longer enough; they need sub-second analysis on fast-moving data. AWS delivers many technologies for solving big data and IOT problems. But what services should you use, why, when, and how? In this webinar where we simplify big data processing as a pipeline comprising various stages: ingest, store, process, analyze & visualize. Next, we discuss how to choose the right technology in each stage based on criteria such as data structure, query latency, cost, request rate, item size, data volume, and durability. Finally, we provide a reference architecture, design patterns, and best practices for assembling these technologies to solve your big data problems.
AWS re:Invent 2016: Visualizing Big Data Insights with Amazon QuickSight (BDM...Amazon Web Services
Amazon QuickSight is a fast BI service that makes it easy for you to build visualizations, perform ad-hoc analysis, and quickly get business insights from your data. QuickSight is built to harness the power and scalability of the cloud, so you can easily run analysis on large datasets, and support hundreds of thousands of users. In this session, we’ll demonstrate how you can easily get started with Amazon QuickSight, uploading files, connecting to S3 and Redshift and creating analyses from visualizations that are optimized based on the underlying data. Once we’ve built our analysis and dashboard, we’ll show you easy it is to share it with colleagues and stakeholders in just a few seconds. And with SPICE – QuckSight’s in-memory calculation engine – you can go from data to insights, faster than ever.
Amazon Kinesis is the AWS service for real-time streaming big data ingestion and processing. This talk gives a detailed exploration of Kinesis stream processing. We'll discuss in detail techniques for building, and scaling Kinesis processing applications, including data filtration and transformation. Finally we'll address tips and techniques to emitting data into S3, DynamoDB, and Redshift.
Structured, Unstructured and Streaming Big Data on the AWSAmazon Web Services
Using AWS has never been easier or more affordable to solve business problems and uncover new opportunities using data. Now, businesses of all sizes and across all industries can take advantage of big data technologies and easily collect, store, process, analyze, and share their data. Gain a thorough understanding of what AWS offers across the big data lifecycle and learn architectural best practices for applying these technologies to your projects. We will also deep dive into how to use AWS services such as Kinesis, DynamoDB, Redshift, and Quicksight to optimize logging, build real-time applications, and analyze and visualize data at any scale.
Clickstream Data Warehouse - Turning clicks into customersAlbert Hui
As web is becoming a main channel for reaching customers and prospects, Clickstream data generated by websites has become another important enterprise data source, like other traditional business data sources, like store transactions, CRM data, call center’s logs etc. As simple as it sounds for recording every click a customer made, Clickstream data actually offers a wide range of opportunities for modelling user behaviour, gaining valuable customer insights. This is definitely a data source which has been under utilized. However, benefits also come with a problem. Amazon records 5 Billion clicks a day and the whole US generates 400 Billion clicks, equivalent to 3.4 Petabytes a day. This immense volume has given enterprises and their IT professionals a big data problem before they can fully utilize this insight-rich data source.
This presentation will use big data technology to help solve this big data problem; the presenter will explain everything about Clickstream data, like benefits, challenges and the solution. The end-to-end solution will include proposed data architecture, ETL, and various machine learning algorithms. A real world successful example will also be presented for audience to better grasp the concept and its applications. Sample codes and demo will also be presented for audience to apply in their respective areas.
Deep Dive and Best Practices for Real Time Streaming ApplicationsAmazon Web Services
Get answers to technical questions, frequently asked by those starting to work with streaming data. Learn best practices for building a real-time streaming data architecture on AWS with Amazon Kinesis, Spark Streaming, AWS Lambda, and Amazon EMR. First, we will focus on building a scalable, durable streaming data ingestion workflow from data producers like mobile devices, servers, or even web browsers. We will provide guidelines to minimize duplicates and achieve exactly-once processing semantics in your stream-processing applications. Then, we will show some of the proven architectures for processing streaming data using a combination of tools including Amazon Kinesis Stream, AWS Lambda, and Spark Streaming running on Amazon EMR.
(BDT320) New! Streaming Data Flows with Amazon Kinesis FirehoseAmazon Web Services
Amazon Kinesis Firehose is a fully-managed, elastic service to deliver real-time data streams to Amazon S3, Amazon Redshift, and other destinations. In this session, we start with overviews of Amazon Kinesis Firehose and Amazon Kinesis Analytics. We then discuss how Amazon Kinesis Firehose makes it even easier to get started with streaming data, without writing a stream processing application or provisioning a single resource. You learn about the key features of Amazon Kinesis Firehose, including its companion agent that makes emitting data from data producers even easier. We walk through capture and delivery with an end-to-end demo, and discuss key metrics that will help developers and architects understand their streaming data flow. Finally, we look at some patterns for data consumption as the data streams into S3. We show two examples: using AWS Lambda, and how you can use Apache Spark running within Amazon EMR to query data directly in Amazon S3 through EMRFS.
DPACC Acceleration Progress and DemonstrationOPNFV
The session provides an update to on the DPACC project within the OPNFV with a brief discussion on APIs and implementation progress. This session will review the API definition progress and follow up with a demo highlighting a common application as the vNF running on top of the DPACC defined layers. The demo will highlight the use of both hardware and software acceleration utilizing the DPACC defined acceleration layers. The demonstrationIt will highlight the progress in optimizing performance and latency characteristics of a platform to realize the vision of NFV while meeting stringent requirements, particularly for certain workloads, required by carriers.
Analytics & Reporting for Amazon Cloud LogsCloudlytics
A deep dive into the Cloudytics Reports section, with the Following reports in detail & how they can help you with your business use case:
- Geo Tracker Report
- IP Tracker Report
- Timeline Report
- ELB Tracker
- CloudFront Cost Analyzer
- Custom Function
World's best AWS Cloud Log Analytics & Management ToolCloudlytics
Overview of Cloudlytics Features, Pricing Details, Offers to AWS Activate Customers, AWS Marketplace Info & A Sneak Preview of All the Analytics ( The Reports Section will be covered in Detail in our Next Presentation.)
AWS re:Invent 2016| GAM301 | How EA Leveraged Amazon Redshift and AWS Partner...Amazon Web Services
In November 2015, Capital Games launched a mobile game accompanying a major feature film release. The back end of the game is hosted in AWS and uses big data services like Amazon Kinesis, Amazon EC2, Amazon S3, Amazon Redshift, and AWS Data Pipeline. Capital Games will describe some of their challenges on their initial setup and usage of Amazon Redshift and Amazon EMR. They will then go over their engagement with AWS Partner 47lining and talk about specific best practices regarding solution architecture, data transformation pipelines, and system maintenance using AWS big data services. Attendees of this session should expect a candid view of the process to implementing a big data solution. From problem statement identification to visualizing data, with an in-depth look at the technical challenges and hurdles along the way.
(MBL303) Get Deeper Insights Using Amazon Mobile Analytics | AWS re:Invent 2014Amazon Web Services
Choosing the right mobile analytics solution can help you understand user behavior, engage users, and maximize user lifetime value. After this session, you will understand how you can learn more about your users and their behavior quickly across platforms with just one line of code using Amazon Mobile Analytics.
GDC 2015 - Game Analytics with AWS Redshift, Kinesis, and the Mobile SDKNate Wiger
See the latest analytics architectures for companies succeeding in the free-to-play space, such as Supercell, GREE, and Rovio. Also see how to create a real-time analytics pipeline to connect to your players, enabling you to deliver deeper experiences.
• What is a web log?
• Where do they come from?
• Why are they relevant?
• How can we analyze them?
• What about Clickstream?
Facing these questions I have make a personal research, and realize a synthesis, which has help me to clarify some ideas. This presentation does not intend to be exhaustive on the subject, but could perhaps bring you some useful insights.
(BDT306) Mission-Critical Stream Processing with Amazon EMR and Amazon Kinesi...Amazon Web Services
Organizations processing mission critical high-volume data must be able to achieve high levels of throughput and durability in data processing workflows. In this session, we will learn how DataXu is using Amazon Kinesis, Amazon S3, and Amazon EMR for its patented approach to programmatic marketing. Every second, the DataXu Marketing Cloud processes over 1 Million ad requests and makes more than 40 billion decisions to select and bid on ad impressions that are most likely to convert. In addition to addressing the scalability and availability of the platform, we will explore Amazon Kinesis producer and consumer applications that support high levels of scalability and durability in mission-critical record processing.
(GAM302) EA's Real-World Hurdles with Millions of Players in the Simpsons: Ta...Amazon Web Services
How do you really architect a game that can handle 5, 6, or 7 million daily active users? Learn about the scalability challenges that EA had to overcome for The Simpsons: Tapped Out. Hear how EA had to redesign their MySQL-based database layer on the fly, migrating over to Amazon DynamoDB, while keeping the game running. See how EA added AWS Elastic Beanstalk and Auto Scaling to simplify their deployments, while also lowering costs by enabling them to respond to changing player counts. EA shows how they switched from sticky sessions to Amazon ElastiCache, solving player disconnects and allowing further scaling out. Finally, EA shares some interesting statistics about The Simpsons: Tapped Out, as well as their overall learnings about how best to develop, deploy, and monitor a game on AWS.
AWS ML and SparkML on EMR to Build Recommendation Engine Amazon Web Services
Machine Learning
A managed supervised learning environment to build different models, including Binary Classification / Multi-class classification / Regression ML. The demos will show a dataset of banking customers with demographics, predicting the likelihood of whether they are going to default using binary classification. Second one will be predicting a UK bike rental shop traffic using linear regression, and third one for predicting a rainforest soil type using multi-class classification.
Benefits: Managed and on-demand environment for supervised learning algorithm, available as batch processing or real-time API.
Spark ML Cluster
Running spark on AWS managed cluster, storing data on HDFS / S3 persistent storage, modules include MLib and Zeppelin (Web Notebook), to build a movie recommendation engine based on “Collaborative Filtering”. The dataset contains 10M ratings provided by grouplens from MovieLens website.
Benefits: Fully managed clusters, with HA, Scalability, Elasticity and Spot instance pricing
AWS Data Transfer Services: Data Ingest Strategies Into the AWS CloudAmazon Web Services
Different types and sizes of data require different strategies. In this session, learn about the various features and services available for migrating data, be it small ongoing transactional data or large multi-petabyte volumes. Come learn how customers are using the latest network, streaming and large scale ingest features for their cloud data migrations to AWS storage services.
(DVO312) Sony: Building At-Scale Services with AWS Elastic BeanstalkAmazon Web Services
Learn about Sony's efforts to build a cloud-native authentication and profile management platform on AWS. Sony engineers demonstrate how they used AWS Elastic Beanstalk (Elastic Beanstalk) to deploy, manage, and scale their applications. They also describe how they use AWS CloudFormation for resource provisioning, Amazon DynamoDB for the main database, and AWS Lambda and Amazon Redshift for log handling and analysis. This discussion focuses on best practices, security considerations, tradeoffs, and final architecture and implementation. By the end of the session, you will clearly understand how to use Elastic Beanstalk as a platform to quickly and easily build at-scale web application on AWS, and how to use Elastic Beanstalk with other AWS services to build cloud-native applications.
(WEB301) Operational Web Log Analysis | AWS re:Invent 2014Amazon Web Services
Log data contains some of the most valuable raw information you can gather and analyze about your infrastructure and applications. Amid the mess of confusing lines of seemingly random text can be hints about performance, security, flaws in code, user access patterns, and other operational data. Without the proper tools, finding insights in these logs can be like searching for a hay-colored needle in a haystack. In this session you learn what practices and patterns you can easily implement that can help you better understand your log files. You see how you can customize web logs to add more information to them, how to digest logs from around your infrastructure, and how to analyze your log files in near real time.
Learn best practices for building a real-time streaming data architecture on AWS with Spark Streaming, Amazon Kinesis, and Amazon Elastic MapReduce (EMR). Get a closer look at how to ingest streaming data scalably and durably from data producers like mobile devices, servers, and even web browsers, and design a stream processing application with minimal data duplication and exactly-once processing.
Presented by: Guy Ernest, Principal Business Development Manager, Amazon Web Services
Customer Guest: Harry Koch, Solutions Architecture, Philips
Ask an Amazon Redshift Customer Anything (ANT389) - AWS re:Invent 2018Amazon Web Services
Learn best practices from Hilton Hotels Worldwide as they built an Enterprise Data Lake/Management (EDM) platform on AWS to drive insights and analytics for their business applications, including worldwide hotel booking and reservation management systems. The EDM architecture is built with Hadoop clusters running on Amazon EC2 combined with Amazon Redshift and Amazon Athena for data warehousing and ad hoc SQL analytics. This is a great opportunity to get an unfiltered customer perspective on their road to data nirvana!
Amazon Kinesis provides services for you to work with streaming data on AWS. Learn how to load streaming data continuously and cost-effectively to Amazon S3 and Amazon Redshift using Amazon Kinesis Firehose without writing custom stream processing code. Get an introduction to building custom stream processing applications with Amazon Kinesis Streams for specialised needs.
Presented at the AWS Summit in London, here's a deep dive on getting started with Amazon Kinesis and use-case with Jampp, the world's leading mobile app marketing platform.
Real-Time Web Analytics with Amazon Kinesis Data Analytics (ADT401) - AWS re:...Amazon Web Services
Knowing what users are doing on your websites in real time provides insights you can act on without waiting for delayed batch processing of clickstream data. Watching the immediate impact on user behavior after new releases, detecting and responding to anomalies, situational awareness, and evaluating trends are all benefits of real-time website analytics. In this workshop, we build a cost-optimized platform to capture web beacon traffic, analyze it for interesting metrics, and display it on a customized dashboard. We start by deploying the Web Analytics Solution Accelerator, then once the core is complete, we extend their solution to capture new and interesting metrics, process those with Amazon Kinesis Analytics, and display new graphs on their custom dashboard. Participants come away with a fully functional system for capturing, analyzing, and displaying valuable website metrics in real time.
AWS April 2016 Webinar Series - Getting Started with Real-Time Data Analytics...Amazon Web Services
It is becoming increasingly important to analyze real time streaming data. It allows organizations to remain competitive by uncovering relevant, actionable insights. AWS makes it easy to capture, store, and analyze real-time streaming data.
In this webinar, we will guide you through some of the proven architectures for processing streaming data, using a combination of tools including Amazon Kinesis Streams, AWS Lambda, and Spark Streaming on Amazon Elastic MapReduce (EMR). We will then talk about common use cases and best practices for real-time data analysis on AWS.
Learning Objectives:
Understand how you can analyze real-time data streams using Amazon Kinesis, AWS Lambda, and Spark running on Amazon EMR
Learn use cases and best practices for streaming data applications on AWS
AWS re:Invent 2016: What’s New with Amazon Redshift (BDA304)Amazon Web Services
In this session, you learn about the latest and hottest features of Amazon Redshift. Join Vidhya Srinivasan, General Manager of Amazon Redshift, to take a deep dive into the architecture and inner workings of Amazon Redshift. You discover how the recent availability, performance, and manageability improvements we’ve made can significantly enhance your end user experience. You also get a glimpse of what we are working on and our plans for the future.
Amazon Kinesis provides services for you to work with streaming data on AWS. Learn how to load streaming data continuously and cost-effectively to Amazon S3 and Amazon Redshift using Amazon Kinesis Firehose without writing custom stream processing code. Get an introduction to building custom stream processing applications with Amazon Kinesis Streams for specialized needs.
AWS Summit 2013 | Singapore - Big Data Analytics, Presented by AWS, Intel and...Amazon Web Services
Learn more about the tools, techniques and technologies for working productively with data at any scale. This session will introduce the family of data analytics tools on AWS which you can use to collect, compute and collaborate around data, from gigabytes to petabytes. We'll discuss Amazon Elastic MapReduce, Hadoop, structured and unstructured data, and the EC2 instance types which enable high performance analytics.
Path to the future #4 - Ingestão, processamento e análise de dados em tempo realAmazon Web Services LATAM
Nesta sessão faremos uma demonstração de controle e defesa de tráfego aéreo utilizando processamento em tempo real.
Trataremos das boas práticas para ingestão, armazenamento, processamento e visualização de dados através de serviços da AWS como Kinesis, DynamoDB, Lambda, Redshift, Quicksight e Amazon Machine Learning.
Amazon Redshift Update and How Equinox Fitness Clubs Migrated to a Modern Dat...Amazon Web Services
In this session, we provide an update on Amazon Redshift, and look at a case study from Equinox Fitness Clubs. We show you how Amazon Redshift queries data across your data warehouse and data lake, without the need or delay of loading data, to deliver insights you cannot obtain by querying independent data silos. Discover how Equinox Fitness Clubs transitioned from on-premises data warehouses and data marts to a cloud-based, integrated data platform, built on AWS and Amazon Redshift. Learn about their journey from static reports, redundant data, and inefficient data integration to a modern and flexible data lake and data warehouse architecture that delivers dynamic reports based on trusted data.
by Darin Briskman, Technical Evangelist, AWS
Amazon Kinesis Data Analytics gives us to tools to run SQL queries against active data streams. We'll look at how we can performance real-time log analytics and q build entire streaming applications using SQL, so that you can gain actionable insights promptly.
Amazon Kinesis is a platform for streaming data on AWS, offering powerful services to make it easy to load and analyze streaming data. In this session, you'll learn about how AWS customers are transitioning from batch to real-time processing using Amazon Kinesis, and how to get started. We will provide an overview of streaming data applications and introduce the Amazon Kinesis platform and its services. We will walk through a production use case to demonstrate how to ingest streaming data, prepare it, and analyze it to gain actionable insights in real time using Amazon Kinesis. We will also provide pointers to tutorials and other resources so you can quickly get started with your streaming data application.
Data Lake allows an organisation to store all of their data, structured and unstructured, in one, centralised repository. Since data can be stored as-is, there is no need to convert it to a predefined schema and you no longer need to know what questions you want to ask of your data beforehand. In this session we will explore the architecture of a Data Lake on AWS and cover topics such as storage, processing and security.
Amazon Kinesis Platform – The Complete Overview - Pop-up Loft TLV 2017Amazon Web Services
Real-Time Streaming Analytics became popular amongst many verticals and use cases. In AdTech, Gaming, Financial Service and IoT, AWS customers are leveraging Amazon Kinesis platform to ingest billions of events every day and process them in real-time. In this session, we will discuss Amazon Kinesis Streams, Amazon Kinesis Firehose and Amazon Kinesis Analytics. We will show best practice and design patterns in integrating Amazon Kinesis platform with other services like Amazon EMR, Redshift, Amazon Elasticsearch and AWS lambda as well as 3rd party connectors like storm, Spark and more.
Get Started with Real-Time Streaming Data in Under 5 Minutes - AWS Online Tec...Amazon Web Services
Learning Objectives:
- Identify common problems that streaming data can help solve
- Understand the AWS services that are used to solve these problems, including Amazon Kinesis
- Try out one of 5+ different solutions powered by Amazon Kinesis through AWS CloudFormation templates
Social Media Analytics with Amazon QuickSight (ANT370) - AWS re:Invent 2018Amazon Web Services
Realizing the value of social media analytics can bolster your business goals. This type of analysis has grown in recent years due to the large amount of available information and the speed at which it can be collected and analyzed. In this workshop, we build a serverless data processing and machine learning (ML) pipeline that provides a multi-lingual social media dashboard of tweets within Amazon QuickSight. We leverage API-driven ML services, AWS Glue, Amazon Athena and Amazon QuickSight. These building blocks are put together with very little code by leveraging serverless offerings within AWS.
Amazon Kinesis Data Analytics gives us to tools to run SQL queries against active data streams. We'll look at how we can performance real-time log analytics and q build entire streaming applications using SQL, so that you can gain actionable insights promptly.
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Amazon Web Services
Il Forecasting è un processo importante per tantissime aziende e viene utilizzato in vari ambiti per cercare di prevedere in modo accurato la crescita e distribuzione di un prodotto, l’utilizzo delle risorse necessarie nelle linee produttive, presentazioni finanziarie e tanto altro. Amazon utilizza delle tecniche avanzate di forecasting, in parte questi servizi sono stati messi a disposizione di tutti i clienti AWS.
In questa sessione illustreremo come pre-processare i dati che contengono una componente temporale e successivamente utilizzare un algoritmo che a partire dal tipo di dato analizzato produce un forecasting accurato.
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Amazon Web Services
La varietà e la quantità di dati che si crea ogni giorno accelera sempre più velocemente e rappresenta una opportunità irripetibile per innovare e creare nuove startup.
Tuttavia gestire grandi quantità di dati può apparire complesso: creare cluster Big Data su larga scala sembra essere un investimento accessibile solo ad aziende consolidate. Ma l’elasticità del Cloud e, in particolare, i servizi Serverless ci permettono di rompere questi limiti.
Vediamo quindi come è possibile sviluppare applicazioni Big Data rapidamente, senza preoccuparci dell’infrastruttura, ma dedicando tutte le risorse allo sviluppo delle nostre le nostre idee per creare prodotti innovativi.
Ora puoi utilizzare Amazon Elastic Kubernetes Service (EKS) per eseguire pod Kubernetes su AWS Fargate, il motore di elaborazione serverless creato per container su AWS. Questo rende più semplice che mai costruire ed eseguire le tue applicazioni Kubernetes nel cloud AWS.In questa sessione presenteremo le caratteristiche principali del servizio e come distribuire la tua applicazione in pochi passaggi
Vent'anni fa Amazon ha attraversato una trasformazione radicale con l'obiettivo di aumentare il ritmo dell'innovazione. In questo periodo abbiamo imparato come cambiare il nostro approccio allo sviluppo delle applicazioni ci ha permesso di aumentare notevolmente l'agilità, la velocità di rilascio e, in definitiva, ci ha consentito di creare applicazioni più affidabili e scalabili. In questa sessione illustreremo come definiamo le applicazioni moderne e come la creazione di app moderne influisce non solo sull'architettura dell'applicazione, ma sulla struttura organizzativa, sulle pipeline di rilascio dello sviluppo e persino sul modello operativo. Descriveremo anche approcci comuni alla modernizzazione, compreso l'approccio utilizzato dalla stessa Amazon.com.
Come spendere fino al 90% in meno con i container e le istanze spot Amazon Web Services
L’utilizzo dei container è in continua crescita.
Se correttamente disegnate, le applicazioni basate su Container sono molto spesso stateless e flessibili.
I servizi AWS ECS, EKS e Kubernetes su EC2 possono sfruttare le istanze Spot, portando ad un risparmio medio del 70% rispetto alle istanze On Demand. In questa sessione scopriremo insieme quali sono le caratteristiche delle istanze Spot e come possono essere utilizzate facilmente su AWS. Impareremo inoltre come Spreaker sfrutta le istanze spot per eseguire applicazioni di diverso tipo, in produzione, ad una frazione del costo on-demand!
In recent months, many customers have been asking us the question – how to monetise Open APIs, simplify Fintech integrations and accelerate adoption of various Open Banking business models. Therefore, AWS and FinConecta would like to invite you to Open Finance marketplace presentation on October 20th.
Event Agenda :
Open banking so far (short recap)
• PSD2, OB UK, OB Australia, OB LATAM, OB Israel
Intro to Open Finance marketplace
• Scope
• Features
• Tech overview and Demo
The role of the Cloud
The Future of APIs
• Complying with regulation
• Monetizing data / APIs
• Business models
• Time to market
One platform for all: a Strategic approach
Q&A
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Amazon Web Services
Per creare valore e costruire una propria offerta differenziante e riconoscibile, le startup di successo sanno come combinare tecnologie consolidate con componenti innovativi creati ad hoc.
AWS fornisce servizi pronti all'utilizzo e, allo stesso tempo, permette di personalizzare e creare gli elementi differenzianti della propria offerta.
Concentrandoci sulle tecnologie di Machine Learning, vedremo come selezionare i servizi di intelligenza artificiale offerti da AWS e, anche attraverso una demo, come costruire modelli di Machine Learning personalizzati utilizzando SageMaker Studio.
OpsWorks Configuration Management: automatizza la gestione e i deployment del...Amazon Web Services
Con l'approccio tradizionale al mondo IT per molti anni è stato difficile implementare tecniche di DevOps, che finora spesso hanno previsto attività manuali portando di tanto in tanto a dei downtime degli applicativi interrompendo l'operatività dell'utente. Con l'avvento del cloud, le tecniche di DevOps sono ormai a portata di tutti a basso costo per qualsiasi genere di workload, garantendo maggiore affidabilità del sistema e risultando in dei significativi miglioramenti della business continuity.
AWS mette a disposizione AWS OpsWork come strumento di Configuration Management che mira ad automatizzare e semplificare la gestione e i deployment delle istanze EC2 per mezzo di workload Chef e Puppet.
Scopri come sfruttare AWS OpsWork a garanzia e affidabilità del tuo applicativo installato su Instanze EC2.
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsAmazon Web Services
Vuoi conoscere le opzioni per eseguire Microsoft Active Directory su AWS? Quando si spostano carichi di lavoro Microsoft in AWS, è importante considerare come distribuire Microsoft Active Directory per supportare la gestione, l'autenticazione e l'autorizzazione dei criteri di gruppo. In questa sessione, discuteremo le opzioni per la distribuzione di Microsoft Active Directory su AWS, incluso AWS Directory Service per Microsoft Active Directory e la distribuzione di Active Directory su Windows su Amazon Elastic Compute Cloud (Amazon EC2). Trattiamo argomenti quali l'integrazione del tuo ambiente Microsoft Active Directory locale nel cloud e l'utilizzo di applicazioni SaaS, come Office 365, con AWS Single Sign-On.
Dal riconoscimento facciale al riconoscimento di frodi o difetti di fabbricazione, l'analisi di immagini e video che sfruttano tecniche di intelligenza artificiale, si stanno evolvendo e raffinando a ritmi elevati. In questo webinar esploreremo le possibilità messe a disposizione dai servizi AWS per applicare lo stato dell'arte delle tecniche di computer vision a scenari reali.
Amazon Web Services e VMware organizzano un evento virtuale gratuito il prossimo mercoledì 14 Ottobre dalle 12:00 alle 13:00 dedicato a VMware Cloud ™ on AWS, il servizio on demand che consente di eseguire applicazioni in ambienti cloud basati su VMware vSphere® e di accedere ad una vasta gamma di servizi AWS, sfruttando a pieno le potenzialità del cloud AWS e tutelando gli investimenti VMware esistenti.
Molte organizzazioni sfruttano i vantaggi del cloud migrando i propri carichi di lavoro Oracle e assicurandosi notevoli vantaggi in termini di agilità ed efficienza dei costi.
La migrazione di questi carichi di lavoro, può creare complessità durante la modernizzazione e il refactoring delle applicazioni e a questo si possono aggiungere rischi di prestazione che possono essere introdotti quando si spostano le applicazioni dai data center locali.
Crea la tua prima serverless ledger-based app con QLDB e NodeJSAmazon Web Services
Molte aziende oggi, costruiscono applicazioni con funzionalità di tipo ledger ad esempio per verificare lo storico di accrediti o addebiti nelle transazioni bancarie o ancora per tenere traccia del flusso supply chain dei propri prodotti.
Alla base di queste soluzioni ci sono i database ledger che permettono di avere un log delle transazioni trasparente, immutabile e crittograficamente verificabile, ma sono strumenti complessi e onerosi da gestire.
Amazon QLDB elimina la necessità di costruire sistemi personalizzati e complessi fornendo un database ledger serverless completamente gestito.
In questa sessione scopriremo come realizzare un'applicazione serverless completa che utilizzi le funzionalità di QLDB.
Con l’ascesa delle architetture di microservizi e delle ricche applicazioni mobili e Web, le API sono più importanti che mai per offrire agli utenti finali una user experience eccezionale. In questa sessione impareremo come affrontare le moderne sfide di progettazione delle API con GraphQL, un linguaggio di query API open source utilizzato da Facebook, Amazon e altro e come utilizzare AWS AppSync, un servizio GraphQL serverless gestito su AWS. Approfondiremo diversi scenari, comprendendo come AppSync può aiutare a risolvere questi casi d’uso creando API moderne con funzionalità di aggiornamento dati in tempo reale e offline.
Inoltre, impareremo come Sky Italia utilizza AWS AppSync per fornire aggiornamenti sportivi in tempo reale agli utenti del proprio portale web.
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareAmazon Web Services
Molte organizzazioni sfruttano i vantaggi del cloud migrando i propri carichi di lavoro Oracle e assicurandosi notevoli vantaggi in termini di agilità ed efficienza dei costi.
La migrazione di questi carichi di lavoro, può creare complessità durante la modernizzazione e il refactoring delle applicazioni e a questo si possono aggiungere rischi di prestazione che possono essere introdotti quando si spostano le applicazioni dai data center locali.
In queste slide, gli esperti AWS e VMware presentano semplici e pratici accorgimenti per facilitare e semplificare la migrazione dei carichi di lavoro Oracle accelerando la trasformazione verso il cloud, approfondiranno l’architettura e dimostreranno come sfruttare a pieno le potenzialità di VMware Cloud ™ on AWS.
Amazon Elastic Container Service (Amazon ECS) è un servizio di gestione dei container altamente scalabile, che semplifica la gestione dei contenitori Docker attraverso un layer di orchestrazione per il controllo del deployment e del relativo lifecycle. In questa sessione presenteremo le principali caratteristiche del servizio, le architetture di riferimento per i differenti carichi di lavoro e i semplici passi necessari per poter velocemente migrare uno o più dei tuo container.
Key Trends Shaping the Future of Infrastructure.pdfCheryl Hung
Keynote at DIGIT West Expo, Glasgow on 29 May 2024.
Cheryl Hung, ochery.com
Sr Director, Infrastructure Ecosystem, Arm.
The key trends across hardware, cloud and open-source; exploring how these areas are likely to mature and develop over the short and long-term, and then considering how organisations can position themselves to adapt and thrive.
Let's dive deeper into the world of ODC! Ricardo Alves (OutSystems) will join us to tell all about the new Data Fabric. After that, Sezen de Bruijn (OutSystems) will get into the details on how to best design a sturdy architecture within ODC.
Transcript: Selling digital books in 2024: Insights from industry leaders - T...BookNet Canada
The publishing industry has been selling digital audiobooks and ebooks for over a decade and has found its groove. What’s changed? What has stayed the same? Where do we go from here? Join a group of leading sales peers from across the industry for a conversation about the lessons learned since the popularization of digital books, best practices, digital book supply chain management, and more.
Link to video recording: https://bnctechforum.ca/sessions/selling-digital-books-in-2024-insights-from-industry-leaders/
Presented by BookNet Canada on May 28, 2024, with support from the Department of Canadian Heritage.
Essentials of Automations: Optimizing FME Workflows with ParametersSafe Software
Are you looking to streamline your workflows and boost your projects’ efficiency? Do you find yourself searching for ways to add flexibility and control over your FME workflows? If so, you’re in the right place.
Join us for an insightful dive into the world of FME parameters, a critical element in optimizing workflow efficiency. This webinar marks the beginning of our three-part “Essentials of Automation” series. This first webinar is designed to equip you with the knowledge and skills to utilize parameters effectively: enhancing the flexibility, maintainability, and user control of your FME projects.
Here’s what you’ll gain:
- Essentials of FME Parameters: Understand the pivotal role of parameters, including Reader/Writer, Transformer, User, and FME Flow categories. Discover how they are the key to unlocking automation and optimization within your workflows.
- Practical Applications in FME Form: Delve into key user parameter types including choice, connections, and file URLs. Allow users to control how a workflow runs, making your workflows more reusable. Learn to import values and deliver the best user experience for your workflows while enhancing accuracy.
- Optimization Strategies in FME Flow: Explore the creation and strategic deployment of parameters in FME Flow, including the use of deployment and geometry parameters, to maximize workflow efficiency.
- Pro Tips for Success: Gain insights on parameterizing connections and leveraging new features like Conditional Visibility for clarity and simplicity.
We’ll wrap up with a glimpse into future webinars, followed by a Q&A session to address your specific questions surrounding this topic.
Don’t miss this opportunity to elevate your FME expertise and drive your projects to new heights of efficiency.
PHP Frameworks: I want to break free (IPC Berlin 2024)Ralf Eggert
In this presentation, we examine the challenges and limitations of relying too heavily on PHP frameworks in web development. We discuss the history of PHP and its frameworks to understand how this dependence has evolved. The focus will be on providing concrete tips and strategies to reduce reliance on these frameworks, based on real-world examples and practical considerations. The goal is to equip developers with the skills and knowledge to create more flexible and future-proof web applications. We'll explore the importance of maintaining autonomy in a rapidly changing tech landscape and how to make informed decisions in PHP development.
This talk is aimed at encouraging a more independent approach to using PHP frameworks, moving towards a more flexible and future-proof approach to PHP development.
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf91mobiles
91mobiles recently conducted a Smart TV Buyer Insights Survey in which we asked over 3,000 respondents about the TV they own, aspects they look at on a new TV, and their TV buying preferences.
Neuro-symbolic is not enough, we need neuro-*semantic*Frank van Harmelen
Neuro-symbolic (NeSy) AI is on the rise. However, simply machine learning on just any symbolic structure is not sufficient to really harvest the gains of NeSy. These will only be gained when the symbolic structures have an actual semantics. I give an operational definition of semantics as “predictable inference”.
All of this illustrated with link prediction over knowledge graphs, but the argument is general.
DevOps and Testing slides at DASA ConnectKari Kakkonen
My and Rik Marselis slides at 30.5.2024 DASA Connect conference. We discuss about what is testing, then what is agile testing and finally what is Testing in DevOps. Finally we had lovely workshop with the participants trying to find out different ways to think about quality and testing in different parts of the DevOps infinity loop.
The Art of the Pitch: WordPress Relationships and SalesLaura Byrne
Clients don’t know what they don’t know. What web solutions are right for them? How does WordPress come into the picture? How do you make sure you understand scope and timeline? What do you do if sometime changes?
All these questions and more will be explored as we talk about matching clients’ needs with what your agency offers without pulling teeth or pulling your hair out. Practical tips, and strategies for successful relationship building that leads to closing the deal.
"Impact of front-end architecture on development cost", Viktor TurskyiFwdays
I have heard many times that architecture is not important for the front-end. Also, many times I have seen how developers implement features on the front-end just following the standard rules for a framework and think that this is enough to successfully launch the project, and then the project fails. How to prevent this and what approach to choose? I have launched dozens of complex projects and during the talk we will analyze which approaches have worked for me and which have not.
2. What to Expect from the Session
• Common patterns for clickstream analytics
• Tips on using Amazon Kinesis and
Amazon EMR for clickstream processing
• Hearst’s big data journey in building the Hearst analytics
stack for clickstream
• Lesson learned
• Q&A
3. Clickstream Analytics = Business Value
Verticals/Use
Cases
Accelerated Ingest-
Transform-Load to final
destination
Continual Metrics/
KPI Extraction
Actionable Insights
Ad Tech/
Marketing Analytics
Advertising data aggregation Advertising metrics like coverage,
yield, conversion, scoring
webpages
User activity engagement
analytics, optimized bid/ buy
engines
Consumer Online/
Gaming
Online customer engagement data
aggregation
Consumer/ app engagement
metrics like page views, CTR
Customer clickstream analytics,
recommendation engines
Financial Services
Digital assets
Improve customer experience on
bank website
Financial market data metrics Fraud monitoring, and value-at-
risk assessment, auditing of
market order data
IoT / Sensor Data Fitness device , vehicle sensor,
telemetry data ingestion
Wearable sensor operational
metrics, and dashboards
Devices / sensor operational
intelligence
4. DataXu Records
68.198.92 - - [22/Dec/2013:23:08:37 -0400] "GET
/ HTTP/1.1" 200 6394 www.yahoo.com
"-" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1...)" "-"
192.168.198.92 - - [22/Dec/2013:23:08:38 -0400] "GET
/images/logo.gif HTTP/1.1" 200 807 www.yahoo.com
"http://www.some.com/" "Mozilla/4.0 (compatible; MSIE 6...)" "-"
192.168.72.177 - - [22/Dec/2013:23:32:14 -0400] "GET
APACHE ACCESS LOG
{"cId":"10049","cdid":"5961","campID":"8","loc":"b","ip_address":"174.56.106.10
","icctm_ht_athr":"","icctm_ht_aid":"","icctm_ht_attl":"Family
Circus","icctm_ht_dtpub":"2011-04-05","icctm_ht_stnm":"SEATTLE POST-
INTELLIGENCER","icctm_ht_cnocl":"http://www.seattlepi.com/comics-and-
games/fun/Family_Circus","ts":"1422839422426","url":"http://www.seattlepi.co
m/comics-and-
games/fun/Family_Circus","hash":"d98ace5874334232f6db3e1c0f8be3ab","load"
:"5.096","ref":"http://www.seattlepi.com/comics-and-
games","bu":"HNP","brand":"SEATTLE POST-
INTELLIGENCER","ref_type":"SAMESITE","ref_subtype":"SAMESITE","ua":"deskto
p:chrome"}
JSON
Clickstream Record
Number of fields is not fixed
Tags names change
Multiple pages/sites
Format can be defined as we
store the data
AVRO, CSV, TSV, JSON
6. Clickstream Analytics – Common Patterns
Flume HDFS Hive Batch high latency on retrieve
SQLFlume HDFS
Hive &
Pig
Batch low latency on
retrieve
Flume
Sqoop
HDFS
Impala
SparkSql
Presto
Other
More options: Batch with lower
latency when retrieve
8. It’s All About the Pace, About the Pace…
Big data
Hourly server logs:
were your systems misbehaving 1hr ago
Weekly / monthly bill:
what you spent this billing cycle
Daily customer-preferences report from your web
site’s click stream:
what deal or ad to try next time
Daily fraud reports:
was there fraud yesterday
Real-time big data
•Amazon CloudWatch metrics:
what went wrong now
•Real-time spending alerts/caps:
prevent overspending now
•Real-time analysis:
what to offer the current customer now
•Real-time detection:
block fraudulent use now
9. Clickstream Storage and Processing with
Amazon Kinesis
Amazon Kinesis
App N
Live dashboard
AWSendpoint
App 1
Aggregate and ingest
data to S3
App 2
Aggregate and
ingest data to
Amazon Redshift
Data lake
Amazon Redshift
App 3
ETL/ELT
Machine learning
Availability
Zone
Shard 1
Shard 2
Shard N
Availability
Zone
Availability
Zone
EMR
DynamoDB
10. Amazon
EMR
Managed, elastic Hadoop (1.x & 2.x) cluster
Integrates with Amazon S3, Amazon DynamoDB, and
Amazon Redshift
Install Storm, Spark, Hive, Pig, Impala, and end user
tools automatically
Support for Spot instances
Integrated HBase NOSQL database
Amazon EMR with Apache Spark
Apache Spark
Spark
SQL
Spark
Streaming
Mllib
GraphX
15. Resize Nodes with Spot Instances
50 % less run-time ( 14 7)
25% less cost (140 105)
16. Amazon EMR and Amazon Kinesis for Batch and Interactive Processing
• Streaming log analysis
• Interactive ETL
Amazon Kinesis Amazon EMR
Amazon Redshift
Amazon S3
Data scientist
Amazon EMR for data scientists
using Spot instances
BI
18. • Amazon software license linking – Add ASL dependency
to SBT/MAVEN project (artifactId = spark-streaming-
kinesis-asl_2.10)
• Shards - Include head-room to catching up with data in
stream
• Tracking Amazon Kinesis application state (DynamoDB)
• Kinesis-Application:DynamoDB-table (1:1)
• Created automatically
• Make sure application name doesn’t conflict with existing DynamoDB tables.
• Adjust DynamoDB provision throughput if necessary (default 10 reads per
sec & 10 writes per second)
Amazon Kinesis Applications – Tips
19. Spark on Amazon EMR - Tips
• Amazon EMR applications after version 3.8.0 (no need
to run bootstrap actions)
• Use Spot instances for time & cost saving especially
when using Spark
• Run in Yarn cluster mode (--master yarn-cluster) for
production jobs – Spark driver runs in application master
(high availability)
• Data serialization – use Kryo if possible to boost
performance
(spark.serializer=org.apache.spark.serializer.KryoSerializer)
20. The Life of a Click at Hearst
• Hearst’s journey with their big data analytics platform on
AWS
• Demo
• Clickstream analysis patterns
• Lessons learned
23. BUSINESS MEDIA
operates more than 20 business-to-businesses with significant holdings in the
automotive, electronic, medical and finance industries
MAGAZINES
publishes 20 U.S. titles and close to 300 international editions
BROADCASTING
comprises 31 television and two radio stations
NEWSPAPERS
owns 15 daily and 34 weekly newspapers
Hearst includes over 200 businesses in over
100 countries around the world
24. Data Services at Hearst – Our Mission
• Ensure that Hearst leverages its combined data
assets
• Unify Hearst’s data streams
• Development of Big Data Analytics Platform using AWS
services
• Promote enterprise-wide product development
• Example: product initiative led by all of Hearst’s editors
– Buzzing@Hearst
26. Business Value of Buzzing
• Instant feedback on articles from our audiences
• Incremental re-syndication of popular articles across
properties (e.g. trending newspaper articles can be
adopted by magazines)
• Inform the editors to write articles that are more
relevant to our audiences and what channels are our
audiences leveraging to read our articles
• Ultimately, drive incremental value
• 25% more page views, 15% more visitors which
lead to incremental revenue
27. • Throughput goal: transport data from all 250+ Hearst
properties worldwide
• Latency goal: click-to-tool in under 5 minutes
• Agile: easily add new data fields into clickstream
• Unique metrics requirements defined by Data Science team
(e.g., standard deviations, regressions, etc.)
• Data reporting windows ranging from 1 hour to 1 week
• Front-end developed “from scratch” so data exposed through
API must support development team’s unique requirements
Most importantly, operation of existing sites cannot be
affected!
Engineering Requirements of Buzzing…
28. What we had to work with…
a ”static” clickstream collection process on many Hearst sites
Users to
Hearst
Properties
Clickstream
corporate
data center
Netezza
Data
Warehouse
Once per day
…now how do we get there?
Used for ad hoc
SQL-based
reporting and
analytics
~30 GB per day
containing basic web
log data (e.g., referrer,
url, user agent, cookie,
etc.)
29. …and we own Hearst’s tag management system
Users to
Hearst
Properties
Clickstream
This not only gave us access
to the clickstream but also
the JavaScript code that
lives on our websites
JavaScript on
web pages
30. Phase 1 – Ingest Clickstream Data Using AWS
Amazon
Kinesis
Node.JS App-
Proxy
Kinesis S3 App –
KCL Libraries
Users to
Hearst
Properties
Clickstream
“Raw JSON”
Raw data
Use tag manager to easily deploy JavaScript to all sites
Kinesis Client
Libraries and
Kinesis
Connectors
persist data to
Amazon S3
for durability
ElasticBeanstalk with
Node.JS exposes an
HTTP endpoint which
asynchronously takes
the data and feeds to
Amazon Kinesis
Implement
JavaScript on
sites that call
an exposed
endpoint and
pass in query
parameters
31. Node.JS – Push clickstream to Amazon Kinesis
function pushToKinesis(data) {
var params = {
Data: data, /* required */
PartitionKey: guid(),
StreamName: streamName /* required */
};
kinesis.putRecord(params, function(err, data) {
if (err) {
console.log(err, err.stack); // an error occurred
}
});
}
app.get('/hearstkin.gif', function(req, res){
async.series([function(callback){
var queryData = url.parse(req.url, true).query;
queryData.proxyts = new Date().getTime().toString();
pushToKinesis(JSON.stringify(queryData));
callback(null);
}]);
res.writeHead(200,{'Content-Type': 'text/plain', 'Access-
Control-Allow-Origin': '*'});
res.end(imageGIF, 'binary');
});
http.createServer(app).listen(app.get('port'), function(){
console.log('Express server listening on port ' +
app.get('port'));
});
Asynchronous calls –
ensures no user experience
interruption
Server timestamp – to
create a unified timestamp.
Amazon Kinesis now offers
this out-of-the box!
JSON format – this
helps us downstream
Kinesis Partition Key – guid() is a
good partition key to ensure even
distribution across the shards
32. Ingest Monitoring - AWS
Amazon Kinesis Monitoring
AWS Elastic Beanstalk Monitoring
Auto Scaling triggered by
network in > 20MB. Then scale
up to 40 instances.
33. Phase 1- Summary
• Use JSON formatting for payloads so more fields can be easily
added without impacting downstream processing
• HTTP call requires minimal code introduced to the actual site
implementations
• Flexible to meet rollout and growing demand
• Elastic Beanstalk can be scaled
• Amazon Kinesis stream can be re-sharded
• Amazon S3 provides high durability storage for raw data
• Once a reliable, scalable onboarding platform is in place,
we can now focus on ETL!
34. Phase 2a- Data Processing First Version (EMR)
ETL on
Amazon EMR
“Raw
JSON”
Raw Data Clean Aggregate
Data
• Amazon EMR was
chosen initially for
processing due to ease
of Amazon EMR creation
… and Pig because we
knew how to code in
PigLatin
• 50+ UDFs were written
using Python…also
because we knew Python
35. Unfortunately, Pig was not performing well – 15 min latency
Processing Clickstream Data with Pig
set output.compression.enabled true;
set output.compression.codec org.apache.hadoop.io.compress.GzipCodec;
REGISTER '/home/hadoop/PROD/parser.py' USING jython as pyudf;
REGISTER '/home/hadoop/PROD/refclean.py' USING jython AS refs;
AA0 = load 's3://BUCKET_NAME/rawjson/datadata.tsv.gz' using TextLoader as
(line:chararray);
A0 = FILTER AA0 BY ( pyudf.get_obj(line,'url') MATCHES '.*(ad+|gd+).*';
A1 = FOREACH A0 GENERATE
( pyudf.urlclean(pyudf.get_obj(line,'url')) as url:chararray,
pyudf.get_obj(line,'hash') as hash:chararray,
pyudf.get_obj(line,'icxid') as icxid:chararray,
pyudf.pubclean(pyudf.get_obj(line,'icctm_ht_dtpub')) as pubdt:chararray,
pyudf.get_obj(line,'icctm_ht_cnocl') as cnocl:chararray,
pyudf.get_obj(line,'icctm_ht_athr') as author:chararray,
pyudf.get_obj(line,'icctm_ht_attl') as title:chararray,
pyudf.get_obj(line,'icctm_ht_aid') as cms_id:chararray,
pyudf.num1(pyudf.get_obj(line,'mxdpth')) as mxdpth:double,
pyudf.num2(pyudf.get_obj(line,'load')) as topsecs:double,
refs.classy(pyudf.get_obj(line,'url'),1) as bu:chararray,
pyudf.get_obj(line,'ip_address') as ip:chararray,
pyudf.get_obj(line,'img') as img:chararray
;
Gzip your
output
Regex in
Pig!
Python imports
limited to what is
allowed by
Jython
36. Phase 2b- Data Processing (SparkStreaming)
Clean Aggregate Data
Node.JS App-
ProxyUsers to
Hearst
Properties
Clickstream
• Welcome Apache Spark– one framework for
batch and realtime
• Benefits – using same code for batch and
real time ETL
• Use Spot instances – cost savings
• Drawbacks – Scala!
Amazon Kinesis
ETL on EMR
37. Using SQL with Scala
SparkSQL
Since we
knew SQL,
we wrote
Scala with
embedded
SQL Query
endpointUrl = kinesis.us-west-2.amazonaws.com
streamName= hearststream
outputLoc.json.streaming =
s3://hearstkinesisdata/processedsparkjson
window.length = 300
sliding.interval = 300
outputLimit = 5000000
query1Table=hearst1
query1= SELECT
simplestartq(proxyts, 5) as startq,
urlclean(url) as url,
hash,
icxid,
pubclean(icctm_ht_dtpub) as pubdt,
classy(url,1) as bu,
ip_address as ip,
artcheck(classy(url,1),url) as artcheck,
ref_type(ref,url) as ref_type,
img,
wc,
contentSource
FROM hearst1
val jsonRDD = sqlContext.jsonRDD(rdd1)
jsonRDD.registerTempTable(query1Table.trim)
val query1Result = sqlContext.sql(query1)//.limit(outputLimit.toInt)
query1Result.registerTempTable(query2Table.trim)
val query2Result = sqlContext.sql(query2)
query2Result.registerTempTable(query3Table.trim)
val query3Result = sqlContext.sql(query3).limit(outputLimit.toInt)
val outPartitionFolder = UDFUtils.output60WithRolling(slidingInterval.toInt)
query3Result.toJSON.saveAsTextFile("%s/%s".format(outputLocJSON,
outPartitionFolder), classOf[org.apache.hadoop.io.compress.GzipCodec])
logger.info("New JSON file written to "+outputLoc+"/"+outPartitionFolder)
38. Python UDF versus Scala
Python
def artcheck(bu,url):
try:
if url and bu:
cleanurl = url[0:url.find("?")].strip('/')
tailurl = url[findnth(url, '/', 3)+1:url.find("?")].strip('/')
revurl=cleanurl[::-1]
root=revurl[0:revurl.find('/')][::-1]
if (bu=='HMI' or bu=='HMG') and re.compile('ad+|gd+').search(tailurl)!=None : return 'T'
elif bu=='HTV' and root.isdigit()==True and re.compile('/search/').search(cleanurl)==None: return 'T'
elif bu=='HNP' and re.compile('blog|fuelfix').search(url)!=None and re.compile(r'S*[0-9]{4,4}/[0-9]{2,2}/[0-9]{2,2}S*').search(tailurl)!=None : return 'T'
elif bu=='HNP' and re.compile('businessinsider').search(url)!=None and re.compile(r'S*[0-9]{4,4}-[0-9]{2,2}').search(root)!=None : return 'T'
elif bu=='HNP' and re.compile('blog|fuelfix|businessinsider').search(url)==None and re.compile('.php').search(url)!=None : return 'T'
else : return 'F'
else : return 'F'
except:
return 'F'
def artcheck(bu:String, url: String )={
try{
val cleanurl = UDFUtils.utilurlclean(url.trim).stripSuffix("/")
val pathClean = UDFUtils.pathURI(cleanurl)
val lastContext = pathClean.split("/").last
var resp = "F"
if(("HMI"==bu||"HMG"==bu)&&Pattern.compile("/ad+|/gd+").matcher(pathClean).find()) resp="T"
else if("HTV"==bu && StringUtils.isNumeric(lastContext) && !cleanurl.contains("/search/")) resp="T"
else if("HNP"==bu && Pattern.compile("blog|fuelfix").matcher(url).find() && Pattern.compile("d{4}/d{2}/d{2}").matcher(pathClean).find()) resp="T"
else if("HNP"==bu && Pattern.compile("businessinsider").matcher(url).find() && Pattern.compile("d{4}-d{2}").matcher(lastContext).find()) resp="T"
else if("HNP"==bu && !Pattern.compile("blog|fuelfix|businessinsider").matcher(url).find()&& Pattern.compile(".php").matcher(url).find()) resp="T"
resp}
}
catch{
case e:Exception => "F"
}
Scala
Don’t be intimidated by Scala…if
you know Python, the syntax can
be similar
re.compile('ad+|gd+').
Pattern.compile("ad+|gd+").
Try: Except:
Try{} Catch{}
39. Phase 3a- Data Science!
Data Science on EC2
Clean Aggregate Data API-ready Data
Amazon Kinesis
ETL on EMR
• We decided to perform our
Data Science using SAS
on Amazon EC2 initially
because of the ability to
perform both data
manipulation and easily
run complex data science
techniques (e.g.,
regressions)
• Great for exploration and
initial development
• Performing data science
using this method took
3-5 minutes to complete
40. SAS Code Example
data _null_;
call system("aws s3 cp s3://BUCKET_NAME/file.gz
/home/ec2-user/LOGFILES/file.gz");
run;
FILENAME IN pipe "gzip -dc /home/ec2-user/LOGFILES/file.gz" lrecl=32767;
data temp1;
FORMAT startq DATETIME19.;
infile IN delimiter='09'x MISSOVER DSD lrecl=32767 firstobs=1;
input
startq :YMDDTTM.
url :$1000.
pageviews :best32.
visits :best32.
author :$100.
cms_id :$100.
img :$1000.
title :$1000.;
run;
Use pipe to
read in S3
data and
keep it
compressed
proc sql;
CREATE TABLE metrics AS
SELECT
url FORMAT=$1000.,
SUM(pageviews) as pageviews,
SUM(visits) as visits,
SUM(fvisits) as fvisits,
SUM(evisits) as evisits,
MIN(ttct) as rec,
COUNT(distinct startq) as frq,
AVG(visits) as avg_visits_pp,
SUM(visits1) as visits_soc,
SUM(visits2) as visits_dir,
SUM(visits3) as visits_int,
SUM(visits4) as visits_sea,
SUM(visits5) as visits_web,
SUM(visits6) as visits_nws,
SUM(visits7) as visits_pd,
SUM(visits8) as visits_soc_fb,
SUM(visits9) as visits_soc_tw,
SUM(visits10) as visits_soc_pi,
SUM(visits11) as visits_soc_re,
SUM(visits12) as visits_soc_yt,
SUM(visits13) as visits_soc_su,
SUM(visits14) as visits_soc_gp,
SUM(visits15) as visits_soc_li,
SUM(visits16) as visits_soc_tb,
SUM(visits17) as visits_soc_ot,
CASE WHEN (SUM(v1) - SUM(v3) ) > 20 THEN ( SUM(v1) - SUM(v3) ) / 2 ELSE 0 END as trending
FROM temp1
GROUP BY 1;
Use PROC SQL when
possible for easier
translation to Amazon
Redshift for production
later on.
41. Phase 3b- Split Data Science into Development and Production
Amazon
Kinesis
Clean Aggregate
Data
API-ready Data
Data Science
“Production”
Amazon Redshift
ETL on EMR
• Once Data Science
models were established,
we split the modeling and
production
• Production was moved to
Amazon Redshift which
provided much faster
ability to read Amazon S3
data and process the data
• Data Science processing
time went down to 100
seconds!
Use S3 to store
data science
models and
apply them
using Amazon
Redshift
Data Science
“Development”
on EC2
Statistical Models
run once per day
Models
Agg Data
42. select
clean_url as url,
trim(substring(max(proxyts||domain) from 20 for 1000)) as domain,
trim(substring(max(proxyts||clean_cnocl) from 20 for 1000)) as cnocl,
trim(substring(max(proxyts||img) from 20 for 1000)) as img,
trim(substring(max(proxyts||title) from 20 for 1000)) as title,
trim(substring(max(proxyts||section) from 20 for 1000)) as section,
approximate count(distinct ic_fpc) as visits,
count(1) as hits
from kinesis_hits
where bu='HMG' and (article_id is not null or author is not null or title is
not null)
group by 1;
Amazon Redshift Code Example
Cool trick to find the most
recent value of a
character field in one
pass through the data
43. Phase 4a- Elasticsearch Integration
Amazon EMR
PUSH
Buzzing API
S3 Storage
Data Science
Amazon Redshift
ETL on EMR
Since we had the Amazon EMR cluster running already, we used a handy Pig jar
that made it easy to push data to Elasticsearch.
S3 Storage
Models
Agg Data API Ready Data
44. REGISTER /home/hadoop/pig/lib/piggybank.jar;
REGISTER /home/hadoop/PROD/elasticsearch-hadoop-2.0.2.jar;
DEFINE EsStorageDEV org.elasticsearch.hadoop.pig.EsStorage
('es.nodes = es-dev.hearst.io',
'es.port = 9200',
'es.http.timeout = 5m',
'es.index.auto.create = true');
SECTIONS = load 's3://hearstkinesisdata/ss.tsv' USING PigStorage('t') as
(sectionid:chararray,cnt:long,visits:long,sectionname:chararray);
STORE SECTIONS INTO 'content-sections-sync/content-sections-sync' USING
EsStoragePROD;
Pig Code – Push to ES Example
Use handy Pig jar to push data to Elasticsearch
The “Amazon EMR overhead” required to read small files added 2 min to latency
45. Phase 4b- Elasticsearch Integration Sped Up
Buzzing API
S3 Storage
API
Ready
Data
Data Science
Amazon Redshift
ETL on EMR
Since the Amazon
Redshift code was
run in a Python
wrapper, solution
was to push data
directly into
Elasticsearch
Models
Agg Data
46. # Converting file into bulk-insert compatible format
$bin/convert_json.php big.json create rowbyrow.json
# Get mapping file
${aws} s3 cp S3://hearst/es_mapping es_mapping
# Creating new ES index
$(curl -XPUT http://es.hearst.io/content-web180-sync --data-binary es_mapping -s)
# Performing bulk API call
$(curl -XPOST http://es.hearst.io/content-web180-sync/_bulk --data-binary rowbyrow.json
-s) "http://es.hearst.io/content-web180-sync"
Script to Push to Elasticsearch Directly
Converting one big input JSON
file to a row-by-row JSON is a
key step for making the data
batch compatible
Use a mapping file to manage
the formatting in your index…
very important for dates and
numeric values that look like
strings
47. Final Data Pipeline
Buzzing API
API
Ready
Data
Amazon
Kinesis
S3 Storage
Node.JS
App- Proxy
Users to
Hearst
Properties
Clickstream
Data Science
Application
Amazon Redshift
ETL on EMR
100 seconds
1G/day
30 seconds
5GB/day
5 seconds
1G/day
Milliseconds
100GB/day
LATENCY
THROUGHPUT Models
Agg Data
49. Version Transport
S
T
O
R
A
G
E ETL
S
T
O
R
A
G
E Analysis
S
T
O
R
A
G
E Exposure Latency
V1
Amazon
Kinesis S3 EMR-Pig S3 EC2-SAS S3
EMR to
ElasticSearch 1 hour
Today
Amazon
Kinesis Spark-Scala S3
Amazon
Redshift ElasticSearch <5 min
Tomorrow
Amazon
Kinesis PySpark + SparkR ElasticSearch <2 min
Lessons learned
“No Duh’s” Removing “stoppage” points, speed up processing, and
combine processes improve latency.
50. Data Science Tool Box
Buzzing API
API
Ready
Data
Amazon
Kinesis
S3 Storage
Node.JS
App- Proxy
Users to
Hearst
Properties
Clickstream
Data Science
Application
Amazon Redshift
ETL on EMR
Models
Agg Data
• IPython Notebook
• On Spark and Amazon Redshift
• Code sharing (and insights)
• User-friendly development
environment for data scientists
• Auto-convert .pynb .py Data
Science
Toolbox
Data
Models
Amazon Redshift
52. Next Steps
• Amazon EMR 4.1.0 with Spark 1.5 released and we can do
more with pyspark, look at Apache Zeppelin on Amazon EMR
• Amazon Kinesis just release a new feature to retain data up
to 7 days - We could do more ETL “in the stream”
• Amazon Kinesis Firehose and Lambda – Zero touch (no
Amazon EC2 maintenance)
• More complex data science that requires…
• Amazon Redshift UDFs
• Python shell that calls Amazon Redshift but also allows for
complex statistical methods (e.g., using R or machine learning)
53. Conclusion
• Clickstreams are the new “data currency” of business
• AWS provides great technology to process data
• High speed
• Lower costs – Using Spot…
• Very agile
• Do more with less: this can all be done with a team
of 2 FTEs!
• 1 developer (well versed in AWS) + 1 data scientist
54. Ingest Store Process Analyze
Click Insight
Time
Call To Action
Amazon S3
Amazon Kinesis
Amazon DynamoDB
Amazon RDS (Aurora)
AWS Lambda
KCL Apps
Amazon
EMR
Amazon
Redshift
55. Use Amazon Kinesis, EMR and Amazon Redshift for
Clickstream
Open source connectors:
• http://docs.aws.amazon.com/kinesis/latest/dev/developing-
consumers-with-kcl.html
AWS Big Data blog
- http://blogs.aws.amazon.com/bigdata/
AWS re:Invent Big Data booth
AWS Big Data Marketplace and Partner ecosystem
Hearst Booth – Hall C1156: Learn more about the
interesting things we are doing with data!
Call To Action