How we at Plumbee collect and process data at scale and how this data is used to send relevant mobile push notifications to our players to keep them engaged.
Presented as part of a Tech Talk: http://engineering.plumbee.com/blog/2014/11/07/tech-talk-push-notifications-big-data/
Bellevue Big Data meetup: Dive Deep into Spark StreamingSantosh Sahoo
Discuss the code and architecture about building realtime streaming application using Spark and Kafka. This demo presents some use cases and patterns of different streaming frameworks.
Strava Labs: Exploring a Billion Activity Dataset from Athletes with Apache S...Databricks
At Strava we have extensively leveraged Apache Spark to explore our data of over a billion activities, from tens of millions of athletes. This talk will be a survey of the more unique and exciting applications: A Global Heatmap gives a ~2 meter resolution density map of one billion runs, rides, and other activities consisting of three trillion GPS points from 17 billion miles of exercise data. The heatmap was rewritten from a non-scalable system into a highly scalable Spark job enabling great gains in speed, cost, and quality. Locally sensitive hashing for GPS traces was used to efficiently cluster 1 billion activities. Additional processes categorize and extract data from each cluster, such as names and statistics. Clustering gives an automated process to extract worldwide geographical patterns of athletes.
Applications include route discovery, recommendation systems, and detection of events and races. A coarse spatiotemporal index of all activity data is stored in Apache Cassandra. Spark streaming jobs maintain this index and compute all space-time intersections (“flybys”) of activities in this index. Intersecting activity pairs are then checked for spatiotemporal correlation, indicated by connected components in the graph of highly correlated pairs form “Group Activities”, creating a social graph of shared activities and workout partners. Data from several hundred thousand runners was used to build an improved model of the relationship between running difficulty and elevation gradient (Grade Adjusted Pace).
Sorry - How Bieber broke Google Cloud at SpotifyNeville Li
Talk at Scala Up North Jul 21 2017
We will talk about Spotify's story with Scala big data and our journey to migrate our entire data infrastructure to Google Cloud and how Justin Bieber contributed to breaking it. We'll talk about Scio, a Scala API for Apache Beam and Google Cloud Dataflow, and the technology behind it, including macros, algebird, chill and shapeless. There'll also be a live coding demo.
Bellevue Big Data meetup: Dive Deep into Spark StreamingSantosh Sahoo
Discuss the code and architecture about building realtime streaming application using Spark and Kafka. This demo presents some use cases and patterns of different streaming frameworks.
Strava Labs: Exploring a Billion Activity Dataset from Athletes with Apache S...Databricks
At Strava we have extensively leveraged Apache Spark to explore our data of over a billion activities, from tens of millions of athletes. This talk will be a survey of the more unique and exciting applications: A Global Heatmap gives a ~2 meter resolution density map of one billion runs, rides, and other activities consisting of three trillion GPS points from 17 billion miles of exercise data. The heatmap was rewritten from a non-scalable system into a highly scalable Spark job enabling great gains in speed, cost, and quality. Locally sensitive hashing for GPS traces was used to efficiently cluster 1 billion activities. Additional processes categorize and extract data from each cluster, such as names and statistics. Clustering gives an automated process to extract worldwide geographical patterns of athletes.
Applications include route discovery, recommendation systems, and detection of events and races. A coarse spatiotemporal index of all activity data is stored in Apache Cassandra. Spark streaming jobs maintain this index and compute all space-time intersections (“flybys”) of activities in this index. Intersecting activity pairs are then checked for spatiotemporal correlation, indicated by connected components in the graph of highly correlated pairs form “Group Activities”, creating a social graph of shared activities and workout partners. Data from several hundred thousand runners was used to build an improved model of the relationship between running difficulty and elevation gradient (Grade Adjusted Pace).
Sorry - How Bieber broke Google Cloud at SpotifyNeville Li
Talk at Scala Up North Jul 21 2017
We will talk about Spotify's story with Scala big data and our journey to migrate our entire data infrastructure to Google Cloud and how Justin Bieber contributed to breaking it. We'll talk about Scio, a Scala API for Apache Beam and Google Cloud Dataflow, and the technology behind it, including macros, algebird, chill and shapeless. There'll also be a live coding demo.
The presentation explains the reasons we picked Kafka as Streaming Hub and the use of Kafka Streams to avoid common anti-patterns, streamline development experience, improve resilience, enhance performances and enable experimentation. A step-by-step example will be presented to introduce the Kafka Streams DSL and understand what happens under the hood of a stateful streaming application.
Arbitrary Stateful Aggregations using Structured Streaming in Apache SparkDatabricks
In this talk, we will introduce some of the new available APIs around stateful aggregation in Structured Streaming, namely flatMapGroupsWithState. We will show how this API can be used to power many complex real-time workflows, including stream-to-stream joins, through live demos using Databricks and Apache Kafka.
These are the slides for the Productionizing your Streaming Jobs webinar on 5/26/2016.
Apache Spark Streaming is one of the most popular stream processing framework that enables scalable, high-throughput, fault-tolerant stream processing of live data streams. In this talk, we will focus on the following aspects of Spark streaming:
- Motivation and most common use cases for Spark Streaming
- Common design patterns that emerge from these use cases and tips to avoid common pitfalls while implementing these design patterns
- Performance Optimization Techniques
Scio - Moving to Google Cloud, A Spotify StoryNeville Li
Talk at Philly ETE Apr 28 2017
We will talk about Spotify’s story of migrating our big data infrastructure to Google Cloud. Over the past year or so we moved away from maintaining our own 2500+ node Hadoop cluster to managed services in the cloud. We replaced two key components in our data processing stack, Hive and Scalding, with BigQuery and Scio and are able to iterate at a much faster speed. We will focus the technical aspect of Scio, a Scala API for Apache Beam and Google Cloud Dataflow and how it changed the way we process data.
Enterprises are Increasingly demanding realtime analytics and insights to power use cases like personalization, monitoring and marketing. We will present Pulsar, a realtime streaming system used at eBay which can scale to millions of events per second with high availability and SQL-like language support, enabling realtime data enrichment, filtering and multi-dimensional metrics aggregation.
We will discuss how Pulsar integrates with a number of open source Apache technologies like Kafka, Hadoop and Kylin (Apache incubator) to achieve the high scalability, availability and flexibility. We use Kafka to replay unprocessed events to avoid data loss and to stream realtime events into Hadoop enabling reconciliation of data between realtime and batch. We use Kylin to provide multi-dimensional OLAP capabilities.
Distributed Stream Processing - Spark Summit East 2017Petr Zapletal
The demand for stream processing is increasing a lot these days. Immense amounts of data have to be processed fast from a rapidly growing set of disparate data sources. This pushes the limits of traditional data processing infrastructures. These stream-based applications include trading, social networks, Internet of things, system monitoring, and many other examples.
A number of powerful, easy-to-use open source platforms have emerged to address this. But the same problem can be solved differently, various but sometimes overlapping use-cases can be targeted or different vocabularies for similar concepts can be used. This may lead to confusion, longer development time or costly wrong decisions.
Streaming Auto-scaling in Google Cloud DataflowC4Media
Video and slides synchronized, mp3 and slide download available at URL http://bit.ly/1Z2JXhs.
Manuel Fahndrich describes how they tackled one particular resource allocation aspect of Google Cloud Dataflow pipelines, namely, horizontal scaling of worker pools as a function of pipeline input rate. Managing the redistribution of key ranges across new pool sizes and the associated persistent data storage was particularly challenging. Filmed at qconlondon.com.
Manuel Fahndrich earned his Ph.D. in C.S. from UC Berkeley in 1999. He spent the next 15 years as a Research Scientist at Microsoft, working on static and dynamic verification tools for object-oriented programs and system software. After joining Google in 2014 he has been working on data-parallel infrastructure, in particular auto-scaling for batch and streaming pipelines.
Apache Spark for Library Developers with William Benton and Erik ErlandsonDatabricks
As a developer, data engineer, or data scientist, you’ve seen how Apache Spark is expressive enough to let you solve problems elegantly and efficient enough to let you scale out to handle more data. However, if you’re solving the same problems again and again, you probably want to capture and distribute your solutions so that you can focus on new problems and so other people can reuse and remix them: you want to develop a library that extends Spark.
You faced a learning curve when you first started using Spark, and you’ll face a different learning curve as you start to develop reusable abstractions atop Spark. In this talk, two experienced Spark library developers will give you the background and context you’ll need to turn your code into a library that you can share with the world. We’ll cover: Issues to consider when developing parallel algorithms with Spark, Designing generic, robust functions that operate on data frames and datasets, Extending data frames with user-defined functions (UDFs) and user-defined aggregates (UDAFs), Best practices around caching and broadcasting, and why these are especially important for library developers, Integrating with ML pipelines, Exposing key functionality in both Python and Scala, and How to test, build, and publish your library for the community.
We’ll back up our advice with concrete examples from real packages built atop Spark. You’ll leave this talk informed and inspired to take your Spark proficiency to the next level and develop and publish an awesome library of your own.
Imagine that self-driving cars now exist and are becoming widespread around the world. To facilitate the transition, it's necessary to set up central service to monitor traffic conditions nationwide, deploy sensors throughout the interstate system that monitor traffic conditions including car speeds, pavement and weather conditions, as well as accidents, construction, and other sources of traffic tie ups.
MongoDB has been selected as the database for this application. In this webinar, we will walk through designing the application’s schema that will both support the high update and read volumes as well as the data aggregation and analytics queries.
Data processing platforms architectures with Spark, Mesos, Akka, Cassandra an...Anton Kirillov
This talk is about architecture designs for data processing platforms based on SMACK stack which stands for Spark, Mesos, Akka, Cassandra and Kafka. The main topics of the talk are:
- SMACK stack overview
- storage layer layout
- fixing NoSQL limitations (joins and group by)
- cluster resource management and dynamic allocation
- reliable scheduling and execution at scale
- different options for getting the data into your system
- preparing for failures with proper backup and patching strategies
AWS re:Invent 2016: Deep Dive on Amazon DynamoDB (DAT304)Amazon Web Services
Explore Amazon DynamoDB capabilities and benefits in detail and learn how to get the most out of your DynamoDB database. We go over best practices for schema design with DynamoDB across multiple use cases, including gaming, AdTech, IoT, and others. We explore designing efficient indexes, scanning, and querying, and go into detail on a number of recently released features, including JSON document support, DynamoDB Streams, and more. We also provide lessons learned from operating DynamoDB at scale, including provisioning DynamoDB for IoT.
This presentation will describe how to go beyond a "Hello world" stream application and build a real-time data-driven product. We will present architectural patterns, go through tradeoffs and considerations when deciding on technology and implementation strategy, and describe how to put the pieces together. We will also cover necessary practical pieces for building real products: testing streaming applications, and how to evolve products over time.
Presented at highloadstrategy.com 2016 by Øyvind Løkling (Schibsted Products & Technology), joint work with Lars Albertsson (independent, www.mapflat.com).
Big commercial websites breathe data: they create a lot of it very fast, but also need the feedback based on the very same data to become better and better.
In this talk we're showing our ideas, the drawbacks and the solutions, for building your own big data infrastructure.
We further explore the possibilities to access and harness the data using map/reduce and near real-time approaches in order to prepare you for the most challenging part of it all: gaining relevant knowledge you did not had before.
This talk was held at the Developer Conference 2013 (http://www.developer-conference.eu/session_post/log-everything/)
Your Guide to Push Notifications - Comparing GCM & APNS Sparkbit
Learn more about the basic concept of push notification and its current implementations. See the difference between Apple Push Notifications and Google Cloud Messaging.
The presentation explains the reasons we picked Kafka as Streaming Hub and the use of Kafka Streams to avoid common anti-patterns, streamline development experience, improve resilience, enhance performances and enable experimentation. A step-by-step example will be presented to introduce the Kafka Streams DSL and understand what happens under the hood of a stateful streaming application.
Arbitrary Stateful Aggregations using Structured Streaming in Apache SparkDatabricks
In this talk, we will introduce some of the new available APIs around stateful aggregation in Structured Streaming, namely flatMapGroupsWithState. We will show how this API can be used to power many complex real-time workflows, including stream-to-stream joins, through live demos using Databricks and Apache Kafka.
These are the slides for the Productionizing your Streaming Jobs webinar on 5/26/2016.
Apache Spark Streaming is one of the most popular stream processing framework that enables scalable, high-throughput, fault-tolerant stream processing of live data streams. In this talk, we will focus on the following aspects of Spark streaming:
- Motivation and most common use cases for Spark Streaming
- Common design patterns that emerge from these use cases and tips to avoid common pitfalls while implementing these design patterns
- Performance Optimization Techniques
Scio - Moving to Google Cloud, A Spotify StoryNeville Li
Talk at Philly ETE Apr 28 2017
We will talk about Spotify’s story of migrating our big data infrastructure to Google Cloud. Over the past year or so we moved away from maintaining our own 2500+ node Hadoop cluster to managed services in the cloud. We replaced two key components in our data processing stack, Hive and Scalding, with BigQuery and Scio and are able to iterate at a much faster speed. We will focus the technical aspect of Scio, a Scala API for Apache Beam and Google Cloud Dataflow and how it changed the way we process data.
Enterprises are Increasingly demanding realtime analytics and insights to power use cases like personalization, monitoring and marketing. We will present Pulsar, a realtime streaming system used at eBay which can scale to millions of events per second with high availability and SQL-like language support, enabling realtime data enrichment, filtering and multi-dimensional metrics aggregation.
We will discuss how Pulsar integrates with a number of open source Apache technologies like Kafka, Hadoop and Kylin (Apache incubator) to achieve the high scalability, availability and flexibility. We use Kafka to replay unprocessed events to avoid data loss and to stream realtime events into Hadoop enabling reconciliation of data between realtime and batch. We use Kylin to provide multi-dimensional OLAP capabilities.
Distributed Stream Processing - Spark Summit East 2017Petr Zapletal
The demand for stream processing is increasing a lot these days. Immense amounts of data have to be processed fast from a rapidly growing set of disparate data sources. This pushes the limits of traditional data processing infrastructures. These stream-based applications include trading, social networks, Internet of things, system monitoring, and many other examples.
A number of powerful, easy-to-use open source platforms have emerged to address this. But the same problem can be solved differently, various but sometimes overlapping use-cases can be targeted or different vocabularies for similar concepts can be used. This may lead to confusion, longer development time or costly wrong decisions.
Streaming Auto-scaling in Google Cloud DataflowC4Media
Video and slides synchronized, mp3 and slide download available at URL http://bit.ly/1Z2JXhs.
Manuel Fahndrich describes how they tackled one particular resource allocation aspect of Google Cloud Dataflow pipelines, namely, horizontal scaling of worker pools as a function of pipeline input rate. Managing the redistribution of key ranges across new pool sizes and the associated persistent data storage was particularly challenging. Filmed at qconlondon.com.
Manuel Fahndrich earned his Ph.D. in C.S. from UC Berkeley in 1999. He spent the next 15 years as a Research Scientist at Microsoft, working on static and dynamic verification tools for object-oriented programs and system software. After joining Google in 2014 he has been working on data-parallel infrastructure, in particular auto-scaling for batch and streaming pipelines.
Apache Spark for Library Developers with William Benton and Erik ErlandsonDatabricks
As a developer, data engineer, or data scientist, you’ve seen how Apache Spark is expressive enough to let you solve problems elegantly and efficient enough to let you scale out to handle more data. However, if you’re solving the same problems again and again, you probably want to capture and distribute your solutions so that you can focus on new problems and so other people can reuse and remix them: you want to develop a library that extends Spark.
You faced a learning curve when you first started using Spark, and you’ll face a different learning curve as you start to develop reusable abstractions atop Spark. In this talk, two experienced Spark library developers will give you the background and context you’ll need to turn your code into a library that you can share with the world. We’ll cover: Issues to consider when developing parallel algorithms with Spark, Designing generic, robust functions that operate on data frames and datasets, Extending data frames with user-defined functions (UDFs) and user-defined aggregates (UDAFs), Best practices around caching and broadcasting, and why these are especially important for library developers, Integrating with ML pipelines, Exposing key functionality in both Python and Scala, and How to test, build, and publish your library for the community.
We’ll back up our advice with concrete examples from real packages built atop Spark. You’ll leave this talk informed and inspired to take your Spark proficiency to the next level and develop and publish an awesome library of your own.
Imagine that self-driving cars now exist and are becoming widespread around the world. To facilitate the transition, it's necessary to set up central service to monitor traffic conditions nationwide, deploy sensors throughout the interstate system that monitor traffic conditions including car speeds, pavement and weather conditions, as well as accidents, construction, and other sources of traffic tie ups.
MongoDB has been selected as the database for this application. In this webinar, we will walk through designing the application’s schema that will both support the high update and read volumes as well as the data aggregation and analytics queries.
Data processing platforms architectures with Spark, Mesos, Akka, Cassandra an...Anton Kirillov
This talk is about architecture designs for data processing platforms based on SMACK stack which stands for Spark, Mesos, Akka, Cassandra and Kafka. The main topics of the talk are:
- SMACK stack overview
- storage layer layout
- fixing NoSQL limitations (joins and group by)
- cluster resource management and dynamic allocation
- reliable scheduling and execution at scale
- different options for getting the data into your system
- preparing for failures with proper backup and patching strategies
AWS re:Invent 2016: Deep Dive on Amazon DynamoDB (DAT304)Amazon Web Services
Explore Amazon DynamoDB capabilities and benefits in detail and learn how to get the most out of your DynamoDB database. We go over best practices for schema design with DynamoDB across multiple use cases, including gaming, AdTech, IoT, and others. We explore designing efficient indexes, scanning, and querying, and go into detail on a number of recently released features, including JSON document support, DynamoDB Streams, and more. We also provide lessons learned from operating DynamoDB at scale, including provisioning DynamoDB for IoT.
This presentation will describe how to go beyond a "Hello world" stream application and build a real-time data-driven product. We will present architectural patterns, go through tradeoffs and considerations when deciding on technology and implementation strategy, and describe how to put the pieces together. We will also cover necessary practical pieces for building real products: testing streaming applications, and how to evolve products over time.
Presented at highloadstrategy.com 2016 by Øyvind Løkling (Schibsted Products & Technology), joint work with Lars Albertsson (independent, www.mapflat.com).
Big commercial websites breathe data: they create a lot of it very fast, but also need the feedback based on the very same data to become better and better.
In this talk we're showing our ideas, the drawbacks and the solutions, for building your own big data infrastructure.
We further explore the possibilities to access and harness the data using map/reduce and near real-time approaches in order to prepare you for the most challenging part of it all: gaining relevant knowledge you did not had before.
This talk was held at the Developer Conference 2013 (http://www.developer-conference.eu/session_post/log-everything/)
Your Guide to Push Notifications - Comparing GCM & APNS Sparkbit
Learn more about the basic concept of push notification and its current implementations. See the difference between Apple Push Notifications and Google Cloud Messaging.
Push notifications allow your users to opt-in to timely updates from sites they love and allow you to effectively
re-engage them with customized, engaging content.
How to Choose Between Push Notifications and SMS | CM TelecomCM.com
Push notifications are cheaper, but SMS reaches pretty much every type of handset. Push messaging requires Internet, but text messages are more static.
It can be quite difficult for businesses to choose between push notifications and SMS as one of the communication channels.
Push notifications as well as SMS both have advantages and disadvantages.
In this presentation you will learn the differences and how to choose between push notifications and SMS
(MBL307) How Mobile Businesses and Enterprises Use Amazon SNSAmazon Web Services
Does your business need a scalable messaging solution to drive user engagement or enable communication across your service-tiers? Join us to learn how Amazon SNS can be used to send messages at scale to destinations such as mobile apps, desktop apps, HTTP endpoints, Amazon SQS queues, email addresses, and AWS Lambda functions. Additionally, we will discuss how customers are using Amazon SNS in conjunction with other AWS services to address business needs ranging from targeted mobile push notifications to messaging bus fabrics for server-less backends. We are also excited that Easy Taxi and Earth Networks will join us and share how SNS has helped them address their business needs.
How to configure web push notifications for your web application. Quick overview presentation. Chrome, Firefox, Opera and Safari. https://lahiiru.github.io/browser-push
(MBL301) Beyond the App - Extend Your User Experience with Mobile Push Notifi...Amazon Web Services
Cross-platform push notifications that can engage your customers even when your app is in the background are becoming a central part of a mobile app user experience. Some customers may rarely open an app that provides useful information to them; for them, the notifications are the most important part. But great user experiences can break if your messages get dropped or delayed. How do you ensure your messages are delivered fast and reliably at scale? And how can you use them to extend the user experience of your app? In this session, we show you how Amazon SNS provides the performance and simplicity of a managed service, while also supporting interactive notifications, silent push, and broadcasts to large groups. We also learn from Mailbox, who rely on large-scale push notifications as a core part of the user experience, and who will share real-world design patterns.
This was a presentation on my book MapReduce Design Patterns, given to the Twin Cities Hadoop Users Group. Check it out if you are interested in seeing what my my book is about.
Scaling up uber's real time data analyticsXiang Fu
Realtime infrastructure powers critical pieces of Uber. This talk will discuss the architecture, technical challenges, learnings and how a blend of open source infrastructure (Apache Kafka/Flink/Pinot) and in-house technologies have helped Uber scale and enabled SQL to power realtime decision making for city ops, data scientists, data analysts and engineers.
Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...Data Con LA
This talk explores deploying a series of small and large batch and streaming pipelines locally, to Spark and Flink clusters and to Google Cloud Dataflow services to give the audience a feel for the portability of Beam, a new portable Big Data processing framework recently submitted by Google to the Apache foundation. This talk will look at how the programming model handles late arriving data in a stream with event time, windows, and triggers.
Getting Maximum Performance from Amazon Redshift (DAT305) | AWS re:Invent 2013Amazon Web Services
Get the most out of Amazon Redshift by learning about cutting-edge data warehousing implementations. Desk.com, a Salesforce.com company, discusses how they maintain a large concurrent user base on their customer-facing business intelligence portal powered by Amazon Redshift. HasOffers shares how they load 60 million events per day into Amazon Redshift with a 3-minute end-to-end load latency to support ad performance tracking for thousands of affiliate networks. Finally, Aggregate Knowledge discusses how they perform complex queries at scale with Amazon Redshift to support their media intelligence platform.
Serverless Streaming Data Processing using Amazon Kinesis AnalyticsAmazon Web Services
by Adrian Hornsby, Technical Evanglist, AWS
As more and more organizations strive to gain real-time insights into their business, streaming data has become ubiquitous. Typical streaming data analytics solutions require specific skills and complex infrastructure. However, with Amazon Kinesis Analytics, you can analyze streaming data in real-time with standard SQL—there is no need to learn new programming languages or processing frameworks. In this session, we dive deep into the capabilities of Amazon Kinesis Analytics using real-world examples. We’ll present an end-to-end streaming data solution using Amazon Kinesis Streams for data ingestion, Amazon Kinesis Analytics for real-time processing, and Amazon Kinesis Firehose for persistence. We review in detail how to write SQL queries using streaming data and discuss best practices to optimize and monitor your Amazon Kinesis Analytics applications. Lastly, we discuss how to estimate the cost of the entire system.
Streaming data analytics (Kinesis, EMR/Spark) - Pop-up Loft Tel Aviv Amazon Web Services
"Low latency analytics is becoming a very popular scenario. In this session we will discuss several architectural options for doing
analytics on moving data using Amazon Kinesis and EMR/Spark Streaming and share some best practices and real world examples."
BDA403 How Netflix Monitors Applications in Real-time with Amazon KinesisAmazon Web Services
Thousands of services work in concert to deliver millions of hours of video streams to Netflix customers every day. These applications vary in size, function, and technology, but they all make use of the Netflix network to communicate. Understanding the interactions between these services is a daunting challenge both because of the sheer volume of traffic and the dynamic nature of deployments. In this talk, we’ll first discuss why Netflix chose Amazon Kinesis Streams over other streaming data solutions like Kafka to address these challenges at scale. We’ll then dive deep into how Netflix uses Amazon Kinesis Streams to enrich network traffic logs and identify usage patterns in real time. Lastly, we will cover how Netflix uses this system to build comprehensive dependency maps, increase network efficiency, and improve failure resiliency. From this talk, you’ll take away techniques and processes that you can apply to your large-scale networks and derive real-time, actionable insights.
AWS re:Invent re:Cap 행사에서 발표된 강연 자료입니다. 아마존 웹서비스의 김일호 솔루션스 아키텍트가 발표한 내용입니다.
내용 요약: Hadoop과 Elastic MapReduce, Redshift, Kinesis, Data Pipeline, S3 등 다양한 서비스들을 활용하는 데이터 분석의 모범사례 및 아키텍처 설계 패턴에 대해 말씀드리고, re:Invent에서 새로 추가된 Amazon EC2 컴퓨팅 최적화 인스턴스 C4와 새로 발표된 Amazon EBS 볼륨 확장 및 성능 향상에 대해 함께 살펴볼 예정입니다.
GDC 2015 - Game Analytics with AWS Redshift, Kinesis, and the Mobile SDKNate Wiger
See the latest analytics architectures for companies succeeding in the free-to-play space, such as Supercell, GREE, and Rovio. Also see how to create a real-time analytics pipeline to connect to your players, enabling you to deliver deeper experiences.
Sergei Sokolenko "Advances in Stream Analytics: Apache Beam and Google Cloud ...Fwdays
In this session, Sergei Sokolenko, the Google product manager for Cloud Dataflow, will share the implementation details of many of the unique features available in Apache Beam and Cloud Dataflow, including:
- autoscaling of resources based on data inputs;
- separating compute and state storage for better scaling of resources;
- simultaneous grouping and joining of 100s of Terabytes in a hybrid in-memory/on-desk file system;
- dynamic work rebalancing of work items away from overutilized worker nodes and many others.
Customers benefit from these advances through faster execution of jobs, resource savings, and a fully managed data processing environment that runs in the Cloud and removes the need to manage infrastructure.
Serverless Streaming Data Processing using Amazon Kinesis AnalyticsAmazon Web Services
As more and more organizations strive to gain real-time insights into their business, streaming data has become ubiquitous. Typical streaming data analytics solutions require specific skills and complex infrastructure. However, with Amazon Kinesis Analytics, you can analyze streaming data in real-time with standard SQL—there is no need to learn new programming languages or processing frameworks. In this session, we dive deep into the capabilities of Amazon Kinesis Analytics using real-world examples. We’ll present an end-to-end streaming data solution using Amazon Kinesis Streams for data ingestion, Amazon Kinesis Analytics for real-time processing, and Amazon Kinesis Firehose for persistence. We review in detail how to write SQL queries using streaming data and discuss best practices to optimize and monitor your Amazon Kinesis Analytics applications. Lastly, we discuss how to estimate the cost of the entire system.
This session is recommended for anyone interested in understanding how to use AWS big data services to develop real-time analytics applications. In this session, you will get an overview of a number of Amazon's big data and analytics services that enable you to build highly scaleable cloud applications that immediately and continuously analyze large sets of distributed data. We'll explain how services like Amazon Kinesis, EMR and Redshift can be used for data ingestion, processing and storage to enable real-time insights and analysis into customer, operational and machine generated data and log files. We'll explore system requirements, design considerations, and walk through a specific customer use case to illustrate the power of real-time insights on their business.
Slides from the Cloudyna event in Katowice, Poland on November 14th, 2015. Data analysis is being used to transform businesses, increase efficiency, and drive innovation. The AWS Cloud has a comprehensive portfolio of analytics services to help you process data of any volume and automate how you put that data to work for your organization. In this session we'll see how to put those services at work on structured, unstructured and real-time data.
Serverless Streaming Data Processing using Amazon Kinesis AnalyticsAmazon Web Services
As more and more organizations strive to gain real-time insights into their business, streaming data has become ubiquitous. Typical streaming data analytics solutions require specific skills and complex infrastructure. However, with Amazon Kinesis Analytics, you can analyze streaming data in real-time with standard SQL—there is no need to learn new programming languages or processing frameworks. In this session, we dive deep into the capabilities of Amazon Kinesis Analytics using real-world examples. We’ll present an end-to-end streaming data solution using Amazon Kinesis Streams for data ingestion, Amazon Kinesis Analytics for real-time processing, and Amazon Kinesis Firehose for persistence. We review in detail how to write SQL queries using streaming data and discuss best practices to optimize and monitor your Amazon Kinesis Analytics applications. Lastly, we discuss how to estimate the cost of the entire system.
AWS Partner Webcast - Analyze Big Data for Consumer Applications with Looker ...Amazon Web Services
Analyze Big Data for Consumer Applications with Looker BI and Amazon Redshift Customizing the customer experience based on user behavior is a constant challenge for today’s consumer apps. Business intelligence helps analyze and model large amounts of data. Looker offers a modern approach to BI leveraging AWS that’s fast, agile, and easy to manage. Join this webinar to learn how MessageMe, which provides emotionally engaging messaging apps to consumers, leverages Looker business intelligence software and the Amazon Redshift data warehouse service to analyze billions of rows of customer data in seconds.
Webinar topics include:
• How MessageMe turns billions of rows of customer data stored in Amazon Redshift into actionable insights
• How Looker connects directly to Amazon Redshift in just a few clicks, enabling MessageMe to build a modern, big data analytics in the cloud. Who should attend
• Information or Solution Architects, Data Analysts, BI Directors, DBAs, Development Leads, Developers, or Technical IT Leaders.
Presenters:
• Justin Rosenthal, CTO, MessageMe
• Keenan Rice, VP, Marketing & Alliances, Looker
• Tina Adams, Senior Product Manager, AWS
Essentials of Automations: Optimizing FME Workflows with ParametersSafe Software
Are you looking to streamline your workflows and boost your projects’ efficiency? Do you find yourself searching for ways to add flexibility and control over your FME workflows? If so, you’re in the right place.
Join us for an insightful dive into the world of FME parameters, a critical element in optimizing workflow efficiency. This webinar marks the beginning of our three-part “Essentials of Automation” series. This first webinar is designed to equip you with the knowledge and skills to utilize parameters effectively: enhancing the flexibility, maintainability, and user control of your FME projects.
Here’s what you’ll gain:
- Essentials of FME Parameters: Understand the pivotal role of parameters, including Reader/Writer, Transformer, User, and FME Flow categories. Discover how they are the key to unlocking automation and optimization within your workflows.
- Practical Applications in FME Form: Delve into key user parameter types including choice, connections, and file URLs. Allow users to control how a workflow runs, making your workflows more reusable. Learn to import values and deliver the best user experience for your workflows while enhancing accuracy.
- Optimization Strategies in FME Flow: Explore the creation and strategic deployment of parameters in FME Flow, including the use of deployment and geometry parameters, to maximize workflow efficiency.
- Pro Tips for Success: Gain insights on parameterizing connections and leveraging new features like Conditional Visibility for clarity and simplicity.
We’ll wrap up with a glimpse into future webinars, followed by a Q&A session to address your specific questions surrounding this topic.
Don’t miss this opportunity to elevate your FME expertise and drive your projects to new heights of efficiency.
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Albert Hoitingh
In this session I delve into the encryption technology used in Microsoft 365 and Microsoft Purview. Including the concepts of Customer Key and Double Key Encryption.
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Ramesh Iyer
In today's fast-changing business world, Companies that adapt and embrace new ideas often need help to keep up with the competition. However, fostering a culture of innovation takes much work. It takes vision, leadership and willingness to take risks in the right proportion. Sachin Dev Duggal, co-founder of Builder.ai, has perfected the art of this balance, creating a company culture where creativity and growth are nurtured at each stage.
Epistemic Interaction - tuning interfaces to provide information for AI supportAlan Dix
Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024
https://alandix.com/academic/papers/synergy2024-epistemic/
As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.
Elevating Tactical DDD Patterns Through Object CalisthenicsDorra BARTAGUIZ
After immersing yourself in the blue book and its red counterpart, attending DDD-focused conferences, and applying tactical patterns, you're left with a crucial question: How do I ensure my design is effective? Tactical patterns within Domain-Driven Design (DDD) serve as guiding principles for creating clear and manageable domain models. However, achieving success with these patterns requires additional guidance. Interestingly, we've observed that a set of constraints initially designed for training purposes remarkably aligns with effective pattern implementation, offering a more ‘mechanical’ approach. Let's explore together how Object Calisthenics can elevate the design of your tactical DDD patterns, offering concrete help for those venturing into DDD for the first time!
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
Accelerate your Kubernetes clusters with Varnish CachingThijs Feryn
A presentation about the usage and availability of Varnish on Kubernetes. This talk explores the capabilities of Varnish caching and shows how to use the Varnish Helm chart to deploy it to Kubernetes.
This presentation was delivered at K8SUG Singapore. See https://feryn.eu/presentations/accelerate-your-kubernetes-clusters-with-varnish-caching-k8sug-singapore-28-2024 for more details.
Generating a custom Ruby SDK for your web service or Rails API using Smithyg2nightmarescribd
Have you ever wanted a Ruby client API to communicate with your web service? Smithy is a protocol-agnostic language for defining services and SDKs. Smithy Ruby is an implementation of Smithy that generates a Ruby SDK using a Smithy model. In this talk, we will explore Smithy and Smithy Ruby to learn how to generate custom feature-rich SDKs that can communicate with any web service, such as a Rails JSON API.
Transcript: Selling digital books in 2024: Insights from industry leaders - T...BookNet Canada
The publishing industry has been selling digital audiobooks and ebooks for over a decade and has found its groove. What’s changed? What has stayed the same? Where do we go from here? Join a group of leading sales peers from across the industry for a conversation about the lessons learned since the popularization of digital books, best practices, digital book supply chain management, and more.
Link to video recording: https://bnctechforum.ca/sessions/selling-digital-books-in-2024-insights-from-industry-leaders/
Presented by BookNet Canada on May 28, 2024, with support from the Department of Canadian Heritage.
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...UiPathCommunity
💥 Speed, accuracy, and scaling – discover the superpowers of GenAI in action with UiPath Document Understanding and Communications Mining™:
See how to accelerate model training and optimize model performance with active learning
Learn about the latest enhancements to out-of-the-box document processing – with little to no training required
Get an exclusive demo of the new family of UiPath LLMs – GenAI models specialized for processing different types of documents and messages
This is a hands-on session specifically designed for automation developers and AI enthusiasts seeking to enhance their knowledge in leveraging the latest intelligent document processing capabilities offered by UiPath.
Speakers:
👨🏫 Andras Palfi, Senior Product Manager, UiPath
👩🏫 Lenka Dulovicova, Product Program Manager, UiPath
UiPath Test Automation using UiPath Test Suite series, part 3DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 3. In this session, we will cover desktop automation along with UI automation.
Topics covered:
UI automation Introduction,
UI automation Sample
Desktop automation flow
Pradeep Chinnala, Senior Consultant Automation Developer @WonderBotz and UiPath MVP
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
UiPath Test Automation using UiPath Test Suite series, part 4DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 4. In this session, we will cover Test Manager overview along with SAP heatmap.
The UiPath Test Manager overview with SAP heatmap webinar offers a concise yet comprehensive exploration of the role of a Test Manager within SAP environments, coupled with the utilization of heatmaps for effective testing strategies.
Participants will gain insights into the responsibilities, challenges, and best practices associated with test management in SAP projects. Additionally, the webinar delves into the significance of heatmaps as a visual aid for identifying testing priorities, areas of risk, and resource allocation within SAP landscapes. Through this session, attendees can expect to enhance their understanding of test management principles while learning practical approaches to optimize testing processes in SAP environments using heatmap visualization techniques
What will you get from this session?
1. Insights into SAP testing best practices
2. Heatmap utilization for testing
3. Optimization of testing processes
4. Demo
Topics covered:
Execution from the test manager
Orchestrator execution result
Defect reporting
SAP heatmap example with demo
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Monitoring Java Application Security with JDK Tools and JFR Events
Transforming Mobile Push Notifications with Big Data
1. Transforming Mobile
Push Notifications with
Big Data
Dennis Waldron, Data Engineering
Pablo Varela, Systems Engineering
2. Who is Plumbee?
● 12.8M Installs
● 209K Daily Active Users
● 818K Monthly Active Users
● Social Games Studio
● Mirrorball Slots & Bingo
● Facebook Canvas, iOS
3. Data Providers
Inhouse data = 99.9% of all data
In Total:
● 98TB (907 days of data)
● All stored in Amazon S3
Daily:
● 78GB compressed
● ~450M events/day
● 4,800 events/second (peak)
5. Amazon Web Service
Application/Game Servers
End Users (Desktop & Mobile)
● Collect everything!
● RPC events intercepted by
annotated endpoints. (Requests)
● All mutating state changes
recorded:
○ DynamoDB, MySQL, Memcache
(Blobs Updates)
● Custom Telemetry (Other):
○ Client: click tracking, loading time
statistics, GPU data...
○ Server: promotions, transactions,
Facebook user data...
Game Data
MySQL
MemCache
RPC
77%
9%
OTHER 15%
GENERATES
DynamoDB
6. Game Data - Example RPC Endpoint Annotation
/**
* Example annotation
*/
@SQSRequestLog(requestMessage = SpinRequest.class)
@RequestMapping(“/spin”)
public SpinResponse spin(SpinRequest spinRequest) {
…
}
7. Example Event - userStats
● All events are recorded in JSON.
● Structure:
○ Headers
○ Categorization Data (metadata)
○ Payload (message)
● Important Headers:
○ timestamp
○ testVariant
○ plumbeeUid
9. Data Collection (I) - PUT
Application/Game Servers
Events (JSON)
SQS Queue
Log Aggregators
Producers Consumers
What is SQS (Simple Queue Service)?
A cloud-based message queue for transmitting
messages between producers and consumers
SQS Provides:
● ACK/FAIL semantics
● Unlimited number of messages
● Scales transparently
● Buffer zone
10. Data Collection (II) - GET
SQS Queue
What is Apache Flume?
A distributed, reliable, and available service
for efficiently collecting, aggregating, and
moving large amounts of log data
Apache Flume
Consumers
Amazon S3
(Simple Storage Service)
S3 Data:
● Partitioned by: date / type / sub_type
● Compressed with: Snappy
● Aggregated in 512MB chunks
11. Data Collection (III) - Flume
Flume Agent
Source
(Custom)
Sink
(HDFS)
SQS Queue
Channel
(File Based)
● Pluggable component architecture
● Durability via transactions
● File channel use Elastic Book Store (EBS) volumes (network attached storage)
○ Protects against Hardware failure
● SQS Flume Plugin: https://github.com/plumbee/flume-sqs-source
S3 Bucket
Transactions
A + B + C = Flow
A B C
13. Extract, Transform, Load
● Daily activity
● Orchestrated by Amazon DataPipeline
● Includes generation of reports
● Configured with JSON
What is DataPipeline?
A cloud-based data workflow service that
helps you process and move data between
different AWS services
RESOURCE COMMAND SCHEDULE
14. Extract & Transform (I)
What is Elastic Map Reduce?
Cloud-based MapReduce implementation to
process vast amounts of data built on top of
the open-sourced Hadoop framework.
Two phases:
● Map() Procedure -> Filtering & Sorting
● Reduce() -> Summary operation
Penguin
Horse
Cake
Cake
Penguin
Penguin
Penguin
Horse
Horse
Cake
Cake
Horse
Horse
Horse
MAP()
Penguin
Penguin
Penguin
Penguin
REDUCE()
Cake: 2 Horse: 3
RESULT SORTED QUEUES RAW DATA
Penguin:
4
15. Extract & Transform (II)
What is Hive?
An open-sourced Apache project with provides a
SQL-Like interface to summarize, query and
analysis large datasets by leveraging Hadoop’s
MapReduce infrastructure.
● Not really SQL, HQL -> HiveQL
● No transactions, materialized views,
limited subquery support, ...
SELECT plumbeeuid,
COUNT(*) AS spins
FROM eventlog
-- Partitioned data access
WHERE event_date = '2014-11-18'
AND event_type = 'rpc'
AND event_sub_type = 'rpc-spin'
-- Aggregation
GROUP BY plumbeeuid;
Table: Eventlog
● Mounted on top of raw data
● SerDe provides JSON parsing
● Target data via partition filters
16. Extract & Transform (III)
● Hive has limitations!
○ Speed, JSON
● Most of our transformations use:
Streaming MapReduce Jobs
What is Streaming?
“A Hadoop utility that allows you to create
and run MapReduce jobs using any
executable script as a mapper or reducer”
for line in sys.stdin:
data = json.loads(line)
print data['plumbeeUid'] + 't' + 1
Emits, Key value Pairs
466264 => 1, 376166 => 1
983131 => 1, 466264 => 1
Hadoop sorts and shuffles the data making sure
matching keys are processed by a single reducer!
results = defaultdict(int)
for line in sys.stdin:
plumbee_uid, count = line.split('t')
results[plumbee_uid] += int(count)
print results
JSON rpc-spin
Data
Result:
{ 466264: 2, 376166: 1, 983131: 1 }
map()
reduce()
17. Results
Load (I) - Problem
Raw S3 JSON Data Aggregated Data
EMR Transformed data:
● Referred to as aggregates
● Stored in S3
● Accessible via EMR cluster
EMR Transformation
(Hive & Streaming Jobs)
5.4TB
Problem
● We don’t run long-lived EMR clusters.
EMR requires:
● Specialists knowledge
● Is slow, processing and booting “offline”.
Use Amazon Redshift for fast “online” data access
18. What is Redshift?
A column-oriented database which uses
Massive Parallel Processing (MPP) techniques
to support analytics style SQL based
workloads across large datasets.
Power comes from:
● Query parallelization
● Column-oriented design
Redshift Provides:
● Low latency JDBC and ODBC access
● Fault Tolerance
● Automated Backups
Load (II) - Redshift
Redshift (x3 nodes): 0.33s
EMR (x20 nodes): 135.46s
19. Load (II) - Column-Oriented Databases
Row-oriented Database - MySQL
ID First Name Last Name Country
1 Penguin Situation GB
2 Cheese Labs US
3 Horse Barracks GB
Column-oriented Database - Redshift
ID First Name Last Name Country
1 Penguin Situation GB
2 Cheese Labs US
3 Horse Barracks GB
● East to add/modify records
● Could read irrelevant data.
● Great for fast lookups (OLTP)
● Only read in relevant data
● Adding rows requires multiple
updates to column data.
● Great for aggregation queries
(OLAP)
32. User targeting
Run SQL queries directly against Redshift
SQL Query
Amazon Redshift User Segment
33. User targeting: Query example
-- Target all mobile users
SELECT plumbee_uid, arn
FROM mobile_user
34. User targeting: Query example (II)
-- Target lapsed users (1 week lapse)
SELECT plumbee_uid, arn
FROM mobile_user
WHERE last_play_time < (now - 7 days)
44. Amazon SNS: Mobile Push
private void publishMessage(UserData userData, String jsonPayload) {
amazonSNS.publish(new PublishRequest()
.withTargetArn( userData.getEndpoint())
.withMessageStructure( "json")
.withMessage( jsonPayload ));
}
Payload example
{"default": "The 5 day Halloween Challenge has started today! Touch to play NOW!"}