Twitter's 250 million users generate tens of billions of tweet views per day. Aggregating these events in real time - in a robust enough way to incorporate into our products - presents a massive scaling challenge. In this presentation I introduce TSAR (the TimeSeries AggregatoR), a robust, flexible, and scalable service for real-time event aggregation designed to solve this problem and a range of similar ones. I discuss how we built TSAR using Python and Scala from the ground up, almost entirely on open-source technologies (Storm, Summingbird, Kafka, Aurora, and others), and describe some of the challenges we faced in scaling it to process tens of billions of events per day.
One of the biggest time sinks and challenges for mobile application developers is developing, accessing, and managing all of the disparate data sources that are involved in delivering delightful, collaborative, and real-time mobile experiences for users while also enabling offline capabilities for when a user is not connected, but still wants to use the app. In this session, you be introduced to the new AWS AppSync service that speed and simplifies these tasks for developers using GraphQL to provide a data abstraction layer and easy query and update statements without having to know the details of the underlying data sources.
(Diapositivas de presentación son en inglés.)
Cada vez estamos produciendo un mayor volumen de datos, y los negocios necesitan del análisis de está información al segundo (o incluso al milisegundo). AWS proporciona tecnologías para resolver los problemas del Big Data, pero qué servicios debo usar, porqué, cuándo, cómo?. En esta sesión hablaremos de las diferentes fases en el análisis de los datos: ingesta, almacenamiento, procesamiento y visualización, y de como elegir la tecnología adecuada para cada una de ellas.
Streaming, Analytics and Reactive Applications with Apache CassandraCédrick Lunven
Apache Cassandra is the one of the first choice when it comes to work with timeseries as it allows thousands of writes per second and a linear scalability. In this session we will demonstrate through code reviews what can be achieved during and after data ingestion using extra capabilities provided by DataStax Enterprise (spark | search | graph)
- Event streaming and Machine learning (forecast)
- Store in multi-model for multiple usage (search, graph)
- Display dashboards in real-time leveraging reactive driver in web application.
One of the biggest time sinks and challenges for mobile application developers is developing, accessing, and managing all of the disparate data sources that are involved in delivering delightful, collaborative, and real-time mobile experiences for users while also enabling offline capabilities for when a user is not connected, but still wants to use the app. In this session, you be introduced to the new AWS AppSync service that speed and simplifies these tasks for developers using GraphQL to provide a data abstraction layer and easy query and update statements without having to know the details of the underlying data sources.
(Diapositivas de presentación son en inglés.)
Cada vez estamos produciendo un mayor volumen de datos, y los negocios necesitan del análisis de está información al segundo (o incluso al milisegundo). AWS proporciona tecnologías para resolver los problemas del Big Data, pero qué servicios debo usar, porqué, cuándo, cómo?. En esta sesión hablaremos de las diferentes fases en el análisis de los datos: ingesta, almacenamiento, procesamiento y visualización, y de como elegir la tecnología adecuada para cada una de ellas.
Streaming, Analytics and Reactive Applications with Apache CassandraCédrick Lunven
Apache Cassandra is the one of the first choice when it comes to work with timeseries as it allows thousands of writes per second and a linear scalability. In this session we will demonstrate through code reviews what can be achieved during and after data ingestion using extra capabilities provided by DataStax Enterprise (spark | search | graph)
- Event streaming and Machine learning (forecast)
- Store in multi-model for multiple usage (search, graph)
- Display dashboards in real-time leveraging reactive driver in web application.
(APP304) AWS CloudFormation Best Practices | AWS re:Invent 2014Amazon Web Services
"With AWS CloudFormation you can model, provision, and update the full breadth of AWS resources. You can manage anything from a single Amazon EC2 instance to a multi-tier application.
If you are familiar with AWS CloudFormation or using it already, this session is for you. If you are familiar with AWS CloudFormation, you may have questions such as ''How do I plan my stacks?', ''How do I deploy and bootstrap software on my stacks?' and ''Where does AWS CloudFormation fit in a DevOps pipeline?' If you are using AWS CloudFormation already, you may have questions such as ''How do I manage my templates at scale?', ''How do I safely update stacks?', and ''How do I audit changes to my stack?' This session is intended to answer those questions.
If you are new to AWS CloudFormation, get up to speed for this session by completing the Working with CloudFormation lab in the self-paced Labs Lounge."
by Lin Chunyong and Ryan Deivert, Airbnb
AWS Data & Analytics Week is an opportunity to learn about Amazon’s family of managed analytics services. These services provide easy, scalable, reliable, and cost-effective ways to manage your data in the cloud. We explain the fundamentals and take a technical deep dive into Amazon Redshift data warehouse; Data Lake services including Amazon EMR, Amazon Athena, & Amazon Redshift Spectrum; Log Analytics with Amazon Elasticsearch Service; and data preparation and placement services with AWS Glue and Amazon Kinesis. You'll will learn how to get started, how to support applications, and how to scale.
Crossing the Streams: Event Streaming with Kafka Streams
Viktor Gamov, Developer Advocate, Confluent
https://www.meetup.com/Lille-Kafka/events/270888493/
Increasingly, developers are breaking their applications apart into smaller components and distributing them across a pool of compute resources. Docker is fast becoming a core component of these architectures, but going from a single or a small number of containers to a distributed application is not trivial. In this session we will talk about some of the core architectural principles underlying the Amazon EC2 Container (ECS) and how they are designed to help you scale your applications and run them in production. We will talk about how containers can be used as the foundation for new computing primitives and how these are being used by our customers for increased agility and productivity.
by Gautam Srinivasan, Solutions Architect and Karan Desai, Solutions Architect, AWS
We'll import data from multiple sources and run analytics on our Data Lake. You’ll need a laptop with a Firefox or Chrome browser.
AWS May Webinar Series - Deep Dive: Infrastructure as CodeAmazon Web Services
If you are interested to know more about AWS Chicago Summit, please use the following to register: http://amzn.to/1RooPPL
Many AWS customers have adopted a DevOps model for faster and more reliable software delivery. Applying software engineering best practices such as revision control and continuous delivery to your infrastructure is essential for adopting DevOps. In this webinar, find out how AWS CloudFormation allows you to leverage a DevOps model by treating infrastructure as code and applying software engineering best practices to your AWS infrastructure.
Learning Objectives: • Understand the basic CloudFormation terminology, concepts, and workflow • Deploy applications and provision infrastructure through a CloudFormation template • Use CloudFormation with a CICD pipeline, AWS OpsWorks, and AWS Lambda
Who Should Attend: • DevOps Engineers, Solutions Architects, Systems Integrators
AWS CloudTrail to Track AWS Resources in Your Account (SEC207) | AWS re:Inven...Amazon Web Services
Customers using AWS resources such as EC2 instances, EC2 Security Groups and RDS instances would like to track changes made to such resources and who made those changes. In this session, customers will learn about gaining visibility into user activity in their account and aggregating logs across multiple accounts into a single bucket. Customers will also learn about how they can use the user activity logs to meet the logging guidelines/requirements of different compliance standards. AWS Advanced Technology Partners Splunk/Sumologic (exact partners TBD) will demonstrate applications for analyzing user activity within an AWS account.
Deep Dive into Concepts and Tools for Analyzing Streaming Data on AWS AWS Germany
Querying streaming data with SQL to derive actionable insights at the point of impact in a timely and continuous fashion offers various benefits over querying data in a traditional database. However, although it is desirable for many use cases to transition to a stream based paradigm, stream processing systems and traditional databases are fundamentally different: in a database, the data is (more or less) fixed and the queries are executed in an ad-hoc manner, whereas in stream processing systems, the queries are fixed and the data flows through the system in real-time. This leads to different primitives that are required to model and query streaming data.
In this session, we will introduce basic stream processing concepts and discuss strategies that are commonly used to address the challenges that arise from querying of streaming data. We will discuss different time semantics, processing guarantees and elaborate how to deal with reordering and late arriving of events. Finally, we will compare how different streaming use cases can be implemented on AWS by leveraging Amazon Kinesis Data Analytics and Apache Flink.
Mastering Access Control Policies (SEC302) | AWS re:Invent 2013Amazon Web Services
This session will take an in-depth look at using the AWS Access Control Policy language. We will start with the basics of understanding the policy language and how to create policies for users and groups. Next we’ll take a look at how to use policy variables to simplify the management of policies. Finally, we’ll cover some of the common use cases such as granting a user secure access to an S3 bucket, allowing an IAM user to manage their own credentials and passwords, and more!
AWS Analytics Immersion Day - Build BI System from Scratch (Day1, Day2 Full V...Sungmin Kim
How to build Business Intelligence System from scratch on AWS (Day1, Day2)
------------------------------------------------------------------------------------------
2020-03-18(수)~19(목) 2일 동안 온라인으로 진행한 Online AWS Analytics Immersion Day 전체 발표 자료 입니다.
BI(Business Intelligence) 시스템을 설계하는 과정에서 AWS Analytics 서비스들을 어떻게 활용할 수 있는지 설명 드리고자 만든 자료 입니다.
Target Audience
-------------------
Online Analytics Immersion Day는 다음과 같은 고객을 대상으로 진행됩니다.
- AWS Analytics Services (ex. Kinesis, Athena, Redshift, EMR, etc)의 기본 개념을 알고 있지만, 이러한 서비스 활용 방법 및 데이터 분석 시스템 구축 과정이 궁금하신 분
- 데이터 분석 시스템을 구축한 경험은 있지만, 자신이 만든 시스템을 아키텍처 관점에서
어떻게 평가하고 확인할 수 있는지 궁금하신 분
Data Platform at Twitter: Enabling Real-time & Batch Analytics at ScaleSriram Krishnan
The Data Platform at Twitter supports engineers and data scientists running batch jobs on Hadoop clusters that are several 1000s of nodes, and real-time jobs on top of systems such as Storm. In this presentation, I discuss the overall Data Platform stack at Twitter. In particular, I talk about enabling real-time and batch analytics at scale with the help of Scalding, which is a Scala DSL for batch jobs using MapReduce, Summingbird, which is a framework for combined real-time and batch processing, and Tsar, which is a framework for real-time time-series aggregations.
Rapid prototyping of eclipse rcp applications - Eclipsecon Europe 2017Patrik Suzzi
As Eclipse RCP developer, you have different options to create prototypes for your customers quickly.
In this talk, I will share my experience in building RCP applications by using a simple, efficient and extensible technology stack based on:
POJO data model
JAXB and JPA annotations for XML and DB persistence
Eclipse E4 as Application Platform
WindowBuilder as a visual designer to build UIs and manage data bindings.
With the help of a case study from the banking domain, you will see all the steps needed to create a complete prototype that you can easily re-use and transform into a real world application.
During the presentation, we'll examine the key points for the creation of the prototype, and you will see few demos on how to use the E4 Editor and Window Builder efficiently.
At the end of the talk, you will be able to repeat the process and understand and adapt the provided source code to develop your applications quickly.
If you want to prototype Eclipse RCP applications with XML/DB persistence and a relatively complex UI, you should attend this talk.
Ceilometer is a tool that collects usage and performance data, while Heat orchestrates complex deployments on top of OpenStack. Heat aims to autoscale its deployments, scaling up when they're running hot and scaling back when idle.
Ceilometer can access decisive data and trigger the appropriate actions in Heat. The result of these two OpenStack projects meeting is value creation in the form of an alarming API in Ceilometer and its consumption in Heat.
Slides presented at the Fall OpenStack Design Summit in Hong Kong
Azure Functions are great for a wide range of scenarios, including working with data on a transactional or event-driven basis. In this session, we'll look at how you can interact with Azure SQL, Cosmos DB, Event Hubs, and more so you can see how you can take a lightweight but code-first approach to building APIs, integrations, ETL, and maintenance routines.
Social media analytics using Azure TechnologiesKoray Kocabas
Social media are computer-mediated tools that allow people to create, share or exchange information, ideas, and pictures/videos in virtual communities and networks. To sum up Social Media is everything for your customers and Your company need to listen them to understand, make a custom offer or improve loyalty etc. Azure Stream Analytics and HDInsight platforms can solve this problem for you. We'll focus on how to get Twitter data using Stream Analytics and how to make data enrichment and storing using HDInsight and What is the problem about sentiment analytics using Azure Machine Learning.
An experiment in a distributed approach to processing the real-time data generated by a large scale social media campaign. Presented at Cambridge Geek Nights 13.
(APP304) AWS CloudFormation Best Practices | AWS re:Invent 2014Amazon Web Services
"With AWS CloudFormation you can model, provision, and update the full breadth of AWS resources. You can manage anything from a single Amazon EC2 instance to a multi-tier application.
If you are familiar with AWS CloudFormation or using it already, this session is for you. If you are familiar with AWS CloudFormation, you may have questions such as ''How do I plan my stacks?', ''How do I deploy and bootstrap software on my stacks?' and ''Where does AWS CloudFormation fit in a DevOps pipeline?' If you are using AWS CloudFormation already, you may have questions such as ''How do I manage my templates at scale?', ''How do I safely update stacks?', and ''How do I audit changes to my stack?' This session is intended to answer those questions.
If you are new to AWS CloudFormation, get up to speed for this session by completing the Working with CloudFormation lab in the self-paced Labs Lounge."
by Lin Chunyong and Ryan Deivert, Airbnb
AWS Data & Analytics Week is an opportunity to learn about Amazon’s family of managed analytics services. These services provide easy, scalable, reliable, and cost-effective ways to manage your data in the cloud. We explain the fundamentals and take a technical deep dive into Amazon Redshift data warehouse; Data Lake services including Amazon EMR, Amazon Athena, & Amazon Redshift Spectrum; Log Analytics with Amazon Elasticsearch Service; and data preparation and placement services with AWS Glue and Amazon Kinesis. You'll will learn how to get started, how to support applications, and how to scale.
Crossing the Streams: Event Streaming with Kafka Streams
Viktor Gamov, Developer Advocate, Confluent
https://www.meetup.com/Lille-Kafka/events/270888493/
Increasingly, developers are breaking their applications apart into smaller components and distributing them across a pool of compute resources. Docker is fast becoming a core component of these architectures, but going from a single or a small number of containers to a distributed application is not trivial. In this session we will talk about some of the core architectural principles underlying the Amazon EC2 Container (ECS) and how they are designed to help you scale your applications and run them in production. We will talk about how containers can be used as the foundation for new computing primitives and how these are being used by our customers for increased agility and productivity.
by Gautam Srinivasan, Solutions Architect and Karan Desai, Solutions Architect, AWS
We'll import data from multiple sources and run analytics on our Data Lake. You’ll need a laptop with a Firefox or Chrome browser.
AWS May Webinar Series - Deep Dive: Infrastructure as CodeAmazon Web Services
If you are interested to know more about AWS Chicago Summit, please use the following to register: http://amzn.to/1RooPPL
Many AWS customers have adopted a DevOps model for faster and more reliable software delivery. Applying software engineering best practices such as revision control and continuous delivery to your infrastructure is essential for adopting DevOps. In this webinar, find out how AWS CloudFormation allows you to leverage a DevOps model by treating infrastructure as code and applying software engineering best practices to your AWS infrastructure.
Learning Objectives: • Understand the basic CloudFormation terminology, concepts, and workflow • Deploy applications and provision infrastructure through a CloudFormation template • Use CloudFormation with a CICD pipeline, AWS OpsWorks, and AWS Lambda
Who Should Attend: • DevOps Engineers, Solutions Architects, Systems Integrators
AWS CloudTrail to Track AWS Resources in Your Account (SEC207) | AWS re:Inven...Amazon Web Services
Customers using AWS resources such as EC2 instances, EC2 Security Groups and RDS instances would like to track changes made to such resources and who made those changes. In this session, customers will learn about gaining visibility into user activity in their account and aggregating logs across multiple accounts into a single bucket. Customers will also learn about how they can use the user activity logs to meet the logging guidelines/requirements of different compliance standards. AWS Advanced Technology Partners Splunk/Sumologic (exact partners TBD) will demonstrate applications for analyzing user activity within an AWS account.
Deep Dive into Concepts and Tools for Analyzing Streaming Data on AWS AWS Germany
Querying streaming data with SQL to derive actionable insights at the point of impact in a timely and continuous fashion offers various benefits over querying data in a traditional database. However, although it is desirable for many use cases to transition to a stream based paradigm, stream processing systems and traditional databases are fundamentally different: in a database, the data is (more or less) fixed and the queries are executed in an ad-hoc manner, whereas in stream processing systems, the queries are fixed and the data flows through the system in real-time. This leads to different primitives that are required to model and query streaming data.
In this session, we will introduce basic stream processing concepts and discuss strategies that are commonly used to address the challenges that arise from querying of streaming data. We will discuss different time semantics, processing guarantees and elaborate how to deal with reordering and late arriving of events. Finally, we will compare how different streaming use cases can be implemented on AWS by leveraging Amazon Kinesis Data Analytics and Apache Flink.
Mastering Access Control Policies (SEC302) | AWS re:Invent 2013Amazon Web Services
This session will take an in-depth look at using the AWS Access Control Policy language. We will start with the basics of understanding the policy language and how to create policies for users and groups. Next we’ll take a look at how to use policy variables to simplify the management of policies. Finally, we’ll cover some of the common use cases such as granting a user secure access to an S3 bucket, allowing an IAM user to manage their own credentials and passwords, and more!
AWS Analytics Immersion Day - Build BI System from Scratch (Day1, Day2 Full V...Sungmin Kim
How to build Business Intelligence System from scratch on AWS (Day1, Day2)
------------------------------------------------------------------------------------------
2020-03-18(수)~19(목) 2일 동안 온라인으로 진행한 Online AWS Analytics Immersion Day 전체 발표 자료 입니다.
BI(Business Intelligence) 시스템을 설계하는 과정에서 AWS Analytics 서비스들을 어떻게 활용할 수 있는지 설명 드리고자 만든 자료 입니다.
Target Audience
-------------------
Online Analytics Immersion Day는 다음과 같은 고객을 대상으로 진행됩니다.
- AWS Analytics Services (ex. Kinesis, Athena, Redshift, EMR, etc)의 기본 개념을 알고 있지만, 이러한 서비스 활용 방법 및 데이터 분석 시스템 구축 과정이 궁금하신 분
- 데이터 분석 시스템을 구축한 경험은 있지만, 자신이 만든 시스템을 아키텍처 관점에서
어떻게 평가하고 확인할 수 있는지 궁금하신 분
Data Platform at Twitter: Enabling Real-time & Batch Analytics at ScaleSriram Krishnan
The Data Platform at Twitter supports engineers and data scientists running batch jobs on Hadoop clusters that are several 1000s of nodes, and real-time jobs on top of systems such as Storm. In this presentation, I discuss the overall Data Platform stack at Twitter. In particular, I talk about enabling real-time and batch analytics at scale with the help of Scalding, which is a Scala DSL for batch jobs using MapReduce, Summingbird, which is a framework for combined real-time and batch processing, and Tsar, which is a framework for real-time time-series aggregations.
Rapid prototyping of eclipse rcp applications - Eclipsecon Europe 2017Patrik Suzzi
As Eclipse RCP developer, you have different options to create prototypes for your customers quickly.
In this talk, I will share my experience in building RCP applications by using a simple, efficient and extensible technology stack based on:
POJO data model
JAXB and JPA annotations for XML and DB persistence
Eclipse E4 as Application Platform
WindowBuilder as a visual designer to build UIs and manage data bindings.
With the help of a case study from the banking domain, you will see all the steps needed to create a complete prototype that you can easily re-use and transform into a real world application.
During the presentation, we'll examine the key points for the creation of the prototype, and you will see few demos on how to use the E4 Editor and Window Builder efficiently.
At the end of the talk, you will be able to repeat the process and understand and adapt the provided source code to develop your applications quickly.
If you want to prototype Eclipse RCP applications with XML/DB persistence and a relatively complex UI, you should attend this talk.
Ceilometer is a tool that collects usage and performance data, while Heat orchestrates complex deployments on top of OpenStack. Heat aims to autoscale its deployments, scaling up when they're running hot and scaling back when idle.
Ceilometer can access decisive data and trigger the appropriate actions in Heat. The result of these two OpenStack projects meeting is value creation in the form of an alarming API in Ceilometer and its consumption in Heat.
Slides presented at the Fall OpenStack Design Summit in Hong Kong
Azure Functions are great for a wide range of scenarios, including working with data on a transactional or event-driven basis. In this session, we'll look at how you can interact with Azure SQL, Cosmos DB, Event Hubs, and more so you can see how you can take a lightweight but code-first approach to building APIs, integrations, ETL, and maintenance routines.
Social media analytics using Azure TechnologiesKoray Kocabas
Social media are computer-mediated tools that allow people to create, share or exchange information, ideas, and pictures/videos in virtual communities and networks. To sum up Social Media is everything for your customers and Your company need to listen them to understand, make a custom offer or improve loyalty etc. Azure Stream Analytics and HDInsight platforms can solve this problem for you. We'll focus on how to get Twitter data using Stream Analytics and how to make data enrichment and storing using HDInsight and What is the problem about sentiment analytics using Azure Machine Learning.
An experiment in a distributed approach to processing the real-time data generated by a large scale social media campaign. Presented at Cambridge Geek Nights 13.
This session is recommended for anyone interested in building real-time streaming applications using AWS. In this session, you will get a deep understanding of how data can be ingested by Amazon Kinesis and made available for real-time analysis and processing. We’ll also show how you can leverage the Kinesis client to make your applications highly available and fault tolerant. We’ll explore various design considerations in implementing real-time solutions and explain key concepts against the backdrop of an actual use case. Finally, we’ll situate stream processing in the broader context of your big data applications.
WSO2Con USA 2015: WSO2 Analytics Platform - The One Stop Shop for All Your Da...WSO2
Today’s highly connected world is flooding businesses with big and fast-moving data. The ability to trawl this data ocean and identify actionable insights can deliver a competitive advantage to any organization. The WSO2 Analytics Platform enables businesses to do just that by providing batch, real-time, interactive and predictive analysis capabilities all in one place.
In order to avoid tightly coupled layers each other, we are likely to adopt architectures for apps. This talk covers about simple layered architecture especially for data and repository layer, which comes before other layer UI points. e.g. ViewModel
Gaining actionable insights in real time enables organizations to grab opportunities and omit threats. Sensing the world, detecting actionable insights, and acting upon them has now become far easier than ever with the advancements of streaming SQL. Below are the topics discussed in this slide.
- Building stream processing applications using streaming SQL
- Deploying and monitoring streaming applications
- Scaling streaming applications
- Building domain specific business UIs
- Visualizing stream processing outputs via dashboards
Timeseries - data visualization in GrafanaOCoderFest
Presentation deals with proper handling of the application and resources monitoring. It mentions tools that help with presentation layer - Grafana, storage - InfluxDB, and communication between the measurements and their destination - Telegraf.
Presented by Marek Szymeczko
Meetup Streaming data loaded and analysed successfully.
Streaming events data loaded through Spark Streaming using Custom Receivers and handled as Asynchronous HTTP requests.
History Events and Rsvp data analysed through Spark MLlib to build an Group member recommendations based on K-means clustering model.
Code: https://github.com/ssushmanth/meetup-stream
Apache Spark Streaming: Architecture and Fault ToleranceSachin Aggarwal
Agenda:
• Spark Streaming Architecture
• How different is Spark Streaming from other streaming applications
• Fault Tolerance
• Code Walk through & demo
• We will supplement theory concepts with sufficient examples
Speakers :
Paranth Thiruvengadam (Architect (STSM), Analytics Platform at IBM Labs)
Profile : https://in.linkedin.com/in/paranth-thiruvengadam-2567719
Sachin Aggarwal (Developer, Analytics Platform at IBM Labs)
Profile : https://in.linkedin.com/in/nitksachinaggarwal
Github Link: https://github.com/agsachin/spark-meetup
Accumulo Summit 2015: Building Aggregation Systems on Accumulo [Leveraging Ac...Accumulo Summit
Talk Abstract
Aggregation has long been a use case of Accumulo Iterators. Iterators' ability to reduce data during compaction and scanning can greatly simplify an aggregation system built on Accumulo. This talk will first review how Accumulo's Iterators/Combiners work in the context of aggregating values. I'll then step back and look at the abstraction of aggregation functions as commutative operations and the several benefits that result by making this abstraction. We will see how it becomes no harder to introduce powerful operations such as cardinality estimation and approximate top-k than it is to sum integers. I will show how to integrate these ideas into Accumulo with an example schema and Iterator. Finally, a practical aggregation use case will be discussed to highlight the concepts from the talk.
Speakers
Gadalia O'Bryan
Senior Solutions Architect, Koverse
Gadalia O'Bryan is a Sr. Solutions Architect at Koverse, where she leads customer projects and contributes to key feature and algorithm design, such as Koverse's Aggregation Framework. Prior to Koverse, Gadalia was a mathematician for the National Security Agency. She has an M.A. in mathematics from UCLA and has been working with Accumulo for the past 6 years.
Bill Slacum
Software Engineer, Koverse
Bill is an Accumulo committer and PMC member who has been working on large scale query and analytic frameworks since 2010. He holds BS's in computer science and financial economics from UMBC. Having never used his passport to leave the United States, he is currently a national man of mystery.
Building and deploying microservices with event sourcing, CQRS and Docker (QC...Chris Richardson
In this talk we share our experiences developing and deploying a microservices-based application. You will learn about the distributed data management challenges that arise in a microservices architecture. We will describe how we solved them using event sourcing to reliably publish events that drive eventually consistent workflows and pdate CQRS-based views. You will also learn how we build and deploy the application using a Jenkins-based deployment pipeline that creates Docker images that run on Amazon EC2.
The Fine Art of Time Travelling: implementing Event SourcingAndrea Saltarello
If there is a common practice in architecting software systems, it is to have them store the last known state of business entities in a relational database: this practice trades the easiness of implementation with the cost of losing the history of such entities. Event Sourcing provides a pivotal solution to this problem, giving systems the capability of restoring the state they had at any given point in time. Furthermore, injecting mock-up events and having them replayed by the business logic allows for an easy implementation of simulations and “what if” scenarios.
Similar to TSAR (TimeSeries AggregatoR) Tech Talk (20)
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...John Andrews
SlideShare Description for "Chatty Kathy - UNC Bootcamp Final Project Presentation"
Title: Chatty Kathy: Enhancing Physical Activity Among Older Adults
Description:
Discover how Chatty Kathy, an innovative project developed at the UNC Bootcamp, aims to tackle the challenge of low physical activity among older adults. Our AI-driven solution uses peer interaction to boost and sustain exercise levels, significantly improving health outcomes. This presentation covers our problem statement, the rationale behind Chatty Kathy, synthetic data and persona creation, model performance metrics, a visual demonstration of the project, and potential future developments. Join us for an insightful Q&A session to explore the potential of this groundbreaking project.
Project Team: Jay Requarth, Jana Avery, John Andrews, Dr. Dick Davis II, Nee Buntoum, Nam Yeongjin & Mat Nicholas
Quantitative Data AnalysisReliability Analysis (Cronbach Alpha) Common Method...2023240532
Quantitative data Analysis
Overview
Reliability Analysis (Cronbach Alpha)
Common Method Bias (Harman Single Factor Test)
Frequency Analysis (Demographic)
Descriptive Analysis
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Data and AI
Round table discussion of vector databases, unstructured data, ai, big data, real-time, robots and Milvus.
A lively discussion with NJ Gen AI Meetup Lead, Prasad and Procure.FYI's Co-Found
Techniques to optimize the pagerank algorithm usually fall in two categories. One is to try reducing the work per iteration, and the other is to try reducing the number of iterations. These goals are often at odds with one another. Skipping computation on vertices which have already converged has the potential to save iteration time. Skipping in-identical vertices, with the same in-links, helps reduce duplicate computations and thus could help reduce iteration time. Road networks often have chains which can be short-circuited before pagerank computation to improve performance. Final ranks of chain nodes can be easily calculated. This could reduce both the iteration time, and the number of iterations. If a graph has no dangling nodes, pagerank of each strongly connected component can be computed in topological order. This could help reduce the iteration time, no. of iterations, and also enable multi-iteration concurrency in pagerank computation. The combination of all of the above methods is the STICD algorithm. [sticd] For dynamic graphs, unchanged components whose ranks are unaffected can be skipped altogether.
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Data and AI
Discussion on Vector Databases, Unstructured Data and AI
https://www.meetup.com/unstructured-data-meetup-new-york/
This meetup is for people working in unstructured data. Speakers will come present about related topics such as vector databases, LLMs, and managing data at scale. The intended audience of this group includes roles like machine learning engineers, data scientists, data engineers, software engineers, and PMs.This meetup was formerly Milvus Meetup, and is sponsored by Zilliz maintainers of Milvus.
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...pchutichetpong
M Capital Group (“MCG”) expects to see demand and the changing evolution of supply, facilitated through institutional investment rotation out of offices and into work from home (“WFH”), while the ever-expanding need for data storage as global internet usage expands, with experts predicting 5.3 billion users by 2023. These market factors will be underpinned by technological changes, such as progressing cloud services and edge sites, allowing the industry to see strong expected annual growth of 13% over the next 4 years.
Whilst competitive headwinds remain, represented through the recent second bankruptcy filing of Sungard, which blames “COVID-19 and other macroeconomic trends including delayed customer spending decisions, insourcing and reductions in IT spending, energy inflation and reduction in demand for certain services”, the industry has seen key adjustments, where MCG believes that engineering cost management and technological innovation will be paramount to success.
MCG reports that the more favorable market conditions expected over the next few years, helped by the winding down of pandemic restrictions and a hybrid working environment will be driving market momentum forward. The continuous injection of capital by alternative investment firms, as well as the growing infrastructural investment from cloud service providers and social media companies, whose revenues are expected to grow over 3.6x larger by value in 2026, will likely help propel center provision and innovation. These factors paint a promising picture for the industry players that offset rising input costs and adapt to new technologies.
According to M Capital Group: “Specifically, the long-term cost-saving opportunities available from the rise of remote managing will likely aid value growth for the industry. Through margin optimization and further availability of capital for reinvestment, strong players will maintain their competitive foothold, while weaker players exit the market to balance supply and demand.”
The Building Blocks of QuestDB, a Time Series Databasejavier ramirez
Talk Delivered at Valencia Codes Meetup 2024-06.
Traditionally, databases have treated timestamps just as another data type. However, when performing real-time analytics, timestamps should be first class citizens and we need rich time semantics to get the most out of our data. We also need to deal with ever growing datasets while keeping performant, which is as fun as it sounds.
It is no wonder time-series databases are now more popular than ever before. Join me in this session to learn about the internal architecture and building blocks of QuestDB, an open source time-series database designed for speed. We will also review a history of some of the changes we have gone over the past two years to deal with late and unordered data, non-blocking writes, read-replicas, or faster batch ingestion.
5. TimeSeries Aggregation at Twitter
A common problem
⇢Data products (analytics.twitter.com, embedded
analytics, internal dashboards)
⇢Business metrics
⇢Site traffic, service health, and user engagement
monitoring
!
Hard to do at scale
⇢10s of billions of events/day in real time
!
Hard to maintain aggregation services once deployed -
⇢Complex tooling is required
7. TimeSeries Aggregation at Twitter
Many time-series applications look similar
⇢ Common types of aggregations
⇢ Similar service stacks
!
Multi-year effort to build general solutions
⇢Summingbird - abstraction library for generalized
distributed computation
!
TSAR - an end-to-end aggregation service built on
Summingbird
⇢Abstracts away everything except application's data
model and business logic
15. Example: API aggregates
!
⇢Bucket each API call
!
⇢Dimensions - endpoint, datacenter, client
application ID
!
⇢Compute - total event count, unique users, mean
response time etc
!
⇢Write the output toVertica
17. Example: Impressions by Tweet
!
⇢Bucket each impressions by tweet ID
!
⇢Compute total count, unique users
!
⇢Write output to a key-value store
!
⇢Expose output via a high-SLA query service
!
⇢Write sample of data toVertica for cross-validation
19. Problems!
⇢Service interruption: Can we retrieve lost data?
!
⇢Data schema coordination: Store output as log
data, in a key-value data store, in cache, and in
relational databases
!
⇢Flexible schema change
!
⇢Easy to backfill and update/repair historical data
!
Most important: Solve these problems in a general
way
21. TSAR’s design principles
1) Hybrid computation: Build on Summingbird, process
each event twice - in real time & in batch (at a later
time)
!
⇢Gives stability and reproducibility of batch
⇢Streaming (recency) of realtime
!
Leverage the Summingbird ecosystem:
⇢Abstraction framework over computing platforms
⇢Rich library of approximation monoids (Algebird)
⇢Storage abstractions (Storehaus)
!
!
24. TSAR’s design principles
2) Separate event production from event aggregation
!
User specifies how to extract events from source data
!
Bucketing and aggregating events is managed by TSAR
!
!
26. TSAR’s design principles
3) Unified data schema:
⇢Data schema specified in datastore-independent way
⇢Managed schema evolution & data transformation
!
Store data on:
⇢HDFS
⇢Manhattan (key-value)
⇢Vertica/MySQL
⇢Cache
!
Easily extensible to other schemas (Cassandra, HBase, etc)
31. Tweet Impressions in TSAR
⇢Annotate each tweet with an impression count
!
⇢Count = unique users who saw that tweet
!
⇢Massive scalability challenge:
• > 500MM tweets/day
• tens of billions of impressions
!
⇢Want realtime updates
!
⇢Production ready and robust
55. What has been specified?
⇢Our event schema (in thrift)
!
⇢How to produce these events
!
⇢Dimensions to aggregate on
!
⇢Time granularities to aggregate on
!
⇢Sinks (Manhattan / MySQL) to use
!
56. What do you not specify?
⇢How to represent the aggregated data
!
⇢How to represent the schema in MySQL /
Manhattan
!
⇢How to perform the aggregation
!
⇢How to locate and connect to underlying services
(Hadoop, Storm, MySQL, Manhattan, …)
!
57. Operational simplicity
End-to-end service infrastructure with a single command
$ tsar deploy --env=prod
⇢Launch Hadoop jobs
⇢Launch Storm jobs
⇢Launch Thrift query service
⇢Launch loader processes to load data into MySQL / Manhattan
⇢Mesos configs for all of the above
⇢Alerts for the batch & storm jobs and the query service
⇢Observability for the query service
⇢Auto-create tables and views in MySQL orVertica
⇢Automatic data regression and data anomaly checks
66. Backfill tooling
But what about historical data?
tsar backfill —start=<start> —end=<end>
Backfill runs parallel to the production job
!
Useful for repairing historical data as well
68. Aggregating on different granularities
We have been computing only daily aggregates
We now wish to add alltime aggregates
Configuration File
69. Aggregating on different granularities
We have been computing only daily aggregates
We now wish to add alltime aggregates
Output(sink = Sink.Manhattan, width = 1 * Day)
Output(sink = Sink.Manhattan, width = Alltime)
Configuration File
70. Aggregating on different granularities
We have been computing only daily aggregates
We now wish to add alltime aggregates
Output(sink = Sink.Manhattan, width = 1 * Day)
Output(sink = Sink.Manhattan, width = Alltime)
Configuration File
71. Aggregating on different granularities
We have been computing only daily aggregates
We now wish to add alltime aggregates
Output(sink = Sink.Manhattan, width = 1 * Day)
Output(sink = Sink.Manhattan, width = Alltime)
New aggregation granularity
Configuration File
77. Support for multiple sinks
So far, only persisting data to Manhattan
!
Persist data to MySQL as well
Configuration File
78. Support for multiple sinks
So far, only persisting data to Manhattan
!
Persist data to MySQL as well
Output(sink = Sink.Manhattan, width = 1 * Day)
Output(sink = Sink.Manhattan, width = Alltime)
Output(sink = Sink.MySQL, width = Alltime)
Configuration File
79. Support for multiple sinks
So far, only persisting data to Manhattan
!
Persist data to MySQL as well
Output(sink = Sink.Manhattan, width = 1 * Day)
Output(sink = Sink.Manhattan, width = Alltime)
Output(sink = Sink.MySQL, width = Alltime)
Configuration File
80. Support for multiple sinks
So far, only persisting data to Manhattan
!
Persist data to MySQL as well
Output(sink = Sink.Manhattan, width = 1 * Day)
Output(sink = Sink.Manhattan, width = Alltime)
Output(sink = Sink.MySQL, width = Alltime)
New sink
Configuration File
90. Value Packing - Key Indexer
Some keys are always queried together
91. Value Packing - Key Indexer
!
Naive approach:
!
(CampaignIdField, CountryField) —-> Impressions
!
Would have to query:
!
(CampaignIdField, USA) —-> Impressions_USA
(CampaignIdField, UK) —-> Impressions_UK
…
!
Better approach:
!
(CampaignIdField) —> Map[CountryField -> Impressions]
!
Would have to query:
!
(CampaignIdField) —> Map[USA -> Impressions_USA,
UK -> Impressions_UK,…]
⇢Limits key fanout
!
⇢Implicit index on CountryField - don’t need to
know which countries to query for
95. Conclusion: Three basic problems
⇢Computation management
Describe and execute computational logic
Specify aggregation dimensions, metrics and
time granularities
⇢Dataset management
Define, deploy and evolve data schemas
Coordinate data migration, backfill and
recovery
⇢Service management
Define query services, observability, alerting,
regression checks, coordinate deployment
across all underlying services
TSAR gives you all of the above
!