This document discusses analytics for the real-time web. It describes how real-time web, enabled by mobile devices and social sharing, requires real-time incremental processing instead of batch processing. The author's system, Triggy, is presented as a solution. Triggy extends the Cassandra distributed key-value store with push-style processing and synchronization to incrementally update aggregate results. Experiments show Triggy can handle high throughput workloads like tweet counting. Similar systems like Yahoo! S4 and Google Percolator are also discussed. Potential applications mentioned include social media optimization, real-time news recommendations, advertising, and game analytics.
Glynn Bird - Building the "microservices way" involves breaking monolithic IT systems into small, decoupled services that each to one job well. This talk builds a practical microservices architecture during the talk using small Node.js apps that perform storage, analytics and visualisation tasks. Learn how to orchestrate your own microservice architecture using simple, easily-tested building blocks.
Using druid for interactive count distinct queries at scaleItai Yaffe
At NMC (Nielsen Marketing Cloud) we need to present to our clients the number of unique users who meet a given criteria. The condition is typically a set-theoretic expression over a stream of events for a given time range. Historically, we have used ElasticSearch to answer these types of questions, however, we have encountered major scaling issues. In this presentation we will detail the journey of researching, benchmarking and productionizing a new technology, Druid, with DataSketches, to overcome the limitations we were facing
Improving Organizational Knowledge with Natural Language Processing Enriched ...DataWorks Summit
This document summarizes a presentation on improving organizational knowledge with natural language processing and enriched data pipelines. The system discussed ingests unstructured text data from various sources using Apache Kafka and Apache NiFi/MiNiFi. The data is then processed by Apache OpenNLP microservices to extract entities, sentences, tokens and perform sentiment analysis. The extracted structured data is stored in a database and Elasticsearch for visualization in Apache Superset dashboards. The system is designed to be scalable, extensible and repeatable using infrastructure as code deployed on Amazon Web Services.
This document discusses big data and AWS tools for managing it. It defines big data as data with high volume, velocity and variety. AWS provides scalable tools like EC2, EMR, Kinesis and Redshift to handle the ingestion, storage, processing and analysis of large and diverse datasets in the cloud. These tools work together in an integrated environment and auto-scale based on demand, providing a cost-effective solution for big data challenges. An example use case of real-time IoT analytics is presented to illustrate how different AWS products interact to build scalable data pipelines.
HBaseCon 2012 | Unique Sets on HBase and Hadoop - Elliot Clark, StumbleUponCloudera, Inc.
Determining the number of unique users that have interacted with a web page, game, or application is a very common use case. HBase is becoming an increasingly accepted tool for calculating sets or counts of unique individuals who meet some criteria. Computing these statistics can range in difficulty from very simple to very difficult. This session will explore how different approaches have worked or not worked at scale for counting uniques on HBase with Hadoop.
Infinitely Scalable Clusters - Grid Computing on Public Cloud - New YorkHentsū
Hentsū helps hedge funds and asset managers run research clusters, big data and high performance computing solutions using the public cloud.
This workshop was hosted in our NY offices and covered an introduction to grid computing using the public cloud. We also specifically looked at Google's BigQuery for running analytics across terabytes of depth market data, with some live demos.
MongoDB World 2016: Scaling Targeted Notifications in the Music Streaming Wor...MongoDB
This document summarizes key information about Saavn, India's largest music streaming service. Some key points:
- Saavn has 18 million global monthly active users, with 14 million in India. The majority (64%) use Android devices to access over 25 million tracks.
- Push notifications are a primary driver of mobile app growth for Saavn. They send over 30 million notifications per day and see 3x more engagement from targeted notifications.
- Saavn stores user notification messages and activity data in MongoDB. They upgraded to WiredTiger for its document locking and high performance. Maintaining over 500GB of user data required implementing sharding and migrating the data.
- Tools like
Journey to the Real-Time Analytics in Extreme GrowthSingleStore
The document summarizes AppsFlyer's journey to implement a real-time analytics solution to handle their extreme growth and increasing data volumes. They were previously using TokuDB but it was failing weekly and not scalable. They tried Druid but it did not meet their requirements. They then implemented MemSQL, an in-memory database, which provided faster query latency, recoverability, and the ability to scale to handle 30x more data while reducing costs. Their current architecture uses Kafka to ingest data, MemSQL clusters for real-time queries and a daily batch process to a columnstore for history.
Glynn Bird - Building the "microservices way" involves breaking monolithic IT systems into small, decoupled services that each to one job well. This talk builds a practical microservices architecture during the talk using small Node.js apps that perform storage, analytics and visualisation tasks. Learn how to orchestrate your own microservice architecture using simple, easily-tested building blocks.
Using druid for interactive count distinct queries at scaleItai Yaffe
At NMC (Nielsen Marketing Cloud) we need to present to our clients the number of unique users who meet a given criteria. The condition is typically a set-theoretic expression over a stream of events for a given time range. Historically, we have used ElasticSearch to answer these types of questions, however, we have encountered major scaling issues. In this presentation we will detail the journey of researching, benchmarking and productionizing a new technology, Druid, with DataSketches, to overcome the limitations we were facing
Improving Organizational Knowledge with Natural Language Processing Enriched ...DataWorks Summit
This document summarizes a presentation on improving organizational knowledge with natural language processing and enriched data pipelines. The system discussed ingests unstructured text data from various sources using Apache Kafka and Apache NiFi/MiNiFi. The data is then processed by Apache OpenNLP microservices to extract entities, sentences, tokens and perform sentiment analysis. The extracted structured data is stored in a database and Elasticsearch for visualization in Apache Superset dashboards. The system is designed to be scalable, extensible and repeatable using infrastructure as code deployed on Amazon Web Services.
This document discusses big data and AWS tools for managing it. It defines big data as data with high volume, velocity and variety. AWS provides scalable tools like EC2, EMR, Kinesis and Redshift to handle the ingestion, storage, processing and analysis of large and diverse datasets in the cloud. These tools work together in an integrated environment and auto-scale based on demand, providing a cost-effective solution for big data challenges. An example use case of real-time IoT analytics is presented to illustrate how different AWS products interact to build scalable data pipelines.
HBaseCon 2012 | Unique Sets on HBase and Hadoop - Elliot Clark, StumbleUponCloudera, Inc.
Determining the number of unique users that have interacted with a web page, game, or application is a very common use case. HBase is becoming an increasingly accepted tool for calculating sets or counts of unique individuals who meet some criteria. Computing these statistics can range in difficulty from very simple to very difficult. This session will explore how different approaches have worked or not worked at scale for counting uniques on HBase with Hadoop.
Infinitely Scalable Clusters - Grid Computing on Public Cloud - New YorkHentsū
Hentsū helps hedge funds and asset managers run research clusters, big data and high performance computing solutions using the public cloud.
This workshop was hosted in our NY offices and covered an introduction to grid computing using the public cloud. We also specifically looked at Google's BigQuery for running analytics across terabytes of depth market data, with some live demos.
MongoDB World 2016: Scaling Targeted Notifications in the Music Streaming Wor...MongoDB
This document summarizes key information about Saavn, India's largest music streaming service. Some key points:
- Saavn has 18 million global monthly active users, with 14 million in India. The majority (64%) use Android devices to access over 25 million tracks.
- Push notifications are a primary driver of mobile app growth for Saavn. They send over 30 million notifications per day and see 3x more engagement from targeted notifications.
- Saavn stores user notification messages and activity data in MongoDB. They upgraded to WiredTiger for its document locking and high performance. Maintaining over 500GB of user data required implementing sharding and migrating the data.
- Tools like
Journey to the Real-Time Analytics in Extreme GrowthSingleStore
The document summarizes AppsFlyer's journey to implement a real-time analytics solution to handle their extreme growth and increasing data volumes. They were previously using TokuDB but it was failing weekly and not scalable. They tried Druid but it did not meet their requirements. They then implemented MemSQL, an in-memory database, which provided faster query latency, recoverability, and the ability to scale to handle 30x more data while reducing costs. Their current architecture uses Kafka to ingest data, MemSQL clusters for real-time queries and a daily batch process to a columnstore for history.
Amazon Redshift is a cloud-hosted data warehouse service from AWS that allows for petabyte-scale analytics on large datasets using massive parallel processing. It stores data in a column-oriented format and integrates with other AWS services like S3, DynamoDB, and EMR. Redshift provides features like columnar storage, parallel query processing across multiple nodes, automated backups and restores, encryption, and integration with SQL and BI tools. The document demonstrates using Redshift alongside S3, Pipeline, EC2/MySQL, and Qlik Sense to build a scalable data warehouse solution in the cloud.
Real Time Data Infrastructure team overviewMonal Daxini
Netflix is hiring for a Senior Software Engineer role to work on their Real Time Data Infrastructure project which processes over 1 trillion events per day. The role involves helping to build out their greenfield Stream Processing as a Service platform called Keystone which will offer reusable components and schema support to process streaming data at massive scale for Netflix.
This document summarizes Anahit Pogosova's presentation on serverless data streaming at scale. It discusses using AWS services like Kinesis Data Streams, Kinesis Firehose, and Kinesis Data Analytics to collect, store, and analyze large amounts of streaming data from Yle, Finland's national public broadcasting company. It also outlines some gotchas and lessons learned, such as understanding service limits and monitoring metrics like iterator age and throttling. The presentation provides an overview of serverless data architectures and best practices for streaming data at massive scales.
This document summarizes 5 papers related to big data architecture and deep learning. Paper 1 discusses the Lambda architecture for balancing real-time and batch data processing. Paper 2 introduces Delta Lake for efficient ACID-compliant storage over object stores. Paper 3 proposes the Lakehouse architecture which unifies data warehousing and analytics using Delta Lake. Paper 4 presents the Conformer model that combines transformers and convolutions for speech recognition. The last paper applies intent detection and slot filling to Vietnamese text using BERT. These papers are relevant to the author's graduation thesis on traffic prediction using speech data analysis.
Gregorry Letribot - Druid at Criteo - NoSQL matters 2015NoSQLmatters
How do you monitor performance for one of your clients on a specific user segmentation when dealing with billions of events a day ? With over 2 billion ads served and 230Tb of data processed a day, we at Criteo have a comprehensive need for an interactive analytics stack. And by interactive, we mean a querying system with dynamic filtering to drill down over multiple dimensions, answering within sub-second latency. This session will take you on our journey with Druid, ""an open-source data store designed for real-time exploratory analytics on large data sets"". We will explore Druid's architecture and noticeable concepts, how relevant they are for some use cases and how it really performs.
This document discusses Druid in production at Fyber, a company that indexes 5 terabytes of data daily from various sources into Druid. It describes the hardware used, including 30 historical nodes and 2 broker nodes. Issues addressed include slow query times with many dimensions, some as lists, and data cleanup steps to reduce cardinality like replacing values. Segment sizing and partitioning are also discussed. Hardware, data ingestion, querying, and optimizations used to scale Druid for Fyber's analytics needs are covered in under 3 sentences.
1) The presentation discusses Druid, an open source analytics engine that can perform aggregations on memory mapped data in sub-second time.
2) It describes how Druid fits into their software stack at the API layer and how they extend its capabilities through a SQL interface and addressing limitations like limited querying and missing features like distinct counts.
3) Examples of SQL queries against Druid are shown to demonstrate its capabilities like group by, filtering, joins, and handling of timeseries data.
WSO2Con ASIA 2016: Patterns for Deploying Analytics in the Real WorldWSO2
Abundant data is all around. The most important aspect is how you as an organization can access the data, process it, and present information to the relevant authorities on time. To gain competitive advantage the means of accessing, processing and presenting the data should be optimal, highly available and scalable.
In this talk, we will discuss how you can leverage WSO2 Data Analytics Server, WSO2 IoT Server, WSO2 Enterprise Service Bus and other WSO2 products in order to analyze the data. We will also discuss different deployment patterns that can provide you with a suitable solution that lets you analyze relevant data historically, in real-time or interactively and predict future states to make better decisions for your organization’s success.
Taboola's data processing architecture has evolved over time from directly writing to databases to using Apache Spark for scalable real-time processing. Spark allows Taboola to process terabytes of data daily across multiple data centers for real-time recommendations, analytics, and algorithm calibration. Key aspects of Taboola's architecture include using Cassandra for event storage, Spark for distributed computing, Mesos for cluster management, and Zookeeper for coordination across a large Spark cluster.
Proofpoint: Fraud Detection and Security on Social MediaDataStax Academy
Social media has become the new frontier for cyber-attackers. The explosive growth of this new communications platform, combined with the potential to reach millions of people through a single post, has provided a low barrier for exploitation. In this talk, we will focus on how Cassandra is used to enable our fight against bad actors on social media. In particular, we will discuss how we use Cassandra for anomaly detection, social mob alerting, trending topics, and fraudulent classification. We will also speak about our Cassandra data models, integration with Spark Streaming, and how we use KairosDB for our time series data. Watch us don our superhero-Cassandra capes as we fight against the bad guys!
GOTO Aarhus 2014: Making Enterprise Data Available in Real Time with elastics...Yann Cluchey
My talk from GOTO Aarhus, 30th September 2014. Cogenta is a retail intelligence company which tracks ecommerce web sites around the world to provide competitive monitoring and analysis services to retailers. Using its proprietary crawler technology, Lucene and SQL Server, a stream of 20 million raw product data entries is captured and processed each day. This case study looks at how Cogenta uses Elasticsearch to break the shackles imposed by the RDBMS (and a limited budget) to make the data available in real time to its customers.
Cogenta uses SQL as its canonical store & for complex reporting, and Elasticsearch for real-time processing & to drive its SaaS web applications. Elasticsearch is easy to use, delivers the powerful features of Lucene and enables the data & platform cost to scale linearly. But… synchronising your existing data in two places presents some interesting challenges such as aggregation and concurrency control. This talk will take a detailed look at how Cogenta how overcame those challenges, with a perpetually changing and asynchronously updated dataset.
http://gotocon.com/aarhus-2014/presentation/Cogenta%20-%20Making%20Enterprise%20Data%20Available%20in%20Real%20Time%20with%20Elasticsearch
Kafka Summit NYC 2017 - Simplifying Omni-Channel Retail at Scaleconfluent
This document summarizes Aaron Strey's presentation on how Target uses Apache Kafka to support omni-channel retail operations at scale. Some key points:
Target uses Kafka for log aggregation, threat detection, clickstream analysis, and business event messaging. They have over 1,800 stores and 38 distribution centers in the US, serving over 26 million online visitors per month.
Target's large Kafka deployment includes up to 300 topics per cluster, with 10-20 thousand consumer requests per second and compaction widely used. They aim for exactly once semantics across a diverse set of clients.
Strey suggests reinventing log aggregation to allow querying log streams directly from Kafka as easily as current methods using Elastic, to avoid indexing ter
Organizational success depends on our ability to sense the environment, grab opportunities and eliminate threats that are present in real-time. Such real-time processing is now available to all organizations (with or without a big data background) through the new WSO2 Stream Processor.
This slides presents WSO2 Stream Processor’s new features and improvements and explains how they make an organization excel in the current competitive marketplace. Some key features we will consider are:
* WSO2 Stream Processor’s highly productive developer environment, with graphical drag-and-drop, and the Streaming SQL query editor
* The ability to process real-time queries that span from seconds to years
* Its interactive visualization and dashboarding features with improved widget generation
* Its ability to processing at scale via distributed deployments with full observability
* Default support for HTTP analytics, distributed message trace analytics, and Twitter analytics
The document discusses the Fermilab HEPCloud facility, which provides computing resources for high energy physics experiments. HEPCloud integrates commercial cloud resources from Amazon Web Services (AWS) with Fermilab's physically owned resources to provide elastic computing capacity. This allows experiments to burst to peak usage levels when needed. Several challenges are discussed around optimizing performance, provisioning, storage, networking, and monitoring when running scientific workflows on AWS. Examples of experiments using HEPCloud include NOvA processing datasets, searches for gravitational wave counterparts by the Dark Energy Survey, and CMS Monte Carlo simulations. HEPCloud aims to provide resources efficiently whether demand is high or low.
Going from three nines to four nines using Kafka | Tejas Chopra, NetflixHostedbyConfluent
The document discusses using Apache Kafka to improve data upload availability from 99.9% to 99.99% when moving data between on-premise and cloud storage. It describes using Kafka to trigger uploads to the cloud from on-premise storage with 99.9% availability and using Kafka to split uploads between cloud and on-premise storage as well as rehydrating failed on-premise uploads from the cloud to achieve 99.99% availability. The presentation concludes that Kafka provides high throughput and persistence needed to design effective data rehydration strategies across cloud and on-premise storage for very high availability.
An introduction to cloud computing with Amazon Web Services and MongoDBSamuel Demharter
This document provides an introduction to cloud computing using Amazon Web Services (AWS) and MongoDB. It defines cloud computing and describes the various service models including Software as a Service (SaaS), Platform as a Service (PaaS), and Infrastructure as a Service (IaaS). It outlines some of the key AWS computing, storage, database, and other services like EC2, S3, DynamoDB, and ElastiCache. It also introduces MongoDB as a scalable and natural document-oriented NoSQL database and compares some of its features to SQL databases. Finally, it provides two examples of using AWS and MongoDB for DNA sequencing and genome analysis.
Cassandra Day SV 2014: Scaling Hulu’s Video Progress Tracking Service with Ap...DataStax Academy
At Hulu, we deal with scaling our web services to meet the demands of an ever growing number of users. During this talk, we will discuss our initial use case for cassandra at Hulu: the video progress tracking service known as hugetop. While cassandra provides a fantastic platform on which to build scalable applications, there are some dark corners of which to be cautious. We will provide a walkthrough of hugetop and some design decisions that went into the hugetop keyspace, our hardware choices, and our experiences operating cassandra in a high-traffic environment.
Triggy is a system that provides real-time analytics capabilities. It is based on Cassandra, a distributed key-value store, and extends it to support push-style incremental processing of streaming data using a modified MapReduce programming model. Triggy scales computation and data storage together across nodes and can handle high volumes of streaming data with low latency for applications like real-time advertising and social media analytics. Other similar systems like Yahoo!’s S4 and Google’s Percolator also aim to enable real-time analytics but use different approaches and may not support real-time processing or incremental scaling in the same way.
[RightScale Webinar] Architecting Databases in the cloud: How RightScale Doe...RightScale
Your database is the foundation of your application. With cloud comes new advantages and considerations for architecting and deployment. Find out how RightScale uses SQL and NoSQL databases such as MySQL, MongoDB, and Cassandra to provide a scalable, distributed, and highly available service around the globe.
Amazon Redshift is a cloud-hosted data warehouse service from AWS that allows for petabyte-scale analytics on large datasets using massive parallel processing. It stores data in a column-oriented format and integrates with other AWS services like S3, DynamoDB, and EMR. Redshift provides features like columnar storage, parallel query processing across multiple nodes, automated backups and restores, encryption, and integration with SQL and BI tools. The document demonstrates using Redshift alongside S3, Pipeline, EC2/MySQL, and Qlik Sense to build a scalable data warehouse solution in the cloud.
Real Time Data Infrastructure team overviewMonal Daxini
Netflix is hiring for a Senior Software Engineer role to work on their Real Time Data Infrastructure project which processes over 1 trillion events per day. The role involves helping to build out their greenfield Stream Processing as a Service platform called Keystone which will offer reusable components and schema support to process streaming data at massive scale for Netflix.
This document summarizes Anahit Pogosova's presentation on serverless data streaming at scale. It discusses using AWS services like Kinesis Data Streams, Kinesis Firehose, and Kinesis Data Analytics to collect, store, and analyze large amounts of streaming data from Yle, Finland's national public broadcasting company. It also outlines some gotchas and lessons learned, such as understanding service limits and monitoring metrics like iterator age and throttling. The presentation provides an overview of serverless data architectures and best practices for streaming data at massive scales.
This document summarizes 5 papers related to big data architecture and deep learning. Paper 1 discusses the Lambda architecture for balancing real-time and batch data processing. Paper 2 introduces Delta Lake for efficient ACID-compliant storage over object stores. Paper 3 proposes the Lakehouse architecture which unifies data warehousing and analytics using Delta Lake. Paper 4 presents the Conformer model that combines transformers and convolutions for speech recognition. The last paper applies intent detection and slot filling to Vietnamese text using BERT. These papers are relevant to the author's graduation thesis on traffic prediction using speech data analysis.
Gregorry Letribot - Druid at Criteo - NoSQL matters 2015NoSQLmatters
How do you monitor performance for one of your clients on a specific user segmentation when dealing with billions of events a day ? With over 2 billion ads served and 230Tb of data processed a day, we at Criteo have a comprehensive need for an interactive analytics stack. And by interactive, we mean a querying system with dynamic filtering to drill down over multiple dimensions, answering within sub-second latency. This session will take you on our journey with Druid, ""an open-source data store designed for real-time exploratory analytics on large data sets"". We will explore Druid's architecture and noticeable concepts, how relevant they are for some use cases and how it really performs.
This document discusses Druid in production at Fyber, a company that indexes 5 terabytes of data daily from various sources into Druid. It describes the hardware used, including 30 historical nodes and 2 broker nodes. Issues addressed include slow query times with many dimensions, some as lists, and data cleanup steps to reduce cardinality like replacing values. Segment sizing and partitioning are also discussed. Hardware, data ingestion, querying, and optimizations used to scale Druid for Fyber's analytics needs are covered in under 3 sentences.
1) The presentation discusses Druid, an open source analytics engine that can perform aggregations on memory mapped data in sub-second time.
2) It describes how Druid fits into their software stack at the API layer and how they extend its capabilities through a SQL interface and addressing limitations like limited querying and missing features like distinct counts.
3) Examples of SQL queries against Druid are shown to demonstrate its capabilities like group by, filtering, joins, and handling of timeseries data.
WSO2Con ASIA 2016: Patterns for Deploying Analytics in the Real WorldWSO2
Abundant data is all around. The most important aspect is how you as an organization can access the data, process it, and present information to the relevant authorities on time. To gain competitive advantage the means of accessing, processing and presenting the data should be optimal, highly available and scalable.
In this talk, we will discuss how you can leverage WSO2 Data Analytics Server, WSO2 IoT Server, WSO2 Enterprise Service Bus and other WSO2 products in order to analyze the data. We will also discuss different deployment patterns that can provide you with a suitable solution that lets you analyze relevant data historically, in real-time or interactively and predict future states to make better decisions for your organization’s success.
Taboola's data processing architecture has evolved over time from directly writing to databases to using Apache Spark for scalable real-time processing. Spark allows Taboola to process terabytes of data daily across multiple data centers for real-time recommendations, analytics, and algorithm calibration. Key aspects of Taboola's architecture include using Cassandra for event storage, Spark for distributed computing, Mesos for cluster management, and Zookeeper for coordination across a large Spark cluster.
Proofpoint: Fraud Detection and Security on Social MediaDataStax Academy
Social media has become the new frontier for cyber-attackers. The explosive growth of this new communications platform, combined with the potential to reach millions of people through a single post, has provided a low barrier for exploitation. In this talk, we will focus on how Cassandra is used to enable our fight against bad actors on social media. In particular, we will discuss how we use Cassandra for anomaly detection, social mob alerting, trending topics, and fraudulent classification. We will also speak about our Cassandra data models, integration with Spark Streaming, and how we use KairosDB for our time series data. Watch us don our superhero-Cassandra capes as we fight against the bad guys!
GOTO Aarhus 2014: Making Enterprise Data Available in Real Time with elastics...Yann Cluchey
My talk from GOTO Aarhus, 30th September 2014. Cogenta is a retail intelligence company which tracks ecommerce web sites around the world to provide competitive monitoring and analysis services to retailers. Using its proprietary crawler technology, Lucene and SQL Server, a stream of 20 million raw product data entries is captured and processed each day. This case study looks at how Cogenta uses Elasticsearch to break the shackles imposed by the RDBMS (and a limited budget) to make the data available in real time to its customers.
Cogenta uses SQL as its canonical store & for complex reporting, and Elasticsearch for real-time processing & to drive its SaaS web applications. Elasticsearch is easy to use, delivers the powerful features of Lucene and enables the data & platform cost to scale linearly. But… synchronising your existing data in two places presents some interesting challenges such as aggregation and concurrency control. This talk will take a detailed look at how Cogenta how overcame those challenges, with a perpetually changing and asynchronously updated dataset.
http://gotocon.com/aarhus-2014/presentation/Cogenta%20-%20Making%20Enterprise%20Data%20Available%20in%20Real%20Time%20with%20Elasticsearch
Kafka Summit NYC 2017 - Simplifying Omni-Channel Retail at Scaleconfluent
This document summarizes Aaron Strey's presentation on how Target uses Apache Kafka to support omni-channel retail operations at scale. Some key points:
Target uses Kafka for log aggregation, threat detection, clickstream analysis, and business event messaging. They have over 1,800 stores and 38 distribution centers in the US, serving over 26 million online visitors per month.
Target's large Kafka deployment includes up to 300 topics per cluster, with 10-20 thousand consumer requests per second and compaction widely used. They aim for exactly once semantics across a diverse set of clients.
Strey suggests reinventing log aggregation to allow querying log streams directly from Kafka as easily as current methods using Elastic, to avoid indexing ter
Organizational success depends on our ability to sense the environment, grab opportunities and eliminate threats that are present in real-time. Such real-time processing is now available to all organizations (with or without a big data background) through the new WSO2 Stream Processor.
This slides presents WSO2 Stream Processor’s new features and improvements and explains how they make an organization excel in the current competitive marketplace. Some key features we will consider are:
* WSO2 Stream Processor’s highly productive developer environment, with graphical drag-and-drop, and the Streaming SQL query editor
* The ability to process real-time queries that span from seconds to years
* Its interactive visualization and dashboarding features with improved widget generation
* Its ability to processing at scale via distributed deployments with full observability
* Default support for HTTP analytics, distributed message trace analytics, and Twitter analytics
The document discusses the Fermilab HEPCloud facility, which provides computing resources for high energy physics experiments. HEPCloud integrates commercial cloud resources from Amazon Web Services (AWS) with Fermilab's physically owned resources to provide elastic computing capacity. This allows experiments to burst to peak usage levels when needed. Several challenges are discussed around optimizing performance, provisioning, storage, networking, and monitoring when running scientific workflows on AWS. Examples of experiments using HEPCloud include NOvA processing datasets, searches for gravitational wave counterparts by the Dark Energy Survey, and CMS Monte Carlo simulations. HEPCloud aims to provide resources efficiently whether demand is high or low.
Going from three nines to four nines using Kafka | Tejas Chopra, NetflixHostedbyConfluent
The document discusses using Apache Kafka to improve data upload availability from 99.9% to 99.99% when moving data between on-premise and cloud storage. It describes using Kafka to trigger uploads to the cloud from on-premise storage with 99.9% availability and using Kafka to split uploads between cloud and on-premise storage as well as rehydrating failed on-premise uploads from the cloud to achieve 99.99% availability. The presentation concludes that Kafka provides high throughput and persistence needed to design effective data rehydration strategies across cloud and on-premise storage for very high availability.
An introduction to cloud computing with Amazon Web Services and MongoDBSamuel Demharter
This document provides an introduction to cloud computing using Amazon Web Services (AWS) and MongoDB. It defines cloud computing and describes the various service models including Software as a Service (SaaS), Platform as a Service (PaaS), and Infrastructure as a Service (IaaS). It outlines some of the key AWS computing, storage, database, and other services like EC2, S3, DynamoDB, and ElastiCache. It also introduces MongoDB as a scalable and natural document-oriented NoSQL database and compares some of its features to SQL databases. Finally, it provides two examples of using AWS and MongoDB for DNA sequencing and genome analysis.
Cassandra Day SV 2014: Scaling Hulu’s Video Progress Tracking Service with Ap...DataStax Academy
At Hulu, we deal with scaling our web services to meet the demands of an ever growing number of users. During this talk, we will discuss our initial use case for cassandra at Hulu: the video progress tracking service known as hugetop. While cassandra provides a fantastic platform on which to build scalable applications, there are some dark corners of which to be cautious. We will provide a walkthrough of hugetop and some design decisions that went into the hugetop keyspace, our hardware choices, and our experiences operating cassandra in a high-traffic environment.
Triggy is a system that provides real-time analytics capabilities. It is based on Cassandra, a distributed key-value store, and extends it to support push-style incremental processing of streaming data using a modified MapReduce programming model. Triggy scales computation and data storage together across nodes and can handle high volumes of streaming data with low latency for applications like real-time advertising and social media analytics. Other similar systems like Yahoo!’s S4 and Google’s Percolator also aim to enable real-time analytics but use different approaches and may not support real-time processing or incremental scaling in the same way.
[RightScale Webinar] Architecting Databases in the cloud: How RightScale Doe...RightScale
Your database is the foundation of your application. With cloud comes new advantages and considerations for architecting and deployment. Find out how RightScale uses SQL and NoSQL databases such as MySQL, MongoDB, and Cassandra to provide a scalable, distributed, and highly available service around the globe.
Big data is generated from a variety of sources at a massive scale and high velocity. Hadoop is an open source framework that allows processing and analyzing large datasets across clusters of commodity hardware. It uses a distributed file system called HDFS that stores multiple replicas of data blocks across nodes for reliability. Hadoop also uses a MapReduce processing model where mappers process data in parallel across nodes before reducers consolidate the outputs into final results. An example demonstrates how Hadoop would count word frequencies in a large text file by mapping word counts across nodes before reducing the results.
An overview of modern scalable web developmentTung Nguyen
The document provides an overview of modern scalable web development trends. It discusses the motivation to build systems that can handle large amounts of data quickly and reliably. It then summarizes the evolution of software architectures from monolithic to microservices. Specific techniques covered include reactive system design, big data analytics using Hadoop and MapReduce, machine learning workflows, and cloud computing services. The document concludes with an overview of the technologies used in the Septeni tech stack.
This document provides an overview of big data processing techniques including batch processing using MapReduce and Hive, iterative batch processing using Spark, stream processing using Apache Storm, and OLAP over big data using Dremel and Druid. It discusses techniques such as MapReduce, Hive, Spark RDDs, and Storm tuples for processing large datasets and compares small versus big data approaches. Example usages and technologies for different processing types are also outlined.
By 2020, 50% of all new software will process machine-generated data of some sort (Gartner). Historically, machine data use cases have required non-SQL data stores like Splunk, Elasticsearch, or InfluxDB.
Today, new SQL DB architectures rival the non-SQL solutions in ease of use, scalability, cost, and performance. Please join this webinar for a detailed comparison of machine data management approaches.
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...Perficient, Inc.
This document discusses big data tools and trends that enable real-time business intelligence from machine logs. It provides an overview of Perficient, a leading IT consulting firm, and introduces the speakers Eric Roch and Ben Hahn. It then covers topics like what constitutes big data, how machine data is a source of big data, and how tools like Hadoop, Storm, Elasticsearch can be used to extract insights from machine data in real-time through open source solutions and functional programming approaches like MapReduce. It also demonstrates a sample data analytics workflow using these tools.
Social Media, Cloud Computing, Machine Learning, Open Source, and Big Data An...Open Analytics
This document discusses using social media, cloud computing, machine learning, open source, and big data analytics to analyze Twitter data. It describes how to collect tweets using the Twitter API, classify tweets in real-time using machine learning models on AWS, store classified tweets in MongoDB on AWS, and present results. Cost estimates for real-time classification of 1 million tweets per day are provided. Use cases described include tracking food poisoning reports and disease occurrence. Future directions discussed include developing turnkey services and linking to additional open data sources.
This document discusses using social media, cloud computing, machine learning, open source, and big data analytics to analyze Twitter data. It describes how to collect tweets using the Twitter API, classify tweets in real-time using machine learning models on AWS, store classified tweets in MongoDB on AWS, and present results. Cost estimates for real-time classification of 1 million tweets per day are provided. Use cases described include tracking food poisoning reports and disease occurrence. Future directions discussed include developing turnkey services and linking to additional open data sources.
Designing your SaaS Database for Scale with PostgresOzgun Erdogan
If you’re building a SaaS application, you probably already have the notion of tenancy built in your data model. Typically, most information relates to tenants / customers / accounts and your database tables capture this natural relation.
With smaller amounts of data, it’s easy to throw more hardware at the problem and scale up your database. As these tables grow however, you need to think about ways to scale your multi-tenant (B2B) database across dozens or hundreds of machines.
In this talk, we're first going to talk about motivations behind scaling your SaaS (multi-tenant) database and several heuristics we found helpful on deciding when to scale. We'll then describe three design patterns that are common in scaling SaaS databases: (1) Create one database per tenant, (2) Create one schema per tenant, and (3) Have all tenants share the same table(s). Next, we'll highlight the tradeoffs involved with each design pattern and focus on one pattern that scales to hundreds of thousands of tenants. We'll also share an example architecture from the industry that describes this pattern in more detail.
Last, we'll talk about key PostgreSQL properties, such as semi-structured data types, that make building multi-tenant applications easy. We'll also mention Citus as a method to scale out your multi-tenant database. We'll conclude by answering frequently asked questions on multi-tenant databases and Q&A.
Mongo db 2.4 time series data - BrignoliCodemotion
Time series data is event data that is recorded and analyzed over time. This document discusses considerations for modeling time series data in MongoDB such as resolution, retention policies, and schema design. It provides examples of different schema designs including embedding data at different granularities like per document, per minute, or per hour. It also discusses use cases for time series data like operational intelligence and monitoring. Overall, the document outlines best practices for modeling, aggregating, analyzing, and scaling time series data in MongoDB.
DynamoDB is a key-value database that achieves high availability and scalability through several techniques:
1. It uses consistent hashing to partition and replicate data across multiple storage nodes, allowing incremental scalability.
2. It employs vector clocks to maintain consistency among replicas during writes, decoupling version size from update rates.
3. For handling temporary failures, it uses sloppy quorum and hinted handoff to provide high availability and durability guarantees when some replicas are unavailable.
The document summarizes the key components of the big data stack, from the presentation layer where users interact, through various processing and storage layers, down to the physical infrastructure of data centers. It provides examples like Facebook's petabyte-scale data warehouse and Google's globally distributed database Spanner. The stack aims to enable the processing and analysis of massive datasets across clusters of servers and data centers.
This document discusses a study on using a separation-based approach for analyzing big data from various sources. It proposes a system with four main components: a main switch to separate relevant data types, a key generator to encrypt data, a subnet switch master to assign work to map workers, and map workers that process assigned data. This bottom-up approach aims to more easily analyze large amounts of data by starting with smaller subsets. The document also covers challenges of big data, existing grid computing systems, Hadoop tools, and concludes the proposed framework would improve upon previous methods for big data analysis.
In 2013:
- 1.4 Trillion digital interactions happen per month.
- 2.9 million emails are sent every second.
- 72.9 products are ordered on Amazon per second.
That is a lot of connected data, graphs are truly everywhere. Companies are finding that graph database technology is helping them make sense of their big data.
Objectivity’s Nick Quinn, Chief Architect of InfiniteGraph, shows us just how popular graph databases have become and where they are being used, as well as showing us the ins and outs.
Do you want to build technology that does great things with big data? You might want to find out what your colleagues are Tweeting about, make recommendations for apps, music or other retail that result in higher purchase rates, discover hidden connections between new and recorded medical research data, or maybe even leverage intel across government agencies to catch the bad guys.
All this is possible with a graph database.
1. In Memory Grids break problems into parts that can be solved using multiple resources on a network, using main memory instead of disk for faster file I/O.
2. In Memory Compute Grids allow computation tasks to be split and executed in parallel across grid nodes, while In Memory Data Grids provide applications with the ability to keep frequently accessed data in memory across multiple JVMs for high availability and low latency access.
3. Reference architectures show how In Memory Grids distribute data, computation tasks, and resources across a cluster for real-time processing of large datasets.
Scalable Similarity-Based Neighborhood Methods with MapReducesscdotopen
This document summarizes a research paper that proposes using MapReduce to scale up similarity-based neighborhood recommendation methods. The authors rephrase these algorithms to be efficiently parallelized across large datasets. They express common similarity measures like Jaccard coefficient in terms of canonical functions that can be embedded in their MapReduce approach. Experiments on a Yahoo! Music dataset with over 700 million ratings showed their method provided linear speedup and scalability with increasing data and cluster size.
Relational cloud, A Database-as-a-Service for the CloudHossein Riasati
Relational Cloud is a database-as-a-service that runs relational databases in the cloud. It aims to efficiently handle multi-tenancy, provide elastic scalability, and ensure database privacy. Key challenges include efficiently sharing machines among multiple databases, scaling databases across multiple nodes elastically based on workload, and providing privacy through encryption methods like CryptDB that allow useful query processing on encrypted data. Experiments show Relational Cloud can efficiently consolidate workloads from multiple servers onto fewer physical machines with low overhead, scale databases across many nodes to handle increasing load, and incur only a 22.5% performance reduction using CryptDB encryption.
Semantic Data Search and Analysis Using Web-based User-Generated Knowledge Basesmaria.grineva
This document discusses using web-based user-generated knowledge bases like Wikipedia and Twitter to perform semantic data search and analysis. It describes developing a technology that extracts semantic information from these sources and applies it to organize news, blogs, and enterprise documents. The goal is to build a scalable open source framework that performs tasks like word sense disambiguation, semantic search, and personalized news generation using real-time information from Twitter.
This document discusses using Twitter lists to filter social media content by topic. It notes that 75% of online news consumers get news shared through social media, and over half share links. Twitter lists allow manually grouping users by topic discussed. The approach aims to automatically identify the niche topic of a Twitter list in real-time, improving the topic identification based on the global Twitter stream. Both textual and social features would be used to classify tweets, examining the interconnectedness of users in lists to identify central vs outlier users.
Architecture of Native XML Database Sednamaria.grineva
Sedna is a document database with APIs for C, Java, Scheme, OmniMark, Python, PHP, and .Net. The core C API allows for session management, transactions, query execution and result processing, and data loading. The Sedna architecture also includes an open socket protocol and extensibility of the basic C API to create new APIs.
XQuery Triggers in Native XML Database Sednamaria.grineva
XQuery triggers allow triggering actions in response to XML document changes in Sedna, an XML database. Triggers are defined using XQuery and can fire before or after insert, delete, or replace operations on nodes or entire statements. Triggers enable capabilities like integrity constraints and statistics monitoring. Sedna implements triggers efficiently using fixators on the schema to quickly detect triggered updates.
Extracting Key Terms From Noisy and Multi-theme Documentsmaria.grineva
The document summarizes a method for extracting key terms from documents using Wikipedia as a knowledge base. It models documents as semantic graphs connecting terms based on their relatedness as computed from Wikipedia. It then uses network analysis techniques to detect communities in these graphs, ranking the communities to select those most likely to contain key terms over noise or ambiguous terms. An evaluation found the method outperformed statistical and graph-based approaches on both noise-free and noisy, multi-theme documents without requiring training data.
Climate Impact of Software Testing at Nordic Testing DaysKari Kakkonen
My slides at Nordic Testing Days 6.6.2024
Climate impact / sustainability of software testing discussed on the talk. ICT and testing must carry their part of global responsibility to help with the climat warming. We can minimize the carbon footprint but we can also have a carbon handprint, a positive impact on the climate. Quality characteristics can be added with sustainability, and then measured continuously. Test environments can be used less, and in smaller scale and on demand. Test techniques can be used in optimizing or minimizing number of tests. Test automation can be used to speed up testing.
A tale of scale & speed: How the US Navy is enabling software delivery from l...sonjaschweigert1
Rapid and secure feature delivery is a goal across every application team and every branch of the DoD. The Navy’s DevSecOps platform, Party Barge, has achieved:
- Reduction in onboarding time from 5 weeks to 1 day
- Improved developer experience and productivity through actionable findings and reduction of false positives
- Maintenance of superior security standards and inherent policy enforcement with Authorization to Operate (ATO)
Development teams can ship efficiently and ensure applications are cyber ready for Navy Authorizing Officials (AOs). In this webinar, Sigma Defense and Anchore will give attendees a look behind the scenes and demo secure pipeline automation and security artifacts that speed up application ATO and time to production.
We will cover:
- How to remove silos in DevSecOps
- How to build efficient development pipeline roles and component templates
- How to deliver security artifacts that matter for ATO’s (SBOMs, vulnerability reports, and policy evidence)
- How to streamline operations with automated policy checks on container images
UiPath Test Automation using UiPath Test Suite series, part 6DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 6. In this session, we will cover Test Automation with generative AI and Open AI.
UiPath Test Automation with generative AI and Open AI webinar offers an in-depth exploration of leveraging cutting-edge technologies for test automation within the UiPath platform. Attendees will delve into the integration of generative AI, a test automation solution, with Open AI advanced natural language processing capabilities.
Throughout the session, participants will discover how this synergy empowers testers to automate repetitive tasks, enhance testing accuracy, and expedite the software testing life cycle. Topics covered include the seamless integration process, practical use cases, and the benefits of harnessing AI-driven automation for UiPath testing initiatives. By attending this webinar, testers, and automation professionals can gain valuable insights into harnessing the power of AI to optimize their test automation workflows within the UiPath ecosystem, ultimately driving efficiency and quality in software development processes.
What will you get from this session?
1. Insights into integrating generative AI.
2. Understanding how this integration enhances test automation within the UiPath platform
3. Practical demonstrations
4. Exploration of real-world use cases illustrating the benefits of AI-driven test automation for UiPath
Topics covered:
What is generative AI
Test Automation with generative AI and Open AI.
UiPath integration with generative AI
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Full-RAG: A modern architecture for hyper-personalizationZilliz
Mike Del Balso, CEO & Co-Founder at Tecton, presents "Full RAG," a novel approach to AI recommendation systems, aiming to push beyond the limitations of traditional models through a deep integration of contextual insights and real-time data, leveraging the Retrieval-Augmented Generation architecture. This talk will outline Full RAG's potential to significantly enhance personalization, address engineering challenges such as data management and model training, and introduce data enrichment with reranking as a key solution. Attendees will gain crucial insights into the importance of hyperpersonalization in AI, the capabilities of Full RAG for advanced personalization, and strategies for managing complex data integrations for deploying cutting-edge AI solutions.
In the rapidly evolving landscape of technologies, XML continues to play a vital role in structuring, storing, and transporting data across diverse systems. The recent advancements in artificial intelligence (AI) present new methodologies for enhancing XML development workflows, introducing efficiency, automation, and intelligent capabilities. This presentation will outline the scope and perspective of utilizing AI in XML development. The potential benefits and the possible pitfalls will be highlighted, providing a balanced view of the subject.
We will explore the capabilities of AI in understanding XML markup languages and autonomously creating structured XML content. Additionally, we will examine the capacity of AI to enrich plain text with appropriate XML markup. Practical examples and methodological guidelines will be provided to elucidate how AI can be effectively prompted to interpret and generate accurate XML markup.
Further emphasis will be placed on the role of AI in developing XSLT, or schemas such as XSD and Schematron. We will address the techniques and strategies adopted to create prompts for generating code, explaining code, or refactoring the code, and the results achieved.
The discussion will extend to how AI can be used to transform XML content. In particular, the focus will be on the use of AI XPath extension functions in XSLT, Schematron, Schematron Quick Fixes, or for XML content refactoring.
The presentation aims to deliver a comprehensive overview of AI usage in XML development, providing attendees with the necessary knowledge to make informed decisions. Whether you’re at the early stages of adopting AI or considering integrating it in advanced XML development, this presentation will cover all levels of expertise.
By highlighting the potential advantages and challenges of integrating AI with XML development tools and languages, the presentation seeks to inspire thoughtful conversation around the future of XML development. We’ll not only delve into the technical aspects of AI-powered XML development but also discuss practical implications and possible future directions.
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...SOFTTECHHUB
The choice of an operating system plays a pivotal role in shaping our computing experience. For decades, Microsoft's Windows has dominated the market, offering a familiar and widely adopted platform for personal and professional use. However, as technological advancements continue to push the boundaries of innovation, alternative operating systems have emerged, challenging the status quo and offering users a fresh perspective on computing.
One such alternative that has garnered significant attention and acclaim is Nitrux Linux 3.5.0, a sleek, powerful, and user-friendly Linux distribution that promises to redefine the way we interact with our devices. With its focus on performance, security, and customization, Nitrux Linux presents a compelling case for those seeking to break free from the constraints of proprietary software and embrace the freedom and flexibility of open-source computing.
Maruthi Prithivirajan, Head of ASEAN & IN Solution Architecture, Neo4j
Get an inside look at the latest Neo4j innovations that enable relationship-driven intelligence at scale. Learn more about the newest cloud integrations and product enhancements that make Neo4j an essential choice for developers building apps with interconnected data and generative AI.
20 Comprehensive Checklist of Designing and Developing a WebsitePixlogix Infotech
Dive into the world of Website Designing and Developing with Pixlogix! Looking to create a stunning online presence? Look no further! Our comprehensive checklist covers everything you need to know to craft a website that stands out. From user-friendly design to seamless functionality, we've got you covered. Don't miss out on this invaluable resource! Check out our checklist now at Pixlogix and start your journey towards a captivating online presence today.
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024Neo4j
Neha Bajwa, Vice President of Product Marketing, Neo4j
Join us as we explore breakthrough innovations enabled by interconnected data and AI. Discover firsthand how organizations use relationships in data to uncover contextual insights and solve our most pressing challenges – from optimizing supply chains, detecting fraud, and improving customer experiences to accelerating drug discoveries.
How to Get CNIC Information System with Paksim Ga.pptxdanishmna97
Pakdata Cf is a groundbreaking system designed to streamline and facilitate access to CNIC information. This innovative platform leverages advanced technology to provide users with efficient and secure access to their CNIC details.
Unlocking Productivity: Leveraging the Potential of Copilot in Microsoft 365, a presentation by Christoforos Vlachos, Senior Solutions Manager – Modern Workplace, Uni Systems
Securing your Kubernetes cluster_ a step-by-step guide to success !KatiaHIMEUR1
Today, after several years of existence, an extremely active community and an ultra-dynamic ecosystem, Kubernetes has established itself as the de facto standard in container orchestration. Thanks to a wide range of managed services, it has never been so easy to set up a ready-to-use Kubernetes cluster.
However, this ease of use means that the subject of security in Kubernetes is often left for later, or even neglected. This exposes companies to significant risks.
In this talk, I'll show you step-by-step how to secure your Kubernetes cluster for greater peace of mind and reliability.
“An Outlook of the Ongoing and Future Relationship between Blockchain Technologies and Process-aware Information Systems.” Invited talk at the joint workshop on Blockchain for Information Systems (BC4IS) and Blockchain for Trusted Data Sharing (B4TDS), co-located with with the 36th International Conference on Advanced Information Systems Engineering (CAiSE), 3 June 2024, Limassol, Cyprus.
Essentials of Automations: The Art of Triggers and Actions in FMESafe Software
In this second installment of our Essentials of Automations webinar series, we’ll explore the landscape of triggers and actions, guiding you through the nuances of authoring and adapting workspaces for seamless automations. Gain an understanding of the full spectrum of triggers and actions available in FME, empowering you to enhance your workspaces for efficient automation.
We’ll kick things off by showcasing the most commonly used event-based triggers, introducing you to various automation workflows like manual triggers, schedules, directory watchers, and more. Plus, see how these elements play out in real scenarios.
Whether you’re tweaking your current setup or building from the ground up, this session will arm you with the tools and insights needed to transform your FME usage into a powerhouse of productivity. Join us to discover effective strategies that simplify complex processes, enhancing your productivity and transforming your data management practices with FME. Let’s turn complexity into clarity and make your workspaces work wonders!
2. Real-Time Web
• Web 2.0 + mobile devices = Real-Time Web
• People share what they do now, discuss
breaking news on Twitter, share their current
locations on Foursquare...
3. Analytics for the Real-Time Web:
new requirements
• Batch processing (MapReduce) is too slow
• New requirements:
• real-time processing: aggregate values
incrementally, as new data arrives
• data-base intensive: aggregate values
are stored in a database constantly
being updated
4. Our System: Triggy
• Based on Cassandra, a distributed key-value store
• Provides programming model similar to MapReduce,
adapted to push-style processing
• Extends Cassandra with
• push-style procedures - to immediately propagate
the data to computations;
• synchronization - to ensure consistency of aggregate
results (counters)
• Easily scalable
5. Cassandra Overview
Data Model
• Data Model: key-value
• Extends basic key-value
with 2 levels of nesting
• Super column - if the
second level is presented
• Column family ~ table;
key-value pair ~ record
• Keys are stored ordered
6. Cassandra Overview
Incremental Scalability
• Incremental scalability requires
mechanism to dynamically
partition data over the nodes
• Data partitioned by key using
consistent hashing
• Advantage of consistent
hashing: departure or arrival of
a node affects only its
immediate neighbors, other
nodes remain unaffected
8. Triggy
Programming Model
• Modified MapReduce to support push-style
processing
• Modified only reduce function: reduce*
• reduce* incrementally applies a new input value
to an already existing aggregate value
Map(k1,v1) -> list(k2,v2)
Reduce(k2, list (v2)) -> (k2, v3)
11. Triggy
Synchronization
• reduce* functions have to be synchronized for the same key to guarantee
correct results
• we make use of Cassandra’s partitioning strategy: all keys are routed to the same
node
• synchronization within a node: locks on keys that are being processed right now
12. Triggy
Fault Tolerance and Scalability
• No fault tolerance guarantees
• Intermediate data and data in queue can be
lost
• Triggy is easily scalable because the
execution and data storage are tightly
coupled
• A new node is placed near the most loaded
node, part of data are transferred
13. Experiments
• Generated workload: tweets with user ids (1 .. 100000) in uniform
distribution
• The load generator issues as many requests as the system with N
can handle
• Application: count the number of words posted by each user
Map: tweet => (user_id, number_of_words_in_tweet)
Reduce: (user_id, numer_of_words_total, number_of_words_in_tweet) =>
(user_id, number_of_words_total)
14. Similar Systems: Yahoo!’s S4
• Distributed stream processing engine:
• Programming interface: Processing
Elements written in Java
• Data routed between Processing Elements by
key
• No database. All processing in memory
• Used to estimate Click-Through-Rate using
user’s behavior within a time window
15. Similar Systems:
Google’s Percolator
• Percolator is database-intensive: based on BigTable
• BigTable:
• the same data model as in Cassandra
• the same log-structured storage
• BigTable - a distributed system with a master; Cassandra - peer2peer
• Percolator extends BigTable with
• observers (similar to database triggers for push-style processing)
• ACID transactions
• Triggy vs. Percolator:
• MapReduce programming model
• No ACID transactions (intermediate data can be lost) - less overhead. (What is
the real overhead of full transaction support? )
16. Application
Social Media Optimization for news sites
• A/B testing for headlines of news stories
• Optimization of front page to attract more clicks
17. Application
Real-Time News Recommendations
• TwitterTim.es - new recommendations via
Twitter’s friends graph
• Now - rebuilt every 2 hours; goal - real-
time updating newspaper
18.
19. Application
Real-Time Advertising
• Real-Time bidding:
• Sites track your browsing behavior via cookies and sell it to
advertising services
• Web publishers offer up display inventory to advertising services
• No fixed CPM, instead: each ad impression is sold to the highest
bidder
• Retargeting (remarketing)
• Advertisers can do remarketing after the following events: (1) the user
visited your site and left (assume the site is within the Google content
network); (2) the user visited your site and added products to their
shopping cart then left; 3) went through purchase process but stop
somewhere.
• Potentially interesting to use information from social networks
20. Other Applications
• Recommendations on location checkins:
Foursquare, Facebook places...
• Social Games: monitoring events from millions
of users in real-time, react in real-time
The Web 2.0 era is characterized by the emergence of large amounts of user-generated content. People started generate and contribute data on different Web services: blogs, social networks, Wikipedia. \n\nToday, with the emergence of mobile devices constantly connected to the Internet, that nature of user-generated content has changed. Now people contribute more often, with smaller posts and the life-span of these posts became shorter. \n\nNew Web services appear that encourage real-time usage:\n1) Twitter\nLifespan of each tweet is shorter than it was before for Blog post. Twitter stream is almost real-time.\n2) Location-based social networks: Foursquare, Facebook places. People share their current location (or checkin) at real venues. This data is real-time sensitive, the user reveals his current location and recommendation of near-by friends and other interesting places must be done immediately, while the user is there.\n\n
So far, analyzing and making use of Web 2.0 data has been accomplished using batch-style processing. Data produced over a certain period of time is accumulated and then processed. MapReduce has become the state-of-the-art approach for analytical batch processing of user-generated data.\n\nToday, the Web 2.0 data has become more real-time and this change implies new requirements for analytical systems. Processing data in batches is too slow for real-time sensitive data. Accumulated data can lose its importance in several hours or, even, minutes. Therefore, analytical systems must aggregate values in real-time, incrementally, as new data arrives. It follows that workloads are database-intensive because aggregate values are not produced at once, as in batch processing, but stored in a database constantly being updated. For example, Google’s new web indexing system, Percolator, is not based on MapReduce anymore. Percolator allows lower document processing latencies by updating the web index incrementally (database-intensive).\n\n
We are working on a system that can process analytical tasks at real-time for large amounts of data.\n\nOur system is based on Cassandra distributed key-value store.\nWe add two extensions into Cassandra in order to turn it into a system for real-time analytics: push-style procedures and synchronization.\n\nWe extend Cassandra with push-style procedures. These procedures act like triggers, you can set it onto a table and they fire up when a new key-value record is inserted. They make the computation real-time, as they immediately propagate the inserted data to the analytical computations.\n\nSynchronization: Cassandra is a simple key-value store. There is no mechanism to update a value based on the existing value. For example, to maintain counters, when we need to increment the existing value we first need to query it, and then insert a new value. In Cassandra, there is no transactions, that means, between querying and updating other client can also update the value. That leads to inconsistent counters. We add local synchronization into Cassandra, that can synchronize data within a node.\n\nFurthermore, our system provides a programming model similar to MapReduce, adapted to push-style processing, and is scalable in terms of computation and data storage.\n
In a nutshell, Cassandra data model can be described as follows:\n1) Cassandra is based on a key-value model\nA database consists of column families. A column family is a set of key-value pairs. Drawing an analogy with relational databases, you can think about column family as table and a key-value pair as a record in a table.\n2) Cassandra extends basic key-value model with two levels of nesting\nAt the first level the value of a record is in turn a sequence of key-value pairs. These nested key-value pairs are called columns where key is the name of the column. In other words you can say that a record in a column family has a key and consists of columns. \nAt the second level, the value of a nested key-value pair can be a sequence of key-value pairs as well. When the second level of nesting is presented, outer key-value pairs are called super columns with key being the name of the super column and inner key-value pairs are called columns.\nLet’s consider an classical example of Twitter database to demonstrate the points.\nColumn family Tweets contains records representing tweets. The key of a record is of Time UUID type and generated when the tweet is received (we will use this feature in User_Timelines column family below). The record consist of columns (no super columns here). Columns simply represent attributes of tweets. So it is very similar to how one would store it in a relational database.\nThe next example is User_Timelines (i.e. tweets posted by a user). Records are keyed by user IDs (referenced by User_ID columns in Tweets column family). User_Timelines demonstrates how column names can be used to store values – tweet IDs in this case. The type of column names is defined as Time UUID. It means that tweets IDs are kept ordered by the time of posting. That is very useful as we usually want to show the last N tweets for a user. Values of all columns are set to an empty byte array (denoted “-”) as they are not used.\nTo demonstrate super columns let us assume that we want to collect statistics about URLs posted by each user. For that we need to group all the tweets posted by a user by URLs contained in the tweets. It can be stored using super columns as follows.\nIn User_URLs the names of the super columns are used to store URLs and the names of the nested columns are the corresponding tweet IDs.\n\n\n
One of the key features of Cassandra is that it must scale incrementally. This requires a mechanism to dynamically partition the data over the set of nodes. Cassandra’s partitioning scheme relies on consistent hashing to distribute the load across multiple storage hosts. \n\nIn consistent hashing, the output range of a hash function (which is normally MD5 ) is treated as a fixed circular space or a ring. By this, I mean, that the largest hash value wraps around to the smallest hash value. \n\nEach node in the system is assigned a random value within this space which represents its position on the ring. Each data item identified by a key is assigned to a node by hashing the data item’s key to yield its position on the ring, and then walking the ring clockwise to find the first node with a position larger than the item’s position. The node is deemed the coordinator for this key. Thus, each node becomes responsible for the region in the ring between it and previous node on the ring.\n\nThe principal advantage of the consistent hashing is that departure or arrival of a node only affects its immediate neighbors and other nodes remain unaffected. \n\nThe problem with MD5 hash function for nodes distribution: the random position assignment of each node on the ring leads to non-uniform load and data distribution. That’s why Cassandra analyzes load information on the ring and inserts new nodes near the highly loaded nodes, so that the overloaded node can transfer the data from it onto the new node.\n\n
Cassandra is optimized for write-intensive workloads, that is a useful feature for us, as computing aggregate values for analytical tasks implies heavy updates to the system\n\nCassadra uses so called log-structured stored which was successfully used in BigTable.\nThe idea is that write operations write to buffer in main memory. When the buffer is full, it is written on disk. So, in the result, the buffer is periodically written on disk. And there is a separate thread that merges different versions a sstable. This process is called compaction.\n\nRead operation looks up the value first in memtable, then, if it was not found, in different versions of sstable moving from the most recent one.\n\nSuch storage is highly optimized for writes, and of course makes the queries slower, which is always a tradeoff for databases.\n
MapReduce is a well-established programming model to express analytical applications. To support real-time analytical applications, we modify this programming model to support push-style data processing. In particular, we modify the reduce function. Originally, reduce combined a list of input values into a single aggregate value. Our modified function, reduce∗, incrementally applies a new input value to an already existing aggregate value. This modification allows to apply a new input value to the aggregate value as soon as the new input value is produced. This means,we are able to pushnewvaluestothe reduce function. \n\nFigure 1 depicts our modified programming model. reduce∗ takes as parameters a key, a new value, and the existing aggregate value. It outputs a key-value pair with the same key and the new aggregate value. We did not modify the map function as it is already allows push-style processing. The difference between map and reduce∗ is that multiple maps can be executed in parallel for the same key, while the execution of reduce∗ has to be synchronized for the same key to guarantee correct results. \n\nNote that reduce∗ exhibits some limitations in comparison to the original reduce. Not every reduce function can be converted to its incremental counterpart. For example, to compute the median of a set of values, the previous median and new value is not enough to compute the new median. The complete set of values needs to be stored to compute the new median.\n\n
In order to setup a map/reduce∗ job the developer has to provide implementations for both functions and define the input table, from which the data is fed into map, and the output table, to which the output of reduce∗ is written.\n\n
Example: implementation of WordCountMapReducer\n
The difference between map and reduce∗ is that multiple maps can be executed in parallel for the same key, while the execution of reduce∗ has to be synchronized for the same key to guarantee correct results.\n\nFor that, we extended the nodes of the key-value store adding queues and worker threads. Figure 2 shows our extensions. Each node maintains a queue that buffers map and reduce∗ tasks. Worker threads drain the queues and execute buffered tasks. Buffering map and reduce∗ tasks allows to handle bursts of input data. Furthermore, the size of the queue allows a rough estimation of the load of a node.\n\nHow to Execute map. As described, for each map the developer has to define an input table. Whenever a new key-value pair is written to this table, the node handling this write schedules a new map task by putting it into its local queue. Eventually, a worker thread will execute the map task at this node. Map tasks can be executed in parallel at any node in the system and do not require synchronization because they do not share any data.\n\nHow to Execute reduce∗. In contrast to map, the execution of reduce∗ needs to be synchronized because several reduce∗ tasks can potentially update the same aggregate value in parallel leading to inconsistent data. Cassandra do not provide any synchronization mechanisms. In our system, synchronization is realized in two steps: (1) routing all key-value pairs output by map with the same key to a single node, and (2) synchronizing the execution of reduce∗ within a node using locks. Routing is implemented by reusing Cassandra’s partitioning strategy (using consistent hashing). That is, each key-value pair output by map is routed to the node that is primarily responsible for the respective key. At the receiver node, a new reduce∗ task is submitted to the queue. Multiple worker threads execute these reduce∗ tasks by reading and incrementing the latest aggregate value. Workers threads are synchronized such that only one worker executes a reduce∗ task for a given key. For that, we use a lock table that contains keys being processed by each worker. The output of the reduce∗ task is written to the table specified in the reduce definition. The table may be replicated to achieve reliability. By writing the result, the node might fire a subsequent map/reduce∗ task. The result of reduce∗ can be queried using the key-value store’s standard query interface.\n\nThe figure shows the execution of map and reduce∗ inside oursystem.Twokey-valuepairs(k1 , v1 )and(k1 , v2 )are written to nodes N1 and N5 of the key-value store. These writes fire map tasks defined on the updated table. There- fore,receivernodeN1putsamaptaskforpair(k1 , v1 )into its queue (denoted by m in Figure 2). Similarly, node N5 putsamaptaskforpair(k1 , v2 )intoitsqueue.Theexecution of the map tasks results in three intermediate key-value pairs. Determined by Cassandra’s partitioning strategy, the intermediate pair with key k2 is routed to node N2 while pairs with key k3 are routed to node N3. Nodes N2 and N3 put reduce∗ tasks into their respective queues (denoted by r∗). As described, reduce∗ tasks are executed locally using locks. New aggregate values are computed and stored into the result table.\n\n
Our implementation does not provide fault tolerance guarantees for execution of map/reduce∗ tasks. If the node responsible to execute map fails while the map task is still in the queue, the map task will never be executed. Also, our synchronization mechanism requires intermediate key-value pairs to be routed to a single node. These intermediate pairs might be lost in case of failures. Nevertheless, once a map/ reduce∗ task has been executed successfully the results are stored reliably at a number of replica nodes. Thus, only intermediate data can be lost.\n\nThere are a number of reasons for this design decision. First, for many analytical applications losing intermediate data is not critical. For such applications it is more important to see a general trend rather than exact numbers. Second, only those map/reduce∗ tasks can be lost that wait in the queue at the moment a node fails. If there is no burst of input data, queues are usually empty. Therefore, losing intermediate data happens rarely. Third, the execution of map and reduce∗ tasks is distributed across all nodes of the system. Only a portion of intermediate data will be lost in case a single node fails.\n\nIn order to provide stronger consistency guarantees in case of node failures, we would have to provide exactly-once semantics. Relatively light-weight methods that provide at-least-once semantics are not suitable as repeated executions invalidate aggregate values. Providing exactly-once semantics requires additional storage and computation overhead and is argued to be too expensive and not easy to scale.\n\n\nScalability. In our system, the execution of map and reduce∗ is distributed across the nodes according to the data partitioning strategy of the key-value store. It allows to easily scale the system as execution and data storage are tightly coupled. By default, Cassandra provides a mechanism for scaling the data storage. Any new node is placed near the most loaded node of the system. Parts of the data from the loaded node are transferred to the new node, thus, shedding load between the nodes. We extended Cassandra’s load measurement formula to include execution load as well. As in the SEDA architecture, we use the length of the queue to measure execution load. It is a good criteria because it reflects any bottleneck at a node such as CPU overload or network saturation.\n\n\n
\n
Yahoo! recently open sourced S4, a system that is close to ours.\n\nWhat are the differences:\n\n1) Triggy has MapReduce programming model many developers are familiar with. Programming model of S4 is more general. \n\n2) Our system is tightly coupled with the database, while S4 process tasks in memory. Why we think database-intensive solution is important:\n\nа) With Triggy, you don’t have to worry about the window. You can compute analytics using historical data which can be used within a window, as well as without a window, or the window can be of different sizes for different parameters. For example, while monitoring user’s browsing behavior using cookies for advertising: some users show enough interest for a certain ad within a short time period, while you can monitor and wait for other users much longer.\n\nб) Triggy is easily scalable. You don’t have to scale the computation separately from the database. Tightly coupled solution allows scaling the system with a single knob.\n\n\n
\n
News site use real-time analytics for optimizing their sites to attract more readers.\n\n1) A/B testing for headlines of news stories. When the news is first published on the site, there are two different headlines for it. For the first 5 minutes part of the readers get one headline, while another part of the readers gets another headline. Then the headline that attracts more clicks during the first 5 minutes in chosen. \n\n2) Optimizing news layout. The system analyses clicks, likes and retweet to understand which news stories rise discussions in social media. Then put the most discussed news on to the front page to attract even more readers. \n
The Twitter Tim.es - a personalized news service: http://twittertim.es. The Twitter Tim.es uses your friends relationships on Twitter to recommend news for you.\n\nCurrently, The Twitter Tim.es newspapers are being rebuilt every 2 hours (batch processing). Would be nice to have push-style processing, when the new news story is coming to the newspaper as soon as it is published on Twitter.\n\n\n
\n
What is real-time bidding\n\nHere's the basic gist:\n1) Sites across the web track your browsing behavior via cookies and sell basic data about you to Ad Service companies. For example, Google Content Network covers 80% of internet users.\n\n2) Web publishers offer up display inventory to the RTB market through ad services; rather than signing up for a fixed CPM, they sell each individual ad impression to the highest bidder, based on whom that individual ad is being served to. For example, a retailer who agrees to run a display ad campaign for a shoe sale at $5 per 1,000 impressions. That retailer, however, can specify that they will pay $10 per 1,000 impressions for ads that include running shoes if they know that a browser has previously visited the athletics section of its Web site.\n\nReal-time bidding auction is happening during a milliseconds while the site page is opening. Advertisers have to run their algorithms to decide what ad to show and at what price during this time.\n\nGoogle retargeting (or remarketing):\nWhat is remarketing:\nTravel company has a site where they feature the holiday vacations. Users may come to this website, browse the offers and think about booking a trip, but decide that the deal is still not cheap enough. Then, they continue to browse the web. If the travel company later decide to offer discounted deals to the Carribean, it can target the users that already visited their site (interested users) via display ads, that these users will see later on other sites.\n\nAdvertisers can do remarketing after the following events:\n1) User visited your site and left (assume the site is within the Google content network); 2) User visited your site and added products to their shopping cart then left; 3) Go through purchase process but stop somewhere; etc.\n\nThese events can be extended with information from social networks, for example. Suppose, the system can track what the user is posting on twitter and estimate their interest in different products that can be advertised later.\n\nYou can then pay per click for these people as they search and browse the web (ads will be shown in search or content network).  For retargeting you need to aggregate information about a user in a database. Window approach is not applicable here, because there is no a single time frame.\n