Voxxed days thessaloniki 21/10/2016 - Streaming Engines for Big DataStavros Kontopoulos
This document discusses streaming engines for big data and provides a case study on Spark Streaming. It begins with an overview of streaming concepts like streams, stream processing, and time in modern data stream analysis. Next, it covers key design considerations for streaming engines and examples of state-of-the-art stream analysis tools like Apache Flink, Spark Streaming, and Apache Beam. It then focuses on Spark Streaming, describing its DStream and Structured Streaming APIs. Code examples are provided for the DStream API and Structured Streaming. The document concludes with a recommendation to first consider Flink, Spark, or Kafka Streams when choosing a streaming engine.
This document provides a high-level summary of streaming data processing and the Lambda architecture. It begins with a brief history of batch and streaming systems for big data. It then introduces the Lambda architecture as a way to handle both batch and streaming data using separate batch and speed layers. The document discusses advantages and disadvantages of the Lambda architecture, as well as use cases, implementation tips, and approaches that have emerged beyond the Lambda architecture like Kappa and FastData architectures.
This document discusses fast data and streaming systems. It provides a history of big data processing from MapReduce to streaming. Fast data refers to data in motion that is processed in real-time from streaming sources. Streaming systems allow for processing unbounded datasets using techniques like windows, watermarks and triggers. The document discusses streaming architectures and the SMACK stack (Spark, Mesos, Akka, Cassandra and Kafka) that provides technologies for building high performing streaming systems. It provides an example IoT application and how machine learning could be added. Streaming systems like Flink and Spark Streaming are compared.
Big data real time architectures -
How do to big data processing in real time?
What architectures are out there to support this paradigm?
Which one should we choose?
What Advantages / Pitfalls they contain.
WSO2 Product Release Webinar: WSO2 Data Analytics Server 3.0WSO2
To view recording of this webinar please use below URL:
WSO2 Data Analytics Server (WSO2 DAS) version 3.0 is the successor of WSO2 Business Activity Monitor 2.5. It based on the latest technologies and is an evolutionary upgrade to the current system. WSO2 DAS comes with a comprehensive set of new features including support for pluggable data sources, support for batch processing with Apache Spark, support for distributed data indexing, a new dashboard and support for unified data querying with analytics REST APIs.
The WSO2 DAS combines real-time, batch, interactive, and predictive (via machine learning) analysis of data into a single integrated platform. This webinar will present and demonstrate the following key features and capabilities in detail:
Pluggable data sources support with its new data abstraction layer
Batch analytics using the Apache Spark analytics engine
Interactive analysis powered by Apache Lucene
An analytics dashboard to visualize results
Activity monitoring capabilities for tracking related events in a system
Talk I gave at StratHadoop in Barcelona on November 21, 2014.
In this talk I discuss the experience we made with realtime analysis on high volume event data streams.
Abundant data is all around. The most important aspect is how you as an organization can access the data, process it, and present information to the relevant authorities on time. To gain competitive advantage the means of accessing, processing and presenting the data should be optimal, highly available and scalable.
In this talk, we will discuss different deployment patterns that can provide you with a suitable solution that lets you analyze relevant data in batch, real-time or interactively and predict future states. We will discuss how you can leverage and deploy WSO2 Data Analytics Server, WSO2 IoT Server, WSO2 Enterprise Service Bus and other WSO2 products in order to make better decisions for your organization’s success.
Batch and Interactive Analytics: From Data to InsightWSO2
This document provides an overview of batch and interactive analytics. It defines batch analytics as processing stored data through time-consuming tasks, while interactive analytics allows ad-hoc querying of stored data for quick results. The document then outlines technologies used for batch and interactive analytics like Spark, Elasticsearch and Solr. It provides details on the WSO2 analytics architecture and how it supports both batch and interactive processing, alerts, and mixing of real-time and batch data. Example solutions like service monitoring, activity monitoring and log analysis are also presented.
Voxxed days thessaloniki 21/10/2016 - Streaming Engines for Big DataStavros Kontopoulos
This document discusses streaming engines for big data and provides a case study on Spark Streaming. It begins with an overview of streaming concepts like streams, stream processing, and time in modern data stream analysis. Next, it covers key design considerations for streaming engines and examples of state-of-the-art stream analysis tools like Apache Flink, Spark Streaming, and Apache Beam. It then focuses on Spark Streaming, describing its DStream and Structured Streaming APIs. Code examples are provided for the DStream API and Structured Streaming. The document concludes with a recommendation to first consider Flink, Spark, or Kafka Streams when choosing a streaming engine.
This document provides a high-level summary of streaming data processing and the Lambda architecture. It begins with a brief history of batch and streaming systems for big data. It then introduces the Lambda architecture as a way to handle both batch and streaming data using separate batch and speed layers. The document discusses advantages and disadvantages of the Lambda architecture, as well as use cases, implementation tips, and approaches that have emerged beyond the Lambda architecture like Kappa and FastData architectures.
This document discusses fast data and streaming systems. It provides a history of big data processing from MapReduce to streaming. Fast data refers to data in motion that is processed in real-time from streaming sources. Streaming systems allow for processing unbounded datasets using techniques like windows, watermarks and triggers. The document discusses streaming architectures and the SMACK stack (Spark, Mesos, Akka, Cassandra and Kafka) that provides technologies for building high performing streaming systems. It provides an example IoT application and how machine learning could be added. Streaming systems like Flink and Spark Streaming are compared.
Big data real time architectures -
How do to big data processing in real time?
What architectures are out there to support this paradigm?
Which one should we choose?
What Advantages / Pitfalls they contain.
WSO2 Product Release Webinar: WSO2 Data Analytics Server 3.0WSO2
To view recording of this webinar please use below URL:
WSO2 Data Analytics Server (WSO2 DAS) version 3.0 is the successor of WSO2 Business Activity Monitor 2.5. It based on the latest technologies and is an evolutionary upgrade to the current system. WSO2 DAS comes with a comprehensive set of new features including support for pluggable data sources, support for batch processing with Apache Spark, support for distributed data indexing, a new dashboard and support for unified data querying with analytics REST APIs.
The WSO2 DAS combines real-time, batch, interactive, and predictive (via machine learning) analysis of data into a single integrated platform. This webinar will present and demonstrate the following key features and capabilities in detail:
Pluggable data sources support with its new data abstraction layer
Batch analytics using the Apache Spark analytics engine
Interactive analysis powered by Apache Lucene
An analytics dashboard to visualize results
Activity monitoring capabilities for tracking related events in a system
Talk I gave at StratHadoop in Barcelona on November 21, 2014.
In this talk I discuss the experience we made with realtime analysis on high volume event data streams.
Abundant data is all around. The most important aspect is how you as an organization can access the data, process it, and present information to the relevant authorities on time. To gain competitive advantage the means of accessing, processing and presenting the data should be optimal, highly available and scalable.
In this talk, we will discuss different deployment patterns that can provide you with a suitable solution that lets you analyze relevant data in batch, real-time or interactively and predict future states. We will discuss how you can leverage and deploy WSO2 Data Analytics Server, WSO2 IoT Server, WSO2 Enterprise Service Bus and other WSO2 products in order to make better decisions for your organization’s success.
Batch and Interactive Analytics: From Data to InsightWSO2
This document provides an overview of batch and interactive analytics. It defines batch analytics as processing stored data through time-consuming tasks, while interactive analytics allows ad-hoc querying of stored data for quick results. The document then outlines technologies used for batch and interactive analytics like Spark, Elasticsearch and Solr. It provides details on the WSO2 analytics architecture and how it supports both batch and interactive processing, alerts, and mixing of real-time and batch data. Example solutions like service monitoring, activity monitoring and log analysis are also presented.
A primer on building real time data-driven productsLars Albertsson
This document provides an overview of building real-time data products using stream processing. It discusses why stream processing is useful for providing low-latency reactions to data from 1 second to 1 hour. Key aspects covered include using a unified log to decouple producers and consumers, common stream processing building blocks like filtering and joining, and technologies like Spark Streaming, Kafka Streams, and Flink. The document also addresses challenges like out-of-order events and software bugs, and architectural patterns for handling imperfections in streams.
The document introduces the WSO2 Analytics Platform, which allows users to collect, store, analyze, visualize and communicate data. It discusses how the platform can help organizations reduce costs, improve customer satisfaction and efficiency. The key capabilities of the platform include interactive, batch, real-time and predictive analytics. It also provides tools for developers, solutions for various use cases, and discusses how to get started with the platform.
Introduction to Data Science and AnalyticsSrinath Perera
This webinar serves as an introduction to WSO2 Summer School. It will discuss how to build a pipeline for your organization and for each use case, and the technology and tooling choices that need to be made for the same.
This session will explore analytics under four themes:
Hindsight (what happened)
Oversight (what is happening)
Insight (why is it happening)
Foresight (what will happen)
Recording http://t.co/WcMFEAJHok
Improve your SQL workload with observabilityOVHcloud
La majeure partie du SI d'OVH repose sur des bases de données relationnelles (PostgreSQL, MySQL, MariaDB). En termes de volumétrie cela représente 400 bases pesants plus de 20To de données réparties sur 60 clusters dans deux zones géographiques le tout propulsant 3000 applications.
Comment tout voir dans notre parc ? Mieux encore, comment faire pour que tout le monde puisse suivre l'activité de sa base de données ? C'est le challenge que nous nous sommes fixés, un an après nous pouvons partager notre expérience.
Et si l'observability n'était pas juste un buzzword, mais avait un réel impact sur la production ?
This document discusses principles for applying continuous delivery practices to machine learning models. It begins with background on the speaker and their company Indix, which builds location and product-aware software using machine learning. The document then outlines four principles for continuous delivery of machine learning: 1) Automating training, evaluation, and prediction pipelines using tools like Go-CD; 2) Using source code and artifact repositories to improve reproducibility; 3) Deploying models as containers for microservices; and 4) Performing A/B testing using request shadowing rather than multi-armed bandits. Examples and diagrams are provided for each principle.
How Spark is Enabling the New Wave of Converged ApplicationsMapR Technologies
Apache Spark has become the de-facto compute engine of choice for data engineers, developers, and data scientists because of its ability to run multiple analytic workloads with a single compute engine. Spark is speeding up data pipeline development, enabling richer predictive analytics, and bringing a new class of applications to market.
This document provides an overview of monitoring in big data frameworks. It discusses the challenges of monitoring large-scale cloud environments running big data applications. Several open-source monitoring tools are described, including Hadoop Performance Monitoring UI, SequenceIQ, Ganglia, Apache Chukwa, and Nagios. Key requirements for monitoring big data platforms are also outlined, such as scalability, timeliness, and handling constant changes. The document concludes by introducing the DICE monitoring platform, which collects metrics from Hadoop, YARN, Spark, Storm and Kafka using Collectd and stores the data in Elasticsearch for analysis and visualization with Kibana.
This document describes a parallel and scalable approach called Big-SeqSB-Gen for generating large synthetic sequence databases. It implements Whitney enumerators to generate distinct sequences and uses a parallel sequence generator (PSG) built on Hadoop MapReduce. The PSG was tested on a French Grid5000 cluster and achieved generation of over 18 billion sequences in under 2 hours, demonstrating good scalability and throughput. Future work involves mining patterns from large real sequence datasets.
Production-Ready BIG ML Workflows - from zero to heroDaniel Marcous
Data science isn't an easy task to pull of.
You start with exploring data and experimenting with models.
Finally, you find some amazing insight!
What now?
How do you transform a little experiment to a production ready workflow? Better yet, how do you scale it from a small sample in R/Python to TBs of production data?
Building a BIG ML Workflow - from zero to hero, is about the work process you need to take in order to have a production ready workflow up and running.
Covering :
* Small - Medium experimentation (R)
* Big data implementation (Spark Mllib /+ pipeline)
* Setting Metrics and checks in place
* Ad hoc querying and exploring your results (Zeppelin)
* Pain points & Lessons learned the hard way (is there any other way?)
The document discusses data warehouse systems and architectures for processing large datasets. It begins with an overview of typical data warehouse architectures and issues. It then discusses benchmarks for assessing large-scale systems. The remainder of the document discusses various technologies for data warehousing and analytics including classical relational systems, Apache Hadoop, Pig Latin, and the Lambda architecture. It provides examples and descriptions of how these different approaches can be used to build scalable data warehousing and analytics systems.
This talk provides an engineering perspective on privacy protection. The intended audience is architects, developers, data scientists, and engineering managers that build applications handling user data. We highlight topics that require attention at an early design stage, and go through pitfalls and potentially expensive architectural mistakes. We describe a number of technical patterns for complying with privacy regulations without sacrificing the ability to use data for product features. The content of the talk is based on real world experience from handling privacy protection in large scale data processing environments.
Many companies have data with great potential. There are many ways to go wrong with Big Data projects, however; the difference between a successful and a failed project can be huge, both in cost and the return of investment. In this talk. we will describe the most common pitfalls, and how to avoid them. You will learn to:
- Be aware of the existing risk factors in your organisation that may cause a data project to fail.
- Learn how to recognise the most common and costly causes of project failure.
- Learn how to avoid or mitigate project problems in order to ensure return of investment in a lean manner.
The State of Postgres | Strata San Jose 2018 | Umur CubukcuCitus Data
PostgreSQL is often regarded as the world’s most advanced open source database—and it’s on fire. Umur Cubukcu moves beyond the typical list of features in the next release to explore why so many new projects “just use Postgres” as their system of record (or system of engagement) at scale. Along the way, you’ll learn how PostgreSQL’s extension APIs are fueling innovations in relational databases.
Topics include: a framework for thinking about modern workloads, the evolution of database infrastructure, extensibility for the database and PostgreSQL as an ecosystem
This document discusses supporting parallel OLAP (online analytical processing) over big data. It presents different data partitioning schemes for distributed warehouses and evaluates their performance using the TPC-H benchmark. Experimental results show improved query response times when fragmenting and distributing tables over multiple database backends compared to a single backend. The authors also introduce derived data techniques to further optimize query performance. They conclude more work is needed to automate data partitioning and support larger datasets.
The Life of Data at Altocloud. Altocloud connects your business with the right customers at the right time in their journey – improving conversions and enhancing customer experience. These slides were presented at the Altocloud In-Company event, as part of AlanTec Festival 2016. Presenters, Maciej Dabrowski, Chief Data Scientist and Darragh Kirwan, Full Stack Engineer.
Monday 16th May 2016.
Open core summit: Observability for data pipelines with OpenLineageJulien Le Dem
This document discusses Open Lineage and the Marquez project for collecting metadata and data lineage information from data pipelines. It describes how Open Lineage defines a standard model and protocol for instrumentation to collect metadata on jobs, datasets, and runs in a consistent way. This metadata can then provide context on the data source, schema, owners, usage, and changes. The document outlines how Marquez implements the Open Lineage standard by defining entities, relationships, and facets to store this metadata and enable use cases like data governance, discovery, and debugging. It also positions Marquez as a centralized but modular framework to integrate various data platforms and extensions like Datakin's lineage analysis tools.
Scaling Search at Lendingkart discusses how Lendingkart scaled their search capabilities to handle large increases in data volume. They initially tried scaling databases vertically and horizontally, but searches were still slow at 8 seconds. They implemented ElasticSearch for its near real-time search, high scalability, and out-of-the-box functionality. Logstash was used to seed data from MySQL and MongoDB into ElasticSearch. Custom analyzers and mappings were developed. Searches then reduced to 230ms and aggregations to 200ms, allowing the business to scale as transactional data grew 3000% and leads 250%.
Data pipelines observability: OpenLineage & MarquezJulien Le Dem
This document discusses OpenLineage and Marquez, which aim to provide standardized metadata and data lineage collection for data pipelines. OpenLineage defines an open standard for collecting metadata as data moves through pipelines, similar to metadata collected by EXIF for images. Marquez is an open source implementation of this standard, which can collect metadata from various data tools and store it in a graph database for querying lineage and understanding dependencies. This collected metadata helps with tasks like troubleshooting, impact analysis, and understanding how data flows through complex pipelines over time.
This document discusses using Apache Spark to build an academic alert system that can predict at-risk students from large datasets. It presents the use case, architecture, data volumes, and problem that Weka cannot handle large datasets. The solution involves using Spark on a 3-node Hadoop cluster to sample, impute, and model the data using logistic regression, random forest, and naive bayes algorithms. Spark achieved higher accuracy and recall than Weka in less time due to its ability to perform distributed computing on large datasets. Some current Spark challenges are also outlined.
Cancer Outlier Profile Analysis using Apache SparkMahmoud Parsian
The document describes Cancer Outlier Profile Analysis (COPA), a method for identifying outlier genes in cancer gene expression data using Apache Spark. COPA normalizes gene expression data, ranks genes by their expression values, and identifies the top outliers. It was the first algorithm to discover the ERG rearrangement in prostate cancer. The document outlines the COPA algorithm, which median centers data and ranks genes by their percentile scores to identify outlier profiles. It also discusses implementing COPA at large scale using Apache Spark to handle thousands of studies, each with hundreds of samples and over 100,000 gene expression pairs.
A primer on building real time data-driven productsLars Albertsson
This document provides an overview of building real-time data products using stream processing. It discusses why stream processing is useful for providing low-latency reactions to data from 1 second to 1 hour. Key aspects covered include using a unified log to decouple producers and consumers, common stream processing building blocks like filtering and joining, and technologies like Spark Streaming, Kafka Streams, and Flink. The document also addresses challenges like out-of-order events and software bugs, and architectural patterns for handling imperfections in streams.
The document introduces the WSO2 Analytics Platform, which allows users to collect, store, analyze, visualize and communicate data. It discusses how the platform can help organizations reduce costs, improve customer satisfaction and efficiency. The key capabilities of the platform include interactive, batch, real-time and predictive analytics. It also provides tools for developers, solutions for various use cases, and discusses how to get started with the platform.
Introduction to Data Science and AnalyticsSrinath Perera
This webinar serves as an introduction to WSO2 Summer School. It will discuss how to build a pipeline for your organization and for each use case, and the technology and tooling choices that need to be made for the same.
This session will explore analytics under four themes:
Hindsight (what happened)
Oversight (what is happening)
Insight (why is it happening)
Foresight (what will happen)
Recording http://t.co/WcMFEAJHok
Improve your SQL workload with observabilityOVHcloud
La majeure partie du SI d'OVH repose sur des bases de données relationnelles (PostgreSQL, MySQL, MariaDB). En termes de volumétrie cela représente 400 bases pesants plus de 20To de données réparties sur 60 clusters dans deux zones géographiques le tout propulsant 3000 applications.
Comment tout voir dans notre parc ? Mieux encore, comment faire pour que tout le monde puisse suivre l'activité de sa base de données ? C'est le challenge que nous nous sommes fixés, un an après nous pouvons partager notre expérience.
Et si l'observability n'était pas juste un buzzword, mais avait un réel impact sur la production ?
This document discusses principles for applying continuous delivery practices to machine learning models. It begins with background on the speaker and their company Indix, which builds location and product-aware software using machine learning. The document then outlines four principles for continuous delivery of machine learning: 1) Automating training, evaluation, and prediction pipelines using tools like Go-CD; 2) Using source code and artifact repositories to improve reproducibility; 3) Deploying models as containers for microservices; and 4) Performing A/B testing using request shadowing rather than multi-armed bandits. Examples and diagrams are provided for each principle.
How Spark is Enabling the New Wave of Converged ApplicationsMapR Technologies
Apache Spark has become the de-facto compute engine of choice for data engineers, developers, and data scientists because of its ability to run multiple analytic workloads with a single compute engine. Spark is speeding up data pipeline development, enabling richer predictive analytics, and bringing a new class of applications to market.
This document provides an overview of monitoring in big data frameworks. It discusses the challenges of monitoring large-scale cloud environments running big data applications. Several open-source monitoring tools are described, including Hadoop Performance Monitoring UI, SequenceIQ, Ganglia, Apache Chukwa, and Nagios. Key requirements for monitoring big data platforms are also outlined, such as scalability, timeliness, and handling constant changes. The document concludes by introducing the DICE monitoring platform, which collects metrics from Hadoop, YARN, Spark, Storm and Kafka using Collectd and stores the data in Elasticsearch for analysis and visualization with Kibana.
This document describes a parallel and scalable approach called Big-SeqSB-Gen for generating large synthetic sequence databases. It implements Whitney enumerators to generate distinct sequences and uses a parallel sequence generator (PSG) built on Hadoop MapReduce. The PSG was tested on a French Grid5000 cluster and achieved generation of over 18 billion sequences in under 2 hours, demonstrating good scalability and throughput. Future work involves mining patterns from large real sequence datasets.
Production-Ready BIG ML Workflows - from zero to heroDaniel Marcous
Data science isn't an easy task to pull of.
You start with exploring data and experimenting with models.
Finally, you find some amazing insight!
What now?
How do you transform a little experiment to a production ready workflow? Better yet, how do you scale it from a small sample in R/Python to TBs of production data?
Building a BIG ML Workflow - from zero to hero, is about the work process you need to take in order to have a production ready workflow up and running.
Covering :
* Small - Medium experimentation (R)
* Big data implementation (Spark Mllib /+ pipeline)
* Setting Metrics and checks in place
* Ad hoc querying and exploring your results (Zeppelin)
* Pain points & Lessons learned the hard way (is there any other way?)
The document discusses data warehouse systems and architectures for processing large datasets. It begins with an overview of typical data warehouse architectures and issues. It then discusses benchmarks for assessing large-scale systems. The remainder of the document discusses various technologies for data warehousing and analytics including classical relational systems, Apache Hadoop, Pig Latin, and the Lambda architecture. It provides examples and descriptions of how these different approaches can be used to build scalable data warehousing and analytics systems.
This talk provides an engineering perspective on privacy protection. The intended audience is architects, developers, data scientists, and engineering managers that build applications handling user data. We highlight topics that require attention at an early design stage, and go through pitfalls and potentially expensive architectural mistakes. We describe a number of technical patterns for complying with privacy regulations without sacrificing the ability to use data for product features. The content of the talk is based on real world experience from handling privacy protection in large scale data processing environments.
Many companies have data with great potential. There are many ways to go wrong with Big Data projects, however; the difference between a successful and a failed project can be huge, both in cost and the return of investment. In this talk. we will describe the most common pitfalls, and how to avoid them. You will learn to:
- Be aware of the existing risk factors in your organisation that may cause a data project to fail.
- Learn how to recognise the most common and costly causes of project failure.
- Learn how to avoid or mitigate project problems in order to ensure return of investment in a lean manner.
The State of Postgres | Strata San Jose 2018 | Umur CubukcuCitus Data
PostgreSQL is often regarded as the world’s most advanced open source database—and it’s on fire. Umur Cubukcu moves beyond the typical list of features in the next release to explore why so many new projects “just use Postgres” as their system of record (or system of engagement) at scale. Along the way, you’ll learn how PostgreSQL’s extension APIs are fueling innovations in relational databases.
Topics include: a framework for thinking about modern workloads, the evolution of database infrastructure, extensibility for the database and PostgreSQL as an ecosystem
This document discusses supporting parallel OLAP (online analytical processing) over big data. It presents different data partitioning schemes for distributed warehouses and evaluates their performance using the TPC-H benchmark. Experimental results show improved query response times when fragmenting and distributing tables over multiple database backends compared to a single backend. The authors also introduce derived data techniques to further optimize query performance. They conclude more work is needed to automate data partitioning and support larger datasets.
The Life of Data at Altocloud. Altocloud connects your business with the right customers at the right time in their journey – improving conversions and enhancing customer experience. These slides were presented at the Altocloud In-Company event, as part of AlanTec Festival 2016. Presenters, Maciej Dabrowski, Chief Data Scientist and Darragh Kirwan, Full Stack Engineer.
Monday 16th May 2016.
Open core summit: Observability for data pipelines with OpenLineageJulien Le Dem
This document discusses Open Lineage and the Marquez project for collecting metadata and data lineage information from data pipelines. It describes how Open Lineage defines a standard model and protocol for instrumentation to collect metadata on jobs, datasets, and runs in a consistent way. This metadata can then provide context on the data source, schema, owners, usage, and changes. The document outlines how Marquez implements the Open Lineage standard by defining entities, relationships, and facets to store this metadata and enable use cases like data governance, discovery, and debugging. It also positions Marquez as a centralized but modular framework to integrate various data platforms and extensions like Datakin's lineage analysis tools.
Scaling Search at Lendingkart discusses how Lendingkart scaled their search capabilities to handle large increases in data volume. They initially tried scaling databases vertically and horizontally, but searches were still slow at 8 seconds. They implemented ElasticSearch for its near real-time search, high scalability, and out-of-the-box functionality. Logstash was used to seed data from MySQL and MongoDB into ElasticSearch. Custom analyzers and mappings were developed. Searches then reduced to 230ms and aggregations to 200ms, allowing the business to scale as transactional data grew 3000% and leads 250%.
Data pipelines observability: OpenLineage & MarquezJulien Le Dem
This document discusses OpenLineage and Marquez, which aim to provide standardized metadata and data lineage collection for data pipelines. OpenLineage defines an open standard for collecting metadata as data moves through pipelines, similar to metadata collected by EXIF for images. Marquez is an open source implementation of this standard, which can collect metadata from various data tools and store it in a graph database for querying lineage and understanding dependencies. This collected metadata helps with tasks like troubleshooting, impact analysis, and understanding how data flows through complex pipelines over time.
This document discusses using Apache Spark to build an academic alert system that can predict at-risk students from large datasets. It presents the use case, architecture, data volumes, and problem that Weka cannot handle large datasets. The solution involves using Spark on a 3-node Hadoop cluster to sample, impute, and model the data using logistic regression, random forest, and naive bayes algorithms. Spark achieved higher accuracy and recall than Weka in less time due to its ability to perform distributed computing on large datasets. Some current Spark challenges are also outlined.
Cancer Outlier Profile Analysis using Apache SparkMahmoud Parsian
The document describes Cancer Outlier Profile Analysis (COPA), a method for identifying outlier genes in cancer gene expression data using Apache Spark. COPA normalizes gene expression data, ranks genes by their expression values, and identifies the top outliers. It was the first algorithm to discover the ERG rearrangement in prostate cancer. The document outlines the COPA algorithm, which median centers data and ranks genes by their percentile scores to identify outlier profiles. It also discusses implementing COPA at large scale using Apache Spark to handle thousands of studies, each with hundreds of samples and over 100,000 gene expression pairs.
Getting Apache Spark Customers to ProductionCloudera, Inc.
This document discusses common challenges customers face in getting Spark applications to production and provides recommendations to address them. It covers issues like misconfiguration, resource declaration, YARN configuration mismatches, data-dependent tuning like adjusting partitions, and ensuring security in shared clusters through authentication, encryption, and authorization measures. The document also recommends techniques like using dynamic allocation, reducing shuffles, and enabling multi-tenancy with YARN to improve cluster utilization for multiple customers.
Kodu e Project Spark são ferramentas de programação que podem ser usadas no processo pedagógico para ensinar conceitos de programação de forma divertida e criativa. Essas ferramentas permitem que estudantes criem jogos e mundos virtuais sem precisar saber linguagens de programação tradicionais.
Miklos Christine is a solutions architect at Databricks who helps customers build big data platforms using Apache Spark. Databricks is the main contributor to the Apache Spark project. Spark is an open source engine for large-scale data processing that can be used for machine learning. Spark ML provides machine learning algorithms and pipelines to make machine learning scalable and easier to use at an enterprise level. Spark 2.0 includes improvements to Spark ML such as new algorithms and better support for Python.
Record linkage, a real use case with spark ml - Paris Spark meetup Dec 2015Modern Data Stack France
Record Linkage, un cas d’utilisation en Spark ML par Alexis Seigneurin
Le Record Linkage est le process qui consiste à trouver, dans un data set, les enregistrements qui représentent la même entité. Cette opération est particulièrement compliquée quand, comme nous, vous travaillez avec des données anonymisées. C’est là que le Machine Learning vient en renfort ! Nous avons implémenté un algorithme de Record Linkage en Spark SQL (DataFrames) et Spark ML plutôt que d’utiliser des règles statiques. Nous verrons le process de Feature Engineering, pourquoi nous avons dû étendre Spark DataFrames pour préserver des méta-données au travers du pipeline de traitement, et comment nous avons utilisé le Machine Learning pour réconcilier les enregistrements. Nous verrons enfin comment nous avons industrialisé cette application.
Alexis Seigneurin : Développeur depuis 15 ans, j'attache beaucoup d'importance aux problématiques de traitement, d'analyse et de stockage de la donnée.Chez Ippon, j'interviens principalement sur des missions de conseil et d'architecture autour de technologies big data. Par ailleurs, j'anime la formation Spark chez Ippon.
Not Your Father's Database: How to Use Apache Spark Properly in Your Big Data...Databricks
This session will cover a series of problems that are adequately solved with Apache Spark, as well as those that are require additional technologies to implement correctly. Here’s an example outline of some of the topics that will be covered in the talk: Problems that are perfectly solved with Apache Spark: 1) Analyzing a large set of data files. 2) Doing ETL of a large amount of data. 3) Applying Machine Learning & Data Science to a large dataset. 4) Connecting BI/Visualization tools to Apache Spark to analyze large datasets internally.
By Vida Ha at Spark Summit East 2016.
Building a Turbo-fast Data Warehousing Platform with DatabricksDatabricks
Traditionally, data warehouse platforms have been perceived as cost prohibitive, challenging to maintain and complex to scale. The combination of Apache Spark and Spark SQL – running on AWS – provides a fast, simple, and scalable way to build a new generation of data warehouses that revolutionizes how data scientists and engineers analyze their data sets.
In this webinar you will learn how Databricks - a fully managed Spark platform hosted on AWS - integrates with variety of different AWS services, Amazon S3, Kinesis, and VPC. We’ll also show you how to build your own data warehousing platform in very short amount of time and how to integrate it with other tools such as Spark’s machine learning library and Spark streaming for real-time processing of your data.
Video and slides synchronized, mp3 and slide download available at URL http://bit.ly/1FQYcP0.
Gian Merlino presents the advantages, challenges, and best practices to deploying and maintaining lambda architectures in the real world, using the infrastructure at Metamarkets as a case study. Filmed at qconsf.com.
Gian Merlino is a senior software engineer at Metamarkets, responsible for the infrastructure behind its data ingestion pipelines and is a committer on the Druid project.
Real-Time Log Analysis with Apache Mesos, Kafka and CassandraJoe Stein
Slides for our solution we developed for using Mesos, Docker, Kafka, Spark, Cassandra and Solr (DataStax Enterprise Edition) all developed in Go for doing realtime log analysis at scale. Many organizations either need or want log analysis in real time where you can see within a second what is happening within your entire infrastructure. Today, with the hardware available and software systems we have in place, you can develop, build and use as a service these solutions.
Real Time BOM Explosions with Apache Solr and SparkQAware GmbH
Apache Big Data Conference 2016, Vancouver BC: Talk by Andreas Zitzelsberger (@andreasz82, Principal Software Architect at QAware)
Abstract: Bills of materials (BOMs) are at the heart of every manufacturing process. Especially large BOMs can be found in the automotive industry, where a complex and highly variable product meets high production volumes.
Drawing from the experiences made in an ongoing real world project for a major car manufacturer, Andreas provides an in-depth view how Apache Solr and Apache Spark were used to power an innovative architecture that provides lightning-fast BOM explosions, demand forecasts and scenario-based planning on 20 billion records per scenario.
Spark Summit EU talk by Christos ErotocritouSpark Summit
This document discusses Apache Ignite and how it can be used with Apache Spark for fast data applications. It provides an overview of Ignite's in-memory data fabric capabilities, how it compares to Spark, and how Ignite can be integrated with Spark to provide shared resilient storage and distributed computing. Examples are given of reading and writing data between Ignite and Spark and using Ignite's in-memory file system and SQL support from Spark.
What is Kafka? What is real time streaming? What is a data pipeline? What is a message queuing system? This presentation is the answer to these questions and the importance of a powerful real time streaming platform for data sciencists.
Wrangling Big Data in a Small Tech EcosystemShalin Hai-Jew
This document summarizes the process of analyzing large datasets from a university's learning management system (LMS) with limited resources. It describes conceptualizing questions, reviewing available data, extracting and processing the data, validating findings, and presenting results. The key challenges identified are that the LMS data comes in "flat files" without defined relationships between variables, making it difficult to answer granular questions. The process involves loading the data into a database program like Access to enable analyzing the entire datasets, which can have millions of rows.
Streaming datasets for personalizationShriya Arora
Streaming applications have historically been complex to design and implement because of the significant infrastructure investment. However, recent active developments in various streaming platforms provide an easy transition to stream processing, and enable analytics applications/experiments to consume near real-time data without massive development cycles.In this session, we will present our experience on stream processing unbounded datasets in the personalization space. The datasets consisted of -- but were not limited to -- the stream of playback events that are used as feedback for all personalization algorithms. These datasets when ultimately consumed by our machine learning models, directly affect the customer’s personalized experience. We’ll talk about the experiments we did to compare Apache Spark and Apache Flink, and the challenges we faced.
Kafka Streams: The Stream Processing Engine of Apache KafkaEno Thereska
This document discusses Kafka Streams, which is the stream processing engine of Apache Kafka. It provides an overview of Kafka Streams and how it can be used to build real-time applications and services. Some key features of Kafka Streams include its declarative programming model using the Kafka Streams DSL, ability to perform continuous computations on data streams and tables, and building event-driven microservices without external real-time processing frameworks. The document also provides examples of how to build applications that perform operations like joins, aggregations and filtering using the Kafka Streams API.
Online learning with structured streaming, spark summit brussels 2016Ram Sriharsha
This document summarizes an online presentation about online learning with structured streaming in Spark. The key points are:
- Online learning updates model parameters for each data point as it arrives, unlike batch learning which sees the full dataset before updating.
- Structured streaming in Spark provides a single API for batch, streaming, and machine learning workloads. It offers exactly-once guarantees and understands external event time.
- Streaming machine learning on structured streaming works by having a stateful aggregation query that picks up the last trained model and performs a distributed update and merge on each trigger interval. This allows modeling streaming data in a fault-tolerant way.
This document provides an overview of streaming analytics, including definitions, common use cases, and key concepts like streaming engines, processing models, and guarantees. It also provides examples of analyzing data streams using Apache Spark Structured Streaming, Apache Flink, and Kafka Streams APIs. Code snippets demonstrate windowing, triggers, and working with event-time.
Event Driven Architecture with a RESTful Microservices Architecture (Kyle Ben...confluent
Tinder’s Quickfire Pipeline powers all things data at Tinder. It was originally built using AWS Kinesis Firehoses and has since been extended to use both Kafka and other event buses. It is the core of Tinder’s data infrastructure. This rich data flow of both client and backend data has been extended to service a variety of needs at Tinder, including Experimentation, ML, CRM, and Observability, allowing backend developers easier access to shared client side data. We perform this using many systems, including Kafka, Spark, Flink, Kubernetes, and Prometheus. Many of Tinder’s systems were natively designed in an RPC first architecture.
Things we’ll discuss decoupling your system at scale via event-driven architectures include:
– Powering ML, backend, observability, and analytical applications at scale, including an end to end walk through of our processes that allow non-programmers to write and deploy event-driven data flows.
– Show end to end the usage of dynamic event processing that creates other stream processes, via a dynamic control plane topology pattern and broadcasted state pattern
– How to manage the unavailability of cached data that would normally come from repeated API calls for data that’s being backfilled into Kafka, all online! (and why this is not necessarily a “good” idea)
– Integrating common OSS frameworks and libraries like Kafka Streams, Flink, Spark and friends to encourage the best design patterns for developers coming from traditional service oriented architectures, including pitfalls and lessons learned along the way.
– Why and how to avoid overloading microservices with excessive RPC calls from event-driven streaming systems
– Best practices in common data flow patterns, such as shared state via RocksDB + Kafka Streams as well as the complementary tools in the Apache Ecosystem.
– The simplicity and power of streaming SQL with microservices
Evolving Beyond the Data Lake: A Story of Wind and RainMapR Technologies
This document discusses how companies are increasingly investing in next-generation technologies like big data, cloud computing, and software/hardware related to these areas. It notes that 90% of data will be on next-gen technologies within four years. It then discusses how a converged data platform can help organizations gain insights from both historical and real-time data through applications that combine operational and analytical uses. Key benefits include the ability to seamlessly access and analyze both types of data.
Streaming Analytics and Internet of Things - Geesara PrathapWithTheBest
This document discusses streaming analytics and the Internet of Things. It describes some challenges of IoT such as data volume, processing speed requirements, and data storage needs. It then discusses WSO2's IoT and analytics platforms, which can perform real-time, batch, and predictive analytics on streaming IoT and other data. The analytics platform uses Siddhi to analyze event streams using techniques like filtering, windowing, pattern detection, joins, and more. Real-time analytics examples include alerts, counting, correlation, and learning models.
Эталонная архитектура сервиса из компонентов со 100% открытым исходным кодом, готового к развертыванию в облаках, с масштабируемостью и надежностью уровня предприятия.
Антон Овчинников, Grid Dynamics
Presentation by Steffen Zeuch, Researcher at German Research Center for Artificial Intelligence (DFKI) and Post-Doc at TU Berlin (Germany), at the FogGuru Boot Camp training in September 2018.
Introduction of streaming data, difference between batch processing and stream processing, Research issues in streaming data processing, Performance evaluation metrics , tools for stream processing.
The Lyft data platform: Now and in the futuremarkgrover
- Lyft has grown significantly in recent years, providing over 1 billion rides to 30.7 million riders through 1.9 million drivers in 2018 across North America.
- Data is core to Lyft's business decisions, from pricing and driver matching to analyzing performance and informing investments.
- Lyft's data platform supports data scientists, analysts, engineers and others through tools like Apache Superset, change data capture from operational stores, and streaming frameworks.
- Key focuses for the platform include business metric observability, streaming applications, and machine learning while addressing challenges of reliability, integration and scale.
Lyft’s data platform is at the heart of the company's business. Decisions from pricing to ETA to business operations rely on Lyft’s data platform. Moreover, it powers the enormous scale and speed at which Lyft operates. Mark Grover and Deepak Tiwari walk you through the choices Lyft made in the development and sustenance of the data platform, along with what lies ahead in the future.
Real-Time Analytics with Confluent and MemSQLSingleStore
This document discusses enabling real-time analytics for IoT applications. It describes how industries like auto, transportation, energy, warehousing and logistics, and healthcare need real-time analytics to handle streaming data from IoT sensors. It also discusses how Confluent's Kafka stream processing platform can be used to build applications that ingest IoT data at high speeds, transform the data, and power real-time analytics and user interfaces. MemSQL's in-memory database is presented as a fast and scalable storage option to support real-time analytics on the large volumes of IoT data.
Kostas Tzoumas - Stream Processing with Apache Flink®Ververica
In this talk the basics on Apache Flink are covered: why the project exists, where it came from, what gap does it fill, how it differs from all the other stream processing projects, what is it being used for, and where is it headed. In short, streaming data is now the new trend, and for very good reasons. Most data is produced continuously, and it makes sense that it is processed and analysed continuously. Whether it is the need for more real-time products, adopting micro-services, or building continuous applications, stream processing technology offers to simplify the data infrastructure stack and reduce the latency to decisions.
Debunking Common Myths in Stream ProcessingKostas Tzoumas
This document discusses stream processing with Apache Flink. It begins by defining streaming as the continuous processing of never-ending data streams. It then debunks four common myths about stream processing: 1) that there is always a throughput/latency tradeoff, showing that Flink can achieve high throughput and low latency; 2) that exactly-once processing is not possible, but Flink provides exactly-once state guarantees with checkpoints; 3) that streaming is only for real-time applications, whereas it can also be used for historical data; and 4) that streaming is too hard, whereas most data problems are actually streaming problems. The document concludes by discussing Flink's community and examples of companies using Flink in production.
Independent of the source of data, the integration of event streams into an Enterprise Architecture gets more and more important in the world of sensors, social media streams and Internet of Things. Events have to be accepted quickly and reliably, they have to be distributed and analysed, often with many consumers or systems interested in all or part of the events. Storing such huge event streams into HDFS or a NoSQL datastore is feasible and not such a challenge anymore. But if you want to be able to react fast, with minimal latency, you can not afford to first store the data and doing the analysis/analytics later. You have to be able to include part of your analytics right after you consume the event streams. Products for doing event processing, such as Oracle Event Processing or Esper, are avaialble for quite a long time and also used to be called Complex Event Processing (CEP). In the last 3 years, another family of products appeared, mostly out of the Big Data Technology space, called Stream Processing or Streaming Analytics. These are mostly open source products/frameworks such as Apache Storm, Spark Streaming, Apache Samza as well as supporting infrastructures such as Apache Kafka. In this talk I will present the theoretical foundations for Event and Stream Processing and present what differences you might find between the more traditional CEP and the more modern Stream Processing solutions and show that a combination of both will bring the most value.
Big Data Processing Beyond MapReduce by Dr. Flavio VillanustreHPCC Systems
Data Centric Approach: Our platform is built on the premise of absorbing data from multiple data sources and transforming them to a highly intelligent social network graphs that can be processed to non-obvious relationships.
Big Stream Processing Systems, Big GraphsPetr Novotný
Big Data, a recent phenomenon. Everyone talks about it, but do you really know what Big Data is? Join our four-part series about Big Data and you will get answers to your questions!
We will cover Introduction to Big Data and available platforms which we can use to deal with Big Data. And in the end, we are going to give you an insight into the possible future of dealing with Big Data.
After the two previous episodes you know the basics about Big Data. Yet, it might get a bit more complicated than that. Usually when you have to deal with data which is generated in real-time. In this case, you are dealing with Big Stream.
This episode of our series will be focussed on processing systems capable of dealing with Big Streams. But analysing data lacking graphical representation will not be very convenient for us. And this is where we have to use a platform capable of visualising Big Graphs. All these topics will be covered in today’s presentation.
#CHEDTEB
www.chedteb.eu
Apache Flink for IoT: How Event-Time Processing Enables Easy and Accurate Ana...Big Data Spain
This document discusses Apache Flink for IoT event-time stream processing. It begins by introducing streaming architectures and Flink. It then discusses how IoT data has important properties like continuous data production and event timestamps that require event-time based processing. Examples are provided of companies like King and Bouygues Telecom using Flink for billions of events per day with challenges like out-of-order data and flexible windowing. Event-time processing in Flink is able to handle these challenges through features like watermarks.
Managing Large Scale Financial Time-Series Data with Graphs Objectivity
Slides from a recent webinar by Objectivity showing how the ThingSpan platform is ideal for graph analytics to uncover patterns and insights within large, complex data sets in order to make efficient decisions.
Afterwork big data et data viz - du lac à votre écranJoseph Glorieux
This document discusses a data visualization workshop hosted by OCTOSuisse on exploring and visualizing big data from a data lake. It provides an overview of OCTO's big data capabilities and projects. It then uses a case study of Swiss public transportation data to demonstrate data exploration, analysis, and visualization techniques using tools like Tableau. The goal is to understand data, identify insights, and effectively communicate findings to others.
Slim Baltagi, director of Enterprise Architecture at Capital One, gave a presentation at Hadoop Summit on major trends in big data analytics. He discussed 1) increasing portability between execution engines using Apache Beam, 2) the emergence of stream analytics driven by data streams, technology advances, business needs and consumer demands, 3) the growth of in-memory analytics using tools like Alluxio and RocksDB, 4) rapid application development using APIs, notebooks, GUIs and microservices, 5) open sourcing of machine learning systems by tech giants, and 6) hybrid cloud computing models for deploying big data applications both on-premise and in the cloud.
Similar to Voxxed Days Thesaloniki 2016 - Streaming Engines for Big Data (20)
This document discusses various topics related to career building and job searching. It includes two sample job specifications for software developer roles, highlighting desired skills, technologies, and company culture. It also discusses the importance of passion for one's work, gaining diverse experience, and taking charge of one's own career path rather than relying solely on a single company. The overall message is that building a successful career is a long-term process that requires planning, learning, and navigating different roads and opportunities.
The document discusses strategies for scaling React.js applications. It covers grouping components by feature instead of type, isolating styles with CSS Modules, using redux-saga for asynchronous logic, optimizing performance with techniques like shouldComponentUpdate, and code splitting with Webpack.
Voxxed Days Thessaloniki 2016 - Web assembly : the browser vm we were waiting...Voxxed Days Thessaloniki
WebAssembly is a new compilation target and virtual machine that is designed to run compiled code nearly as fast as native machine code. It aims to be a universal compilation target that runs across browsers and other environments. WebAssembly code is compiled to a binary format that is faster to parse and run than JavaScript. It integrates well with JavaScript and the web while providing better performance than asm.js for tasks like games, 3D graphics, and data parallel computation. WebAssembly is currently supported in all major browsers and provides a compilation target for C/C++ and other languages via compilers like Emscripten.
The document discusses various tactics for avoiding writing software documentation, including:
1. Using constructive laziness to think about documentation but delay writing it.
2. Using techniques like "just-in-time documentation" to avoid writing documentation upfront by pretending it's stored elsewhere or feigning ignorance.
3. Recognizing that while documentation avoidance has its place, compromises like writing minimal documentation may be necessary to explain code to future team members or users.
Voxxed Days Thesaloniki 2016 - Rightsize Your Services with WildFly & WildFly...Voxxed Days Thessaloniki
This document discusses WildFly Swarm, a tool that packages a Java application together with just the pieces of the WildFly application server runtime that it needs. It allows building self-contained applications that can run independently of any pre-installed server. The document outlines the architecture and components of WildFly Swarm, how it allows selecting specific Java EE APIs and frameworks, and how it generates a single executable jar file with all dependencies included. It also provides an example of creating a simple RESTful application using WildFly Swarm.
This document discusses best practices for running microservices in production. It advocates for a metrics-first approach and using containers and Kubernetes for infrastructure. Key points include collecting standard metrics in Kubernetes for visibility, packaging apps with all dependencies like metrics endpoints, and taking an iterative approach to continuously evolve systems through experimentation, measurement, and learning from experience.
This document provides a summary of an HTTP2 presentation. It includes sections on the history and development of HTTP protocols. Key features of HTTP2 like multiplexing, header compression, server push and stream prioritization are explained. Implementation details of HTTP2 such as frame types and the HPACK header compression algorithm are covered at a high level. Real-world adoption statistics and examples of enabling HTTP2 in servers like Apache and Nginx are also mentioned. The presentation concludes by discussing future developments like cache digests and the QUIC protocol.
This document discusses machine learning and how it can be used by developers. It covers topics like supervised learning, unsupervised learning, reinforcement learning, and different machine learning algorithms. It also discusses tools for machine learning like Amazon EMR, Spark, Amazon Machine Learning service, and deep learning with DSSTNE. Finally, it provides an example of how to build a smart mobile app using serverless AWS services like Lambda, Kinesis, S3, Cognito and others with machine learning models.
Voxxed Days Thessaloniki 2016 - Continuous Delivery: Jenkins, Docker and Spri...Voxxed Days Thessaloniki
This document discusses continuous delivery using Jenkins, Docker, and Spring Boot. It defines continuous delivery as getting changes safely and quickly into production. It describes using continuous integration with Jenkins to run automated tests on code commits. It advocates treating servers like "cattle, not pets" by using Docker containers with pre-built images from a registry for deployments. This allows applications to be deployed to different environments like staging and production more easily and consistently.
Voxxed Days Thesaloniki 2016 - A journey to Open Source Technologies on AzureVoxxed Days Thessaloniki
1) Microsoft Azure provides a platform for building and deploying Java applications on virtual machines, containers, and platform as a service (PaaS) offerings.
2) Azure supports the full Java ecosystem including frameworks, tools, and databases and has strong partnerships with the Eclipse Foundation and Linux Foundation.
3) Many large Java projects like Jenkins use Azure to host their infrastructure due to Azure's support for open source technologies and large Java developer community.
Voxxed Days Thessaloniki 2016 - Keynote - JDK 9 : Big Changes To Make Java Sm...Voxxed Days Thessaloniki
This document discusses the major changes coming in JDK 9, including encapsulating unsupported APIs, removing some supported APIs, and introducing a new modular structure. The key changes are the introduction of a module system that groups code into modular units and defines dependencies, changes to the accessibility of APIs between modules, and the restructuring of the JDK itself into modules. This modularization is a major change that will impact application development but aims to make Java applications more secure, maintainable and flexible.
SMS API Integration in Saudi Arabia| Best SMS API ServiceYara Milbes
Discover the benefits and implementation of SMS API integration in the UAE and Middle East. This comprehensive guide covers the importance of SMS messaging APIs, the advantages of bulk SMS APIs, and real-world case studies. Learn how CEQUENS, a leader in communication solutions, can help your business enhance customer engagement and streamline operations with innovative CPaaS, reliable SMS APIs, and omnichannel solutions, including WhatsApp Business. Perfect for businesses seeking to optimize their communication strategies in the digital age.
What to do when you have a perfect model for your software but you are constrained by an imperfect business model?
This talk explores the challenges of bringing modelling rigour to the business and strategy levels, and talking to your non-technical counterparts in the process.
Microservice Teams - How the cloud changes the way we workSven Peters
A lot of technical challenges and complexity come with building a cloud-native and distributed architecture. The way we develop backend software has fundamentally changed in the last ten years. Managing a microservices architecture demands a lot of us to ensure observability and operational resiliency. But did you also change the way you run your development teams?
Sven will talk about Atlassian’s journey from a monolith to a multi-tenanted architecture and how it affected the way the engineering teams work. You will learn how we shifted to service ownership, moved to more autonomous teams (and its challenges), and established platform and enablement teams.
Measures in SQL (SIGMOD 2024, Santiago, Chile)Julian Hyde
SQL has attained widespread adoption, but Business Intelligence tools still use their own higher level languages based upon a multidimensional paradigm. Composable calculations are what is missing from SQL, and we propose a new kind of column, called a measure, that attaches a calculation to a table. Like regular tables, tables with measures are composable and closed when used in queries.
SQL-with-measures has the power, conciseness and reusability of multidimensional languages but retains SQL semantics. Measure invocations can be expanded in place to simple, clear SQL.
To define the evaluation semantics for measures, we introduce context-sensitive expressions (a way to evaluate multidimensional expressions that is consistent with existing SQL semantics), a concept called evaluation context, and several operations for setting and modifying the evaluation context.
A talk at SIGMOD, June 9–15, 2024, Santiago, Chile
Authors: Julian Hyde (Google) and John Fremlin (Google)
https://doi.org/10.1145/3626246.3653374
Artificia Intellicence and XPath Extension FunctionsOctavian Nadolu
The purpose of this presentation is to provide an overview of how you can use AI from XSLT, XQuery, Schematron, or XML Refactoring operations, the potential benefits of using AI, and some of the challenges we face.
Everything You Need to Know About X-Sign: The eSign Functionality of XfilesPr...XfilesPro
Wondering how X-Sign gained popularity in a quick time span? This eSign functionality of XfilesPro DocuPrime has many advancements to offer for Salesforce users. Explore them now!
Mobile app Development Services | Drona InfotechDrona Infotech
Drona Infotech is one of the Best Mobile App Development Company In Noida Maintenance and ongoing support. mobile app development Services can help you maintain and support your app after it has been launched. This includes fixing bugs, adding new features, and keeping your app up-to-date with the latest
Visit Us For :
Most important New features of Oracle 23c for DBAs and Developers. You can get more idea from my youtube channel video from https://youtu.be/XvL5WtaC20A
Need for Speed: Removing speed bumps from your Symfony projects ⚡️Łukasz Chruściel
No one wants their application to drag like a car stuck in the slow lane! Yet it’s all too common to encounter bumpy, pothole-filled solutions that slow the speed of any application. Symfony apps are not an exception.
In this talk, I will take you for a spin around the performance racetrack. We’ll explore common pitfalls - those hidden potholes on your application that can cause unexpected slowdowns. Learn how to spot these performance bumps early, and more importantly, how to navigate around them to keep your application running at top speed.
We will focus in particular on tuning your engine at the application level, making the right adjustments to ensure that your system responds like a well-oiled, high-performance race car.
Need for Speed: Removing speed bumps from your Symfony projects ⚡️
Voxxed Days Thesaloniki 2016 - Streaming Engines for Big Data
1. Streaming Engines for Big Data
Spark Streaming: a case study
Stavros Kontopoulos
Senior Software Engineer @ Lightbend, M.Sc.
21st October 2016, Thessaloniki
#VoxxedDaysThessaloniki
2. 2
Who Am I?
Fast Data Team Engineer @ Lightbend
OSS contributor (Apache Spark on Mesos)
https://github.com/skonto
#VoxxedDaysThessaloniki
3. 3
● A bit of history...
● Streaming Engines for Big Data
○ Key concepts - Design Considerations
○ Modern analysis of infinite streams
○ Streaming Engines Examples
○ Which one to use?
● Spark Streaming A Case Study
○ DStream API
○ Structured Streaming
#VoxxedDaysThessaloniki
6. Big Data - The story
● One decade ago people started looking to the problem of how to process
massive data sets (Velocity, Variety, Volume).
● The Apache Hadoop project appeared at that time and became the golden
solution for batch processing running on commodity hardware. Later became
an ecosystem of several other projects: Pig, Hive, HBase etc.
present
GFS paper
2003
Mapreduce
Paper
2004
Hadoop
project, 0.1.0
release
2006 2009
Hadoop sorts
1 Petabyte
Spark on Yarn
by Clouder,
Yarn in
production
2010
Hadoop 2.4,
2.5, 2.6
releases
2014
HBase, Pig,
Hive graduate
2013 2015
Hadoop 2.7
release
#VoxxedDaysThessaloniki
6
7. Big Data - The story
X
Y
Z
MAP
MAP
SHUFFLE
MAP
MAP-REDUCE
A
B
A
REDUCE
REDUCE
Q
W
#VoxxedDaysThessaloniki
7
8. Big Data - The story
Hadoop pros/cons
● Batch jobs usually take hours if not days to complete, in many applications
that is not acceptable anymore.
● Traditionally focus is on throughput than latency. Frameworks like Hadoop
were designed with that in mind.
● Accuracy is the best you can get.
#VoxxedDaysThessaloniki
8
9. Big Data - The story
● Giuseppe DeCandia et al., ”Dynamo: amazon's highly available key-value
store.” changed the DataBase world in 2007.
● NoSQL Databases along with general system like Hadoop solve problems
cannot be solved with traditional RDBMs.
● Technology facts: Cheap memory, SSDs, HDDs are the new tape, more cpus
over more powerful cpus.
#VoxxedDaysThessaloniki
9
10. Big Data - The story
● Disruptive companies need to utilize ML and latest information to come up
with smart decisions sooner.
● And so we need streaming in the enterprise… We no longer talk about Big
Data only, its Fast Data first.
Searching Recommendations Real-time financial activities
Fraud Detection
#VoxxedDaysThessaloniki
10
11. Big Data - The story
OpsClarity Report Summary:
● 92% plan to increase their investment in stream processing applications in the
next year
● 79% plan to reduce or eliminate investment in batch processing
● 32% use real time analysis to power core customer-facing applications
● 44% agreed that it is tedious to correlate issues across the pipeline
● 68% identified lack of experience and underlying complexity of new data
frameworks as their barrier to adoption
http://info.opsclarity.com/2016-fast-data-streaming-applications-report.html
#VoxxedDaysThessaloniki
11
13. Streams
● A Stream is flow of data. The flow consists of ephemeral data elements
flowing from a source to a sink.
● Streams become useful when a set of operations/transformations are applied
on them.
● Can be infinite or finite in size. This translates to the notions of bounded/
unbounded data.
#VoxxedDaysThessaloniki
13
14. Stream Processing
Stream Processing: processing done on an (un)bounded data stream. Not all
data are available.
Source Sink
Processing
#VoxxedDaysThessaloniki
14
16. Stream Processing
Processing can be…
● Stream management: connect, iterate...
● Data manipulation: map, flatmap…
● Input/Output
Graph as the abstraction for defining how all the pieces are put together and how
data flows between them. Some systems use a DAG.
16
#VoxxedDaysThessaloniki
Map Reduce
Count
Distinct DFS
DB
DFS
18. Stream Processing - Execution Model
Map your graph to an execution plan and run it.
Execution Model Abstractions: Job, Task etc.
Actors: JobManager, TaskManager.
Where TaskManager and Tasks run? Threads, nodes etc…
Important: code runs close to the data… Serialize and send over the network the
task code along with any dependencies, communicate back the results to the
application...
18
#VoxxedDaysThessaloniki
19. Stream vs Batch Processing
Batch processing is processing done on finite data set with all data available.
Two types of engines: batch and streaming engines which can actually be used
for both types of processing!
19
#VoxxedDaysThessaloniki
21. Streaming Engines for Big Data
Streaming Engines allows to building streaming applications:
Streaming Engines for Big data provide in addition:
● A rich ecosystem built around them for example connectors for common
sources, outputs to different sinks etc.
● Fault tolerance, scalability (cluster management support), management of
strugglers
● ML, Graph, CEP, processing capabilities
+ API Streaming App
21
#VoxxedDaysThessaloniki
22. Streaming Engines for Big Data
A big data system at minimum needs:
● A data processing framework eg. a streaming engine.
● A Distributed File System.
22
#VoxxedDaysThessaloniki
24. Design Considerations of A Streaming Engine
● Strong consistency. If a machine fails how my results are
affected?
○ Exactly once processing.
○ Checkpointing
● Appropriate semantics for integrating time. Late data?
● API (Language Support, DAG, SQL Support etc)
24
#VoxxedDaysThessaloniki
25. Design Considerations of A Streaming Engine
● Execution Model - integration with cluster manager(s)
● Elasticity - Dynamic allocation
● Performance: Throughput vs Latency
● Libraries for CEP, Graph, ML, SQL based processing
25
#VoxxedDaysThessaloniki
26. Design Considerations of A Streaming Engine
● Deployment modes: local vs cluster mode
● Streaming vs Batch mode, Code looks the same?
● Logging
● Local state management
● Support for session state
26
#VoxxedDaysThessaloniki
27. Design Considerations of A Streaming Engine
● Backpressure
● Off Heap Management
● Caching
● Security
● UI
● CLI env for interactive sessions
27
#VoxxedDaysThessaloniki
29. Analyzing Infinite Data Streams
● Recent advances in Streaming are a result of the pioneer work:
○ MillWheel: Fault-Tolerant Stream Processing at Internet Scale, VLDB
2013.
○ The Dataflow Model: A Practical Approach to Balancing Correctness,
Latency, and Cost in Massive-Scale, Unbounded, Out-of-Order Data
Processing, Proceedings of the VLDB Endowment, vol. 8 (2015), pp.
1792-1803
29
#VoxxedDaysThessaloniki
30. Analyzing Infinite Data Streams
● Two cases for processing:
○ Single event processing: event transformation, trigger an alarm on an error event
○ Event aggregations: summary statistics, group-by, join and similar queries. For example
compute the average temperature for the last 5 minutes from a sensor data stream.
30
#VoxxedDaysThessaloniki
31. Analyzing Infinite Data Streams
● Event aggregation introduces the concept of windowing wrt the notion of time
selected:
○ Event time (the time that events happen): Important for most use cases where context and
correctness matter at the same time. Example: billing applications, anomaly detection.
○ Processing time (the time they are observed during processing): Use cases where I only care
about what I process in a window. Example: accumulated clicks on a page per second.
○ System Arrival or Ingestion time (the time that events arrived at the streaming system).
● Ideally event time = Processing time. Reality is: there is skew.
31
#VoxxedDaysThessaloniki
32. Time in Modern Data Stream Analysis
Windows come in different flavors:
● Tumbling windows discretize a stream into non-overlapping windows.
○ Eg. report all distinct users every 10 seconds
● Sliding Windows: slide over the stream of data.
○ Eg. report all distinct users for the last 10 minutes every 1 minute.
32
#VoxxedDaysThessaloniki
33. Analyzing Infinite Data Streams
● Watermarks: indicates that no elements with a timestamp older or equal to
the watermark timestamp should arrive for the specific window of data.
○ Allows us to mark late data. Late data can either be added to the window or discarded.
● Triggers: decide when the window is evaluated or purged.
○ Allows complex logic for window processing
33
#VoxxedDaysThessaloniki
34. Analyzing Infinite Data Streams
● Apache Beam is the open source successor of Google’s DataFlow
● It is becoming the standard api streaming. Provides the advanced semantics
needed for the current needs in streaming applications.
34
#VoxxedDaysThessaloniki
36. Streaming Engines for Big Data - Pick one
Many criteria: use case at hand, existing infrastructure, performance, customer
support, cloud vendor, features
Recommend to first to look at:
● Apache Flink for low latency and advanced semantics
● Apache Spark for its maturity and rich set of functionality: ML, SQL, GraphX
● Apache Kafka Streams for simple data transformations from and back to
Kafka topics
36
#VoxxedDaysThessaloniki
38. Spark in a Nutshell
Apache Spark: A memory optimized distributed computing framework.
Supports caching of data in memory for speeding computations.
38
#VoxxedDaysThessaloniki
39. Spark in a Nutshell - RDDs
Represents a bounded dataset as an RDD (Resilient Distributed Dataset).
An RDD can be seen as an immutable distributed collection.
Two types of operations can be applied on an RDD: transformations like map
and actions like collect.
Transformations are lazy while actions trigger computation on the cluster.
Operations like groupBy cause shuffle of data across the network.
39
#VoxxedDaysThessaloniki
40. Spark in a Nutshell - Deployment Mode
40
#VoxxedDaysThessaloniki
41. Spark in a Nutshell - Basic Components
41
#VoxxedDaysThessaloniki
43. Spark in a nutshell - Key Features
Dynamic Allocation
Memory management (Project Tungsten + off heap operations)
Cluster managers: Yarn, StandAlone, Mesos
Scala, Python, Java, R
Micro-batch engine
SQL API, ML library, GraphX
Monitoring UI
43
#VoxxedDaysThessaloniki
44. Spark Streaming
Two flavors of Streaming:
● DStream API Spark 1.X -> mature API
● Structured Streaming (Alpha), Spark 2.0 -> Don’t go to production yet
“Based on Spark SQL. User does not need to
reason about streaming end to end”
44
#VoxxedDaysThessaloniki
45. Spark Streaming DStream API
Discretizes the stream based on batchDuration (batch interval) which is configured
once.
Provides exactly one semantics with KafkaDirect for DStream or with WAL
enabled for reliable receivers/drivers plus checkpointing for driver context
recovery.
Many transformations and actions you get on a RDD you can get them on
DStream as well.
45
#VoxxedDaysThessaloniki
46. Spark Structured Streaming
● Integrates with DF and Dataset API (Spark SQL) for structured queries
● Allows for end-to-end exactly once for specific sources/sinks (HDFS/S3)
○ Requires replayable sources and idempotent sinks
● Input is sent to a query and output of the query is written to a sink.
Two types of output implemented:
● Complete Mode - The entire updated Result Table will be written to the external storage. It is up to the storage connector to
decide how to handle writing of the entire table.
● Append Mode - Only the new rows appended in the Result Table since the last trigger will be written to the external storage.
This is applicable only on the queries where existing rows in the Result Table are not expected to change.
46
#VoxxedDaysThessaloniki
47. Spark Structured Streaming - Not Yet Implemented
● More Sources/Sinks
● Watermarks
● Late data management
● State Sessions
47
#VoxxedDaysThessaloniki
51. 51
Structured Streaming
mean code same as batch
readStream instead of read
writeStream instead of write
Session creation is the
same as with batch case
https://github.com/skonto/talks/tree/master/voxxed-days-thess-2016