Sparkta is an open source real-time analytics platform based on Apache Spark. It allows users to define aggregation policies in JSON documents without coding, and processes streaming data in real-time. The platform utilizes technologies like Apache Kite, Spark Streaming, and Kafka to ingest data from various sources and store aggregated outputs. Stratio is developing Sparkta to be a fully-featured, distributed, high-volume, and pluggable analytics framework.
Apache Spark & Cassandra use case at Telefónica Cbs by Antonio AlcacerStratio
Spark & Cassandra Use Case at Telefónica CyberSecurity (CBS) Antonio Alcocer antonio@stratio.com Oscar Mendez oscar@stratio.com @omendezsoto #CassandraSummit 2014 1
An efficient data mining solution by integrating Spark and CassandraStratio
Integrating C* and Spark gives us a system that combines the best of both worlds. The goal of this integration is to obtain a better result than using Spark over HDFS because Cassandra´s philosophy is much closer to RDD's philosophy than what HDFS is. The goal with Cassandra is to have a system that mines all the information stored in C* in a much more efficient way than having the information stored in HDFS. Cassandra data storage and Spark data mining power: an unrivalled mix.
Multiplaform Solution for Graph DatasourcesStratio
One of the top banks in Europe, needed a system to provide better performance, scaling almost linearly with the increase in information to be analyzed, and allowing to move the processes that were currently being executed in the Host to a Big Data infrastructure. During a year we've worked on a system which is able to provide greater agility, flexibility and simplicity for the user to view information when profiling and is now able to analyze the structure of profile data. It's a powerful way to make online queries to a graph database, which is integrated with Apache Spark and different graph libraries. Basically, we get all the necessary information through Cypher queries which are sent to a Neo4j database.
Using the last Big Data technologies like Spark Dataframe, HDFS, Stratio Intelligence or Stratio Crossdata, we have developed a solution which is able to obtain critical information for multiple datasources like text files o graph databases.
Realtime streaming architecture in INFINARIOJozo Kovac
About our experience with realtime analyses on never-ending stream of user events. Discuss Lambda architecture, Kappa, Apache Kafka and our own approach.
Apache Spark & Cassandra use case at Telefónica Cbs by Antonio AlcacerStratio
Spark & Cassandra Use Case at Telefónica CyberSecurity (CBS) Antonio Alcocer antonio@stratio.com Oscar Mendez oscar@stratio.com @omendezsoto #CassandraSummit 2014 1
An efficient data mining solution by integrating Spark and CassandraStratio
Integrating C* and Spark gives us a system that combines the best of both worlds. The goal of this integration is to obtain a better result than using Spark over HDFS because Cassandra´s philosophy is much closer to RDD's philosophy than what HDFS is. The goal with Cassandra is to have a system that mines all the information stored in C* in a much more efficient way than having the information stored in HDFS. Cassandra data storage and Spark data mining power: an unrivalled mix.
Multiplaform Solution for Graph DatasourcesStratio
One of the top banks in Europe, needed a system to provide better performance, scaling almost linearly with the increase in information to be analyzed, and allowing to move the processes that were currently being executed in the Host to a Big Data infrastructure. During a year we've worked on a system which is able to provide greater agility, flexibility and simplicity for the user to view information when profiling and is now able to analyze the structure of profile data. It's a powerful way to make online queries to a graph database, which is integrated with Apache Spark and different graph libraries. Basically, we get all the necessary information through Cypher queries which are sent to a Neo4j database.
Using the last Big Data technologies like Spark Dataframe, HDFS, Stratio Intelligence or Stratio Crossdata, we have developed a solution which is able to obtain critical information for multiple datasources like text files o graph databases.
Realtime streaming architecture in INFINARIOJozo Kovac
About our experience with realtime analyses on never-ending stream of user events. Discuss Lambda architecture, Kappa, Apache Kafka and our own approach.
Lambda architecture for real time big dataTrieu Nguyen
Lambda Architecture in Real-time Big Data Project
Concepts & Techniques “Thinking with Lambda”
Case study in some real projects
Why lambda architecture is correct solution for big data?
Real-Time Analytics and Actions Across Large Data Sets with Apache SparkDatabricks
Around the world, businesses are turning to AI to transform the way they operate and serve their customers. But before they can implement these technologies, companies must address the roadblock of moving from batch analytics to making real-time decisions by rapidly accessing and analyzing the relevant information amidst a sea of data. Yaron will explain how to make Spark handle multivariate real-time, historical and event data simultaneously to provide immediate and intelligent responses. He will present several time sensitive use-cases including fraud detection, prevention of outages and customer recommendations to demonstrate how to perform predictive analytics and real-time actions with Spark.
Speaker: Yaron Ekshtein
Time Series Analysis Using an Event Streaming PlatformDr. Mirko Kämpf
Advanced time series analysis (TSA) requires very special data preparation procedures to convert raw data into useful and compatible formats.
In this presentation you will see some typical processing patterns for time series based research, from simple statistics to reconstruction of correlation networks.
The first case is relevant for anomaly detection and to protect safety.
Reconstruction of graphs from time series data is a very useful technique to better understand complex systems like supply chains, material flows in factories, information flows within organizations, and especially in medical research.
With this motivation we will look at typical data aggregation patterns. We investigate how to apply analysis algorithms in the cloud. Finally we discuss a simple reference architecture for TSA on top of the Confluent Platform or Confluent cloud.
Data Warehousing with Spark Streaming at ZalandoDatabricks
Zalandos AI-driven products and distributed landscape of analytical data marts cannot wait for long-running, hard-to-recover, monolithic batch jobs taking all night to calculate already outdated data. Modern data integration pipelines need to deliver fast and easy to consume data sets in high quality. Based on Spark Streaming and Delta, the central data warehousing team was able to deliver widely-used master data as S3 or Kafka streams and snapshots at the same time.
The talk will cover challenges in our fashion data platform and a detailed architectural deep dive about separation of integration from enrichment, providing streams as well as snapshots and feeding the data to distributed data marts. Finally, lessons learned and best practices about Delta’s MERGE command, Scala API vs Spark SQL and schema evolution give more insights and guidance for similar use cases.
Simplify and Scale Data Engineering Pipelines with Delta LakeDatabricks
We’re always told to ‘Go for the Gold!,’ but how do we get there? This talk will walk you through the process of moving your data to the finish fine to get that gold metal! A common data engineering pipeline architecture uses tables that correspond to different quality levels, progressively adding structure to the data: data ingestion (‘Bronze’ tables), transformation/feature engineering (‘Silver’ tables), and machine learning training or prediction (‘Gold’ tables). Combined, we refer to these tables as a ‘multi-hop’ architecture. It allows data engineers to build a pipeline that begins with raw data as a ‘single source of truth’ from which everything flows. In this session, we will show how to build a scalable data engineering data pipeline using Delta Lake, so you can be the champion in your organization.
Virtualizing Analytics with Apache Spark: Keynote by Arsalan Tavakoli Spark Summit
In the race to invent multi-million dollar business opportunities with exclusive insights, data scientists and engineers are hampered by a multitude of challenges just to make one use case a reality – the need to ingest data from multiple sources, apply real-time analytics, build machine learning algorithms, and intermix different data processing models, all while navigating around their legacy data infrastructure that is just not up to the task. This need has created the demand for Virtual Analytics, where the complexities of disparate data and technology silos have been abstracted away, coupled with a powerful range of analytics and processing horsepower, all in one unified data platform. This talk describes how Databricks is powering this revolutionary new trend with Apache Spark.
Analytics at the Real-Time Speed of Business: Spark Summit East talk by Manis...Spark Summit
Redis accelerates Apache Spark execution by 45 times, when used as a shared distributed in-memory datastore for Spark in analyses like time series data range queries. With the redis module for machine learning, redis-ml, implementation of spark-ml models gains a new real time serving layer that offloads processing of models directly in Redis, allows multiple applications to reuse the same models and speeds up classification and execution of these models by 13x. Join this session to learn more about the Redis Labs’ connector for Apache Spark that enhances production implementations of real-time big data processing.
Slides from my presentation on Lambda Architecture at Indix, presented at Fifth Elephant 2014.
It talks about our experience in using Lambda Architecture at Indix, to build a large scale analytics system on unstructured, dynamically changing data sources using Hadoop, HBase, Scalding, Spark and Solr.
Lightning-Fast Analytics for Workday Transactional Data with Pavel Hardak and...Databricks
Workday Prism Analytics enables data discovery and interactive Business Intelligence analysis for Workday customers. Workday is a “pure SaaS” company, providing a suite of Financial and HCM (Human Capital Management) apps to about 2000 companies around the world, including more than 30% from Fortune-500 list. There are significant business and technical challenges to support millions of concurrent users and hundreds of millions daily transactions. Using memory-centric graph-based architecture allowed to overcome most of these problems.
As Workday grew, data transactions from existing and new customers generated vast amounts of valuable and highly sensitive data. The next big challenge was to provide in-app analytics platform, which for the multiple types of accumulated data, and also would allow using blend in external datasets. Workday users wanted it to be super-fast, but also intuitive and easy-to-use both for the financial and HR analysts and for regular, less technical users. Existing backend technologies were not a good fit, so we turned to Apache Spark.
In this presentation, we will share the lessons we learned when building highly scalable multi-tenant analytics service for transactional data. We will start with the big picture and business requirements. Then describe the architecture with batch and interactive modules for data preparation, publishing, and query engine, noting the relevant Spark technologies. Then we will dive into the internals of Prism’s Query Engine, focusing on Spark SQL, DataFrames and Catalyst compiler features used. We will describe the issues we encountered while compiling and executing complex pipelines and queries, and how we use caching, sampling, and query compilation techniques to support interactive user experience.
Finally, we will share the future challenges for 2018 and beyond.
Optimizing industrial operations using the big data ecosystemDataWorks Summit
GE Digital is undertaking a journey to optimize the reliability, availability, and efficiency of assets in the industrial sector and converge IT and OT. To do so, GE Digital is building cloud-based products that enable customers to analyze the asset data, detect anomalies, and provide recommendations for operating plants efficiently while increasing productivity. In a energy sector such as oil and gas, power, or renewables, a single plant comprises multiple complex assets, such as steam turbines, gas turbines, and compressors, to generate power. Each system contains various sensors to detect the operating conditions of the assets, generating large volumes of variety of data. A highly scalable distributed environment is required to analyze such a large volume of data and provide operating insights in near real time.
In this session I will share the challenges encountered when analyzing the large volumes of data, in-stream data analysis and how we standardized the industrial data based on data frames, and performance tuning.
Debugging Big Data Analytics in Apache Spark with BigDebug with Muhammad Gulz...Databricks
Debugging big data analytics in Data-Intensive Scalable Computing (DISC) systems is a time-consuming effort. Today’s DISC systems offer very little tooling for debugging and, as a result, programmers spend countless hours analyzing log files and performing trial and error debugging. To aid this effort, UCLA developed BigDebug, an interactive debugging tool and automated fault localization service to help Apache Spark developers in debugging big data analytics.
To emulate interactive step-wise debugging without reducing throughput, BigDebug provides simulated breakpoints that enable a user to inspect a program without actually pausing the entire distributed computation. It also supports on-demand watchpoints that enable a user to retrieve intermediate data using a guard predicate and transfer the selected data on demand. To understand the flow of individual records within a pipeline of RDD transformations, BigDebug provides data provenance capability, which can help understand how errors propagate through data processing steps. To support efficient trial-and-error debugging, BigDebug enables users to change program logic in response to an error at runtime through a realtime code fix feature, and selectively replay the execution from that step. Finally, BigDebug proposes an automated fault localization service that leverages all the above features together to isolate failure-inducing inputs, diagnose the root cause of an error, and resume the workflow for only affected data and code.
The BigDebug system should contribute to improving Spark developerproductivity and the correctness of their Big Data applications. This big data debugging effort is led by UCLA Professors Miryung Kim and Tyson Condie, and produced several research papers in top Software Engineering and Database conferences. The current version of BigDebug is publicly available at https://sites.google.com/site/sparkbigdebug/.
Big Data Meets Learning Science: Keynote by Al EssaSpark Summit
How do we learn and how can we learn better? Educational technology is undergoing a revolution fueled by learning science and data science. The promise is to make a high-quality personalized education accessible and affordable by all. In this presentation Alfred will describe how Apache Spark and Databricks are at the center of the innovation pipeline at McGraw Hill for developing next-generation learner models and algorithms in support of millions of learners and instructors worldwide.
Lambda architecture for real time big dataTrieu Nguyen
Lambda Architecture in Real-time Big Data Project
Concepts & Techniques “Thinking with Lambda”
Case study in some real projects
Why lambda architecture is correct solution for big data?
Real-Time Analytics and Actions Across Large Data Sets with Apache SparkDatabricks
Around the world, businesses are turning to AI to transform the way they operate and serve their customers. But before they can implement these technologies, companies must address the roadblock of moving from batch analytics to making real-time decisions by rapidly accessing and analyzing the relevant information amidst a sea of data. Yaron will explain how to make Spark handle multivariate real-time, historical and event data simultaneously to provide immediate and intelligent responses. He will present several time sensitive use-cases including fraud detection, prevention of outages and customer recommendations to demonstrate how to perform predictive analytics and real-time actions with Spark.
Speaker: Yaron Ekshtein
Time Series Analysis Using an Event Streaming PlatformDr. Mirko Kämpf
Advanced time series analysis (TSA) requires very special data preparation procedures to convert raw data into useful and compatible formats.
In this presentation you will see some typical processing patterns for time series based research, from simple statistics to reconstruction of correlation networks.
The first case is relevant for anomaly detection and to protect safety.
Reconstruction of graphs from time series data is a very useful technique to better understand complex systems like supply chains, material flows in factories, information flows within organizations, and especially in medical research.
With this motivation we will look at typical data aggregation patterns. We investigate how to apply analysis algorithms in the cloud. Finally we discuss a simple reference architecture for TSA on top of the Confluent Platform or Confluent cloud.
Data Warehousing with Spark Streaming at ZalandoDatabricks
Zalandos AI-driven products and distributed landscape of analytical data marts cannot wait for long-running, hard-to-recover, monolithic batch jobs taking all night to calculate already outdated data. Modern data integration pipelines need to deliver fast and easy to consume data sets in high quality. Based on Spark Streaming and Delta, the central data warehousing team was able to deliver widely-used master data as S3 or Kafka streams and snapshots at the same time.
The talk will cover challenges in our fashion data platform and a detailed architectural deep dive about separation of integration from enrichment, providing streams as well as snapshots and feeding the data to distributed data marts. Finally, lessons learned and best practices about Delta’s MERGE command, Scala API vs Spark SQL and schema evolution give more insights and guidance for similar use cases.
Simplify and Scale Data Engineering Pipelines with Delta LakeDatabricks
We’re always told to ‘Go for the Gold!,’ but how do we get there? This talk will walk you through the process of moving your data to the finish fine to get that gold metal! A common data engineering pipeline architecture uses tables that correspond to different quality levels, progressively adding structure to the data: data ingestion (‘Bronze’ tables), transformation/feature engineering (‘Silver’ tables), and machine learning training or prediction (‘Gold’ tables). Combined, we refer to these tables as a ‘multi-hop’ architecture. It allows data engineers to build a pipeline that begins with raw data as a ‘single source of truth’ from which everything flows. In this session, we will show how to build a scalable data engineering data pipeline using Delta Lake, so you can be the champion in your organization.
Virtualizing Analytics with Apache Spark: Keynote by Arsalan Tavakoli Spark Summit
In the race to invent multi-million dollar business opportunities with exclusive insights, data scientists and engineers are hampered by a multitude of challenges just to make one use case a reality – the need to ingest data from multiple sources, apply real-time analytics, build machine learning algorithms, and intermix different data processing models, all while navigating around their legacy data infrastructure that is just not up to the task. This need has created the demand for Virtual Analytics, where the complexities of disparate data and technology silos have been abstracted away, coupled with a powerful range of analytics and processing horsepower, all in one unified data platform. This talk describes how Databricks is powering this revolutionary new trend with Apache Spark.
Analytics at the Real-Time Speed of Business: Spark Summit East talk by Manis...Spark Summit
Redis accelerates Apache Spark execution by 45 times, when used as a shared distributed in-memory datastore for Spark in analyses like time series data range queries. With the redis module for machine learning, redis-ml, implementation of spark-ml models gains a new real time serving layer that offloads processing of models directly in Redis, allows multiple applications to reuse the same models and speeds up classification and execution of these models by 13x. Join this session to learn more about the Redis Labs’ connector for Apache Spark that enhances production implementations of real-time big data processing.
Slides from my presentation on Lambda Architecture at Indix, presented at Fifth Elephant 2014.
It talks about our experience in using Lambda Architecture at Indix, to build a large scale analytics system on unstructured, dynamically changing data sources using Hadoop, HBase, Scalding, Spark and Solr.
Lightning-Fast Analytics for Workday Transactional Data with Pavel Hardak and...Databricks
Workday Prism Analytics enables data discovery and interactive Business Intelligence analysis for Workday customers. Workday is a “pure SaaS” company, providing a suite of Financial and HCM (Human Capital Management) apps to about 2000 companies around the world, including more than 30% from Fortune-500 list. There are significant business and technical challenges to support millions of concurrent users and hundreds of millions daily transactions. Using memory-centric graph-based architecture allowed to overcome most of these problems.
As Workday grew, data transactions from existing and new customers generated vast amounts of valuable and highly sensitive data. The next big challenge was to provide in-app analytics platform, which for the multiple types of accumulated data, and also would allow using blend in external datasets. Workday users wanted it to be super-fast, but also intuitive and easy-to-use both for the financial and HR analysts and for regular, less technical users. Existing backend technologies were not a good fit, so we turned to Apache Spark.
In this presentation, we will share the lessons we learned when building highly scalable multi-tenant analytics service for transactional data. We will start with the big picture and business requirements. Then describe the architecture with batch and interactive modules for data preparation, publishing, and query engine, noting the relevant Spark technologies. Then we will dive into the internals of Prism’s Query Engine, focusing on Spark SQL, DataFrames and Catalyst compiler features used. We will describe the issues we encountered while compiling and executing complex pipelines and queries, and how we use caching, sampling, and query compilation techniques to support interactive user experience.
Finally, we will share the future challenges for 2018 and beyond.
Optimizing industrial operations using the big data ecosystemDataWorks Summit
GE Digital is undertaking a journey to optimize the reliability, availability, and efficiency of assets in the industrial sector and converge IT and OT. To do so, GE Digital is building cloud-based products that enable customers to analyze the asset data, detect anomalies, and provide recommendations for operating plants efficiently while increasing productivity. In a energy sector such as oil and gas, power, or renewables, a single plant comprises multiple complex assets, such as steam turbines, gas turbines, and compressors, to generate power. Each system contains various sensors to detect the operating conditions of the assets, generating large volumes of variety of data. A highly scalable distributed environment is required to analyze such a large volume of data and provide operating insights in near real time.
In this session I will share the challenges encountered when analyzing the large volumes of data, in-stream data analysis and how we standardized the industrial data based on data frames, and performance tuning.
Debugging Big Data Analytics in Apache Spark with BigDebug with Muhammad Gulz...Databricks
Debugging big data analytics in Data-Intensive Scalable Computing (DISC) systems is a time-consuming effort. Today’s DISC systems offer very little tooling for debugging and, as a result, programmers spend countless hours analyzing log files and performing trial and error debugging. To aid this effort, UCLA developed BigDebug, an interactive debugging tool and automated fault localization service to help Apache Spark developers in debugging big data analytics.
To emulate interactive step-wise debugging without reducing throughput, BigDebug provides simulated breakpoints that enable a user to inspect a program without actually pausing the entire distributed computation. It also supports on-demand watchpoints that enable a user to retrieve intermediate data using a guard predicate and transfer the selected data on demand. To understand the flow of individual records within a pipeline of RDD transformations, BigDebug provides data provenance capability, which can help understand how errors propagate through data processing steps. To support efficient trial-and-error debugging, BigDebug enables users to change program logic in response to an error at runtime through a realtime code fix feature, and selectively replay the execution from that step. Finally, BigDebug proposes an automated fault localization service that leverages all the above features together to isolate failure-inducing inputs, diagnose the root cause of an error, and resume the workflow for only affected data and code.
The BigDebug system should contribute to improving Spark developerproductivity and the correctness of their Big Data applications. This big data debugging effort is led by UCLA Professors Miryung Kim and Tyson Condie, and produced several research papers in top Software Engineering and Database conferences. The current version of BigDebug is publicly available at https://sites.google.com/site/sparkbigdebug/.
Big Data Meets Learning Science: Keynote by Al EssaSpark Summit
How do we learn and how can we learn better? Educational technology is undergoing a revolution fueled by learning science and data science. The promise is to make a high-quality personalized education accessible and affordable by all. In this presentation Alfred will describe how Apache Spark and Databricks are at the center of the innovation pipeline at McGraw Hill for developing next-generation learner models and algorithms in support of millions of learners and instructors worldwide.
2024 February 28 - NYC - Meetup Unlocking Financial Data with Real-Time Pipel...Timothy Spann
2024 February 28 - NYC - Meetup Unlocking Financial Data with Real-Time Pipelines
https://www.meetup.com/futureofdata-newyork/events/298660453/
Unlocking Financial Data with Real-Time Pipelines
(Flink Analytics on Stocks with SQL )
By Timothy Spann
Financial institutions thrive on accurate and timely data to drive critical decision-making processes, risk assessments, and regulatory compliance. However, managing and processing vast amounts of financial data in real-time can be a daunting task. To overcome this challenge, modern data engineering solutions have emerged, combining powerful technologies like Apache Flink, Apache NiFi, Apache Kafka, and Iceberg to create efficient and reliable real-time data pipelines. In this talk, we will explore how this technology stack can unlock the full potential of financial data, enabling organizations to make data-driven decisions swiftly and with confidence.
Introduction: Financial institutions operate in a fast-paced environment where real-time access to accurate and reliable data is crucial. Traditional batch processing falls short when it comes to handling rapidly changing financial markets and responding to customer demands promptly. In this talk, we will delve into the power of real-time data pipelines, utilizing the strengths of Apache Flink, Apache NiFi, Apache Kafka, and Iceberg, to unlock the potential of financial data. I will be utilizing NiFi 2.0 with Python and Vector Databases.
Timothy Spann
Principal Developer Advocate, Cloudera
Tim Spann is a Principal Developer Advocate in Data In Motion for Cloudera. He works with Apache NiFi, Apache Kafka, Apache Pulsar, Apache Flink, Flink SQL, Apache Pinot, Trino, Apache Iceberg, DeltaLake, Apache Spark, Big Data, IoT, Cloud, AI/DL, machine learning, and deep learning. Tim has over ten years of experience with the IoT, big data, distributed computing, messaging, streaming technologies, and Java programming. Previously, he was a Developer Advocate at StreamNative, Principal DataFlow Field Engineer at Cloudera, a Senior Solutions Engineer at Hortonworks, a Senior Solutions Architect at AirisData, a Senior Field Engineer at Pivotal and a Team Leader at HPE. He blogs for DZone, where he is the Big Data Zone leader, and runs a popular meetup in Princeton & NYC on Big Data, Cloud, IoT, deep learning, streaming, NiFi, the blockchain, and Spark. Tim is a frequent speaker at conferences such as ApacheCon, DeveloperWeek, Pulsar Summit and many more. He holds a BS and MS in computer science.
https://twitter.com/PaaSDev
https://www.linkedin.com/in/timothyspann/
https://medium.com/@tspann
https://github.com/tspannhw/FLiPStackWeekly/
Open Blueprint for Real-Time Analytics in Retail: Strata Hadoop World 2017 S...Grid Dynamics
This presentation outlines key business drivers for real-time analytics applications in retail and describes the emerging architectures based on In-Stream Processing (ISP) technologies. The slides present a complete open blueprint for an ISP platform - including a demo application for real-time Twitter Sentiment Analytics - designed with 100% open source components and deployable to any cloud.
To learn more, read an adjoining blog series on this topic here : https://blog.griddynamics.com/in-stream-processing-service-blueprint
Fast Data – Fast Cars: Wie Apache Kafka die Datenwelt revolutioniertconfluent
Für die Automobilindustrie ist die digitale Transformation wie für jede andere Branche zugleich eine digitale Revolution: Neue Marktspieler, neue Technologien und die in immer größeren Mengen anfallenden Daten schaffen neue Chancen, aber auch neue Herausforderungen – und erfordern neben neuen IT-Architekturen auch völlig neue Denkansätze.
60% der Fortune500-Unternehmen setzen zur Umsetzung ihrer Daten-Streaming-Projekte auf die umfassende verteilte Streaming-Plattform Apache Kafka®, darunter auch die AUDI AG.
Erfahren Sie in diesem Webinar:
Wie Kafka als Grundlage sowohl für Daten-Pipelines als auch für Anwendungen dient, die Echtzeit-Datenströme konsumieren und verarbeiten.
Wie Kafka Connect und Kafka Streams geschäftskritische Anwendungen unterstützt
Wie Audi mithilfe von Kafka und Confluent eine Fast Data IoT-Plattform umgesetzt hat, die den Bereich „Connected Car“ revolutioniert
Sprecher:
David Schmitz, Principal Architect, Audi Electronics Venture GmbH
Kai Waehner, Technology Evangelist, Confluent
Data Engineer's Lunch #81: Reverse ETL Tools for Modern Data PlatformsAnant Corporation
During this lunch, we’ll review open-source reverse ETL tools to uncover how to send data back to SaaS systems.
Sign Up For Our Newsletter: http://eepurl.com/grdMkn
Join Data Engineer’s Lunch Weekly at 12 PM EST Every Monday:
https://www.meetup.com/Data-Wranglers-DC/events/
Cassandra.Link:
https://cassandra.link/
Follow Us and Reach Us At:
Anant:
https://www.anant.us/
Awesome Cassandra:
https://github.com/Anant/awesome-cassandra
Email:
solutions@anant.us
LinkedIn:
https://www.linkedin.com/company/anant/
Twitter:
https://twitter.com/anantcorp
Eventbrite:
https://www.eventbrite.com/o/anant-1072927283
Facebook:
https://www.facebook.com/AnantCorp/
Join The Anant Team:
https://www.careers.anant.us
#data #dataengineering #datagovernance
Developing Enterprise Consciousness: Building Modern Open Data PlatformsScyllaDB
ScyllaDB, along side some of the other major distributed real-time technologies gives businesses a unique opportunity to achieve enterprise consciousness - a business platform that delivers data to the people that need when they need it any time, anywhere.
This talk covers how modern tools in the open data platform can help companies synchronize data across their applications using open source tools and technologies and more modern low-code ETL/ReverseETL tools.
Topics:
- Business Platform Challenges
- What Enterprise Consciousness Solves
- How ScyllaDB Empowers Enterprise Consciousness
- What can ScyllaDB do for Big Companies
- What can ScyllaDB do for smaller companies.
EDA Meets Data Engineering – What's the Big Deal?confluent
Presenter: Guru Sattanathan, Systems Engineer, Confluent
Event-driven architectures have been around for many years, much like Apache Kafka®, which first open sourced in 2011. The reality is that the true potential of Kafka is only being realised now. Kafka is becoming the central nervous system of many of today’s enterprises. It is bringing a profound paradigm shift to the way we think about enterprise IT. What has changed in Kafka to enable this paradigm shift? Is it not just a message broker, and how are enterprises using it today? This session will explore these key questions.
Sydney: https://content.deloitte.com.au/20200221-tel-event-tech-community-syd-registration
Melbourne: https://content.deloitte.com.au/20200221-tel-event-tech-community-mel-registration
Oracle Digital Business Transformation and Internet of Things by Ermin PrašovićBosnia Agile
This session discuss solutions and Oracle strategy to support digital transformation for companies interested in their business transformation path as well as how to allign with modern trends brought by digitalization. Second part of this session discuss news Oracle has in its offer for the Internet of Things (IoT) services and including solutions based on IoT.
How to reinvent your organization in an iterative and pragmatic way? This is the result of using our digital toolbox. It allows you to transform your business model, expand your ecosystem by setting up your digital platform. This reinvention is also supported by the adaptation of your governance allowing you to innovate while guaranteeing the performance of your organization. For any information / suggestion / collaboration - william.poos@nrb.be
Comment réinventer votre organisation de manière itérative et pragmatique ? C'est le résultat de l'utilisation de notre boîte à outils digitale. Elle vous permet de transformer votre modèle métier, d'étendre votre écosystème en mettant en place votre plateforme digitale. Cette réinvention est également supportée par l'adaptation de votre gouvernance vous permettant d'innover tout en garantissant la performance de votre organisation. Pour toute information / suggestion / collaboration - william.poos@nrb.be
Processing Real-Time Data at Scale: A streaming platform as a central nervous...confluent
(Marcus Urbatschek, Confluent)
Presentation during Confluent’s streaming event in Munich. This three-day hands-on course focused on how to build, manage, and monitor clusters using industry best-practices developed by the world’s foremost Apache Kafka™ experts. The sessions focused on how Kafka and the Confluent Platform work, how their main subsystems interact, and how to set up, manage, monitor, and tune your cluster.
SnapLogic has been gaining traction in big-data integration. It recently announced the Fall 2015 release of its Elastic Integration Platform, which adds capabilities for big- data integration that now include Spark (an open source in-memory data-processing framework), a new Snap (preconfigured connector) for Cassandra (an open source distributed ‘big’ database) and support for Microsoft Cortana Analytics. SnapLogic is positioning this release as a self-service hybrid cloud integration offering, and it is intended to strengthen its position among Microsoft customers and others seeking cloud-based big-data analytics.
Technical Deep Dive: Using Apache Kafka to Optimize Real-Time Analytics in Fi...confluent
Watch this talk here: https://www.confluent.io/online-talks/using-apache-kafka-to-optimize-real-time-analytics-financial-services-iot-applications
When it comes to the fast-paced nature of capital markets and IoT, the ability to analyze data in real time is critical to gaining an edge. It’s not just about the quantity of data you can analyze at once, it’s about the speed, scale, and quality of the data you have at your fingertips.
Modern streaming data technologies like Apache Kafka and the broader Confluent platform can help detect opportunities and threats in real time. They can improve profitability, yield, and performance. Combining Kafka with Panopticon visual analytics provides a powerful foundation for optimizing your operations.
Use cases in capital markets include transaction cost analysis (TCA), risk monitoring, surveillance of trading and trader activity, compliance, and optimizing profitability of electronic trading operations. Use cases in IoT include monitoring manufacturing processes, logistics, and connected vehicle telemetry and geospatial data.
This online talk will include in depth practical demonstrations of how Confluent and Panopticon together support several key applications. You will learn:
-Why Apache Kafka is widely used to improve performance of complex operational systems
-How Confluent and Panopticon open new opportunities to analyze operational data in real time
-How to quickly identify and react immediately to fast-emerging trends, clusters, and anomalies
-How to scale data ingestion and data processing
-Build new analytics dashboards in minutes
Azure Data Explorer deep dive - review 04.2020Riccardo Zamana
Full review 04.2020 about Azure Data Explorer service. Slide Desk is a sort of review od Kusto, in terms of usage, ingestion techniques, querying and exporting data, using anomaly detection and clustering methods.
Getting insights from IoT data with Apache Spark and Apache BahirLuciano Resende
The Internet of Things (IoT) is all about connected devices that produce and exchange data, and producing insights from these high volumes of data is challenging. On this session, we will start by providing a quick introduction to the MQTT protocol, and focus on using AI and machine learning techniques to provide insights from data collected from IoT devices. We will present some common AI concepts and techniques used by the industry to deploy state-of-the-art smart IoT systems. These techniques allow systems to determined patterns from the data, predict and prevent failures as well as suggest actions that can be used to minimize or avoid IoT device breakdowns on an intelligent way beyond rule-based and database search approaches. We will finish with a demo that puts together all the techniques discussed in an application that uses Apache Spark and Apache Bahir support for MQTT.
Mesos Meetup - Building an enterprise-ready analytics and operational ecosyst...Stratio
On November 6th, we got together at Google Campus to talk about Mesos and DC/OS.
Ignacio Mulas, Sparta & Spark Product Owner at Stratio, explained how to build an environment that can secure and govern its data for operational and analytical applications on top of DC/OS platform. He showed that analytical and machine learning pipelines can be combined with operational processes maintaining the security and providing governing tools to manage our data. He focused on the architecture and tools needed to achieve an ecosystem like this and we will show a demo of it. He also explained how we can develop our pipelines interactively with auto-discovered data catalogs and explore our results.
Find out more: https://www.stratio.com/events/discover-how-to-deploy-a-secure-big-data-pipeline-with-dcos/
Can an intelligent system exist without awareness? BDS18Stratio
Marco Baena, Head of AI at Stratio, presented at Big Data Spain 2018 "Can an intelligent system exist without awareness?"
Description:
Is a system without awareness a truly intelligent system? Here he showed why it is a mandatory approach to pursue awareness and how an Awareness-Centric model is the key of any competitive modern business. In addition to this, he also showed the technical benefits of such systems and how Stratio proposal is a reference when following this model.
On September 6th, we got together at Campus Madrid to learn about Kafka and KSQL. Discover with Antonio Abril, Software Architect at Stratio, how we can use Kafka to process real-time social media data.
Find out more about the event: https://www.stratio.com/blog/events/apache-kafka-and-ksql-in-action/
On July 19th, we got together at Google Campus to talk about how to increase and complete existing data to improve Machine Learning Models.
Fernando Velasco, Data Scientist at Stratio, and Raúl de la Fuente, Presales at Stratio, talked about techniques of image processing like Data Augmentation and other more modern techniques that involve the use of Deep Learning models.
More info: http://www.stratio.com/blog/events/planet-data-scientist-live-meet-the-wild-data/
Using Kafka on Event-driven Microservices Architectures - Apache Kafka MeetupStratio
On July 18th, we got together at Campus Madrid to discover all about Kafka. Discover with Óscar Gómez, Software Architect at Stratio, how Kafka can help us on our event-driven Microservices Architectures.
Find out more: http://www.stratio.com/blog/events/all-about-kafka-origins-ecosystem-and-future-directions/
Within the use of Machine Learning models for prediction, one of the sets of techniques that stands out is the model combination. We will study these combinations with Fernando Velasco, Data Scientist at Stratio, who will explain what they are, why and when to use them. Two of the main general techniques will be explained: boosting and bagging and, finally, how to make feature selection through ensembles.
Looking to build Kappa architectures or SMACK applications or to use distributed technologies such as Spark, Kafka, Elasticsearch and Hadoop? Do you want to build your own data-centric platform? Are you lacking a development team with Scala and Spark skills? Are you managing to move towards full Digital Transformation?
Have our questions stressed you out enough? Then you are ready for our latest release!
Our next product release will include Sparta 2.0, ready to solve the above-mentioned issues. Stratio clients will be able to build Big Data processes and data pipeline workflows in minutes with an amazing UI, integrated with Spark, Kafka and several Big Data technologies.
In this session, we will show how Sparta 2.0 and its workflow jobs are a key piece in a data-centric platform and how it is integrated with a PaaS (DC/OS). We will also show how Stratio has made sure all pieces within the platform, including Sparta 2.0 are secured.
By: José Carlos García and Javier Yuste
A data-centric platform integrates multiple Big Data open source technologies. For example, at Stratio we use Spark, Kafka, Elastic search and many more. Most of these technologies do not offer native security. This lack of security, not only leaves companies open to critical risks like data leakage, unsecure communications or DoS attacks but is also a major barrier to complying with different regulations such as LOPD, PCI-DSS or the upcoming GDPR. This talk gives a technical and innovative overview of how companies can face the challenge of protecting the data and services that are in their data-centric platform, focusing on three main aspects: implementing network segmentation, managing AAA and securing data processing.
By: Carlos Gómez
Our data lake is full of data, our Business intelligence is squeezing every byte of information and our operational applications are just great… why do I still feel I can do better? Having big data gives you a competitive advantage, but using big data in your daily operations will give you much more. Taking the best of both worlds, we aim for systems in which big data analysis is performed on operational data in real-time and our applications embed the extracted intelligence in their every-day operations. The good news is that combining both is perfectly possible using a data-centric approach together with well-known industry patterns and a few good practices.
By: Nacho Mulas
Artificial Intelligence on Data Centric PlatformStratio
Digital Transformation starts with data. What if a solution existed that put data at the center, in a single place, serving all applications around it? This training will include a demonstration in a distributed data-centric platform which provides a data intelligence layer, composed of artificial intelligence models able to make use of a whole company’s data.
Nowadays, one of the most innovative techniques in the realm of artificial intelligence is Deep Neural Nets. Among the many applications, language modelling, machine translation and image generation are receiving particular attention. Deep nets are also powerful in predictive modelling ambits such as stock pricing and the energy industry. We will address a few case studies modeled with TensorFlow, running on Stratio’s data-centric product in a distributed cluster.
By: Fernando Velasco
Opening of our Deep Learning Lunch & Learn series. First session: introduction to Neural Networks, Gradient descent and backpropagation, by Pablo J. Villacorta, with a prologue by Fernando Velasco
“A Distributed Operational and Informational Technological Stack” Stratio
By Loreto Fernández Costas and Adrián Doncel Gabaldó.
Digital Transformation starts with data. What if a solution existed that put data at the center, in a single place, serving all applications around it – A distributed data centric solution that combined the operational and the informational, managed by a single data center operating system?
This session will provide a detailed explanation of such a solution, bringing the concept of data centricity to life. We will cover the details of the array of open source technologies that come together to create a transformational solution to the historic problem of physical companies: From multiple data stores, distributed run-time engines and SQL engines based on Spark, to microservices, Machine Learning and Deep Learning Algorithms. Big Data 3.0 is just round the corner.
Meetup: Cómo monitorizar y optimizar procesos de Spark usando la Spark Web - ...Stratio
Apache Spark ya es una realidad en el mundo de la informática y ahora necesitamos no sólo saber de lo que la tecnología es capaz, necesitamos hacerlo productivo. Para ello necesitamos saber cómo poder auditar cada uno de sus procesos de una manera sencilla y sin necesidad de conocimientos destacados de esta tecnología.
Jorge López-Malla muestra qué herramientas proporciona el propio framework de Apache Spark para poder monitorizar el rendimiento de los algoritmos y cómo sacarle partido para mejorar los jobs de Apache Spark, tanto streaming como batch, y ver qué magia hace SparkSQL cuando se quiere hacer un simple join.
Hoy en día la combinación de modelos de Machine Learning es muy popular. En la mayoría de los concursos de kaggle, los primeros clasificados suelen emplear alguna variante de estas técnicas. Hablamos de qué nos lleva a emplearlas, cuándo y cómo. Nos centramos en las técnicas de resampleado, como bagging y boosting, trabajamos el implementado de un AdaBoost sencillo sobre árboles binarios, y finalmente, vemos una aplicación de Random Forest para selección de variables.
More information:
www.stratio.com
https://www.youtube.com/StratioBD
Los proyectos de Big Data han pasado de sus fases de POC, donde la seguridad ha sido, en el mejor de los casos, un aspecto secundario. Por ello las herramientas Big Data y más concretamente las usadas para procesar datos, deben ponerse al día en seguridad.
Las herramientas como Spark, no están pensados para la seguridad. Por eso Abel y Jorge quieren compartir los hacks que han sido necesarios hacer a Spark para poder usar Kerberos para autenticarse contra servicios securizados.
Classification algorithms play an important role in different business areas, such as fraud detection, cross selling or customer behavior. In the business context, interpretability is a very desirable property, sometimes even a hard requirement. However, interpretable algorithms are usually outperformed by other non-interpretable algorithms such as Random Forest. In this talk Antonio Soriano and Mateo Alvarez presented a distributed implementation in Spark of the Logistic Model Tree (LMT) algorithm (Landwehr, et al. (2005). Machine Learning, 59(1-2), 161-205.), which consists of a decision tree with logistic classifiers in the leaves. While being highly interpretable, the LMT consistently performs equally or better than other popular algorithms in several performance metrics such as accuracy, precision/recall or area under the ROC curve.
Stratio's Cassandra Lucene index: Geospatial use cases - Big Data Spain 2016Stratio
Stratio’s Cassandra Lucene Index, derived from Stratio Cassandra, is an open sourced plugin for Apache Cassandra that extends its index functionality to provide near real time search such as ElasticSearch or Solr, including full text search capabilities and free multivariable, geospatial and bitemporal search. It is achieved through an Apache Lucene based implementation of Cassandra secondary indexes, where each node of the cluster indexes its own data. Stratio’s Cassandra indexes are one of the core modules on which Stratio’s BigData platform is based.
Andres de la Peña discusses the recently added geospatial search features in Stratio's Cassandra Lucene index using some Nephila Capital use cases. These new features include indexing complex polygons, nearest neighbour search, and the application of chained geometrical transformations such as bounding box, convex hull, centroid, union, intersection, exclusion and distance buffer.
DevOps and Testing slides at DASA ConnectKari Kakkonen
My and Rik Marselis slides at 30.5.2024 DASA Connect conference. We discuss about what is testing, then what is agile testing and finally what is Testing in DevOps. Finally we had lovely workshop with the participants trying to find out different ways to think about quality and testing in different parts of the DevOps infinity loop.
The Art of the Pitch: WordPress Relationships and SalesLaura Byrne
Clients don’t know what they don’t know. What web solutions are right for them? How does WordPress come into the picture? How do you make sure you understand scope and timeline? What do you do if sometime changes?
All these questions and more will be explored as we talk about matching clients’ needs with what your agency offers without pulling teeth or pulling your hair out. Practical tips, and strategies for successful relationship building that leads to closing the deal.
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Jeffrey Haguewood
Sidekick Solutions uses Bonterra Impact Management (fka Social Solutions Apricot) and automation solutions to integrate data for business workflows.
We believe integration and automation are essential to user experience and the promise of efficient work through technology. Automation is the critical ingredient to realizing that full vision. We develop integration products and services for Bonterra Case Management software to support the deployment of automations for a variety of use cases.
This video focuses on the notifications, alerts, and approval requests using Slack for Bonterra Impact Management. The solutions covered in this webinar can also be deployed for Microsoft Teams.
Interested in deploying notification automations for Bonterra Impact Management? Contact us at sales@sidekicksolutionsllc.com to discuss next steps.
State of ICS and IoT Cyber Threat Landscape Report 2024 previewPrayukth K V
The IoT and OT threat landscape report has been prepared by the Threat Research Team at Sectrio using data from Sectrio, cyber threat intelligence farming facilities spread across over 85 cities around the world. In addition, Sectrio also runs AI-based advanced threat and payload engagement facilities that serve as sinks to attract and engage sophisticated threat actors, and newer malware including new variants and latent threats that are at an earlier stage of development.
The latest edition of the OT/ICS and IoT security Threat Landscape Report 2024 also covers:
State of global ICS asset and network exposure
Sectoral targets and attacks as well as the cost of ransom
Global APT activity, AI usage, actor and tactic profiles, and implications
Rise in volumes of AI-powered cyberattacks
Major cyber events in 2024
Malware and malicious payload trends
Cyberattack types and targets
Vulnerability exploit attempts on CVEs
Attacks on counties – USA
Expansion of bot farms – how, where, and why
In-depth analysis of the cyber threat landscape across North America, South America, Europe, APAC, and the Middle East
Why are attacks on smart factories rising?
Cyber risk predictions
Axis of attacks – Europe
Systemic attacks in the Middle East
Download the full report from here:
https://sectrio.com/resources/ot-threat-landscape-reports/sectrio-releases-ot-ics-and-iot-security-threat-landscape-report-2024/
Epistemic Interaction - tuning interfaces to provide information for AI supportAlan Dix
Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024
https://alandix.com/academic/papers/synergy2024-epistemic/
As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.
JMeter webinar - integration with InfluxDB and GrafanaRTTS
Watch this recorded webinar about real-time monitoring of application performance. See how to integrate Apache JMeter, the open-source leader in performance testing, with InfluxDB, the open-source time-series database, and Grafana, the open-source analytics and visualization application.
In this webinar, we will review the benefits of leveraging InfluxDB and Grafana when executing load tests and demonstrate how these tools are used to visualize performance metrics.
Length: 30 minutes
Session Overview
-------------------------------------------
During this webinar, we will cover the following topics while demonstrating the integrations of JMeter, InfluxDB and Grafana:
- What out-of-the-box solutions are available for real-time monitoring JMeter tests?
- What are the benefits of integrating InfluxDB and Grafana into the load testing stack?
- Which features are provided by Grafana?
- Demonstration of InfluxDB and Grafana using a practice web application
To view the webinar recording, go to:
https://www.rttsweb.com/jmeter-integration-webinar
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Ramesh Iyer
In today's fast-changing business world, Companies that adapt and embrace new ideas often need help to keep up with the competition. However, fostering a culture of innovation takes much work. It takes vision, leadership and willingness to take risks in the right proportion. Sachin Dev Duggal, co-founder of Builder.ai, has perfected the art of this balance, creating a company culture where creativity and growth are nurtured at each stage.
Securing your Kubernetes cluster_ a step-by-step guide to success !KatiaHIMEUR1
Today, after several years of existence, an extremely active community and an ultra-dynamic ecosystem, Kubernetes has established itself as the de facto standard in container orchestration. Thanks to a wide range of managed services, it has never been so easy to set up a ready-to-use Kubernetes cluster.
However, this ease of use means that the subject of security in Kubernetes is often left for later, or even neglected. This exposes companies to significant risks.
In this talk, I'll show you step-by-step how to secure your Kubernetes cluster for greater peace of mind and reliability.
Connector Corner: Automate dynamic content and events by pushing a buttonDianaGray10
Here is something new! In our next Connector Corner webinar, we will demonstrate how you can use a single workflow to:
Create a campaign using Mailchimp with merge tags/fields
Send an interactive Slack channel message (using buttons)
Have the message received by managers and peers along with a test email for review
But there’s more:
In a second workflow supporting the same use case, you’ll see:
Your campaign sent to target colleagues for approval
If the “Approve” button is clicked, a Jira/Zendesk ticket is created for the marketing design team
But—if the “Reject” button is pushed, colleagues will be alerted via Slack message
Join us to learn more about this new, human-in-the-loop capability, brought to you by Integration Service connectors.
And...
Speakers:
Akshay Agnihotri, Product Manager
Charlie Greenberg, Host
5. STRATIO
INGESTION
Customer lake
STRATIO
STREAMING
STRATIO
QUANTUM
STRATIO DEEP
STRATIO CROSSDATA
ODBC JBDC API Rest
CRM
ERP
Call
Center
BI
Internal
Data
External
data
BI AD HOC APP
Ingests,
transforms
Analyzes and
processes real
time streaming
A unified SQL interface
Machine Learning
and algorithms
Processes & combines with
Spark
STRATIO DATAVIS
Creates and designs
dashboards and reports
Hdfs S3 Elastic
Search
Mongo DB Cassandra Redis Oracle, DB2
Other
Databases
5
6. STRATIO
INGESTION
Ingests,
transforms
STRATIO
STREAMING
STRATIO
QUANTUM
STRATIO CROSSDATA
Analyzes & processes
A unified SQL interface
Machine Learning
and algorithms
ODBC JBDC API Rest
Streaming
Apache Kite
Apache Flume
CRM
ERP
Call
Center
BI
MLlib
Internal
Data
External
Data
BI AD HOC APP
Combines with Spark data from any
source
Customer lake
STRATIO DEEP
Processes & combines with Spark
Hdfs S3 Elastic
Search
Mongo DB Cassandra Redis Oracle, DB2
Other
Databases
STRATIO DATAVIS
Creates and designs
dashboards and reports
6
7. STRATIO
INGESTION
Hdfs S3 Elastic
Search
Mongo DB Cassandra Redis Oracle, DB2
Other
Databases
Ingests,
transforms
STRATIO
STREAMING
STRATIO
QUANTUM
STRATIO CROSSDATA
Analyzes &
processes
Consult & analyze. SQL interface
Machine Learning
& algorithms
ODBC JBDC API Rest
Streaming
Apache Kite
Apache Flume
CRM
ERP
Call
Center
BI
MLib
Internal
Data
External
Data
BI AD HOC APP
Data combination through time
Customer lake
STRATIO DEEP
Processes & combines with
Spark
Real-time
Ephemer
al tables
Past
Stored
tables
Future
Quantum
tables
STRATIO DATAVIS
Creates and designs
dashboards and reports
7
8. STRATIO DATAVIS
STRATIO
INGESTION
Ingests,
transforms
STRATIO
STREAMING
STRATIO
QUANTUM
STRATIO CROSSDATA
Analyzes &
processes
Consulta y analiza. Interfaz SQL
Machine Learning
& algorithms
ODBC JBDC API Rest
Streaming
Apache Kite
Apache Flume
CRM
ERP
Call
Center
BI
MLlib
Internal
Data
External
Data
Creates and designs
dashboards and reports
Customer lake
STRATIO DEEP
Processes & combines with Spark
Hdfs S3 Elastic
Search
Mongo DB Cassandra Redis Oracle, DB2
Other
Databases
INFORMATIONAL + OPERATIONAL
WITHOUT NEED TO REPLICATE DATA
Oracle, DB2
Other Databases Mongo DB TeradataOPERATIONAL
8
10. The time is N W
We all know this story already
Social media and networking sites are a part of the fabric of
everyday life, changing the way the world shares and accesses
information.
The overwhelming amount of information gathered not only
from messages, updates and images but also readings from
sensors,GPS signals and many other sources was the origin of
a (big) technological revolution.
Remember? VOLUME, VARIETY & VELOCITY
CONFERENCE10
11. Look at these sexy infographics!
We all love data
visualization
Insights from this vast amount of data
allows us to learn from the users and
explore our own world.
We can follow in real-time the evolution
of a topic, an event or even an incident
just by exploring aggregated data.
CONFERENCE11
12. Delivering real-time business in the Internet
But beyond cool visualizations, there are
some core services delivered in real-time,
using aggregated data to answer common
questions in the fastest way.
These services are the heart of the
business behind their nice logos.
Site traffic, user engagement monitoring,
service health, APIs, internal monitoring
platforms, real-time dashboards…
Aggregated data feeds directly to end
users, publishers, and advertisers, among
others.
CONFERENCE12
13. Pushing business’ processes to perform faster
Digital companies, born to develop their services in real-time have changed
the expectations of many others businesses.
Real-time information makes it possible for a company to be much more agile
than its competitors, improving business answers, gaining insights on their
performance…
CONFERENCE13
14. Listen to your data…
CLIENTTPV
Accounts
Loans
and credits
Insurances
Broker
Mortgages
Cards
Deposits
ATM
Online
gateway
application logs
Social
networks
transactions
geolocation
CRM
Where as business intelligence is data gathered
for the purpose of analyzing trends over time,
operational intelligence provides a picture of
what is currently happening within a process.
And we can listen to almost everything! Orders,
transactions, clicks, calls, bookings, internal
services...
CONFERENCE14
15. …and start delivering real-time services
Real-time monitoring could be really nice, but your
company needs to work in the same way as digital
companies:
• Rethinking existing processes to deliver them
faster, better.
• Creating new opportunities for competitive
advantages.
CONFERENCE15
17. Real-time fraud monitoring
DATA RECEIVER
REAL-TIME
AGGREGATION
CONSOLIDATION
Dashboardin
g
Reporting
FRAUD
DETECTION
Leveraging the power of Spark Streaming, we have developed some fraud detection
solutions, aggregating data in real-time to work better with machine learning
algorithms.
CONFERENCE17
18. Extract, Transform and Aggregate
By combining Apache Flume and Spark Streaming we have deployed complex
topologies to deal with data coming from heterogeneous sources.
The full solution allow us to transform and aggregate data on-the-fly
(data cleaning, normalization and enrichment)
REAL-TIME
AGGREGATION
Dashboardin
g
Reporting
CONFERENCE18
19. Custom data sources and storage
Each project requires
specific inputs and data
storages, dealing with
different kinds of
events.
From click stream
activity to bank
transactions...
DATA STREAM
LOADING
TRANSFORM
CUSTOM LOGS
CONFERENCE19
20. Towards a generic real-time aggregation platform
At Stratio, we have implemented several real-time analytic projects based
on Apache Spark, Kafka, Flume, Cassandra, or MongoDB.
These technologies were always a perfect fit, but soon we found ourselves
writing the same pieces of integration code over and over again.
This is how SPARKTA was born.
CONFERENCE20
22. #1 RainBird from Twitter
Some folks from twitter shared some thoughts
about their real-time needs at Strata (2011).
They worked on a “generic” platform in order to
deal with pre-calculated data from a huge number
of events.
It allows them to deal with:
• Data Structures
• Hierarchical Aggregation
• Temporal Aggregation
• Multiple Formulas
Still not open sourceCURRENT STATE
http://goo.gl/ykvQa
CONFERENCE22
23. #2 Countandra
Countandra is a hierarchical distributed counting
engine exploiting all the excellent write&read
performance of Cassandra.
It supports:
• Geographically distributed counting.
• Easy Http Based interface to insert counts.
• Hierarchical counting such as
com.mywebsite.music.
• Retrieves counts, sums and square in near real-
time.
• Simple Http queries provides desired output in Json
format
• Queries can be sliced by period such as lasthour
,lastyear and so on for minutely,hourly,daily,monthly
values
https://github.com/milindparikh/Countandra
Rather deprecatedCURRENT STATE
CONFERENCE23
24. #3 ThunderRain from Intel
ThunderRain is a Real-Time Analytical Processing
(RTAP) example using Spark and Shark, which
can be best characterized by the following four
salient properties:
• Data continuously streamed in & processed
in near real-time
• Real-time data queried and presented
in an online fashion
• Real-time and history data combined
and mined interactively
• Predominant RAM-based processing
https://github.com/thunderain-
project/thunderain
Rather deprecatedCURRENT STATE
CONFERENCE24
25. #4 TSAR from Twitter
TSAR (the TimeSeries AggregatoR) is a
flexible, reusable, end-to-end service
architecture on top of Summingbird.
Twitter really needs a truly robust real-
time aggregation service considering their
scaling and evolving needs.
They realized that many time-series
applications call for essentially the same
architecture, with only slight variations in
the data model.
https://blog.twitter.com/2014/tsar-a-timeseries-aggregator
Still not open sourceCURRENT STATE
CONFERENCE25
26. Towards a generic real-time aggregation platform
Some initiatives have tried to solve this problem, but until now most of them
were complex or obsolete while others were not open source.
For this reason, Stratio created SPARKTA: an open source and full-featured
platform for real-time analytics, based on Apache Spark.
This is why SPARKTA was conceived
CONFERENCE26
28. Distributed, high-volume & pluggable analytics framework
Our goals:
Since Aryabhatta invented zero, Mathematicians such as John von Neuman have
been in pursuit of efficient counting and architects have constantly built systems that
computes counts quicker. In this age of social media, where 100s of 1000s events
take place every second, we designed a aggregation engine to deliver real-time
service
• Pure Spark!
• No need of coding, only declarative aggregation
workflows
• Data continuously streamed in & processed in near real-
time
• Ready to use out of the box
• Plug & play: flexible workflows (inputs, outputs, parsers,
etc…)
• High performance
• Scalable and fault tolerant
CONFERENCE28
29. Sparkta: A first look
DRIVER - SUPERVISOR
AGGREGATION POLICY
QUERY
SERVICES
Aggregation policy
definition is sent to the
engine
Allows multiple application to be
defined, each of which is bound to
a context, executing the
aggregation workflow
others
AGGREGATION WORKFLOW
CONFERENCE29
30. Sparkta: Deploy any number of real-time aggregation policies
DRIVER - SUPERVISOR
You can start
several workflows
at any time, and
also stop or
monitor them
CONFERENCE30
32. Sparkta: Define your real-time needs
AGGREGATION POLICY
Remember: no need to code anything.
Define your workflow in a JSON document, including:
INPUT Where is the data coming from?
OUTPUT(s) Where should aggregate data be stored?
DIMENSION(s) Which fields will you need for your real-time
needs?
ROLLUP(s) How do you want to aggregate the dimensions?
TRANSFORMATION(s) Which functions should be applied before aggregation?
SAVE RAW DATA Do you want to save raw events?
CONFERENCE32
33. Sparkta: Key Technologies
ROLLUPS
• Pass-through
• Time-based
• Secondly, minutely, hourly, daily,
monthly, yearly...
• Hierarchycal
• GeoRange: Areas with different sizes
(rectangles)
OPERATORS
• Max, min, count, sum
• Average, median
• Stdev, variance, count distinct
• Last value
• Full-text search
KiteSDK
CONFERENCE33
34. Sparkta SDK
INPUT
OUTPUT(s)
DIMENSION(s)
OPERATORS
TRANSFORMATION(s)
Sparkta has been conceived as an SDK.
You can extend several points of the platform to
fulfill your needs, such as adding new inputs,
outputs, operators, dimension types.
Add new functions to Apache Kite in order to
extend the data cleaning, enrichment and
normalization capabilities.
CONFERENCE34
36. Next steps in our roadmap (1)
Sparkta is a work in progress, so we still have some nice features to
develop…
QUERY
SERVICES
ALARMS
Creating a REST services layer in order to query the
aggregated data allows us to isolate the final consumer
from the specific data storage
Features
- Time ranges
- Agreggation on time ranges
- Best rollup selection
For example, I want to know if we have earned over $3000 in
London in the last hour...
Remember operational intelligence!
CONFERENCE36
37. Next steps in our roadmap (II)
WEB
APPLICATION
DEPLOYING &
MONITORING
How about a nice web interface to create and manage policies?
Forget the JSON file and use your mouse to define the workflow :)
We have been working with Spark jobServer & Yarn, but it will be
nice to support Mesos, for example.
Hey, did you miss something? Do you have a great idea?
Let us know!
MORE AWESOMENESS
CONFERENCE37
39. OPEN TO YOUR IDEAS
www.stratio.com
@StratioBD
https://github.com/stratio/sparkta
SPARKTA is fully open source
Apache 2 License.
We are open to contributors & ideas
CONFERENCE39
41. Do you want to try SPARKTA?
Use a full-featured sandbox to start trying SPARKTA
vagrant init “stratio/sparkta”
vagrant up
Just open a shell and type
CONFERENCE41
42. Do you want to try SPARKTA?
Getting some real-time stats from
#StrataHadoop
Our real-time policy defines some
rollups in order to know chatty users, hot
hashtags, and heatmaps from
StrataConf tweets.
We are using the standard Twitter input
from Spark Streaming, ElasticSearch
output & Kibana to display results
CONFERENCE42