At improve digital we collect and store large volumes of machine generated and behavioural data from our fleet of ad servers. For some time we have performed mostly batch processing through a data warehouse that combines traditional RDBMs (MySQL), columnar stores (Infobright, impala+parquet) and Hadoop.
We wish to share our experiences in enhancing this capability with systems and techniques that process the data as streams in near-realtime. In particular we will cover:
• The architectural need for an approach to data collection and distribution as a first-class capability
• The different needs of the ingest pipeline required by streamed realtime data, the challenges faced in building these pipelines and how they forced us to start thinking about the concept of production-ready data.
• The tools we used, in particular Apache Kafka as the message broker, Apache Samza for stream processing and Apache Avro to allow schema evolution; an essential element to handle data whose formats will change over time.
• The unexpected capabilities enabled by this approach, including the value in using realtime alerting as a strong adjunct to data validation and testing.
• What this has meant for our approach to analytics and how we are moving to online learning and realtime simulation.
This is still a work in progress at Improve Digital with differing levels of production-deployed capability across the topics above. We feel our experiences can help inform others embarking on a similar journey and hopefully allow them to learn from our initiative in this space.
Some notes about spark streming positioning give the current players: Beam, Flink, Storm et al. Helpful if you have to choose an Streaming engine for your project.
Stateful stream processing with Apache FlinkKnoldus Inc.
Nowadays, many stream processing applications have sophisticated business logic, strict correctness guarantees, high performance, low latency, fault-tolerant, and maintain terabytes of state. There are many stream processing frameworks available in the market which helps businesses to write robust stateful stream processing applications.
In this session, we will talk about Apache Flink, a distributed stream processor with intuitive and expressive APIs to implement stateful stream processing applications. It can efficiently run such applications at a large scale in a fault-tolerant manner. In this session, we will see what is stateful stream processing in detail, and how Flink takes on stateful stream processing. We'll get to know how checkpointing mechanism works in Flink.
Along with the arrival of BigData, a parallel yet less well known but significant change to the way we process data has occurred. Data is getting faster! Business models are changing radically based on the ability to be first to know insights and act appropriately to keep the customer, prevent the breakdown or save the patient. In essence, knowing something now is overriding knowing everything later. Stream processing engines allow us to blend event streams from different internal and external sources to gain insights in real time. This talk will discuss the need for streaming, business models it can change, new applications it allows and why Apache Flink enables these applications. Apache Flink is a top Level Apache Project for real time stream processing at scale. It is a high throughput, low latency, fault tolerant, distributed, state based stream processing engine. Flink has associated Polyglot APIs (Scala, Python, Java) for manipulating streams, a Complex Event Processor for monitoring and alerting on the streams and integration points with other big data ecosystem tooling.
Some notes about spark streming positioning give the current players: Beam, Flink, Storm et al. Helpful if you have to choose an Streaming engine for your project.
Stateful stream processing with Apache FlinkKnoldus Inc.
Nowadays, many stream processing applications have sophisticated business logic, strict correctness guarantees, high performance, low latency, fault-tolerant, and maintain terabytes of state. There are many stream processing frameworks available in the market which helps businesses to write robust stateful stream processing applications.
In this session, we will talk about Apache Flink, a distributed stream processor with intuitive and expressive APIs to implement stateful stream processing applications. It can efficiently run such applications at a large scale in a fault-tolerant manner. In this session, we will see what is stateful stream processing in detail, and how Flink takes on stateful stream processing. We'll get to know how checkpointing mechanism works in Flink.
Along with the arrival of BigData, a parallel yet less well known but significant change to the way we process data has occurred. Data is getting faster! Business models are changing radically based on the ability to be first to know insights and act appropriately to keep the customer, prevent the breakdown or save the patient. In essence, knowing something now is overriding knowing everything later. Stream processing engines allow us to blend event streams from different internal and external sources to gain insights in real time. This talk will discuss the need for streaming, business models it can change, new applications it allows and why Apache Flink enables these applications. Apache Flink is a top Level Apache Project for real time stream processing at scale. It is a high throughput, low latency, fault tolerant, distributed, state based stream processing engine. Flink has associated Polyglot APIs (Scala, Python, Java) for manipulating streams, a Complex Event Processor for monitoring and alerting on the streams and integration points with other big data ecosystem tooling.
More complex streaming applications generally need to store some state of the running computations in a fault-tolerant manner. This talk discusses the concept of operator state and compares state management in current stream processing frameworks such as Apache Flink Streaming, Apache Spark Streaming, Apache Storm and Apache Samza.
We will go over the recent changes in Flink streaming that introduce a unique set of tools to manage state in a scalable, fault-tolerant way backed by a lightweight asynchronous checkpointing algorithm.
Talk presented in the Apache Flink Bay Area Meetup group on 08/26/15
Presenter: Kenn Knowles, Software Engineer, Google & Apache Beam (incubating) PPMC member
Apache Beam (incubating) is a programming model and library for unified batch & streaming big data processing. This talk will cover the Beam programming model broadly, including its origin story and vision for the future. We will dig into how Beam separates concerns for authors of streaming data processing pipelines, isolating what you want to compute from where your data is distributed in time and when you want to produce output. Time permitting, we might dive deeper into what goes into building a Beam runner, for example atop Apache Apex.
Flink Forward Berlin 2017: Pramod Bhatotia, Do Le Quoc - StreamApprox: Approx...Flink Forward
Approximate computing aims for efficient execution of workflows where an approximate output is sufficient instead of the exact output. The idea behind approximate computing is to compute over a representative sample instead of the entire input dataset. Thus, approximate computing — based on the chosen sample size — can make a systematic trade-off between the output accuracy and computation efficiency. Unfortunately, state-of-the-art systems for approximate computing, such as BlinkDB, ApproxHadoop, primarily target batch analytics, where the input data remains unchanged during the course of sampling. Thus, they are not well-suited for stream analytics. In this talk, we will present the design of StreamApprox, a Flink-based stream analytics system for approximate computing. StreamApprox implements an online stratified reservoir sampling algorithm in Apache Flink to produce approximate output with rigorous error bounds.
Beyond the DSL-Unlocking the Power of Kafka Streams with the Processor API (A...confluent
Kafka Streams is a flexible and powerful framework. The Domain Specific Language (DSL) is an obvious place from which to start, but not all requirements fit the DSL model. Many people are unaware of the Processor API (PAPI) – or are intimidated by it because of sinks, sources, edges and stores – oh my! But most of the power of the PAPI can be leveraged, simply through the DSL ”#process” method, which lets you attach the general building block ”Processor” interface to your -easy to use- DSL topology, to combine the best of both worlds.
In this talk you’ll get a look at the flexibility of the DSL’s process method and the possibilities it opens up. We’ll use real world use-cases borne from extensive experience in the field with multiple customers to explore power of direct write access to the state stores and how to perform range sub-selects. We’ll also see the options that punctuators bring to the table, as well as opportunities for major latency optimisations.
Key takeaways:
* Understanding of how to combine DSL and Processors
* Capabilities and benefits of Processors
* Real-world uses of Processors
A MapReduce job usually splits the input data-set into independent chunks which are processed by the map tasks in a completely parallel manner. The framework sorts the outputs of the maps, which are then input to the reduce tasks. Typically both the input and the output of the job are stored in a file-system.
Time Series Analysis… using an Event Streaming Platformconfluent
Time Series Analysis… using an Event Streaming Platform, Mirko Kämpf, Solutions Architect, Confluent
Meetup Link: https://www.meetup.com/Apache-Kafka-Germany-Munich/events/272827528/
Time Series Analysis Using an Event Streaming PlatformDr. Mirko Kämpf
Advanced time series analysis (TSA) requires very special data preparation procedures to convert raw data into useful and compatible formats.
In this presentation you will see some typical processing patterns for time series based research, from simple statistics to reconstruction of correlation networks.
The first case is relevant for anomaly detection and to protect safety.
Reconstruction of graphs from time series data is a very useful technique to better understand complex systems like supply chains, material flows in factories, information flows within organizations, and especially in medical research.
With this motivation we will look at typical data aggregation patterns. We investigate how to apply analysis algorithms in the cloud. Finally we discuss a simple reference architecture for TSA on top of the Confluent Platform or Confluent cloud.
More complex streaming applications generally need to store some state of the running computations in a fault-tolerant manner. This talk discusses the concept of operator state and compares state management in current stream processing frameworks such as Apache Flink Streaming, Apache Spark Streaming, Apache Storm and Apache Samza.
We will go over the recent changes in Flink streaming that introduce a unique set of tools to manage state in a scalable, fault-tolerant way backed by a lightweight asynchronous checkpointing algorithm.
Talk presented in the Apache Flink Bay Area Meetup group on 08/26/15
Presenter: Kenn Knowles, Software Engineer, Google & Apache Beam (incubating) PPMC member
Apache Beam (incubating) is a programming model and library for unified batch & streaming big data processing. This talk will cover the Beam programming model broadly, including its origin story and vision for the future. We will dig into how Beam separates concerns for authors of streaming data processing pipelines, isolating what you want to compute from where your data is distributed in time and when you want to produce output. Time permitting, we might dive deeper into what goes into building a Beam runner, for example atop Apache Apex.
Flink Forward Berlin 2017: Pramod Bhatotia, Do Le Quoc - StreamApprox: Approx...Flink Forward
Approximate computing aims for efficient execution of workflows where an approximate output is sufficient instead of the exact output. The idea behind approximate computing is to compute over a representative sample instead of the entire input dataset. Thus, approximate computing — based on the chosen sample size — can make a systematic trade-off between the output accuracy and computation efficiency. Unfortunately, state-of-the-art systems for approximate computing, such as BlinkDB, ApproxHadoop, primarily target batch analytics, where the input data remains unchanged during the course of sampling. Thus, they are not well-suited for stream analytics. In this talk, we will present the design of StreamApprox, a Flink-based stream analytics system for approximate computing. StreamApprox implements an online stratified reservoir sampling algorithm in Apache Flink to produce approximate output with rigorous error bounds.
Beyond the DSL-Unlocking the Power of Kafka Streams with the Processor API (A...confluent
Kafka Streams is a flexible and powerful framework. The Domain Specific Language (DSL) is an obvious place from which to start, but not all requirements fit the DSL model. Many people are unaware of the Processor API (PAPI) – or are intimidated by it because of sinks, sources, edges and stores – oh my! But most of the power of the PAPI can be leveraged, simply through the DSL ”#process” method, which lets you attach the general building block ”Processor” interface to your -easy to use- DSL topology, to combine the best of both worlds.
In this talk you’ll get a look at the flexibility of the DSL’s process method and the possibilities it opens up. We’ll use real world use-cases borne from extensive experience in the field with multiple customers to explore power of direct write access to the state stores and how to perform range sub-selects. We’ll also see the options that punctuators bring to the table, as well as opportunities for major latency optimisations.
Key takeaways:
* Understanding of how to combine DSL and Processors
* Capabilities and benefits of Processors
* Real-world uses of Processors
A MapReduce job usually splits the input data-set into independent chunks which are processed by the map tasks in a completely parallel manner. The framework sorts the outputs of the maps, which are then input to the reduce tasks. Typically both the input and the output of the job are stored in a file-system.
Time Series Analysis… using an Event Streaming Platformconfluent
Time Series Analysis… using an Event Streaming Platform, Mirko Kämpf, Solutions Architect, Confluent
Meetup Link: https://www.meetup.com/Apache-Kafka-Germany-Munich/events/272827528/
Time Series Analysis Using an Event Streaming PlatformDr. Mirko Kämpf
Advanced time series analysis (TSA) requires very special data preparation procedures to convert raw data into useful and compatible formats.
In this presentation you will see some typical processing patterns for time series based research, from simple statistics to reconstruction of correlation networks.
The first case is relevant for anomaly detection and to protect safety.
Reconstruction of graphs from time series data is a very useful technique to better understand complex systems like supply chains, material flows in factories, information flows within organizations, and especially in medical research.
With this motivation we will look at typical data aggregation patterns. We investigate how to apply analysis algorithms in the cloud. Finally we discuss a simple reference architecture for TSA on top of the Confluent Platform or Confluent cloud.
Netflix - Pig with Lipstick by Jeff Magnusson Hakka Labs
In this talk Manager of Data Platform Architecture Jeff Magnusson from Netflix discusses Lipstick, a tool that visualizes and monitors the progress and performance of Apache Pig scripts. This talk was recorded at Samsung R&D.
While Pig provides a great level of abstraction between MapReduce and dataflow logic, once scripts reach a sufficient level of complexity, it becomes very difficult to understand how data is being transformed and manipulated across MapReduce jobs. The recently open sourced Lipstick solves this problem. Jeff emphasizes the architecture, implementation, and future of Lipstick, as well as various use cases around using Lipstick at Netflix (e.g. examples of using Lipstick to improve speed of development and efficiency of new and existing scripts).
Jeff manages the Data Platform Architecture group at Netflix where he is helping to build a service oriented architecture that enables easy access to large scale cloud based analytical processing and analysis of data across the organization. Prior to Netflix, he received his PhD from the University of Florida focusing on database system implementation.
Slides from the Big Data Gurus meetup at Samsung R&D, August 14, 2013
This presentation covers the high level architecture of the Netflix Data Platform with a deep dive into the architecture, implementation, use cases, and future of Lipstick (https://github.com/Netflix/Lipstick) - our open source tool for graphically analyzing and monitoring the execution of Apache Pig scripts.
Netflix uses Apache Pig to express many complex data manipulation and analytics workflows. While Pig provides a great level of abstraction between MapReduce and data flow logic, once scripts reach a sufficient level of complexity, it becomes very difficult to understand how data is being transformed and manipulated across MapReduce jobs. To address this problem, we created (and open sourced) a tool named Lipstick that visualizes and monitors the progress and performance of Pig scripts.
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...Reynold Xin
(Berkeley CS186 guest lecture)
Big Data Analytics Systems: What Goes Around Comes Around
Introduction to MapReduce, GFS, HDFS, Spark, and differences between "Big Data" and database systems.
Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...Data Con LA
This talk explores deploying a series of small and large batch and streaming pipelines locally, to Spark and Flink clusters and to Google Cloud Dataflow services to give the audience a feel for the portability of Beam, a new portable Big Data processing framework recently submitted by Google to the Apache foundation. This talk will look at how the programming model handles late arriving data in a stream with event time, windows, and triggers.
Receiving data from a source that produces 5-10 GBytes per hour, and presenting analysis results as the data streams in has some interesting challenges.
We used MongoDB running on Amazon EC2 to house the data, map reduce to analyze it and Django-non-rel to present the results in near-real-time.
(Slides from my presentation at MongoDB Boston)
Architecting Big Data Ingest & ManipulationGeorge Long
Here's the presentation I gave at the KW Big Data Peer2Peer meetup held at Communitech on 3rd November 2015.
The deck served as a backdrop to the interactive session
http://www.meetup.com/KW-Big-Data-Peer2Peer/events/226065176/
The scope was to drive an architectural conversation about :
o What it actually takes to get the data you need to add that one metric to your report/dashboard?
o What's it like to navigate the early conversations of an analytic solution?
o How is one technology selected over another and how do those selections impact or define other selections?
Business intelligence requirements are changing and business users are moving more and more from historical reporting into predictive analytics in an attempt to get both a better and deeper understanding of their data. Traditionally, building an analytical platform has required an expensive infrastructure and a considerable amount of time for setup and deployment. Here we look at a quick and simple alternative.
Big Data Streams Architectures. Why? What? How?Anton Nazaruk
With a current zoo of technologies and different ways of their interaction it's a big challenge to architect a system (or adopt existed one) that will conform to low-latency BigData analysis requirements. Apache Kafka and Kappa Architecture in particular take more and more attention over classic Hadoop-centric technologies stack. New Consumer API put significant boost in this direction. Microservices-based streaming processing and new Kafka Streams tend to be a synergy in BigData world.
Explore big data at speed of thought with Spark 2.0 and SnappydataData Con LA
Abstract:
Data exploration often requires running aggregation/slice-dice queries on data sourced from disparate sources. You may want to identify distribution patterns, outliers, etc and aid the feature selection process as you train your predictive models. As you begin to understand your data, you want to ask ad-hoc questions expressed through your visualization tool (which typically translates to SQL queries), study the results and iteratively explore the data set through more queries. Unfortunately, even when data sets can be in-memory, large data set computations take time breaking the train of thought and increasing time to insight . We know Spark can be fast through its in-memory parallel processing. But, Spark 1.x isn’t quite there. Spark 2.0 promises to offer 10X better speed than its predecessor. Spark 2.0 ushers some impressive improvements to interactive query performance. We first explore these advances - compiling the query plan eliminating virtual function calls, and other improvements in the Catalyst engine. We compare the performance to other popular popular query processing engines by studying the spark query plans. We then go through SnappyData (an open source project that integrates Spark with a database that offers OLTP, OLAP and stream processing in a single cluster) where we use smarter data colocation and Synopses data (.e.g. Stratified sampling) to dramatically cut down on the memory requirements as well as the query latency. We explain the key concepts in summarizing data using structures like stratified sampling by walking through some examples in Apache Zeppelin notebooks (a open source visualization tool for spark) and demonstrate how we can explore massive data sets with just your laptop resources while achieving remarkable speeds.
Bio:
Jags is a founder and the CTO of SnappyData. Previously, Jags was the Chief Architect for “fast data” products at Pivotal and served in the extended leadership team of the company. At Pivotal and previously at VMWare, he led the technology direction for GemFire and other distributed in-memory Bio:
Jags Ramnarayan is a founder and the CTO of SnappyData. Previously, Jags was the Chief Architect for “fast data” products at Pivotal and served in the extended leadership team of the company. At Pivotal and previously at VMWare, he led the technology direction for GemFire and other distributed in-memory products.
Similar to Moving Towards a Streaming Architecture (20)
Estimating mental states of a depressed person with Bayesian NetworksGabriele Modena
In this work in progress paper we present an approach based on Bayesian Networks to model the relationship between mental states and empirical observations in a depressed person. We encode relationships and domain expertise as a Hierarchical Bayesian Network. Mental states are represented as latent (hidden) variables and the measurements found in the data are encoded as a probability distribution generated by such latent variables; we provide examples of how the network can be used to estimate mental states.
Analysis of grid log data with Affinity PropagationGabriele Modena
In this paper we present an unsupervised learning approach to detect meaningful job traffic patterns in Grid log data. Manual anomaly detection on modern Grid environments is troublesome given their in- creasing complexity, the distributed, dynamic topology of the network and heterogeneity of the jobs being executed. The ability to automat- ically detect meaningful events with little or no human intervention is therefore desirable. We evaluate our method on a set of log data col- lected on the Grid. Since we lack a priori knowledge of patterns that can be detected and no labelled data is available, an unsupervised learning method is followed. We cluster jobs executed on the Grid using Affinity Propagation. We try to explain discovered clusters using representative features and we label them with the help of domain experts. Finally, as a further validation step, we construct a classifier for five of the detected clusters and we use it to predict the termination status of unseen jobs.
Approximation algorithms for stream and batch processingGabriele Modena
At Improve Digital (http://www.improvedigital.com) we collect and process large amounts of machine generated and behavioral data. Our systems address a variety of use cases that involve both batch and streaming technologies. One common denominator of the overall architecture is the need to share models and workflows across both worlds. Another one is that the analysis of large amounts of data often requires trade-offs; for instance trading accuracy for timeliness in streaming applications. One approach to satisfy these constraints is to make "big data" small. In this talk we will review a number of approximation methods for sketching, summarization and clustering and discuss how they are starting to change the way we think about certain types of analytics, and how they are being integrated into our data pipelines.
Recent developments in Hadoop version 2 are pushing the system from the traditional, batch oriented, computational model based on MapRecuce towards becoming a multi paradigm, general purpose, platform. In the first part of this talk we will review and contrast three popular processing frameworks. In the second part we will look at how the ecosystem (eg. Hive, Mahout, Spark) is making use of these new advancements. Finally, we will illustrate "use cases" of batch, interactive and streaming architectures to power traditional and "advanced" analytics applications.
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Tobias Schneck
As AI technology is pushing into IT I was wondering myself, as an “infrastructure container kubernetes guy”, how get this fancy AI technology get managed from an infrastructure operational view? Is it possible to apply our lovely cloud native principals as well? What benefit’s both technologies could bring to each other?
Let me take this questions and provide you a short journey through existing deployment models and use cases for AI software. On practical examples, we discuss what cloud/on-premise strategy we may need for applying it to our own infrastructure to get it to work from an enterprise perspective. I want to give an overview about infrastructure requirements and technologies, what could be beneficial or limiting your AI use cases in an enterprise environment. An interactive Demo will give you some insides, what approaches I got already working for real.
UiPath Test Automation using UiPath Test Suite series, part 3DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 3. In this session, we will cover desktop automation along with UI automation.
Topics covered:
UI automation Introduction,
UI automation Sample
Desktop automation flow
Pradeep Chinnala, Senior Consultant Automation Developer @WonderBotz and UiPath MVP
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Neuro-symbolic is not enough, we need neuro-*semantic*Frank van Harmelen
Neuro-symbolic (NeSy) AI is on the rise. However, simply machine learning on just any symbolic structure is not sufficient to really harvest the gains of NeSy. These will only be gained when the symbolic structures have an actual semantics. I give an operational definition of semantics as “predictable inference”.
All of this illustrated with link prediction over knowledge graphs, but the argument is general.
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...UiPathCommunity
💥 Speed, accuracy, and scaling – discover the superpowers of GenAI in action with UiPath Document Understanding and Communications Mining™:
See how to accelerate model training and optimize model performance with active learning
Learn about the latest enhancements to out-of-the-box document processing – with little to no training required
Get an exclusive demo of the new family of UiPath LLMs – GenAI models specialized for processing different types of documents and messages
This is a hands-on session specifically designed for automation developers and AI enthusiasts seeking to enhance their knowledge in leveraging the latest intelligent document processing capabilities offered by UiPath.
Speakers:
👨🏫 Andras Palfi, Senior Product Manager, UiPath
👩🏫 Lenka Dulovicova, Product Program Manager, UiPath
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf91mobiles
91mobiles recently conducted a Smart TV Buyer Insights Survey in which we asked over 3,000 respondents about the TV they own, aspects they look at on a new TV, and their TV buying preferences.
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualityInflectra
In this insightful webinar, Inflectra explores how artificial intelligence (AI) is transforming software development and testing. Discover how AI-powered tools are revolutionizing every stage of the software development lifecycle (SDLC), from design and prototyping to testing, deployment, and monitoring.
Learn about:
• The Future of Testing: How AI is shifting testing towards verification, analysis, and higher-level skills, while reducing repetitive tasks.
• Test Automation: How AI-powered test case generation, optimization, and self-healing tests are making testing more efficient and effective.
• Visual Testing: Explore the emerging capabilities of AI in visual testing and how it's set to revolutionize UI verification.
• Inflectra's AI Solutions: See demonstrations of Inflectra's cutting-edge AI tools like the ChatGPT plugin and Azure Open AI platform, designed to streamline your testing process.
Whether you're a developer, tester, or QA professional, this webinar will give you valuable insights into how AI is shaping the future of software delivery.
Connector Corner: Automate dynamic content and events by pushing a buttonDianaGray10
Here is something new! In our next Connector Corner webinar, we will demonstrate how you can use a single workflow to:
Create a campaign using Mailchimp with merge tags/fields
Send an interactive Slack channel message (using buttons)
Have the message received by managers and peers along with a test email for review
But there’s more:
In a second workflow supporting the same use case, you’ll see:
Your campaign sent to target colleagues for approval
If the “Approve” button is clicked, a Jira/Zendesk ticket is created for the marketing design team
But—if the “Reject” button is pushed, colleagues will be alerted via Slack message
Join us to learn more about this new, human-in-the-loop capability, brought to you by Integration Service connectors.
And...
Speakers:
Akshay Agnihotri, Product Manager
Charlie Greenberg, Host
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Ramesh Iyer
In today's fast-changing business world, Companies that adapt and embrace new ideas often need help to keep up with the competition. However, fostering a culture of innovation takes much work. It takes vision, leadership and willingness to take risks in the right proportion. Sachin Dev Duggal, co-founder of Builder.ai, has perfected the art of this balance, creating a company culture where creativity and growth are nurtured at each stage.
Transcript: Selling digital books in 2024: Insights from industry leaders - T...BookNet Canada
The publishing industry has been selling digital audiobooks and ebooks for over a decade and has found its groove. What’s changed? What has stayed the same? Where do we go from here? Join a group of leading sales peers from across the industry for a conversation about the lessons learned since the popularization of digital books, best practices, digital book supply chain management, and more.
Link to video recording: https://bnctechforum.ca/sessions/selling-digital-books-in-2024-insights-from-industry-leaders/
Presented by BookNet Canada on May 28, 2024, with support from the Department of Canadian Heritage.
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...DanBrown980551
Do you want to learn how to model and simulate an electrical network from scratch in under an hour?
Then welcome to this PowSyBl workshop, hosted by Rte, the French Transmission System Operator (TSO)!
During the webinar, you will discover the PowSyBl ecosystem as well as handle and study an electrical network through an interactive Python notebook.
PowSyBl is an open source project hosted by LF Energy, which offers a comprehensive set of features for electrical grid modelling and simulation. Among other advanced features, PowSyBl provides:
- A fully editable and extendable library for grid component modelling;
- Visualization tools to display your network;
- Grid simulation tools, such as power flows, security analyses (with or without remedial actions) and sensitivity analyses;
The framework is mostly written in Java, with a Python binding so that Python developers can access PowSyBl functionalities as well.
What you will learn during the webinar:
- For beginners: discover PowSyBl's functionalities through a quick general presentation and the notebook, without needing any expert coding skills;
- For advanced developers: master the skills to efficiently apply PowSyBl functionalities to your real-world scenarios.
Essentials of Automations: Optimizing FME Workflows with ParametersSafe Software
Are you looking to streamline your workflows and boost your projects’ efficiency? Do you find yourself searching for ways to add flexibility and control over your FME workflows? If so, you’re in the right place.
Join us for an insightful dive into the world of FME parameters, a critical element in optimizing workflow efficiency. This webinar marks the beginning of our three-part “Essentials of Automation” series. This first webinar is designed to equip you with the knowledge and skills to utilize parameters effectively: enhancing the flexibility, maintainability, and user control of your FME projects.
Here’s what you’ll gain:
- Essentials of FME Parameters: Understand the pivotal role of parameters, including Reader/Writer, Transformer, User, and FME Flow categories. Discover how they are the key to unlocking automation and optimization within your workflows.
- Practical Applications in FME Form: Delve into key user parameter types including choice, connections, and file URLs. Allow users to control how a workflow runs, making your workflows more reusable. Learn to import values and deliver the best user experience for your workflows while enhancing accuracy.
- Optimization Strategies in FME Flow: Explore the creation and strategic deployment of parameters in FME Flow, including the use of deployment and geometry parameters, to maximize workflow efficiency.
- Pro Tips for Success: Gain insights on parameterizing connections and leveraging new features like Conditional Visibility for clarity and simplicity.
We’ll wrap up with a glimpse into future webinars, followed by a Q&A session to address your specific questions surrounding this topic.
Don’t miss this opportunity to elevate your FME expertise and drive your projects to new heights of efficiency.
PHP Frameworks: I want to break free (IPC Berlin 2024)Ralf Eggert
In this presentation, we examine the challenges and limitations of relying too heavily on PHP frameworks in web development. We discuss the history of PHP and its frameworks to understand how this dependence has evolved. The focus will be on providing concrete tips and strategies to reduce reliance on these frameworks, based on real-world examples and practical considerations. The goal is to equip developers with the skills and knowledge to create more flexible and future-proof web applications. We'll explore the importance of maintaining autonomy in a rapidly changing tech landscape and how to make informed decisions in PHP development.
This talk is aimed at encouraging a more independent approach to using PHP frameworks, moving towards a more flexible and future-proof approach to PHP development.
Epistemic Interaction - tuning interfaces to provide information for AI supportAlan Dix
Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024
https://alandix.com/academic/papers/synergy2024-epistemic/
As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.
6. Data sources
A geographically distributed fleet of ad servers push out several types of data,
primarily:
- 1 minute metrics packets
- 15 minute full log files
Previously used ad hoc system to process the former, Hadoop for the latter
7. Conceptualizing the data
We think of event streams when considering the operational system
But have this sliced into many log files for practical reasons
Recreating the stream-based view gets us back to what is happening
And brings data creation and processing closer together
8. How did we get here
Need to process more data, faster
- Reporting
- Decision Making
Evolution of the system to meet scalability requirements
Stream processing by definition
It is changing the way we think about data collection and distribution
Not everything is obvious
It opened new doors
9. Let’s take a step back
Aggregates grow in number and size
Keep a copy in HDFS
Run job, then go and make coffee
Generate new / other aggregates / ad hoc datasets
Analytics datawarehouse (column store), denormalized data model
BI, reporting
Hadoop for logs and event processing (eg. ETL, simulation, learning, optimization)
11. Incremental processing
Ingest data every eg. 24 hours (/data/raw/event/d=YYYY-MM-DD)
Scheduling and partitioning are straightforward
Schema migrations: Avro + HCatalog
Fixed length time windows
- What is a good length?
- Ok if it takes 1 hour to process 15 minutes of data
Need to reprocess data
It can be made efficient (eg. DataFu Hourglass)
MapReduce
12. A more interactive system
We added tools to circumvent MapReduce
- Latency, really
- Still a “batch” mindset
Cloudera Impala (+ Parquet)
- Fast, familiar
Move computation closer to data
- Spark (among other things) for prototyping
- Python/R integration, batch and “streaming”
13. Outcome
Multiple overlapping data sources, processing tools
- point-to-point
Different pipelines for analytics and production models
- Divide
- Where is the “correct” data ?
14. All in all
Data collection and distribution is critical
We are still looking at “yesterday’s” data
We need a different level of abstraction
15.
16. Challenges of near-realtime data
Instructive to ask why near-realtime data analysis is hard
In Hadoop batch world we’d be happy processing 24hrs of data in 4 hrs
So why do we flinch to process 24hrs in 24hrs?
What allows batch workloads to process in sub-linear time:
- Scheduling
- Parallelization
- Partitioning
17. The streaming view
Consider these from the streaming perspective:
- Scheduling is challenging because of unknown and variable data rates
- Parallelization breaks our nice logical model of a single data stream
- Partitioning can’t be simple file size/no of blocks
- The method of partitioning the stream is fundamental
18. Kafka
We use Kafka (http://kafka.apache.org) for all data we view as streams
It is resilient, performant and provides a very clear topic-based interface
Producers write to topics, consumers read from topics
Topics can be partitioned based on a producer-provided key
Provides at-least once message delivery semantics
Messages are persistently stored – topics can be re-read
19. Samza
We use Samza (http://samza.incubator.apache.org) for stream processing
Analogous to MapReduce atop YARN atop HDFS
We have Samza atop YARN atop Kafka
E L E M O D E N A L E A R N IN G H A D O O P 2
A p a c h e S a m z a
afka for streaming !
arn for resource
anagement and exec!
amzaAPI for
ocessing!
weet spot: second,
inutes
Y A R N
S a m z a A P I
K a fk a
20. Samza API
A simple call-back API that should be familiar to MapReduce developers:
public interface StreamTask{
void process(IncomingMessageEnvelope envelope,
MessageCollector collector, TaskCoordinator) ;
}
public interface WindowableTask{
void window(MessageCollector collector,
StreamCoordinator coordinator);
}
21. Kafka/Samza integration
A Samza job is comprised of multiple tasks
Each Kafka partition topic is consumed by only one Samza task
Each YARN container runs multiple Samza tasks
The output of a task is often written to a different topic
Overall applications are then constructed by composition of jobs
22. Stream job output
You’ve processed the data, now what?
Need consider how the output of each stream partition will be processed
A downstream destination with an idempotent interface is ideal
Otherwise try to have expressions of unique state
Hardest is when the outputs need applied in sequence
23. Example: log data
It’s a common model: server process rolls a log every x minutes
Each resultant file is ingested into Kafka and read by multiple Samza jobs
Each Samza task receives a series of potentially interleaved log chunks
Consider a task receiving data from servers S1, S2 and S3 across time periods
T1..T3
It’s view may become something like:
S1[T1..T2], S2[T1..T2], S1[T2..T3], S3[T1..T2]…
24. Example - Streaming aggregation
Input is a Stream of records with the form (key, value)
Output is a series of (key, aggregate)
3 approaches to producing these outputs:
- Each task pushes local counts downstream
- Each task pushes to another stream aggregated by a different job
- Each task uses local state to produce final aggregate values
25. Comparing stream aggregation
Option 1: each stream partition outputs local values for each key it sees across a
time window
- Each partial result is an update record (e.g. increment key k at time t by x)
- Downstream consumer has the responsibility of combining partial results
Option 2 : aggregate partial results in a second stream processing job
- The output becomes a series of state changes
(set the value of key k at time t to x)
26. Comparing stream aggregation contd
Option 3 - push state into the state processors
- Samza tasks can hold persistent local state
- Creates a key/value store modelled as a Kafka topic
- Ensure that all records for each key are sent to the same partition
- State processing task can then create unique state aggregates locally
27. A side effect of Kafka
A firehose for event data
- Can tap into with Samza
- Can tap into with Spark
Unifies how we access and publish data
- Streams are defined by application / use case
- These can come from other teams
- Single entry point, multiple outputs
28. Analytics on streams
Incremental processing pipelines as a basis
Tried to reuse logic
Not that simple
Conceptualization vs. implementation
29. Time windows
Temporal constraints
When is data ready?
- The key is defining “ready”, which is weaker than “complete”
- Enough to make an informed decision (this varies case by case)
- We can update the dataset to improve a confidence interval
When are we done processing?
- Tradeoff between timeliness and accuracy
30. Schema evolution
Schema changes appear within / across time windows
- Not a solved problem
Stop streaming, propagate changes, replay or resume
SAMZA-317
- Avro SerDe
- Schema stream
31. Challenges
Data completeness
Windows length
Tradeoff between finding answers / satisfying temporal constraints
Need to adjust our approach (representation and data structures)
- approximate
- probabilistic
- iterations to convergence
This is a work in progress
32. Model based outlier detection
Previously:
Batch and real time datasets = different approaches
Now:
Same dataset
Model based anomaly detection
Reuse knowledge and infrastructure
What is normal?
How do we set thresholds ?
Literature review (q-digest, t-digest)
33. Testing and alerting
Some features are hard to test
Functionally correct but unexpected consequences
unexpected ~ hard to predict
advertisment is a complex system
Bugs are outliers we can detect (sort of)
34. Testing as predictive modelling
Monitoring (next to testing)
Make software monitorable
Think of testing as an experiment
Establish short, continuous, feedback loops
35. Implications
How do we use real time output
- “live” dashboards
- Instrumentation and alerting (predict failure)
- feed optimization models
- back propagating methods into ETL
How we think about feedback loops
Make data more consumable
36. Conclusion
Streaming is part of the broader system
We are fine tuning how the two fit together
Data collection and distribution is important
publish results follows
Kafka + Samza
data and processing model that fits our applications well
Conceptualization vs. implementation
Real time != incremental
A certain degree of uncertainty
Tradeoffs