Spark and Online Analytics: Spark Summit East talky by Shubham ChopraSpark Summit
Apache Spark was designed as a batch analytics system. By caching RDDs, Spark speeds up jobs that iteratively process the same data. This pattern is also applicable to online analytics. We use Bloomberg’s Spark Server as a server runtime for online analytics. Our framework implements certain useful patterns applicable to online query processing and is centered on the idea of “Managed” DataFrames that can be refreshed and updated as per user requirements, without violating the immutability of RDDs/DataFrames. However, Spark presents significant challenges with respect to availability and resilience in an online setting where Spark is required to respond to queries with high SLAs. In this talk, we try to identify specific areas where slow-down or failures can result in the largest hits on online-query performance and potential solutions to address these.
Building a real time Tweet map with Flink in six weeksMatthias Kricke
In this talk we present OSTMap, a tool which was build by 6 students over the course of 6 weeks. Each student only has to do as little as 5-10h per week and no experience with BigData or the used frameworks. We also present the concept of geotemporal indicies for our use-case.
Consolidating MLOps at One of Europe’s Biggest AirportsDatabricks
At Schiphol airport we run a lot of mission critical machine learning models in production, ranging from models that predict passenger flow to computer vision models that analyze what is happening around the aircraft. Especially now in times of Covid it is paramount for us to be able to quickly iterate on these models by implementing new features, retraining them to match the new dynamics and above all to monitor them actively to see if they still fit the current state of affairs.
To achieve those needs we rely on MLFlow but have also integrated that with many of our other systems. So have we written Airflow operators for MLFlow to ease the retraining of our models, have we integrated MLFlow deeply with our CI pipelines and have we integrated it with our model monitoring tooling.
In this talk we will take you through the way we rely on MLFlow and how that enables us to release (sometimes) multiple versions of a model per week in a controlled fashion. With this set-up we are achieving the same benefits and speed as you have with a traditional software CI pipeline.
Zoltán Zvara - Advanced visualization of Flink and Spark jobs Flink Forward
http://flink-forward.org/kb_sessions/advanced-visualization-of-flink-and-spark-jobs/
Understanding the physical plan of a big data application is often crucial for tracking down bottlenecks and faulty behavior. Flink and Spark although offering useful Web UI components for monitoring and understanding the logical plan of the jobs, both lack a tool that helps to understand the physical plan of the scheduler and the possibility to monitor execution at a very low level, along with the communication that occur between parallel vertex instances. We propose a tool that allows users to real-time monitor and later to replay, examine job executions on any cluster currently supported by Flink or Spark. The tool also offers monitoring of the distribution of keys in a data stream and can lead to optimizing data partitioning across parallel subtasks in the future.
Just because you can, doesn’t mean you should. But in this case, you definitely should! Learn how this one weird trick (Jinja templating) will supercharge your analytics workflows and help you do more, better, faster with SQL.
A unified analytics platform with Kafka and Flink | Stephan Ewen, VervericaHostedbyConfluent
Apache Kafka and Apache Flink together are a winning stack for data analytics that is used by many companies across industries.
The two projects complement each other perfectly: Kafka offers a world-class log for event stream storage and transport, while Flink is a powerful system for analytics and applications on top of those event streams.
This talk will demonstrate how to use Kafka and Flink together for "unified analytics": Analytics that seamlessly combine processing of real-time data and historic data.
Using SQL as the language for our sample applications, we will walk though various scenarios for unified analytics, such as
- Running the same query for processing real-time data from Kafka and for batch-accelerated processing of the historic data stored in Kafka.
- Writing queries that combine data in Kafka with tables in external systems (like S3)
- Switching between streams of historic data (from S3) and real-time streams in Kafka.
The audience will learn how combining real-time and historic data is becoming convenient with the combination of Kafka and Flink.
Kafka for Real-Time Event Processing in Serverless Environmentsconfluent
(Jeff Sharpe + Alex Srisuwan, Capital One) Kafka Summit SF 2018
Using Kafka as a platform messaging bus is common, but bridging communication between real-time and asynchronous components can become complicated, especially when dealing with serverless environments. This has become increasingly common in modern banking where events need to be processed at near-real-time speed. Serverless environments are well-suited to address these needs, and Kafka remains an excellent solution for providing the reliable, resilient communication layer between serverless components and dedicated stream processing services.
In this talk, we will examine some of the strengths and weaknesses of using Kafka for real-time communication, some tips for efficient interactions with Kafka and AWS Lambda, and a number of useful patterns for maximizing the strengths of Kafka and serverless components.
As many industries, banking is undergoing a fundamental change because of the software revolution. No longer are banks competing only on interest rates and having the best traders, these days customer experience and having the best engineers are the focus. In this changing world, banks compete with new start-ups, the so-called Fintechs, and with large platform organisations such as Google, Facebook and Apple. At ING, we believe that staying ahead of the game means changing how we interact with our customers, no longer a traditional model of waiting for the customers to come to the bank through our website or apps, but to actively reach out to the customer with information that is relevant to him or her in order to make their financial life frictionless. Many of these changes are driven by reacting to all events that are relevant to the customer, and using streaming analytics to be able to reach out to the customer in milliseconds after the event occurs. Apache Flink is key for ING to achieve this. This presentation addresses how ING approaches the challenge, the role that Apache Flink plays, and the consequences regulations have on how we work with Open Source in general, and with Apache Flink (and data Artisans) in particular. This keynote takes place at Kino 3.
Spark and Online Analytics: Spark Summit East talky by Shubham ChopraSpark Summit
Apache Spark was designed as a batch analytics system. By caching RDDs, Spark speeds up jobs that iteratively process the same data. This pattern is also applicable to online analytics. We use Bloomberg’s Spark Server as a server runtime for online analytics. Our framework implements certain useful patterns applicable to online query processing and is centered on the idea of “Managed” DataFrames that can be refreshed and updated as per user requirements, without violating the immutability of RDDs/DataFrames. However, Spark presents significant challenges with respect to availability and resilience in an online setting where Spark is required to respond to queries with high SLAs. In this talk, we try to identify specific areas where slow-down or failures can result in the largest hits on online-query performance and potential solutions to address these.
Building a real time Tweet map with Flink in six weeksMatthias Kricke
In this talk we present OSTMap, a tool which was build by 6 students over the course of 6 weeks. Each student only has to do as little as 5-10h per week and no experience with BigData or the used frameworks. We also present the concept of geotemporal indicies for our use-case.
Consolidating MLOps at One of Europe’s Biggest AirportsDatabricks
At Schiphol airport we run a lot of mission critical machine learning models in production, ranging from models that predict passenger flow to computer vision models that analyze what is happening around the aircraft. Especially now in times of Covid it is paramount for us to be able to quickly iterate on these models by implementing new features, retraining them to match the new dynamics and above all to monitor them actively to see if they still fit the current state of affairs.
To achieve those needs we rely on MLFlow but have also integrated that with many of our other systems. So have we written Airflow operators for MLFlow to ease the retraining of our models, have we integrated MLFlow deeply with our CI pipelines and have we integrated it with our model monitoring tooling.
In this talk we will take you through the way we rely on MLFlow and how that enables us to release (sometimes) multiple versions of a model per week in a controlled fashion. With this set-up we are achieving the same benefits and speed as you have with a traditional software CI pipeline.
Zoltán Zvara - Advanced visualization of Flink and Spark jobs Flink Forward
http://flink-forward.org/kb_sessions/advanced-visualization-of-flink-and-spark-jobs/
Understanding the physical plan of a big data application is often crucial for tracking down bottlenecks and faulty behavior. Flink and Spark although offering useful Web UI components for monitoring and understanding the logical plan of the jobs, both lack a tool that helps to understand the physical plan of the scheduler and the possibility to monitor execution at a very low level, along with the communication that occur between parallel vertex instances. We propose a tool that allows users to real-time monitor and later to replay, examine job executions on any cluster currently supported by Flink or Spark. The tool also offers monitoring of the distribution of keys in a data stream and can lead to optimizing data partitioning across parallel subtasks in the future.
Just because you can, doesn’t mean you should. But in this case, you definitely should! Learn how this one weird trick (Jinja templating) will supercharge your analytics workflows and help you do more, better, faster with SQL.
A unified analytics platform with Kafka and Flink | Stephan Ewen, VervericaHostedbyConfluent
Apache Kafka and Apache Flink together are a winning stack for data analytics that is used by many companies across industries.
The two projects complement each other perfectly: Kafka offers a world-class log for event stream storage and transport, while Flink is a powerful system for analytics and applications on top of those event streams.
This talk will demonstrate how to use Kafka and Flink together for "unified analytics": Analytics that seamlessly combine processing of real-time data and historic data.
Using SQL as the language for our sample applications, we will walk though various scenarios for unified analytics, such as
- Running the same query for processing real-time data from Kafka and for batch-accelerated processing of the historic data stored in Kafka.
- Writing queries that combine data in Kafka with tables in external systems (like S3)
- Switching between streams of historic data (from S3) and real-time streams in Kafka.
The audience will learn how combining real-time and historic data is becoming convenient with the combination of Kafka and Flink.
Kafka for Real-Time Event Processing in Serverless Environmentsconfluent
(Jeff Sharpe + Alex Srisuwan, Capital One) Kafka Summit SF 2018
Using Kafka as a platform messaging bus is common, but bridging communication between real-time and asynchronous components can become complicated, especially when dealing with serverless environments. This has become increasingly common in modern banking where events need to be processed at near-real-time speed. Serverless environments are well-suited to address these needs, and Kafka remains an excellent solution for providing the reliable, resilient communication layer between serverless components and dedicated stream processing services.
In this talk, we will examine some of the strengths and weaknesses of using Kafka for real-time communication, some tips for efficient interactions with Kafka and AWS Lambda, and a number of useful patterns for maximizing the strengths of Kafka and serverless components.
As many industries, banking is undergoing a fundamental change because of the software revolution. No longer are banks competing only on interest rates and having the best traders, these days customer experience and having the best engineers are the focus. In this changing world, banks compete with new start-ups, the so-called Fintechs, and with large platform organisations such as Google, Facebook and Apple. At ING, we believe that staying ahead of the game means changing how we interact with our customers, no longer a traditional model of waiting for the customers to come to the bank through our website or apps, but to actively reach out to the customer with information that is relevant to him or her in order to make their financial life frictionless. Many of these changes are driven by reacting to all events that are relevant to the customer, and using streaming analytics to be able to reach out to the customer in milliseconds after the event occurs. Apache Flink is key for ING to achieve this. This presentation addresses how ING approaches the challenge, the role that Apache Flink plays, and the consequences regulations have on how we work with Open Source in general, and with Apache Flink (and data Artisans) in particular. This keynote takes place at Kino 3.
Sub-Second SQL Search, Aggregations and Joins with Kafka and Rockset | Dhruba...HostedbyConfluent
We often need to build applications that analyze Kafka data to unlock the most value from event streams, so how can organizations build these real-time analytics applications? In this talk, we examine an indexing approach that enables fast SQL analytics on data from Kafka, without data flattening or denormalization. Rockset is the real-time indexing database that builds an inverted index, a columnar index and a row index on all fields of your Kafka messages, including nested fields and arrays. This Converged Index accelerates various types of analytic queries–search, aggregations and joins–without the need to denormalize or transform data for performance reasons. With indexing delivering significant gains in query performance, we also need to index new data in a timely manner. We discuss several strategies used for efficient ingestion and indexing from Kafka, including rollups, write optimizations on the underlying RocksDB storage engine, and the disaggregation of ingest and query compute.
Telling the LivePerson Technology Story at Couchbase [SF] 2013LivePerson
As part of Couchbase[SF]2013, Ido Shilon, R&D Group Leader at LivePerson, discusses LivePerson's project to re-architect their LiveEngage platform backend, with a focus on LivePerson's decision to use NoSQL technologies, the challenges encountered, and the lessons learned.
Video: http://www.youtube.com/watch?v=rYKWFmJEHX0
Hybrid Kafka, Taking Real-time Analytics to the Business (Cody Irwin, Google ...HostedbyConfluent
Apache Kafka users who want to leverage Google Cloud Platform's (GCPs) data analytics platform and open source hosting capabilities can bridge their existing Kafka infrastructure on-premise or in other clouds to GCP using Confluent's replicator tool and managed Kafka service on GCP. Using actual customer examples and a reference architecture, we'll showcase how existing Kafka users can stream data to GCP and use it in popular tools like Apache Beam on Dataflow, BigQuery, Google Cloud Storage (GCS), Spark on Dataproc, and Tensorflow for data warehousing, data processing, data storage, and advanced analytics using AI and ML.
Kafka, Killer of Point-to-Point Integrations, Lucian Litaconfluent
With 60+ products and over 24% of the US GDP flowing through it, system integration is a tough problem for Intuit. Seasonality, scale, and massive peaks in products like TurboTax, QuickBooks, and Mint.com add extra layers of difficulty when building shared data services around transaction and user graphs, clickstream processing, a/b testing, and personalization. To reduce complexity and latency, we’ve implemented Kafka as the backbone across these data services. This allows us to asynchronously trigger relevant processing, elegantly scaling up and down as needed around peaks, all without the need for point-to-point integrations.
In this talk, we share what we’ve learned about Kafka at Intuit and describe our data services architecture. We found that Kafka is invaluable in achieving a scalable, clean architecture, allowing engineering teams to focus less on integration and more on product development.
Kafka and Stream Processing, Taking Analytics Real-time, Mike Spicerconfluent
Do you think that analytics are run on stored data sets? Think again, the combination of Apache Kafka and Stream Processing enables analytics on real-time data streams.
First, we give a brief overview of stream processing and how it differs from the request response model of analytics on stored data. Next, we cover the characteristics of Kafka which make it such a good fit for Stream processing and why they matter. Finally, we show a number of use cases which highlight how stream processing is being used to do real-time analytics at scale with very low latency.
Siphon - Near Real Time Databus Using Kafka, Eric Boyd, Nitin Kumarconfluent
Siphon is a highly available and reliable distributed pub/sub system built using Apache Kafka. It is used to publish, discover and subscribe to near real-time data streams for operational and product intelligence. Siphon is used as a “Databus” by a variety of producers and subscribers in Microsoft, and is compliant with security and privacy requirements. It has a built-in Auditing and Quality control. This session will provide an overview of the use of Kafka at Microsoft, and then deep dive into Siphon. We will describe an important business scenario and talk about the technical details of the system in the context of that scenario. We will also cover the design and implementation of the service, the scale, and real world production experiences from operating the service in the Microsoft cloud environment.
At Netflix, we've spent a lot of time thinking about how we can make our analytics group move quickly. Netflix's Data Engineering & Analytics organization embraces the company's culture of "Freedom & Responsibility".
How does a company with a $40 billion market cap and $6 billion in annual revenue keep their data teams moving with the agility of a tiny company?
How do hundreds of data engineers and scientists make the best decisions for their projects independently, without the analytics environment devolving into chaos?
We'll talk about how Netflix equips its business intelligence and data engineers with:
the freedom to leverage cloud-based data tools - Spark, Presto, Redshift, Tableau and others - in ways that solve our most difficult data problems
the freedom to find and introduce right software for the job - even if it isn't used anywhere else in-house
the freedom to create and drop new tables in production without approval
the freedom to choose when a question is a one-off, and when a question is asked often enough to require a self-service tool
the freedom to retire analytics and data processes whose value doesn't justify their support costs
Speaker Bios
Monisha Kanoth is a Senior Data Architect at Netflix, and was one of the founding members of the current streaming Content Analytics team. She previously worked as a big data lead at Convertro (acquired by AOL) and as a data warehouse lead at MySpace.
Jason Flittner is a Senior Business Intelligence Engineer at Netflix, focusing on data transformation, analysis, and visualization as part of the Content Data Engineering & Analytics team. He previously led the EC2 Business Intelligence team at Amazon Web Services and was a business intelligence engineer with Cisco.
Chris Stephens is a Senior Data Engineer at Netflix. He previously served as the CTO at Deep 6 Analytics, a machine learning & content analytics company in Los Angeles, and on the data warehouse teams at the FOX Audience Network and Anheuser-Busch.
Radical Speed for SQL Queries on Databricks: Photon Under the HoodDatabricks
Join this session to hear from the Photon product and engineering team talk about the latest developments with the project.
As organizations embrace data-driven decision-making, it has become imperative for them to invest in a platform that can quickly ingest and analyze massive amounts and types of data. With their data lakes, organizations can store all their data assets in cheap cloud object storage. But data lakes alone lack robust data management and governance capabilities. Fortunately, Delta Lake brings ACID transactions to your data lakes – making them more reliable while retaining the open access and low storage cost you are used to.
Using Delta Lake as its foundation, the Databricks Lakehouse platform delivers a simplified and performant experience with first-class support for all your workloads, including SQL, data engineering, data science & machine learning. With a broad set of enhancements in data access and filtering, query optimization and scheduling, as well as query execution, the Lakehouse achieves state-of-the-art performance to meet the increasing demands of data applications. In this session, we will dive into Photon, a key component responsible for efficient query execution.
Photon was first introduced at Spark and AI Summit 2020 and is written from the ground up in C++ to take advantage of modern hardware. It uses the latest techniques in vectorized query processing to capitalize on data- and instruction-level parallelism in CPUs, enhancing performance on real-world data and applications — all natively on your data lake. Photon is fully compatible with the Apache Spark™ DataFrame and SQL APIs to ensure workloads run seamlessly without code changes. Come join us to learn more about how Photon can radically speed up your queries on Databricks.
How a distributed graph analytics platform uses Apache Kafka for data ingesti...HostedbyConfluent
Using Kafka to stream data into TigerGraph, a distributed graph database, is a common pattern in our customers’ data architecture. In the TigerGraph database, Kafka Connect framework was used to build the native S3 data loader. In TigerGraph Cloud, we will be building native integration with many data sources such as Azure Blob Storage and Google Cloud Storage using Kafka as an integrated component for the Cloud Portal.
In this session, we will be discussing both architectures: 1. built-in Kafka Connect framework within TigerGraph database; 2. using Kafka cluster for cloud native integration with other popular data sources. Demo will be provided for both data streaming processes.
Presentation on dogfooding data at Lyft by Mark Grover and Arup Malakar on Oct 25, 2017 at Big Analytics Meetup (https://www.meetup.com/SF-Big-Analytics/events/243896328/)
Empowering Zillow’s Developers with Self-Service ETLDatabricks
As the amount of data and the number of unique data sources within an organization grow, handling the volume of new pipeline requests becomes difficult. Not all new pipeline requests are created equal — some are for business-critical datasets, others are for routine data preparation, and others are for experimental transformations that allow data scientists to iterate quickly on their solutions.
To meet the growing demand for new data pipelines, Zillow created multiple self-service solutions that enable any team to build, maintain, and monitor their data pipelines. These tools abstract away the orchestration, deployment, and Apache Spark processing implementation from their respective users. In this talk, Zillow engineers discuss two internal platforms they created to address the specific needs of two distinct user groups: data analysts and data producers. Each platform addresses the use cases of its intended user, leverages internal services through its modular design, and empowers users to create their own ETL without having to worry about how the ETL is implemented.
Members of Zillow’s data engineering team discuss:
Why they created two separate user interfaces to meet the needs different user groups
What degree of abstraction from the orchestration, deployment, processing, and other ancillary tasks that chose for each user group
How they leveraged internal services and packages, including their Apache Spark package — Pipeler, to democratize the creation of high-quality, reliable pipelines within Zillow
Spark summit 2017- Transforming B2B sales with Spark powered sales intelligenceWei Di
B2B sales intelligence has become an integral part of LinkedIn’s business to help companies optimize resource allocation and design effective sales and marketing strategies. This new trend of data-driven approaches has “sparked” a new wave of AI and ML needs in companies large and small. Given the tremendous complexity that arises from the multitude of business needs across different verticals and product lines, Apache Spark, with its rich machine learning libraries, scalable data processing engine and developer-friendly APIs, has been proven to be a great fit for delivering such intelligence at scale.
See how Linkedin is utilizing Spark for building sales intelligence products. This session will introduce a comprehensive B2B intelligence system built on top of various open source stacks. The system puts advanced data science to work in a dynamic and complex scenario, in an easily controllable and interpretable way. Balancing flexibility and complexity, the system can deal with various problems in a unified manner and yield actionable insights to empower successful business. You will also learn about some impactful Spark-ML powered applications such as prospect prediction and prioritization, churn prediction, model interpretation, as well as challenges and lessons learned at LinkedIn while building such platform.
Correlate Log Data with Business Metrics Like a JediTrevor Parsons
The Logentries and Hosted Graphite integration allows you to connect two of your favorite Ops tools to easily extract important data from your log files, visualize them as metrics, and share them in Hosted Graphite dashboards.
• Integrate your systems to extract the metrics you need, from both your applications and log data.
• Set-up log metric dashboards based on common use cases (e.g. error tracking, performance, app usage).
• Get off the "complexity elevator" of hosting your own in-house logging or graphite solutions.
• Delight your team and organization with valuable metrics and performance insights.
Jamie Grier - Robust Stream Processing with Apache FlinkFlink Forward
http://flink-forward.org/kb_sessions/robust-stream-processing-with-apache-flink/
In this hands on talk and demonstration I’ll give a very short introduction to stream processing and then dive into writing code and demonstrating the features in Apache Flink that make truly robust stream processing possible. We’ll focus on correctness and robustness in stream processing. During this live demo we’ll be developing a realtime analytics application and modifying it on the fly based on the topics we’re working though. We’ll exercise Flink’s unique features, demonstrate fault-recovery, clearly explain and demonstrate why Event Time is such an important concept in robust stateful stream processing and talk about and demonstrate the features you need in a stream processor in production. Some of the topics covered will be: – Stateful Stream Processing – Event Time vs. Processing Time – Fault tolerance – State management in the face of faults – Savepoints – Data re-processing – Planned downtime and upgrades
Streaming datasets for personalizationShriya Arora
Streaming applications have historically been complex to design and implement because of the significant infrastructure investment. However, recent active developments in various streaming platforms provide an easy transition to stream processing, and enable analytics applications/experiments to consume near real-time data without massive development cycles.In this session, we will present our experience on stream processing unbounded datasets in the personalization space. The datasets consisted of -- but were not limited to -- the stream of playback events that are used as feedback for all personalization algorithms. These datasets when ultimately consumed by our machine learning models, directly affect the customer’s personalized experience. We’ll talk about the experiments we did to compare Apache Spark and Apache Flink, and the challenges we faced.
Spline: Data Lineage For Spark Structured StreamingVaclav Kosar
Data lineage tracking is one of the significant problems that companies in highly regulated industries face. These companies are forced to have a good understanding of how data flows through their systems to comply with strict regulatory frameworks. Many of these organizations also utilize big and fast data technologies such as Hadoop, Apache Spark and Kafka. Spark has become one of the most popular engines for big data computing. In recent releases, Spark also provides the Structured Streaming component, which allows for real-time analysis and processing of streamed data from many sources. Spline is a data lineage tracking and visualization tool for Apache Spark. Spline captures and stores lineage information from internal Spark execution plans in a lightweight, unobtrusive and easy to use manner.
Additionally, Spline offers a modern user interface that allows non-technical users to understand the logic of Apache Spark applications. In this presentation we cover the support of Spline for Structured Streaming and we demonstrate how data lineage can be captured for streaming applications.
Presented at Spark Summit London 2018
Elasticsearch has always been fast, but required structuring and indexing your data up front. We're changing that with the introduction of runtime fields, which enable you to extract, calculate, and transform fields at query time. They can be defined after data is indexed or provided with your query, enabling new cost/storage/performance tradeoffs, and letting analysts gradually define fields over time.
Activate 2019 - Search and relevance at scale for online classifiedsRoger Rafanell Mas
A high performing search service implies both having an effective search infrastructure and high search relevance.
Seeking for a fault tolerant, self-healing and cost-effective search infrastructure at scale, we built a platform based on Apache Solr search engine with light in-memory indexes, avoiding sharding and decreasing the overall infrastructure needs.
To populate the indexes, we use flexible ETL processes, keeping our product catalog and search indexes updated in a near real-time fashion and distributed across high-performant database engines.
We aim at getting a high search relevance precision and recall by applying query relaxation and boost solutions on top of the optimised platform.
https://www.activate-conf.com/speakers/detail/roger-rafanell
Sub-Second SQL Search, Aggregations and Joins with Kafka and Rockset | Dhruba...HostedbyConfluent
We often need to build applications that analyze Kafka data to unlock the most value from event streams, so how can organizations build these real-time analytics applications? In this talk, we examine an indexing approach that enables fast SQL analytics on data from Kafka, without data flattening or denormalization. Rockset is the real-time indexing database that builds an inverted index, a columnar index and a row index on all fields of your Kafka messages, including nested fields and arrays. This Converged Index accelerates various types of analytic queries–search, aggregations and joins–without the need to denormalize or transform data for performance reasons. With indexing delivering significant gains in query performance, we also need to index new data in a timely manner. We discuss several strategies used for efficient ingestion and indexing from Kafka, including rollups, write optimizations on the underlying RocksDB storage engine, and the disaggregation of ingest and query compute.
Telling the LivePerson Technology Story at Couchbase [SF] 2013LivePerson
As part of Couchbase[SF]2013, Ido Shilon, R&D Group Leader at LivePerson, discusses LivePerson's project to re-architect their LiveEngage platform backend, with a focus on LivePerson's decision to use NoSQL technologies, the challenges encountered, and the lessons learned.
Video: http://www.youtube.com/watch?v=rYKWFmJEHX0
Hybrid Kafka, Taking Real-time Analytics to the Business (Cody Irwin, Google ...HostedbyConfluent
Apache Kafka users who want to leverage Google Cloud Platform's (GCPs) data analytics platform and open source hosting capabilities can bridge their existing Kafka infrastructure on-premise or in other clouds to GCP using Confluent's replicator tool and managed Kafka service on GCP. Using actual customer examples and a reference architecture, we'll showcase how existing Kafka users can stream data to GCP and use it in popular tools like Apache Beam on Dataflow, BigQuery, Google Cloud Storage (GCS), Spark on Dataproc, and Tensorflow for data warehousing, data processing, data storage, and advanced analytics using AI and ML.
Kafka, Killer of Point-to-Point Integrations, Lucian Litaconfluent
With 60+ products and over 24% of the US GDP flowing through it, system integration is a tough problem for Intuit. Seasonality, scale, and massive peaks in products like TurboTax, QuickBooks, and Mint.com add extra layers of difficulty when building shared data services around transaction and user graphs, clickstream processing, a/b testing, and personalization. To reduce complexity and latency, we’ve implemented Kafka as the backbone across these data services. This allows us to asynchronously trigger relevant processing, elegantly scaling up and down as needed around peaks, all without the need for point-to-point integrations.
In this talk, we share what we’ve learned about Kafka at Intuit and describe our data services architecture. We found that Kafka is invaluable in achieving a scalable, clean architecture, allowing engineering teams to focus less on integration and more on product development.
Kafka and Stream Processing, Taking Analytics Real-time, Mike Spicerconfluent
Do you think that analytics are run on stored data sets? Think again, the combination of Apache Kafka and Stream Processing enables analytics on real-time data streams.
First, we give a brief overview of stream processing and how it differs from the request response model of analytics on stored data. Next, we cover the characteristics of Kafka which make it such a good fit for Stream processing and why they matter. Finally, we show a number of use cases which highlight how stream processing is being used to do real-time analytics at scale with very low latency.
Siphon - Near Real Time Databus Using Kafka, Eric Boyd, Nitin Kumarconfluent
Siphon is a highly available and reliable distributed pub/sub system built using Apache Kafka. It is used to publish, discover and subscribe to near real-time data streams for operational and product intelligence. Siphon is used as a “Databus” by a variety of producers and subscribers in Microsoft, and is compliant with security and privacy requirements. It has a built-in Auditing and Quality control. This session will provide an overview of the use of Kafka at Microsoft, and then deep dive into Siphon. We will describe an important business scenario and talk about the technical details of the system in the context of that scenario. We will also cover the design and implementation of the service, the scale, and real world production experiences from operating the service in the Microsoft cloud environment.
At Netflix, we've spent a lot of time thinking about how we can make our analytics group move quickly. Netflix's Data Engineering & Analytics organization embraces the company's culture of "Freedom & Responsibility".
How does a company with a $40 billion market cap and $6 billion in annual revenue keep their data teams moving with the agility of a tiny company?
How do hundreds of data engineers and scientists make the best decisions for their projects independently, without the analytics environment devolving into chaos?
We'll talk about how Netflix equips its business intelligence and data engineers with:
the freedom to leverage cloud-based data tools - Spark, Presto, Redshift, Tableau and others - in ways that solve our most difficult data problems
the freedom to find and introduce right software for the job - even if it isn't used anywhere else in-house
the freedom to create and drop new tables in production without approval
the freedom to choose when a question is a one-off, and when a question is asked often enough to require a self-service tool
the freedom to retire analytics and data processes whose value doesn't justify their support costs
Speaker Bios
Monisha Kanoth is a Senior Data Architect at Netflix, and was one of the founding members of the current streaming Content Analytics team. She previously worked as a big data lead at Convertro (acquired by AOL) and as a data warehouse lead at MySpace.
Jason Flittner is a Senior Business Intelligence Engineer at Netflix, focusing on data transformation, analysis, and visualization as part of the Content Data Engineering & Analytics team. He previously led the EC2 Business Intelligence team at Amazon Web Services and was a business intelligence engineer with Cisco.
Chris Stephens is a Senior Data Engineer at Netflix. He previously served as the CTO at Deep 6 Analytics, a machine learning & content analytics company in Los Angeles, and on the data warehouse teams at the FOX Audience Network and Anheuser-Busch.
Radical Speed for SQL Queries on Databricks: Photon Under the HoodDatabricks
Join this session to hear from the Photon product and engineering team talk about the latest developments with the project.
As organizations embrace data-driven decision-making, it has become imperative for them to invest in a platform that can quickly ingest and analyze massive amounts and types of data. With their data lakes, organizations can store all their data assets in cheap cloud object storage. But data lakes alone lack robust data management and governance capabilities. Fortunately, Delta Lake brings ACID transactions to your data lakes – making them more reliable while retaining the open access and low storage cost you are used to.
Using Delta Lake as its foundation, the Databricks Lakehouse platform delivers a simplified and performant experience with first-class support for all your workloads, including SQL, data engineering, data science & machine learning. With a broad set of enhancements in data access and filtering, query optimization and scheduling, as well as query execution, the Lakehouse achieves state-of-the-art performance to meet the increasing demands of data applications. In this session, we will dive into Photon, a key component responsible for efficient query execution.
Photon was first introduced at Spark and AI Summit 2020 and is written from the ground up in C++ to take advantage of modern hardware. It uses the latest techniques in vectorized query processing to capitalize on data- and instruction-level parallelism in CPUs, enhancing performance on real-world data and applications — all natively on your data lake. Photon is fully compatible with the Apache Spark™ DataFrame and SQL APIs to ensure workloads run seamlessly without code changes. Come join us to learn more about how Photon can radically speed up your queries on Databricks.
How a distributed graph analytics platform uses Apache Kafka for data ingesti...HostedbyConfluent
Using Kafka to stream data into TigerGraph, a distributed graph database, is a common pattern in our customers’ data architecture. In the TigerGraph database, Kafka Connect framework was used to build the native S3 data loader. In TigerGraph Cloud, we will be building native integration with many data sources such as Azure Blob Storage and Google Cloud Storage using Kafka as an integrated component for the Cloud Portal.
In this session, we will be discussing both architectures: 1. built-in Kafka Connect framework within TigerGraph database; 2. using Kafka cluster for cloud native integration with other popular data sources. Demo will be provided for both data streaming processes.
Presentation on dogfooding data at Lyft by Mark Grover and Arup Malakar on Oct 25, 2017 at Big Analytics Meetup (https://www.meetup.com/SF-Big-Analytics/events/243896328/)
Empowering Zillow’s Developers with Self-Service ETLDatabricks
As the amount of data and the number of unique data sources within an organization grow, handling the volume of new pipeline requests becomes difficult. Not all new pipeline requests are created equal — some are for business-critical datasets, others are for routine data preparation, and others are for experimental transformations that allow data scientists to iterate quickly on their solutions.
To meet the growing demand for new data pipelines, Zillow created multiple self-service solutions that enable any team to build, maintain, and monitor their data pipelines. These tools abstract away the orchestration, deployment, and Apache Spark processing implementation from their respective users. In this talk, Zillow engineers discuss two internal platforms they created to address the specific needs of two distinct user groups: data analysts and data producers. Each platform addresses the use cases of its intended user, leverages internal services through its modular design, and empowers users to create their own ETL without having to worry about how the ETL is implemented.
Members of Zillow’s data engineering team discuss:
Why they created two separate user interfaces to meet the needs different user groups
What degree of abstraction from the orchestration, deployment, processing, and other ancillary tasks that chose for each user group
How they leveraged internal services and packages, including their Apache Spark package — Pipeler, to democratize the creation of high-quality, reliable pipelines within Zillow
Spark summit 2017- Transforming B2B sales with Spark powered sales intelligenceWei Di
B2B sales intelligence has become an integral part of LinkedIn’s business to help companies optimize resource allocation and design effective sales and marketing strategies. This new trend of data-driven approaches has “sparked” a new wave of AI and ML needs in companies large and small. Given the tremendous complexity that arises from the multitude of business needs across different verticals and product lines, Apache Spark, with its rich machine learning libraries, scalable data processing engine and developer-friendly APIs, has been proven to be a great fit for delivering such intelligence at scale.
See how Linkedin is utilizing Spark for building sales intelligence products. This session will introduce a comprehensive B2B intelligence system built on top of various open source stacks. The system puts advanced data science to work in a dynamic and complex scenario, in an easily controllable and interpretable way. Balancing flexibility and complexity, the system can deal with various problems in a unified manner and yield actionable insights to empower successful business. You will also learn about some impactful Spark-ML powered applications such as prospect prediction and prioritization, churn prediction, model interpretation, as well as challenges and lessons learned at LinkedIn while building such platform.
Correlate Log Data with Business Metrics Like a JediTrevor Parsons
The Logentries and Hosted Graphite integration allows you to connect two of your favorite Ops tools to easily extract important data from your log files, visualize them as metrics, and share them in Hosted Graphite dashboards.
• Integrate your systems to extract the metrics you need, from both your applications and log data.
• Set-up log metric dashboards based on common use cases (e.g. error tracking, performance, app usage).
• Get off the "complexity elevator" of hosting your own in-house logging or graphite solutions.
• Delight your team and organization with valuable metrics and performance insights.
Jamie Grier - Robust Stream Processing with Apache FlinkFlink Forward
http://flink-forward.org/kb_sessions/robust-stream-processing-with-apache-flink/
In this hands on talk and demonstration I’ll give a very short introduction to stream processing and then dive into writing code and demonstrating the features in Apache Flink that make truly robust stream processing possible. We’ll focus on correctness and robustness in stream processing. During this live demo we’ll be developing a realtime analytics application and modifying it on the fly based on the topics we’re working though. We’ll exercise Flink’s unique features, demonstrate fault-recovery, clearly explain and demonstrate why Event Time is such an important concept in robust stateful stream processing and talk about and demonstrate the features you need in a stream processor in production. Some of the topics covered will be: – Stateful Stream Processing – Event Time vs. Processing Time – Fault tolerance – State management in the face of faults – Savepoints – Data re-processing – Planned downtime and upgrades
Streaming datasets for personalizationShriya Arora
Streaming applications have historically been complex to design and implement because of the significant infrastructure investment. However, recent active developments in various streaming platforms provide an easy transition to stream processing, and enable analytics applications/experiments to consume near real-time data without massive development cycles.In this session, we will present our experience on stream processing unbounded datasets in the personalization space. The datasets consisted of -- but were not limited to -- the stream of playback events that are used as feedback for all personalization algorithms. These datasets when ultimately consumed by our machine learning models, directly affect the customer’s personalized experience. We’ll talk about the experiments we did to compare Apache Spark and Apache Flink, and the challenges we faced.
Spline: Data Lineage For Spark Structured StreamingVaclav Kosar
Data lineage tracking is one of the significant problems that companies in highly regulated industries face. These companies are forced to have a good understanding of how data flows through their systems to comply with strict regulatory frameworks. Many of these organizations also utilize big and fast data technologies such as Hadoop, Apache Spark and Kafka. Spark has become one of the most popular engines for big data computing. In recent releases, Spark also provides the Structured Streaming component, which allows for real-time analysis and processing of streamed data from many sources. Spline is a data lineage tracking and visualization tool for Apache Spark. Spline captures and stores lineage information from internal Spark execution plans in a lightweight, unobtrusive and easy to use manner.
Additionally, Spline offers a modern user interface that allows non-technical users to understand the logic of Apache Spark applications. In this presentation we cover the support of Spline for Structured Streaming and we demonstrate how data lineage can be captured for streaming applications.
Presented at Spark Summit London 2018
Elasticsearch has always been fast, but required structuring and indexing your data up front. We're changing that with the introduction of runtime fields, which enable you to extract, calculate, and transform fields at query time. They can be defined after data is indexed or provided with your query, enabling new cost/storage/performance tradeoffs, and letting analysts gradually define fields over time.
Activate 2019 - Search and relevance at scale for online classifiedsRoger Rafanell Mas
A high performing search service implies both having an effective search infrastructure and high search relevance.
Seeking for a fault tolerant, self-healing and cost-effective search infrastructure at scale, we built a platform based on Apache Solr search engine with light in-memory indexes, avoiding sharding and decreasing the overall infrastructure needs.
To populate the indexes, we use flexible ETL processes, keeping our product catalog and search indexes updated in a near real-time fashion and distributed across high-performant database engines.
We aim at getting a high search relevance precision and recall by applying query relaxation and boost solutions on top of the optimised platform.
https://www.activate-conf.com/speakers/detail/roger-rafanell
The Evolution of Testing Methodology at AWS: From Status Quo to Formal Method...C4Media
Video and slides synchronized, mp3 and slide download available at URL http://bit.ly/1MMYOsf.
Tim Rath explains how and why Amazon incorporated more powerful testing methodologies, ultimately leading them to the use of formal methods where TLA+ has become a cornerstone to our overall strategy. Filmed at qconsf.com.
Tim Rath has been building large scale distributed systems at Amazon for the past eleven years.
Behind the Wizard’s Curtain: Scalability and Security at Zuora (Subscribed13)Zuora, Inc.
Ever wonder what's in the Zuora cloud? Join us and learn how Zuora has built a scalable and secure cloud based subscription billing management service. Hear from scalability, security and operations engineers and have your questions answered.
(Mike Graham + Dan Carroll, Comcast) Kafka Summit SF 2018
Comcast manages over 2 million miles of fiber and coax, and over 40 million in home devices. This “outside plant” is subject to adverse conditions from severe weather to power grid outages to construction-related disruptions. Maintaining the health of this large and important infrastructure requires a distributed, scalable, reliable and fast information system capable of real-time processing and rapid analysis and response. Using Apache Kafka and the Kafka Streams Processor API, Comcast built an innovative new system for monitoring, problem analysis, metrics reporting and action response for the outside plant.
In this talk, you’ll learn how topic partitions, state stores, key mapping, source and sink topics and processors from the Kafka Streams Processor API work together to build a powerful dynamic system. We will dive into the details about the inner workings of the state store—how it is backed by a Kafka “changelog” topic, how it is scaled horizontally by partition and how the instances are rebuilt on startup or on processor failure. We will discuss how these state stores essentially become like materialized views in a SQL database but are updated incrementally as data flows through the system, and how this allows the developers to maintain the data in the optimal structures for performing the processing. The best part is that the data is readily available when needed by the processors. You will see how a REST API using Kafka Streams “interactive queries” can be used to retrieve the data in the state stores. We will explore the deployment and monitoring mechanisms used to deliver this system as a set of independently deployed components.
Using Deep Learning and Customized Solr Components to Improve search Relevanc...Lucidworks
Gain insight into the state-of-the-art deep learning algorithms being used to power e-commerce search at Target and how to customize Solr to blend multiple ML signals at a large scale.
Speakers:
Aashish Dattani, Lead Data Engineer, Target
Richard Wang, Principal AI Scientist, Target
Sunil Srinivasan, Lead Engineer, Target
Creating a Project Plan for a Data Warehouse Testing AssignmentRTTS
Learn how to create a project plan for a Data Warehouse testing assignment. Chris Thompson and Mike Calabrese, Senior Solution Architects and QuerySurge experts, provide great information, a demo and lots of humor.
This webinar was performed in conjunction with Test Guild.
To watch the video, go to:
https://youtu.be/_sNYZgL3rZY
A Practical Guide to Selecting a Stream Processing Technology confluent
Presented by Michael Noll, Product Manager, Confluent.
Why are there so many stream processing frameworks that each define their own terminology? Are the components of each comparable? Why do you need to know about spouts or DStreams just to process a simple sequence of records? Depending on your application’s requirements, you may not need a full framework at all.
Processing and understanding your data to create business value is the ultimate goal of a stream data platform. In this talk we will survey the stream processing landscape, the dimensions along which to evaluate stream processing technologies, and how they integrate with Apache Kafka. Particularly, we will learn how Kafka Streams, the built-in stream processing engine of Apache Kafka, compares to other stream processing systems that require a separate processing infrastructure.
Operationalizing Machine Learning—Managing Provenance from Raw Data to Predic...Databricks
Our team at Comcast is challenged with operationalizing predictive ML models to improve customer experience. Our goal is to eliminate bottlenecks in the process from model inception to deployment and monitoring.
Traditionally CI/CD manages code and infrastructure artifacts like container definitions. We want to extend it to support granular traceability enabling tracking of ML Models from use-case, to feature/attribute selection, development of versioned datasets, model training code, model evaluation artifacts, model prediction deployment containers, and sinks to which the predictions/outcomes are persisted to. Our framework stack enables us to track models from use-case to deployments, manage and evaluate multiple models simultaneously in the live yet dark mode and continue to monitor models in production against real-world outcomes using configurable policies.
The technologies/components which drive this vision are:
1. FeatureStore – Enables data scientists to reuse versioned features and review feature metrics by models. Self-Service capabilities allow all teams to onboard their events data into the feature store.
2. ModelRepository – Manages meta-data about models including pre-processing parameters (Ex. Scaling parameters for features), mapping to the features needed to execute the model, model discovery mechanisms, etc.
3. Spark on Alluxio – Alluxio provides the universal data plane on top of various under-stores (Ex. S3, HDFS, RDBMS). Apache Spark with its Data Sources API provides a unified query language which Data Scientist use to consume features to create training/validation/test datasets which are versioned and integrated into the full model pipeline using Ground-Context discussed next.
4. Ground-Context – This open-source vendor-neutral data context service enables full traceability from use-case, models, features, model to features mapping, versioned datasets, model training codebase, model deployment containers and prediction/outcome sinks. It integrates with the Feature-Store, Container Repository and Git to integrate data, code and run-time artifacts for CI/CD integration.
Salesforce API access is nice, but often not enough. In the Big Data era, you are often required to replicate Salesforce data for offline processing. Implisit requires such replication for its data entry engine to automatically enter emails, events, contacts, and leads for its users. In order to do so, Implisit maintains a daily sync of over one billion Salesforce data records, while using no more than a few hundred API calls per Salesforce Org. Join us as we share the suggested architecture of such a replication mechanism, the best practices we developed over time, and the pitfalls to avoid.
Similar to Elevation Query Extension: Introducing Subselects into Lucene Queries (20)
Search is the Tip of the Spear for Your B2B eCommerce StrategyLucidworks
With ecommerce experiencing explosive growth, it seems intuitive that the B2B segment of that ecosystem is mirroring the same trajectory. That said, B2B has very different needs when it comes to transacting with the same style of experiences that we see in B2C. For instance, B2B ecommerce is about precision findability, whereas B2C customers can convert at higher rates when they’re just browsing online. In order for the B2B buying experience to be successful, search needs to be tuned to meet the unique needs of the segment.
In this webinar with Forrester senior analyst Joe Cicman, you’ll learn:
-Which verticals in B2B will drive the most growth, and how machine-learning powered personalization tactics can be deployed to support those specific verticals
-Why an omnichannel selling approach must be deployed in order to see success in B2B
-How deploying content search capabilities will support a longer sales cycle at scale
-What the next steps are to support a robust B2B commerce strategy supported by new technology
Speakers
Joe Cicman, Senior Analyst, Forrester
Jenny Gomez, VP of Marketing, Lucidworks
Customer loyalty starts with quickly responding to your customer’s needs. When it comes to resolving open support cases, time is of the essence. Time spent searching for answers adds up and creates inefficiencies in resolving cases at scale. Relevant answers need to be a few clicks away and easily accessible for agents directly from their service console.
We will explore how Lucidworks’ Agent Insights application automatically connects agents with the correct answers and resources. You’ll learn how to:
-Configure a proactive widget in an agent’s case view page to access resources across third-party systems (such as Sharepoint, Confluence, JIRA, Zendesk, and ServiceNow).
-Easily set up query pipelines to autonomously route assets and resources that are relevant to the case-at-hand—directly to the right agent.
-Identify subject matter experts within your support data and access tribal knowledge with lightning-fast speed.
How Crate & Barrel Connects Shoppers with Relevant ProductsLucidworks
Lunch and Learn during Retail TouchPoints #RIC21 virtual event.
***
Crate & Barrel’s previous search solution couldn’t provide its shoppers with an online search and browse experience consistent with the customer-centric Crate & Barrel brand. Meanwhile, Crate & Barrel merchandisers spent the bulk of their time manually creating and maintaining search rules. The search experience impacted customer retention, loyalty, and revenue growth.
Join this lunch & learn for an interactive chat on how Crate & Barrel partnered with Lucidworks to:
-Improve search and browse by modernizing the technology stack with ML-based personalization and merchandising solutions
-Enhance the experience for both shoppers and merchandisers
-Explore signals to transform the omnichannel shopping experience
Questions? Visit https://lucidworks.com/contact/
Learn how to guide customers to relevant products using eCommerce search, hyper-personalisation, and recommendations in our ‘Best-In-Class Retail Product Discovery’ webinar.
Nowadays, shoppers want their online experience to be engaging, inspirational and fulfilling. They want to find what they’re looking for quickly and easily. If the sought after item isn’t available, they want the next best product or content surfaced to them. They want a website to understand their goals as though they were talking to a sales assistant in person, in-store.
In this webinar, we explore IMRG industry data insights and a best-in-class example of retail product discovery. You’ll learn:
- How AI can drive increased revenue through hyper-personalised experiences
- How user intent can be easily understood and results displayed immediately
- How merchandisers can be empowered to curate results and product placement – all without having to rely on IT.
Presented by:
Dave Hawkins, Principal Sales Engineer - Lucidworks
Matthew Walsh, Director of Data & Retail - IMRG
Connected Experiences Are Personalized ExperiencesLucidworks
Many companies claim personalization and omnichannel capabilities are top priorities. Few are able to deliver on those experiences.
For a recent Lucidworks-commissioned study, Forrester Consulting surveyed 350+ global business decision-makers to see what gets in the way of achieving these goals. They discovered that inefficient technology, lack of behavioral insights, and failure to tie initiatives to enterprise-wide goals are some of the most frequent blockers to personalization success.
Join guest speaker, Forrester VP and Principal Analyst, Brendan Witcher, and Lucidworks CEO, Will Hayes, to hear the results of the Forrester Consulting study, how to avoid “digital blindness,” and how to apply VoC data in real-time to delight customers with personalized experiences connected across every touchpoint.
In this webinar, you’ll learn:
- Why companies who utilize real-time customer signals report more effective personalization
- How to connect employees and customers in a shared experience through search and browse
- How Lucidworks clients Lenovo, Morgan Stanley and Red Hat fast-tracked improvements in conversion, engagement and customer satisfaction
Featuring
- Will Hayes, CEO, Lucidworks
- Brendan Witcher, VP, Principal Analyst, Forrester
Intelligent Insight Driven Policing with MC+A, Toronto Police Service and Luc...Lucidworks
Intelligent Policing. Leveraging Data to more effectively Serve Communities.
Policing in the next decade is anticipated to be very different from historical methods. More data driven, more focused on the intricacies of communities they serve and more open and collaborative to make informed recommendations a reality. Whether its social populations, NIBRS or organization improvement that’s the driver, the IT requirement is largely the same. Provide 360 access to large volumes of siloed data to gain a full 360 understanding of existing connections and patterns for improved insight and recommendation.
Join us for a round table discussion of how the Toronto Police Service is better serving their community through deploying a unified intelligent data platform.
Data innovation improves officers' engagement with existing data and streamlines investigation workflows by enhancing collaboration. This improved visibility into existing police data allows for a more intelligent and responsive police force.
In this webinar, we'll cover:
-The technology needs of an intelligent police force.
-How a Global Search improves an officer's interaction with existing data.
Featuring:
-Simon Taylor, VP, Worldwide Channels & Alliances, Lucidworks
-Michael Cizmar, Managing Director, MC+A
-Ian Williams, Manager of Analytics & Innovation, Toronto Police Service
[Webinar] Intelligent Policing. Leveraging Data to more effectively Serve Com...Lucidworks
Policing in the next decade is anticipated to be very different from historical methods. More data driven, more focused on the intricacies of communities they serve and more open and collaborative to make informed recommendations a reality. Whether its social populations, NIBRS or organization improvement that’s the driver, the IT requirement is largely the same. Provide 360 access to large volumes of siloed data to gain a full 360 understanding of existing connections and patterns for improved insight and recommendation.
Join us for a round table discussion of how the Toronto Police Service is better serving their community through deploying a unified intelligent data platform.
Data innovation improves officers' engagement with existing data and streamlines investigation workflows by enhancing collaboration. This improved visibility into existing police data allows for a more intelligent and responsive police force.
In this webinar, we'll cover:
The technology needs of an intelligent police force.
How a Global Search improves an officer's interaction with existing data.
Featuring
-Simon Taylor, VP, Worldwide Channels & Alliances, Lucidworks
-Michael Cizmar, Managing Director, MC+A
-Ian Williams, Manager of Analytics & Innovation, Toronto Police Service
Accelerate The Path To Purchase With Product Discovery at Retail Innovation C...Lucidworks
Wish your conversion rates were higher? Can’t figure out how to efficiently and effectively serve all the visitors on your site? Embarrassed by the quality of your product discovery experience? The bar is high and the influx of online shopping over recent months has reminded us that the opportunities are real. We’re all deep in holiday prep, but let’s take a few minutes to think about January 2021 and beyond. How can we position ourselves for success with our customers and against our competition?
Grab your lunch and let’s dive into three strategies that need to be part of your 2021 roadmap. You don’t need an army to get there. But you do need to take action and capitalize on the shoppers abandoning the product discovery journey on your site.
In this session, attendees will find out how to:
-Take control of merchandising at scale;
-Implement hands-free search relevancy; and
-Address personalization challenges.
AI-Powered Linguistics and Search with Fusion and RosetteLucidworks
For a personalized search experience, search curation requires robust text interpretation, data enrichment, relevancy tuning and recommendations. In order to achieve this, language and entity identification are crucial.
For teams working on search applications, advanced language packages allow them to achieve greater recall without sacrificing precision.
Join us for a guided tour of our new Advanced Linguistics packages, available in Fusion, thanks to the technology partnership between Lucidworks and Basistech.
We’ll explore the application of language identification and entity extraction in the context of search, along with practical examples of personalizing search and enhancing entity extraction.
In this webinar, we’ll cover:
-How Fusion uses the Rosette Basic Linguistics and Entity Extraction packages
-Tips for improving language identification and treatment as well as data enrichment for personalization
-Speech2 demo modeling Active Recommendation
-Use Rosette’s packages with Fusion Pipelines to build custom entities for specific domain use cases
Featuring:
-Radu Miclaus, Director of Product, AI and Cloud, Lucidworks, Lucidworks
-Robert Lucarini, Senior Software Engineer, Lucidworks
-Nick Belanger, Solutions Engineer, Basis Technology
The Service Industry After COVID-19: The Soul of Service in a Virtual MomentLucidworks
Before COVID-19, almost 80% of the US workforce worked service in jobs that involve in-person interaction with strangers. Now, leaders of service organizations must reshape their offerings during the pandemic and prepare for whatever the new normal turns out to be. Our three panelists will share ideas for adapting their service businesses, now that closer-than-six-feet isn’t an option.
Join Lucidworks as we talk shop with 3 service business leaders, covering:
-Common impacts of the pandemic on service businesses (and what to do about them),
-How service teams can maintain a human touch across virtual channels, and
-Plans for the future, before and after the pandemic subsides.
Featuring
-Sara Nathan, President & CEO, AMIGOS
-Anthony Carruesco, Founder, AC Fly Fishing
-sara bradley, chef and proprietor, freight house
-Justin Sears, VP Product Marketing, Lucidworks
Webinar: Smart answers for employee and customer support after covid 19 - EuropeLucidworks
The COVID-19 pandemic has forced companies to support far more customers and employees through digital channels than ever before. Many are turning to chatbots to help meet increasing demand, but traditional rules-based approaches can’t keep up. Our new Smart Answers add-on to Lucidworks Fusion makes existing chatbots and virtual assistants more intelligent and more valuable to the people you serve.
Smart Answers for Employee and Customer Support After COVID-19Lucidworks
Watch our on-demand webinar showcasing Smart Answers on Lucidworks Fusion. This technology makes existing chatbots and virtual assistants more intelligent and more valuable to the people you serve.
In this webinar, we’ll cover off:
-How search and deep learning extend conversational frameworks for improved experiences
-How Smart Answers improves customer care, call deflection, and employee self-service
-A live demo of Smart Answers for multi-channel self-service support
Applying AI & Search in Europe - featuring 451 ResearchLucidworks
In the current climate, it’s now more important than ever to digitally enable your workforce and customers.
Hear from Simon Taylor, VP Global Partners & Alliances, Lucidworks and Matt Aslett, Research Vice President, 451 Research to get the inside scoop on how industry leaders in Europe are developing and executing their digital transformation strategies.
In this webinar, we’ll discuss:
The top challenges and aspirations European business and technology leaders are solving using AI and search technology
Which search and AI use cases are making the biggest impact in industries such as finance, healthcare, retail and energy in Europe
What technology buyers should look for when evaluating AI and search solutions
Webinar: 5 Must-Have Items You Need for Your 2020 Ecommerce StrategyLucidworks
In this webinar with 451 Research, you'll understand how retailers are using AI to predict customer intent and learn which key performance metrics are used by more than 120 online retailers in Lucidworks’ 2019 Retail Benchmark Survey.
In this webinar, you’ll learn:
● What trends and opportunities are facing the ecommerce industry in 2020
● Why search is the universal path to understanding customer intent
● How large online retailers apply AI to maximize the effectiveness of their personalization efforts
Where Search Meets Science and Style Meets Savings: Nordstrom Rack's Journey ...Lucidworks
Nordstrom Rack | Hautelook curates and serves customers a wide selection of on-trend apparel, accessories, and shoes at an everyday savings of up to 75 percent off regular prices. With over a million visitors shopping across different platforms every day, and a realization that customers have become accustomed to robust and personalized search interactions, Nordstrom Rack | Hautelook launched an initiative over a year ago to provide data science-driven digital experiences to their customers.
In this session, we’ll discuss Nordstrom Rack | Hautelook’s journey of operationalizing a hefty strategy, optimizing a fickle infrastructure, and rallying troops around a single vision of building an expansible machine-learning driven product discovery engine.
The audience will learn about:
-The key technical challenges and outcomes that come with onboarding a solution
-The lessons learned of creating and executing operational design
-The use of Lucidworks Fusion to plug custom data science models into search and browse applications to understand user intent and deliver personalized experiences
Apply Knowledge Graphs and Search for Real-World Decision IntelligenceLucidworks
Knowledge graphs and machine learning are on the rise as enterprises hunt for more effective ways to connect the dots between the data and the business world. With newer technologies, the digital workplace can dramatically improve employee engagement, data-driven decisions, and actions that serve tangible business objectives.
In this webinar, you will learn
-- Introduction to knowledge graphs and where they fit in the ML landscape
-- How breakthroughs in search affect your business
-- The key features to consider when choosing a data discovery platform
-- Best practices for adopting AI-powered search, with real-world examples
GraphRAG is All You need? LLM & Knowledge GraphGuy Korland
Guy Korland, CEO and Co-founder of FalkorDB, will review two articles on the integration of language models with knowledge graphs.
1. Unifying Large Language Models and Knowledge Graphs: A Roadmap.
https://arxiv.org/abs/2306.08302
2. Microsoft Research's GraphRAG paper and a review paper on various uses of knowledge graphs:
https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/
UiPath Test Automation using UiPath Test Suite series, part 4DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 4. In this session, we will cover Test Manager overview along with SAP heatmap.
The UiPath Test Manager overview with SAP heatmap webinar offers a concise yet comprehensive exploration of the role of a Test Manager within SAP environments, coupled with the utilization of heatmaps for effective testing strategies.
Participants will gain insights into the responsibilities, challenges, and best practices associated with test management in SAP projects. Additionally, the webinar delves into the significance of heatmaps as a visual aid for identifying testing priorities, areas of risk, and resource allocation within SAP landscapes. Through this session, attendees can expect to enhance their understanding of test management principles while learning practical approaches to optimize testing processes in SAP environments using heatmap visualization techniques
What will you get from this session?
1. Insights into SAP testing best practices
2. Heatmap utilization for testing
3. Optimization of testing processes
4. Demo
Topics covered:
Execution from the test manager
Orchestrator execution result
Defect reporting
SAP heatmap example with demo
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Elevating Tactical DDD Patterns Through Object CalisthenicsDorra BARTAGUIZ
After immersing yourself in the blue book and its red counterpart, attending DDD-focused conferences, and applying tactical patterns, you're left with a crucial question: How do I ensure my design is effective? Tactical patterns within Domain-Driven Design (DDD) serve as guiding principles for creating clear and manageable domain models. However, achieving success with these patterns requires additional guidance. Interestingly, we've observed that a set of constraints initially designed for training purposes remarkably aligns with effective pattern implementation, offering a more ‘mechanical’ approach. Let's explore together how Object Calisthenics can elevate the design of your tactical DDD patterns, offering concrete help for those venturing into DDD for the first time!
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdfPeter Spielvogel
Building better applications for business users with SAP Fiori.
• What is SAP Fiori and why it matters to you
• How a better user experience drives measurable business benefits
• How to get started with SAP Fiori today
• How SAP Fiori elements accelerates application development
• How SAP Build Code includes SAP Fiori tools and other generative artificial intelligence capabilities
• How SAP Fiori paves the way for using AI in SAP apps
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™UiPathCommunity
In questo evento online gratuito, organizzato dalla Community Italiana di UiPath, potrai esplorare le nuove funzionalità di Autopilot, il tool che integra l'Intelligenza Artificiale nei processi di sviluppo e utilizzo delle Automazioni.
📕 Vedremo insieme alcuni esempi dell'utilizzo di Autopilot in diversi tool della Suite UiPath:
Autopilot per Studio Web
Autopilot per Studio
Autopilot per Apps
Clipboard AI
GenAI applicata alla Document Understanding
👨🏫👨💻 Speakers:
Stefano Negro, UiPath MVPx3, RPA Tech Lead @ BSP Consultant
Flavio Martinelli, UiPath MVP 2023, Technical Account Manager @UiPath
Andrei Tasca, RPA Solutions Team Lead @NTT Data
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf91mobiles
91mobiles recently conducted a Smart TV Buyer Insights Survey in which we asked over 3,000 respondents about the TV they own, aspects they look at on a new TV, and their TV buying preferences.
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...UiPathCommunity
💥 Speed, accuracy, and scaling – discover the superpowers of GenAI in action with UiPath Document Understanding and Communications Mining™:
See how to accelerate model training and optimize model performance with active learning
Learn about the latest enhancements to out-of-the-box document processing – with little to no training required
Get an exclusive demo of the new family of UiPath LLMs – GenAI models specialized for processing different types of documents and messages
This is a hands-on session specifically designed for automation developers and AI enthusiasts seeking to enhance their knowledge in leveraging the latest intelligent document processing capabilities offered by UiPath.
Speakers:
👨🏫 Andras Palfi, Senior Product Manager, UiPath
👩🏫 Lenka Dulovicova, Product Program Manager, UiPath
Climate Impact of Software Testing at Nordic Testing DaysKari Kakkonen
My slides at Nordic Testing Days 6.6.2024
Climate impact / sustainability of software testing discussed on the talk. ICT and testing must carry their part of global responsibility to help with the climat warming. We can minimize the carbon footprint but we can also have a carbon handprint, a positive impact on the climate. Quality characteristics can be added with sustainability, and then measured continuously. Test environments can be used less, and in smaller scale and on demand. Test techniques can be used in optimizing or minimizing number of tests. Test automation can be used to speed up testing.
In his public lecture, Christian Timmerer provides insights into the fascinating history of video streaming, starting from its humble beginnings before YouTube to the groundbreaking technologies that now dominate platforms like Netflix and ORF ON. Timmerer also presents provocative contributions of his own that have significantly influenced the industry. He concludes by looking at future challenges and invites the audience to join in a discussion.
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionAggregage
Join Maher Hanafi, VP of Engineering at Betterworks, in this new session where he'll share a practical framework to transform Gen AI prototypes into impactful products! He'll delve into the complexities of data collection and management, model selection and optimization, and ensuring security, scalability, and responsible use.
Essentials of Automations: The Art of Triggers and Actions in FMESafe Software
In this second installment of our Essentials of Automations webinar series, we’ll explore the landscape of triggers and actions, guiding you through the nuances of authoring and adapting workspaces for seamless automations. Gain an understanding of the full spectrum of triggers and actions available in FME, empowering you to enhance your workspaces for efficient automation.
We’ll kick things off by showcasing the most commonly used event-based triggers, introducing you to various automation workflows like manual triggers, schedules, directory watchers, and more. Plus, see how these elements play out in real scenarios.
Whether you’re tweaking your current setup or building from the ground up, this session will arm you with the tools and insights needed to transform your FME usage into a powerhouse of productivity. Join us to discover effective strategies that simplify complex processes, enhancing your productivity and transforming your data management practices with FME. Let’s turn complexity into clarity and make your workspaces work wonders!
PHP Frameworks: I want to break free (IPC Berlin 2024)Ralf Eggert
In this presentation, we examine the challenges and limitations of relying too heavily on PHP frameworks in web development. We discuss the history of PHP and its frameworks to understand how this dependence has evolved. The focus will be on providing concrete tips and strategies to reduce reliance on these frameworks, based on real-world examples and practical considerations. The goal is to equip developers with the skills and knowledge to create more flexible and future-proof web applications. We'll explore the importance of maintaining autonomy in a rapidly changing tech landscape and how to make informed decisions in PHP development.
This talk is aimed at encouraging a more independent approach to using PHP frameworks, moving towards a more flexible and future-proof approach to PHP development.
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfPaige Cruz
Monitoring and observability aren’t traditionally found in software curriculums and many of us cobble this knowledge together from whatever vendor or ecosystem we were first introduced to and whatever is a part of your current company’s observability stack.
While the dev and ops silo continues to crumble….many organizations still relegate monitoring & observability as the purview of ops, infra and SRE teams. This is a mistake - achieving a highly observable system requires collaboration up and down the stack.
I, a former op, would like to extend an invitation to all application developers to join the observability party will share these foundational concepts to build on:
Pushing the limits of ePRTC: 100ns holdover for 100 daysAdtran
At WSTS 2024, Alon Stern explored the topic of parametric holdover and explained how recent research findings can be implemented in real-world PNT networks to achieve 100 nanoseconds of accuracy for up to 100 days.
Epistemic Interaction - tuning interfaces to provide information for AI supportAlan Dix
Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024
https://alandix.com/academic/papers/synergy2024-epistemic/
As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.
Epistemic Interaction - tuning interfaces to provide information for AI support
Elevation Query Extension: Introducing Subselects into Lucene Queries
1. STAY CONNECTED
Twitter @activate_conf
Facebook @activateconf
#Activate19
Log in to wifi, follow Activate on social media,
and download the event app where you can
submit an evaluation after the session
WIFI NETWORK: Activate2019
PASSWORD: Lucidworks
DOWNLOAD THE ACTIVATE 2019 MOBILE APP
Search Activate2019 in the App/Play store
Or visit: http://crowd.cc/activate19
4. Speaker Slide
R O B E R T
K I R C H G E S S N E R
Search Technology Architect
Wolters Kluwer
E X P E R I E N C E
• Search Algorithms Development
• Content Analysis
• Entity Recognition
• Solr plugins / extensions
• Strong software development experience for about 14 years in different commercial projects
• Last 4 years working on search expertise, particularly with Apache Solr and cloud-based solution for this
including availability and scalability.
• Customers: Wolters Kluwer, TRAFIGURA, Daadkracht...
N A Z A R S E N I U K
Lead Software Engineer
EPAM
6. Some background
• Developing search applications for legal market since 2003
• Inhomogeneous, structured content, rich metadata (laws, cases, commentaries)
• Use of metadata for ranking is essential for good results
• Up to 30% of queries contain legal / other entities
• Relying on query cooking using entity recognition in the user input
• Combining with full text search and tuning the results becomes a challenge
7. Example
User input: § 123 BGB
Transformed to queries Q1, Q2, Q3, Q4
Expected output:
• § 123 BGB (law document)
• Legal commentary A to § 123 BGB (promoted content)
• Legal commentary B to § 123 BGB (promoted content)
• Some latest cases based on § 123 BGB (relevant content)
• Full text (or whatever needed)
How to achieve?
8. Requirements
C O N T E N T S T R U C T U R E
• Handle entities in the user input properly: legal citations, locations, dates, names
– e.g. place the correct document cited in the query on the top
– given a book title place an entry document (table of contents) on the top
• Top (1-5) hits expected to be unambiguous
• Use the top slots efficiently (10-100 hits)
• Keep balance between numerous document types (legal cases) and relevant or promoted
document types
Generally more precise control of what is going on in the top 10
9. Possible solutions
• Boost factors on queries, terms, documents
• Sort fields
• Ranking functions
• Function queries
• Reranking (in Solr or application)
• Filtering
• Multiple requests
10. Works, but…
• Some are too complex
• Some are too slow
• Others are not reliable
• Missing a concept of subquery:
– tracking from which subquery a document is coming from
• Missing LIMIT as in SQL
11. Example continued
User input: § 123 BGB
Transformed to queries Q1, Q2, Q3, Q4
Expected output:
• § 123 BGB (law document)
• Legal commentary A to § 123 BGB (promoted content)
• Legal commentary B to § 123 BGB (promoted content)
• Some latest cases based on § 123 BGB (relevant content)
• Full text (or whatever needed)
Want the request look like: Q1 << Q2 << Q3 << Q4
12. Elevation query
Initial Idea / Specification
Given a list of queries Q1, Q2, …, QN produce a result fulfilling the conditions:
• All the documents of Qn are placed before the documents of Qm for m>n
• Each hit should occur in the leftmost possible subset
• No duplication of hits
• Meaningful scores
• Correct faceting
13. Elevation query
Additional requirements / expectations
• One request / one pass search
• Usable via some new syntax / parser support
• Implemented as plugin
Furthermore it should be possible to
• impose a limit on the results of each subquery
• provide a sort parameter for each subquery
16. Implementation status
• https://github.com/rokirx/solr-eq
• Working
– Collector logic / multiple queues
– Sort and limit parameter per subquery
– Parser support
• In testing
– Correct scoring
– Faceting
– Multiple sort fields per subquery
• Works with 6.4, 7.6, 8.0, 8.2
17. Case Study: Autosuggest
User Input tax
• Assumptions on the relevancy of completion:
– Highest priority if the term at the beginning and exact match, eg tax relief
– Lower priority exact match but term not at the beginnilng, eg income tax
– Lowest priority prefix match anywhere in the phrase, eg estate taxes
• Map this condition to queries:
– Term at the beginning of a phrase and exact match: ^tax$
– Exact match in the middle of a phrase: tax$
– Prefix match (edge n-gram): tax
18. Case Study: Autosuggest
User Input tax
• Resulting query: ^tax$ << tax$ << tax guarantees the specified behavior
• Additional benefit: optimize the performance by cancelling out subqueries
– If the exact hit count is not necessary
– And the minimum required number of hits in the preceeding queues is collected
– Stop fetching the docs from lower priority queue by cancelling them out of the collector/scorer
– Whitout missing out any relevant documents
19. Potential benefits
• Reduce the number of search requests
• Reduce the complexity of the architecture
• Additional dimension to control rank
• Pluggable, easy to evaluate
• Improve performance through runtime subquery cancellation
20. Summary
It is technically possible to implement a concept of subquery into Solr/Lucene
• Single request / one pass collection of results
• Individual limits on each subquery
• Individual sort parameters on each subquery
• Optimization if no total hits number needed
– cancel lower prioritized subqueries during evaluation without affecting top hits
• Plugin