The document discusses the future of search and analytics using streams of data from sources like the Internet of Things. It describes how search technologies can be used to process real-time streams of data by indexing the streams and querying them similar to how searches are currently done on stored data. Examples of searching streams are given, such as searching incoming news stories against stored search profiles to identify matches.
Turning search upside down with powerful open source search softwareCharlie Hull
Turning Search Upside Down - how Flax works with media monitoring companies to build powerful and scalable 'inverted search' systems, applying hundreds of thousands of stored queries to millions of documents in real time. Features Apache Lucene/Solr as a replacement for Autonomy IDOL and our Luwak library as a replacement for Autonomy Verity.
Finding the Bad Actor: Custom scoring & forensic name matching with Elastics...Charlie Hull
How we extended Lucene's SpanQuery and developed a new Elasticsearch query to allow Arachnys to search for names and adverse terms; also how we replicated the relevance scores used by a commercial service.
See some common myths, discover the various open source enterprise search packages available and see some case studies on how open source software has helped organisations build effective search.
Turning search upside down with powerful open source search softwareCharlie Hull
Turning Search Upside Down - how Flax works with media monitoring companies to build powerful and scalable 'inverted search' systems, applying hundreds of thousands of stored queries to millions of documents in real time. Features Apache Lucene/Solr as a replacement for Autonomy IDOL and our Luwak library as a replacement for Autonomy Verity.
Finding the Bad Actor: Custom scoring & forensic name matching with Elastics...Charlie Hull
How we extended Lucene's SpanQuery and developed a new Elasticsearch query to allow Arachnys to search for names and adverse terms; also how we replicated the relevance scores used by a commercial service.
See some common myths, discover the various open source enterprise search packages available and see some case studies on how open source software has helped organisations build effective search.
Given at Data Day Texas 2016.
Apache Spark has been hailed as a trail-blazing new tool for doing distributed data science. However, since it's so new, it can be difficult to set up and hard to use. In this talk, I'll discuss the journey I've had using Spark for data science at Bitly over the past year. I'll talk about the benefits of using Spark, the challenges I've had to overcome, the caveats for using a cutting-edge technology such as this, and my hopes for the Spark project as a whole.
Data Science at Scale: Using Apache Spark for Data Science at BitlySarah Guido
Given at Data Day Seattle 2015.
Bitly generates over 9 billion clicks on shortened links a month, as well as over 100 million unique link shortens. Analyzing data of this scale is not without its challenges. At Bitly, we have started adopting Apache Spark as a way to process our data. In this talk, I’ll elaborate on how I use Spark as part of my data science workflow. I’ll cover how Spark fits into our existing architecture, the kind of problems I’m solving with Spark, and the benefits and challenges of using Spark for large-scale data science.
Amundsen: From discovering to security datamarkgrover
Hear about how Lyft and Square are solving data discovery and data security challenges using a shared open source project - Amundsen.
Talk details and abstract:
https://www.datacouncil.ai/talks/amundsen-from-discovering-data-to-securing-data
Un sujet d'apparence compliqué, la programmation réactive est au final relativement simple et peut s'adopter dans les projets de façon incrémentale. Dans cette présentation je reviendrai sur les bases du réactif, comment vous pouvez l'adopter et comment il peut changer votre façon de coder.
Talk on Data Discovery and Metadata by Mark Grover from July 2019.
Goes into detail of the problem, build/buy/adopt analysis and Lyft's solution - Amundsen, along with thoughts on the future.
Intro to H2O in Python - Data Science LASri Ambati
Erin LeDell's presentation on Intro to H2O Machine Learning in Python at Data Science LA meetup on 1.19.16
- Powered by the open source machine learning software H2O.ai. Contributors welcome at: https://github.com/h2oai
- To view videos on H2O open source machine learning software, go to: https://www.youtube.com/user/0xdata
State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...Big Data Spain
http://www.bigdataspain.org/2014/conference/state-of-play-data-science-on-hadoop-in-2015-keynote
Machine Learning is not new. Big Machine Learning is qualitatively different: More data beats algorithm improvement, scale trumps noise and sample size effects, can brute-force manual tasks.
Session presented at Big Data Spain 2014 Conference
18th Nov 2014
Kinépolis Madrid
http://www.bigdataspain.org
Event promoted by: http://www.paradigmatecnologico.com
Slides: https://speakerdeck.com/bigdataspain/state-of-play-data-science-on-hadoop-in-2015-by-sean-owen-at-big-data-spain-2014
Given at Data Day Texas 2016.
Apache Spark has been hailed as a trail-blazing new tool for doing distributed data science. However, since it's so new, it can be difficult to set up and hard to use. In this talk, I'll discuss the journey I've had using Spark for data science at Bitly over the past year. I'll talk about the benefits of using Spark, the challenges I've had to overcome, the caveats for using a cutting-edge technology such as this, and my hopes for the Spark project as a whole.
Data Science at Scale: Using Apache Spark for Data Science at BitlySarah Guido
Given at Data Day Seattle 2015.
Bitly generates over 9 billion clicks on shortened links a month, as well as over 100 million unique link shortens. Analyzing data of this scale is not without its challenges. At Bitly, we have started adopting Apache Spark as a way to process our data. In this talk, I’ll elaborate on how I use Spark as part of my data science workflow. I’ll cover how Spark fits into our existing architecture, the kind of problems I’m solving with Spark, and the benefits and challenges of using Spark for large-scale data science.
Amundsen: From discovering to security datamarkgrover
Hear about how Lyft and Square are solving data discovery and data security challenges using a shared open source project - Amundsen.
Talk details and abstract:
https://www.datacouncil.ai/talks/amundsen-from-discovering-data-to-securing-data
Un sujet d'apparence compliqué, la programmation réactive est au final relativement simple et peut s'adopter dans les projets de façon incrémentale. Dans cette présentation je reviendrai sur les bases du réactif, comment vous pouvez l'adopter et comment il peut changer votre façon de coder.
Talk on Data Discovery and Metadata by Mark Grover from July 2019.
Goes into detail of the problem, build/buy/adopt analysis and Lyft's solution - Amundsen, along with thoughts on the future.
Intro to H2O in Python - Data Science LASri Ambati
Erin LeDell's presentation on Intro to H2O Machine Learning in Python at Data Science LA meetup on 1.19.16
- Powered by the open source machine learning software H2O.ai. Contributors welcome at: https://github.com/h2oai
- To view videos on H2O open source machine learning software, go to: https://www.youtube.com/user/0xdata
State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...Big Data Spain
http://www.bigdataspain.org/2014/conference/state-of-play-data-science-on-hadoop-in-2015-keynote
Machine Learning is not new. Big Machine Learning is qualitatively different: More data beats algorithm improvement, scale trumps noise and sample size effects, can brute-force manual tasks.
Session presented at Big Data Spain 2014 Conference
18th Nov 2014
Kinépolis Madrid
http://www.bigdataspain.org
Event promoted by: http://www.paradigmatecnologico.com
Slides: https://speakerdeck.com/bigdataspain/state-of-play-data-science-on-hadoop-in-2015-by-sean-owen-at-big-data-spain-2014
Splunk Announces Beta Version of Hunk: Splunk Analytics for Hadoop
New Software Product to Explore, Analyze and Visualize Data in Hadoop
HADOOP SUMMIT NORTH AMERICA 2013, SAN JOSE – June 26, 2013 - Splunk Inc. (NASDAQ: SPLK), the leading software platform for real-time operational intelligence, today announced the beta version of Hunk: Splunk® Analytics for Hadoop. Hunk (beta) is a new software product from Splunk that integrates exploration, analysis and visualization of data in Hadoop. Building upon Splunk’s years of experience with big data analytics technology deployed at thousands of customers, Hunk drives dramatic improvements in the speed and simplicity of interacting with and analyzing data in Hadoop without programming, costly integrations or forced data migrations. Watch the Hunk video to learn more.
Five Fabulous Sinks for Your Kafka Data. #3 will surprise you! (Rachel Pedres...confluent
Apache Kafka has come the modern central point for a fast and scalable streaming platform. Now, thanks to the open source explosion over the last decade, there are now numerous data stores available as sinks for Kafka-brokered data, from search to document stores, columnular DBs, time series DBs and more. While many claim they are the swiss army knife, in reality each is designed for specific types of data and analytics approaches. In this talk, we will cover the taxonomy of various data sinks, delve into each categories pros, cons and ideal use cases, so you can select the right ones and tie them together with Kafka into a well-considered architecture.
Apache spark empowering the real time data driven enterprise - StreamAnalytix...Impetus Technologies
Apache Spark is one of the most popular Big Data frameworks today. It is fast becoming the de facto technology choice for stream processing, real-time analytics, data science and machine learning applications at scale. It has moved well beyond the early-adopter phase, is supported by a vibrant open source community and is enjoying accelerated adoption in enterprises.
Join our guest speaker from Forrester Research, VP & Principal Analyst, Mike Gualtieri and StreamAnalytix, Product Head, Anand Venugopal for a discussion on the trends and directions defining the growing importance of Apache Spark for stream processing, machine learning and other advanced data analytics applications.
Join Cloudian, Hortonworks and 451 Research for a panel-style Q&A discussion about the latest trends and technology innovations in Big Data and Analytics. Matt Aslett, Data Platforms and Analytics Research Director at 451 Research, John Kreisa, Vice President of Strategic Marketing at Hortonworks, and Paul Turner, Chief Marketing Officer at Cloudian, will answer your toughest questions about data storage, data analytics, log data, sensor data and the Internet of Things. Bring your questions or just come and listen!
Big Data at the Speed of Business: Lessons Learned from Leading at the EdgeDataWorks Summit
How do you make big data accessible, usable and valuable for everyone? And mine your data for intelligence in minutes and hours, not weeks and months? What about getting real-time insights from your data, even before you persist and replicate it? In this talk, we’ll examine compelling, real-world examples that offer a blueprint for integrating big data technologies (Splunk, Hadoop, RDBMS, Cassandra, HBase), delivering rapid visibility and insights to IT professionals, data analysts and business users, and that accelerate the adoption of big data in the enterprise.
Research Automation for Data-Driven DiscoveryIan Foster
Talk presented at Workshop on Maximizing the Scientific Return of NASA Data. Makes the case that automation and outsourcing of data management tasks to cloud services is essential for effective data-driven discovery. Describes how the Globus research data management platform addresses this need.
Over 90% of today’s data has been generated in the last two years, and growth rates continue to climb. In this session, we’ll step through challenges and best practices with data capturing, how to derive meaningful insights to help predict the future, and common pitfalls in data analysis.
Come discover how integrated solutions involving Amazon S3, AWS Glue, Amazon Redshift, Amazon Athena, Amazon EMR, Amazon Kinesis, and Amazon Machine Learning/Deep Learning result in effective data systems for data scientists and business users, alike.
Big Data Hadoop ,Business Analytics& Data Warehousing Online Trainingsarala vanga
Big Data Hadoop Training & Certification online. Clear CCA175 & CCAH exams. 9 Real life Big Data projects. Led by industry experts.This online business analytics course introduces quantitative methods to analyze data and make better decisions.Our self-paced Data Warehousing training helps you master Data Warehousing Tools and concepts. The course also earns you a Data Warehousing ... Job Assistance. Enroll Now +1(210)503-7100 .For More Information : http://radiantits.com/.
What is Splunk? At the end of this session you’ll have a high-level understanding of the pieces that make up the Splunk Platform, how it works, and how it fits in the landscape of Big Data. You’ll see practical examples that differentiate Splunk while demonstrating how to gain quick time to value.
AI Pilot Review: The World’s First Virtual Assistant Marketing SuiteGoogle
AI Pilot Review: The World’s First Virtual Assistant Marketing Suite
👉👉 Click Here To Get More Info 👇👇
https://sumonreview.com/ai-pilot-review/
AI Pilot Review: Key Features
✅Deploy AI expert bots in Any Niche With Just A Click
✅With one keyword, generate complete funnels, websites, landing pages, and more.
✅More than 85 AI features are included in the AI pilot.
✅No setup or configuration; use your voice (like Siri) to do whatever you want.
✅You Can Use AI Pilot To Create your version of AI Pilot And Charge People For It…
✅ZERO Manual Work With AI Pilot. Never write, Design, Or Code Again.
✅ZERO Limits On Features Or Usages
✅Use Our AI-powered Traffic To Get Hundreds Of Customers
✅No Complicated Setup: Get Up And Running In 2 Minutes
✅99.99% Up-Time Guaranteed
✅30 Days Money-Back Guarantee
✅ZERO Upfront Cost
See My Other Reviews Article:
(1) TubeTrivia AI Review: https://sumonreview.com/tubetrivia-ai-review
(2) SocioWave Review: https://sumonreview.com/sociowave-review
(3) AI Partner & Profit Review: https://sumonreview.com/ai-partner-profit-review
(4) AI Ebook Suite Review: https://sumonreview.com/ai-ebook-suite-review
In software engineering, the right architecture is essential for robust, scalable platforms. Wix has undergone a pivotal shift from event sourcing to a CRUD-based model for its microservices. This talk will chart the course of this pivotal journey.
Event sourcing, which records state changes as immutable events, provided robust auditing and "time travel" debugging for Wix Stores' microservices. Despite its benefits, the complexity it introduced in state management slowed development. Wix responded by adopting a simpler, unified CRUD model. This talk will explore the challenges of event sourcing and the advantages of Wix's new "CRUD on steroids" approach, which streamlines API integration and domain event management while preserving data integrity and system resilience.
Participants will gain valuable insights into Wix's strategies for ensuring atomicity in database updates and event production, as well as caching, materialization, and performance optimization techniques within a distributed system.
Join us to discover how Wix has mastered the art of balancing simplicity and extensibility, and learn how the re-adoption of the modest CRUD has turbocharged their development velocity, resilience, and scalability in a high-growth environment.
Developing Distributed High-performance Computing Capabilities of an Open Sci...Globus
COVID-19 had an unprecedented impact on scientific collaboration. The pandemic and its broad response from the scientific community has forged new relationships among public health practitioners, mathematical modelers, and scientific computing specialists, while revealing critical gaps in exploiting advanced computing systems to support urgent decision making. Informed by our team’s work in applying high-performance computing in support of public health decision makers during the COVID-19 pandemic, we present how Globus technologies are enabling the development of an open science platform for robust epidemic analysis, with the goal of collaborative, secure, distributed, on-demand, and fast time-to-solution analyses to support public health.
Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...informapgpstrackings
Keep tabs on your field staff effortlessly with Informap Technology Centre LLC. Real-time tracking, task assignment, and smart features for efficient management. Request a live demo today!
For more details, visit us : https://informapuae.com/field-staff-tracking/
We describe the deployment and use of Globus Compute for remote computation. This content is aimed at researchers who wish to compute on remote resources using a unified programming interface, as well as system administrators who will deploy and operate Globus Compute services on their research computing infrastructure.
Code reviews are vital for ensuring good code quality. They serve as one of our last lines of defense against bugs and subpar code reaching production.
Yet, they often turn into annoying tasks riddled with frustration, hostility, unclear feedback and lack of standards. How can we improve this crucial process?
In this session we will cover:
- The Art of Effective Code Reviews
- Streamlining the Review Process
- Elevating Reviews with Automated Tools
By the end of this presentation, you'll have the knowledge on how to organize and improve your code review proces
Enterprise Resource Planning System includes various modules that reduce any business's workload. Additionally, it organizes the workflows, which drives towards enhancing productivity. Here are a detailed explanation of the ERP modules. Going through the points will help you understand how the software is changing the work dynamics.
To know more details here: https://blogs.nyggs.com/nyggs/enterprise-resource-planning-erp-system-modules/
Essentials of Automations: The Art of Triggers and Actions in FMESafe Software
In this second installment of our Essentials of Automations webinar series, we’ll explore the landscape of triggers and actions, guiding you through the nuances of authoring and adapting workspaces for seamless automations. Gain an understanding of the full spectrum of triggers and actions available in FME, empowering you to enhance your workspaces for efficient automation.
We’ll kick things off by showcasing the most commonly used event-based triggers, introducing you to various automation workflows like manual triggers, schedules, directory watchers, and more. Plus, see how these elements play out in real scenarios.
Whether you’re tweaking your current setup or building from the ground up, this session will arm you with the tools and insights needed to transform your FME usage into a powerhouse of productivity. Join us to discover effective strategies that simplify complex processes, enhancing your productivity and transforming your data management practices with FME. Let’s turn complexity into clarity and make your workspaces work wonders!
Providing Globus Services to Users of JASMIN for Environmental Data AnalysisGlobus
JASMIN is the UK’s high-performance data analysis platform for environmental science, operated by STFC on behalf of the UK Natural Environment Research Council (NERC). In addition to its role in hosting the CEDA Archive (NERC’s long-term repository for climate, atmospheric science & Earth observation data in the UK), JASMIN provides a collaborative platform to a community of around 2,000 scientists in the UK and beyond, providing nearly 400 environmental science projects with working space, compute resources and tools to facilitate their work. High-performance data transfer into and out of JASMIN has always been a key feature, with many scientists bringing model outputs from supercomputers elsewhere in the UK, to analyse against observational or other model data in the CEDA Archive. A growing number of JASMIN users are now realising the benefits of using the Globus service to provide reliable and efficient data movement and other tasks in this and other contexts. Further use cases involve long-distance (intercontinental) transfers to and from JASMIN, and collecting results from a mobile atmospheric radar system, pushing data to JASMIN via a lightweight Globus deployment. We provide details of how Globus fits into our current infrastructure, our experience of the recent migration to GCSv5.4, and of our interest in developing use of the wider ecosystem of Globus services for the benefit of our user community.
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...Globus
The U.S. Geological Survey (USGS) has made substantial investments in meeting evolving scientific, technical, and policy driven demands on storing, managing, and delivering data. As these demands continue to grow in complexity and scale, the USGS must continue to explore innovative solutions to improve its management, curation, sharing, delivering, and preservation approaches for large-scale research data. Supporting these needs, the USGS has partnered with the University of Chicago-Globus to research and develop advanced repository components and workflows leveraging its current investment in Globus. The primary outcome of this partnership includes the development of a prototype enterprise repository, driven by USGS Data Release requirements, through exploration and implementation of the entire suite of the Globus platform offerings, including Globus Flow, Globus Auth, Globus Transfer, and Globus Search. This presentation will provide insights into this research partnership, introduce the unique requirements and challenges being addressed and provide relevant project progress.
Understanding Globus Data Transfers with NetSageGlobus
NetSage is an open privacy-aware network measurement, analysis, and visualization service designed to help end-users visualize and reason about large data transfers. NetSage traditionally has used a combination of passive measurements, including SNMP and flow data, as well as active measurements, mainly perfSONAR, to provide longitudinal network performance data visualization. It has been deployed by dozens of networks world wide, and is supported domestically by the Engagement and Performance Operations Center (EPOC), NSF #2328479. We have recently expanded the NetSage data sources to include logs for Globus data transfers, following the same privacy-preserving approach as for Flow data. Using the logs for the Texas Advanced Computing Center (TACC) as an example, this talk will walk through several different example use cases that NetSage can answer, including: Who is using Globus to share data with my institution, and what kind of performance are they able to achieve? How many transfers has Globus supported for us? Which sites are we sharing the most data with, and how is that changing over time? How is my site using Globus to move data internally, and what kind of performance do we see for those transfers? What percentage of data transfers at my institution used Globus, and how did the overall data transfer performance compare to the Globus users?
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptxrickgrimesss22
Discover the essential features to incorporate in your Winzo clone app to boost business growth, enhance user engagement, and drive revenue. Learn how to create a compelling gaming experience that stands out in the competitive market.
May Marketo Masterclass, London MUG May 22 2024.pdfAdele Miller
Can't make Adobe Summit in Vegas? No sweat because the EMEA Marketo Engage Champions are coming to London to share their Summit sessions, insights and more!
This is a MUG with a twist you don't want to miss.
Navigating the Metaverse: A Journey into Virtual Evolution"Donna Lenk
Join us for an exploration of the Metaverse's evolution, where innovation meets imagination. Discover new dimensions of virtual events, engage with thought-provoking discussions, and witness the transformative power of digital realms."
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...Juraj Vysvader
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I didn't get rich from it but it did have 63K downloads (powered possible tens of thousands of websites).
First Steps with Globus Compute Multi-User EndpointsGlobus
In this presentation we will share our experiences around getting started with the Globus Compute multi-user endpoint. Working with the Pharmacology group at the University of Auckland, we have previously written an application using Globus Compute that can offload computationally expensive steps in the researcher's workflows, which they wish to manage from their familiar Windows environments, onto the NeSI (New Zealand eScience Infrastructure) cluster. Some of the challenges we have encountered were that each researcher had to set up and manage their own single-user globus compute endpoint and that the workloads had varying resource requirements (CPUs, memory and wall time) between different runs. We hope that the multi-user endpoint will help to address these challenges and share an update on our progress here.
How Recreation Management Software Can Streamline Your Operations.pptxwottaspaceseo
Recreation management software streamlines operations by automating key tasks such as scheduling, registration, and payment processing, reducing manual workload and errors. It provides centralized management of facilities, classes, and events, ensuring efficient resource allocation and facility usage. The software offers user-friendly online portals for easy access to bookings and program information, enhancing customer experience. Real-time reporting and data analytics deliver insights into attendance and preferences, aiding in strategic decision-making. Additionally, effective communication tools keep participants and staff informed with timely updates. Overall, recreation management software enhances efficiency, improves service delivery, and boosts customer satisfaction.
Enterprise Search Europe 2015: Fishing the big data streams - the future of search
1. Charlie Hull - Managing Director
20th
October 2015
Enterprise Search Europe
charlie@flax.co.uk
www.flax.co.uk/blog
+44 (0) 8700 118334
Twitter: @FlaxSearch
The Future of Search
Fishing The Streams of Big Data
2. Building open source search applications since 2001
Independent, honest advice and analysis
Expert design & development, Apache Solr committers
Test-driven relevancy and performance tuning
Custom training & mentoring for your staff
Flexible support up to 24/7/365 with SLAs
3. Building open source search applications since 2001
Independent, honest advice and analysis
Expert design & development, Apache Solr committers
Test-driven relevancy and performance tuning
Custom training & mentoring for your staff
Flexible support up to 24/7/365 with SLAs
Come and join the open source search community (tonight?)
6. Where are we now?
Analytics based on search
Trends & predictions
Two new ideas
But what's this all got to do with search?
The future of search
@FlaxSearch
7. What sort of enterprise projects do we see at Flax?
– Migrations
– Greenfield
– Speculative
Where are we now?
@FlaxSearch
8. What sort of enterprise projects do we see at Flax?
– Migrations
– Greenfield
– Speculative
Unsurprisingly most are using open source
Where are we now?
@FlaxSearch
9. What sort of enterprise projects do we see at Flax?
– Migrations
– Greenfield
– Speculative
Unsurprisingly most are using open source
….unless they're Sharepoint
Where are we now?
@FlaxSearch
10. Two main stacks
– Apache Lucene/Solr
• Lucidworks (Fusion, SiLK)
– Elasticsearch
• Elastic (ELK)
What's happening with open source?
@FlaxSearch
11. Two main stacks
– Apache Lucene/Solr
• Lucidworks (Fusion, SiLK)
– Elasticsearch
• Elastic (ELK)
Lots of other players (less for Elasticsearch)
What's happening with open source?
@FlaxSearch
12. Two main stacks
– Apache Lucene/Solr
• Lucidworks (Fusion, SiLK)
– Elasticsearch
• Elastic (ELK)
Lots of other players (less for Elasticsearch)
Focus on:
– Log file analysis
– Stability & scalability
– Visualisation
What's happening with open source?
@FlaxSearch
24. “The IoT has the potential to connect 10X as many (28 billion)
“things” to the Internet by 2020, ranging from bracelets to cars.”
Goldman Sachs1
Scary scale
@FlaxSearch
25. “The IoT has the potential to connect 10X as many (28 billion)
“things” to the Internet by 2020, ranging from bracelets to cars.”
Goldman Sachs1
“IoT threatens to generate massive amounts of input data from
sources that are globally distributed. Transferring the entirety of
that data to a single location for processing will not be
technically and economically viable” Gartner2
Scary scale
@FlaxSearch
26. “The IoT has the potential to connect 10X as many (28 billion)
“things” to the Internet by 2020, ranging from bracelets to cars.”
Goldman Sachs1
“IoT threatens to generate massive amounts of input data from
sources that are globally distributed. Transferring the entirety of
that data to a single location for processing will not be
technically and economically viable” Gartner2
“Streaming analytics is anything but a sleepy, rearview mirror
analysis of data. No, it is about knowing and acting on what’s
happening in your business at this very moment — now.”
Forrester3
Scary scale
@FlaxSearch
27. “We’re going to go from information in computers, like your
supercomputer in your pocket, to us being surrounded by
information. … The next information age will be an era of
unbounded malignant complexity. Because there’s a lot of
stuff. We have to get ready for that.”13
Autodesk research fellow Mickey McManus
Even more scary scale
@FlaxSearch
28. It will become increasingly difficult, if not impossible, to store
data for later processing
Some predictions
@FlaxSearch
29. It will become increasingly difficult, if not impossible, to store
data for later processing
It must therefore be processed as it appears, in real-time
(Real Time Analytics)
Some predictions
@FlaxSearch
30. It will become increasingly difficult, if not impossible, to store
data for later processing
It must therefore be processed as it appears, in real-time
(Real Time Analytics)
Much of this data will be unstructured, noisy and badly
formatted
Some predictions
@FlaxSearch
31. It will become increasingly difficult, if not impossible, to store
data for later processing
It must therefore be processed as it appears, in real-time
(Real Time Analytics)
Much of this data will be unstructured, noisy and badly
formatted
There are many exciting applications - if this is done right
Some predictions
@FlaxSearch
33. Unified Log
New ideas
4
5
“The Log: What every software engineer
should know about real-time data's unifying
abstraction” – Jay Kreps, LinkedIn (now
Confluent)6
@FlaxSearch
34. Everything is a stream
More new ideas
“...think of a database as an always-growing collection of
immutable facts. You can query it at some point in time —
but that’s still old, imperative style thinking. A more fruitful
approach is to take the streams of facts as they come in, and
functionally process them in real-time”.- Martin Kleppman,
LinkedIn 7
@FlaxSearch
35. Everything is a stream
More new ideas
8
“...think of a database as an always-growing collection of
immutable facts. You can query it at some point in time —
but that’s still old, imperative style thinking. A more fruitful
approach is to take the streams of facts as they come in, and
functionally process them in real-time”.- Martin Kleppman,
LinkedIn 7
@FlaxSearch
40. Hadoop MapReduce is for batch processing, not stream
processing
...although Storm etc. can run on HDFS
No elephant in the room?
@FlaxSearch
41. How can you currently process data in a stream?
– SQL like
– Regular Expressions
– Machine Learning
So what's this got to do
with Search?
@FlaxSearch
42. How can you currently process data in a stream?
– SQL like
– Regular Expressions
– Machine Learning
Why not use full-text search?
– Easier to create queries
– Great with unstructured data
– Handles noisy data
So what's this got to do
with Search?
@FlaxSearch
43. How can you currently process data in a stream?
– SQL like
– Regular Expressions
– Machine Learning
Why not use full-text search?
– Easier to create queries
– Great with unstructured data
– Handles noisy data
But you can do search already!
– But only near real time
So what's this got to do
with Search?
@FlaxSearch
44. What's a stream?
1 2 3 4 5 6 7 8 9 10 11….. …..
append
– “..an append-only, totally ordered sequence of
records (also called events or messages).”
Martin Kleppman, LinkedIn 9
@FlaxSearch
45. How do we search it?
1 2 3 4 5 6 7 8 9 10 11 12….. …..
append
?
Result
@FlaxSearch
46. How do we search it?
1 2 3 4 5 6 7 8 9 10 11 12….. …..
append
?
Result
Oops!
@FlaxSearch
47. Search for Media Monitoring
– Many complex stored profiles (searches)
– High volume of news stories every day
Here's something we're
doing already
@FlaxSearch
48. Search for Media Monitoring
– Many complex stored profiles (searches)
– High volume of news stories every day
The solution:
– Build an index of the stored profiles
– Turn each news story into a query
– Search the stored profiles to find which might match
– Use these candidates to search the news story
Here's something we're
doing already
@FlaxSearch
49. Search for Media Monitoring
– Many complex stored profiles (searches)
– High volume of news stories every day
The solution:
– Build an index of the stored profiles
– Turn each news story into a query
– Search the stored profiles to find which might match
– Use these candidates to search the news story
We call it 'search turned upside down'
– But it's effectively searching a stream
Here's something we're
doing already
@FlaxSearch
50. Like this...
(((";!MOBILE PHONE*"; OR ";PHONE MAST*"; OR ";HANDSET*"; OR ";CELL* PHONE*"; OR ";3G"; OR ";GPRS"; OR ";G.P.R.S";
OR ";!GENERAL !RADIO PACKET SERVICE*"; OR ";GSM"; OR ";G.S.M"; OR ";!GLOBAL SYSTEM FOR !MOBILE COMM*"; OR
";HSDPA"; OR ";H.S.D.P.A"; OR ";HIGH SPEED DOWNLINK !PACKET ACCESS"; OR ";HSUPA"; OR ";H.S.U.P.A"; OR ";HIGH
SPEED !UPLINK !PACKET ACCESS"; OR ";UMTS"; OR ";U.M.T.S"; OR ";MVNO"; OR ";M.V.N.O"; OR ";SMS"; OR ";SHORT
MESSAGE !SERVICE*"; OR ";MMS"; OR ";!MULTIMEDIA MESSAGE !SERVICE*"; OR ";!MOBILES"; OR ";!CELLPHONE*"; OR ";!
TELECOM*"; OR ";!LANDLINE*"; OR ";!TELEPHONE*"; OR ";PHONE*"; OR ";!TELEKOM*"; OR ";TELCO*"; OR ";VODAFONE"; OR
";T-MOBILE"; OR ";TMOBILE"; OR ";!TELEFONICA"; OR ";BT"; OR ";!MOBILE USER*"; OR ";TEXT MESSAG*"; OR
";SMARTPHONE*"; OR ";!VIRGIN !MEDIA*"; OR ";CABLE & !WIRELESS"; OR ";CABLE AND !WIRELESS";) W/48
((";PROFIT*"; OR ";LOSS*"; OR ";BAN"; OR ";BANNED"; OR ";PREMIUM RATE*"; OR ";FINANC*"; OR ";!REFINANC*"; OR
";OFFICE OF FAIR TRADING"; OR ";MERGER*"; OR ";!ACQUISIT*"; OR ";ACQUIR*"; OR ";TAKEOVER*"; OR ";BUYOUT*"; OR
";BUY-OUT*"; OR ";NEW PRODUCT*"; OR ";INVEST*"; OR ";SHARES"; OR ";MARKET*"; OR ";ACCOUNT*"; OR ";MONEY"; OR
";CASH*"; OR ";SECURIT*"; OR ";!ENTERPRIS*"; OR ";!BUSINESS*"; OR ";PRICE*"; OR ";JOINT*"; OR ";NEW VENTURE*"; OR
";PRICING"; OR ";COST*"; OR ";CHAIRM?N"; OR ";APPOINT*"; OR ";!EXECUTIVE"; OR ";SALE*"; OR ";SELL*"; OR ";FULL
YEAR"; OR ";REGULAT*"; OR ";!DIRECTIVE*"; OR ";LAW"; OR ";LAWS"; OR ";!LEGISLAT*"; OR ";GREEN PAPER"; OR ";WHITE
PAPER*"; OR ";!MEDIAWATCH"; OR ";MORAL*"; OR ";ETHIC*"; OR ";ADVERT*"; OR ";AD"; OR ";ADS"; OR ";MARKETING"; OR
";!COMPLAIN*"; OR ";MIS-SOLD"; OR ";MIS-SELL*"; OR ";SPONSOR"; OR ";COSTCUT*"; OR ";COST CUT*"; OR ";CUT* COST*";
OR ";FIBRE OPTIC*"; OR ";TAX"; OR ";TAXES"; OR ";TAXED"; OR ";EXPAND*"; OR ";!EXPANSION"; OR ";EMPLOY*"; OR
";STAFF"; OR ";WORKER*"; OR ";SPOKESM?N"; OR ";DEBUT"; OR ";BRAND*"; OR ";DIRECTOR*";) OR ((";FAIR"; OR
";UNFAIR"; OR ";%UNSCRUPULOUS"; OR ";NOT FAIR"; OR ";UNJUST*"; OR ";!PENALISE*";) W/12 (";CHARG*"; OR ";TARIFF*";
OR ";PRICE PLAN*"; OR ";GLOBAL";)))) AND NOT (";EXPRESS OFFER"; OR ";TIMES OFFER"; OR ";READER OFFER"; OR
((";CALLS COST";) W/6 (";FROM A LANDLINE"; OR ";FROM LANDLINE*"; OR ";BT LANDLINE*";))))
…and that's an easy one!
@FlaxSearch
51. Turning search upside down..
Docs
Query
QueryStored
Queries 1.
Pre
1 million profiles
Some 250k long
Complex rules
Doc 1 million news
items a day
”
@FlaxSearch
Query
Subset
~200
52. Turning search upside down..
Docs
Query
QueryStored
Queries 1.
Pre
Query
Subset
Result
1 million profiles
Some 250k long
Complex rules
~200
2.
Search
Doc 1 million news
items a day
“this news item is of
interest to my client”
@FlaxSearch
53. Turning search upside down..
Docs
Query
QueryStored
Queries 1.
Pre
Query
Subset
Result
1 million profiles
Some 250k long
Complex rules
~200
2.
Search
Doc 1 million news
items a day
“this news item is of
interest to my client”
@FlaxSearch
Stream
Stream
Stream
54. Flax solution (Luwak) scales to 1m queries over 1m items/day
We need to be faster:
– Network monitoring – up to 1m items/second
– Restaurant reservations/reviews – 840m messages/day
– IoT
So we built a proof of concept...
Searching streaming data at scale
@FlaxSearch
55. Real-time scalable search for streams
1 2 3 4 5 6 7
…..
…..A B C D E F G
….. …..
Which queries match?
N N N Y N….. …..
Documents
Queries
Parse
In-
memory
1-doc
index
Index of
queries
Matches
https://github.com/romseygeek/samza-luwak
@FlaxSearch
56. Build queries using a standard search box, facets etc.
Set them to run on a stream of data - matches appear as
another stream
What could you do with this?
@FlaxSearch
57. Build queries using a standard search box, facets etc.
Set them to run on a stream of data - matches appear as
another stream
Use cases
– Monitor a network
– Watch social media
– Check IoT for faults, errors, trends
– Look for patterns in customer interactions
– Distributed web search
– Combine with analytics & visualisations
What could you do with this?
@FlaxSearch
62. 1.Internet of Things - HorizonWatch 2015 Trend Report, Bill Chamberlin, IBM
2.Gartner Says the Internet of Things Will Transform the Data Center
http://www.gartner.com/newsroom/id/2684915
3.Big Data Forrester Wave 2014
4.Unified Logging Layer: Turning Data into Action, Kiyoto Tamura, Fluentd
5.Unified Log Processing, Alexander Dean
6.The Log: What every software engineer should know about real-time data's unifying
abstraction, Jay Kreps, LinkedIn/Confluent
7.Turning the database inside-out with Apache Samza, Martin Kleppman
8.Designing Data-Intensive Applications, Martin Kleppmann
9.Realtime Full-text Search with Luwak and Samza, Martin Kleppmann
10.Copyright Richard Masoner under a Creative Commons license
11.
12. Demonstrations by Mark Harwood of Elastic ()
13. Trillions: Thriving in the Emerging Information Ecology (Wiley 2012), Mickey McManus
References