Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Enterprise Search Europe 2015: Fishing the big data streams - the future of search

3,042 views

Published on

Keynote presentation from Enterprise Search Europe 2015

Published in: Software
  • Be the first to comment

Enterprise Search Europe 2015: Fishing the big data streams - the future of search

  1. 1. Charlie Hull - Managing Director 20th October 2015 Enterprise Search Europe charlie@flax.co.uk www.flax.co.uk/blog +44 (0) 8700 118334 Twitter: @FlaxSearch The Future of Search Fishing The Streams of Big Data
  2. 2. Building open source search applications since 2001 Independent, honest advice and analysis Expert design & development, Apache Solr committers Test-driven relevancy and performance tuning Custom training & mentoring for your staff Flexible support up to 24/7/365 with SLAs
  3. 3. Building open source search applications since 2001 Independent, honest advice and analysis Expert design & development, Apache Solr committers Test-driven relevancy and performance tuning Custom training & mentoring for your staff Flexible support up to 24/7/365 with SLAs Come and join the open source search community (tonight?)
  4. 4. Fishing the streams... 10 @FlaxSearch
  5. 5. Where are we now? Analytics based on search Trends & predictions Two new ideas But what's this all got to do with search? The future of search @FlaxSearch
  6. 6. What sort of enterprise projects do we see at Flax? – Migrations – Greenfield – Speculative Where are we now? @FlaxSearch
  7. 7. What sort of enterprise projects do we see at Flax? – Migrations – Greenfield – Speculative Unsurprisingly most are using open source Where are we now? @FlaxSearch
  8. 8. What sort of enterprise projects do we see at Flax? – Migrations – Greenfield – Speculative Unsurprisingly most are using open source ….unless they're Sharepoint Where are we now? @FlaxSearch
  9. 9. Two main stacks – Apache Lucene/Solr • Lucidworks (Fusion, SiLK) – Elasticsearch • Elastic (ELK) What's happening with open source? @FlaxSearch
  10. 10. Two main stacks – Apache Lucene/Solr • Lucidworks (Fusion, SiLK) – Elasticsearch • Elastic (ELK) Lots of other players (less for Elasticsearch) What's happening with open source? @FlaxSearch
  11. 11. Two main stacks – Apache Lucene/Solr • Lucidworks (Fusion, SiLK) – Elasticsearch • Elastic (ELK) Lots of other players (less for Elasticsearch) Focus on: – Log file analysis – Stability & scalability – Visualisation What's happening with open source? @FlaxSearch
  12. 12. Analytics based on search Log files, messages, transactions @FlaxSearch
  13. 13. Analytics based on search Log files, messages, transactions @FlaxSearch
  14. 14. Analytics based on search Log files, messages, transactions @FlaxSearch
  15. 15. Analytics based on search Log files, messages, transactions @FlaxSearch
  16. 16. Analytics based on search Log files, messages, transactions @FlaxSearch
  17. 17. Analytics & visualisations 11 @FlaxSearch
  18. 18. Analytics & visualisations 11 @FlaxSearch
  19. 19. Big Data Internet of Things Cloud Semantic Search Mobile / Wearables Trends @FlaxSearch
  20. 20. Big Data Internet of Things Cloud Semantic Search Mobile / Wearables Trends @FlaxSearch
  21. 21. Charlie's All Purpose Big Data Graph Data Time @FlaxSearch
  22. 22. Charlie's All Purpose Big Data Graph Data Time @FlaxSearch
  23. 23. “The IoT has the potential to connect 10X as many (28 billion) “things” to the Internet by 2020, ranging from bracelets to cars.” Goldman Sachs1 Scary scale @FlaxSearch
  24. 24. “The IoT has the potential to connect 10X as many (28 billion) “things” to the Internet by 2020, ranging from bracelets to cars.” Goldman Sachs1 “IoT threatens to generate massive amounts of input data from sources that are globally distributed. Transferring the entirety of that data to a single location for processing will not be technically and economically viable” Gartner2 Scary scale @FlaxSearch
  25. 25. “The IoT has the potential to connect 10X as many (28 billion) “things” to the Internet by 2020, ranging from bracelets to cars.” Goldman Sachs1 “IoT threatens to generate massive amounts of input data from sources that are globally distributed. Transferring the entirety of that data to a single location for processing will not be technically and economically viable” Gartner2 “Streaming analytics is anything but a sleepy, rearview mirror analysis of data. No, it is about knowing and acting on what’s happening in your business at this very moment — now.” Forrester3 Scary scale @FlaxSearch
  26. 26. “We’re going to go from information in computers, like your supercomputer in your pocket, to us being surrounded by information. … The next information age will be an era of unbounded malignant complexity. Because there’s a lot of stuff. We have to get ready for that.”13 Autodesk research fellow Mickey McManus Even more scary scale @FlaxSearch
  27. 27. It will become increasingly difficult, if not impossible, to store data for later processing Some predictions @FlaxSearch
  28. 28. It will become increasingly difficult, if not impossible, to store data for later processing It must therefore be processed as it appears, in real-time (Real Time Analytics) Some predictions @FlaxSearch
  29. 29. It will become increasingly difficult, if not impossible, to store data for later processing It must therefore be processed as it appears, in real-time (Real Time Analytics) Much of this data will be unstructured, noisy and badly formatted Some predictions @FlaxSearch
  30. 30. It will become increasingly difficult, if not impossible, to store data for later processing It must therefore be processed as it appears, in real-time (Real Time Analytics) Much of this data will be unstructured, noisy and badly formatted There are many exciting applications - if this is done right Some predictions @FlaxSearch
  31. 31. Unified Log New ideas 4 5 @FlaxSearch
  32. 32. Unified Log New ideas 4 5 “The Log: What every software engineer should know about real-time data's unifying abstraction” – Jay Kreps, LinkedIn (now Confluent)6 @FlaxSearch
  33. 33. Everything is a stream More new ideas “...think of a database as an always-growing collection of immutable facts. You can query it at some point in time — but that’s still old, imperative style thinking. A more fruitful approach is to take the streams of facts as they come in, and functionally process them in real-time”.- Martin Kleppman, LinkedIn 7 @FlaxSearch
  34. 34. Everything is a stream More new ideas 8 “...think of a database as an always-growing collection of immutable facts. You can query it at some point in time — but that’s still old, imperative style thinking. A more fruitful approach is to take the streams of facts as they come in, and functionally process them in real-time”.- Martin Kleppman, LinkedIn 7 @FlaxSearch
  35. 35. Streaming data platforms New technologies @FlaxSearch
  36. 36. Streaming data platforms New technologies @FlaxSearch
  37. 37. Streaming data platforms New technologies @FlaxSearch
  38. 38. Streaming data platforms New technologies @FlaxSearch
  39. 39. Hadoop MapReduce is for batch processing, not stream processing ...although Storm etc. can run on HDFS No elephant in the room? @FlaxSearch
  40. 40. How can you currently process data in a stream? – SQL like – Regular Expressions – Machine Learning So what's this got to do with Search? @FlaxSearch
  41. 41. How can you currently process data in a stream? – SQL like – Regular Expressions – Machine Learning Why not use full-text search? – Easier to create queries – Great with unstructured data – Handles noisy data So what's this got to do with Search? @FlaxSearch
  42. 42. How can you currently process data in a stream? – SQL like – Regular Expressions – Machine Learning Why not use full-text search? – Easier to create queries – Great with unstructured data – Handles noisy data But you can do search already! – But only near real time So what's this got to do with Search? @FlaxSearch
  43. 43. What's a stream? 1 2 3 4 5 6 7 8 9 10 11….. ….. append – “..an append-only, totally ordered sequence of records (also called events or messages).” Martin Kleppman, LinkedIn 9 @FlaxSearch
  44. 44. How do we search it? 1 2 3 4 5 6 7 8 9 10 11 12….. ….. append ? Result @FlaxSearch
  45. 45. How do we search it? 1 2 3 4 5 6 7 8 9 10 11 12….. ….. append ? Result Oops! @FlaxSearch
  46. 46. Search for Media Monitoring – Many complex stored profiles (searches) – High volume of news stories every day Here's something we're doing already @FlaxSearch
  47. 47. Search for Media Monitoring – Many complex stored profiles (searches) – High volume of news stories every day The solution: – Build an index of the stored profiles – Turn each news story into a query – Search the stored profiles to find which might match – Use these candidates to search the news story Here's something we're doing already @FlaxSearch
  48. 48. Search for Media Monitoring – Many complex stored profiles (searches) – High volume of news stories every day The solution: – Build an index of the stored profiles – Turn each news story into a query – Search the stored profiles to find which might match – Use these candidates to search the news story We call it 'search turned upside down' – But it's effectively searching a stream Here's something we're doing already @FlaxSearch
  49. 49. Like this... (((";!MOBILE PHONE*"; OR ";PHONE MAST*"; OR ";HANDSET*"; OR ";CELL* PHONE*"; OR ";3G"; OR ";GPRS"; OR ";G.P.R.S"; OR ";!GENERAL !RADIO PACKET SERVICE*"; OR ";GSM"; OR ";G.S.M"; OR ";!GLOBAL SYSTEM FOR !MOBILE COMM*"; OR ";HSDPA"; OR ";H.S.D.P.A"; OR ";HIGH SPEED DOWNLINK !PACKET ACCESS"; OR ";HSUPA"; OR ";H.S.U.P.A"; OR ";HIGH SPEED !UPLINK !PACKET ACCESS"; OR ";UMTS"; OR ";U.M.T.S"; OR ";MVNO"; OR ";M.V.N.O"; OR ";SMS"; OR ";SHORT MESSAGE !SERVICE*"; OR ";MMS"; OR ";!MULTIMEDIA MESSAGE !SERVICE*"; OR ";!MOBILES"; OR ";!CELLPHONE*"; OR ";! TELECOM*"; OR ";!LANDLINE*"; OR ";!TELEPHONE*"; OR ";PHONE*"; OR ";!TELEKOM*"; OR ";TELCO*"; OR ";VODAFONE"; OR ";T-MOBILE"; OR ";TMOBILE"; OR ";!TELEFONICA"; OR ";BT"; OR ";!MOBILE USER*"; OR ";TEXT MESSAG*"; OR ";SMARTPHONE*"; OR ";!VIRGIN !MEDIA*"; OR ";CABLE & !WIRELESS"; OR ";CABLE AND !WIRELESS";) W/48 ((";PROFIT*"; OR ";LOSS*"; OR ";BAN"; OR ";BANNED"; OR ";PREMIUM RATE*"; OR ";FINANC*"; OR ";!REFINANC*"; OR ";OFFICE OF FAIR TRADING"; OR ";MERGER*"; OR ";!ACQUISIT*"; OR ";ACQUIR*"; OR ";TAKEOVER*"; OR ";BUYOUT*"; OR ";BUY-OUT*"; OR ";NEW PRODUCT*"; OR ";INVEST*"; OR ";SHARES"; OR ";MARKET*"; OR ";ACCOUNT*"; OR ";MONEY"; OR ";CASH*"; OR ";SECURIT*"; OR ";!ENTERPRIS*"; OR ";!BUSINESS*"; OR ";PRICE*"; OR ";JOINT*"; OR ";NEW VENTURE*"; OR ";PRICING"; OR ";COST*"; OR ";CHAIRM?N"; OR ";APPOINT*"; OR ";!EXECUTIVE"; OR ";SALE*"; OR ";SELL*"; OR ";FULL YEAR"; OR ";REGULAT*"; OR ";!DIRECTIVE*"; OR ";LAW"; OR ";LAWS"; OR ";!LEGISLAT*"; OR ";GREEN PAPER"; OR ";WHITE PAPER*"; OR ";!MEDIAWATCH"; OR ";MORAL*"; OR ";ETHIC*"; OR ";ADVERT*"; OR ";AD"; OR ";ADS"; OR ";MARKETING"; OR ";!COMPLAIN*"; OR ";MIS-SOLD"; OR ";MIS-SELL*"; OR ";SPONSOR"; OR ";COSTCUT*"; OR ";COST CUT*"; OR ";CUT* COST*"; OR ";FIBRE OPTIC*"; OR ";TAX"; OR ";TAXES"; OR ";TAXED"; OR ";EXPAND*"; OR ";!EXPANSION"; OR ";EMPLOY*"; OR ";STAFF"; OR ";WORKER*"; OR ";SPOKESM?N"; OR ";DEBUT"; OR ";BRAND*"; OR ";DIRECTOR*";) OR ((";FAIR"; OR ";UNFAIR"; OR ";%UNSCRUPULOUS"; OR ";NOT FAIR"; OR ";UNJUST*"; OR ";!PENALISE*";) W/12 (";CHARG*"; OR ";TARIFF*"; OR ";PRICE PLAN*"; OR ";GLOBAL";)))) AND NOT (";EXPRESS OFFER"; OR ";TIMES OFFER"; OR ";READER OFFER"; OR ((";CALLS COST";) W/6 (";FROM A LANDLINE"; OR ";FROM LANDLINE*"; OR ";BT LANDLINE*";)))) …and that's an easy one! @FlaxSearch
  50. 50. Turning search upside down.. Docs Query QueryStored Queries 1. Pre 1 million profiles Some 250k long Complex rules Doc 1 million news items a day ” @FlaxSearch Query Subset ~200
  51. 51. Turning search upside down.. Docs Query QueryStored Queries 1. Pre Query Subset Result 1 million profiles Some 250k long Complex rules ~200 2. Search Doc 1 million news items a day “this news item is of interest to my client” @FlaxSearch
  52. 52. Turning search upside down.. Docs Query QueryStored Queries 1. Pre Query Subset Result 1 million profiles Some 250k long Complex rules ~200 2. Search Doc 1 million news items a day “this news item is of interest to my client” @FlaxSearch Stream Stream Stream
  53. 53. Flax solution (Luwak) scales to 1m queries over 1m items/day We need to be faster: – Network monitoring – up to 1m items/second – Restaurant reservations/reviews – 840m messages/day – IoT So we built a proof of concept... Searching streaming data at scale @FlaxSearch
  54. 54. Real-time scalable search for streams 1 2 3 4 5 6 7 ….. …..A B C D E F G ….. ….. Which queries match? N N N Y N….. ….. Documents Queries Parse In- memory 1-doc index Index of queries Matches https://github.com/romseygeek/samza-luwak @FlaxSearch
  55. 55. Build queries using a standard search box, facets etc. Set them to run on a stream of data - matches appear as another stream What could you do with this? @FlaxSearch
  56. 56. Build queries using a standard search box, facets etc. Set them to run on a stream of data - matches appear as another stream Use cases – Monitor a network – Watch social media – Check IoT for faults, errors, trends – Look for patterns in customer interactions – Distributed web search – Combine with analytics & visualisations What could you do with this? @FlaxSearch
  57. 57. Big Data....is about to get a lot bigger In conclusion.... @FlaxSearch
  58. 58. Big Data....is about to get a lot bigger We'll need to react to it in real time In conclusion.... @FlaxSearch
  59. 59. Big Data....is about to get a lot bigger We'll need to react to it in real time Search can help! In conclusion.... @FlaxSearch
  60. 60. Thankyou! Any questions? charlie@flax.co.uk www.flax.co.uk/blog +44 (0) 8700 118334 Twitter: @FlaxSearch
  61. 61. 1.Internet of Things - HorizonWatch 2015 Trend Report, Bill Chamberlin, IBM 2.Gartner Says the Internet of Things Will Transform the Data Center http://www.gartner.com/newsroom/id/2684915 3.Big Data Forrester Wave 2014 4.Unified Logging Layer: Turning Data into Action, Kiyoto Tamura, Fluentd 5.Unified Log Processing, Alexander Dean 6.The Log: What every software engineer should know about real-time data's unifying abstraction, Jay Kreps, LinkedIn/Confluent 7.Turning the database inside-out with Apache Samza, Martin Kleppman 8.Designing Data-Intensive Applications, Martin Kleppmann 9.Realtime Full-text Search with Luwak and Samza, Martin Kleppmann 10.Copyright Richard Masoner under a Creative Commons license 11. 12. Demonstrations by Mark Harwood of Elastic () 13. Trillions: Thriving in the Emerging Information Ecology (Wiley 2012), Mickey McManus References

×