Enterprise Search Summit Keynote: A Big Data Architecture for Search

980 views

Published on

This presentation was given by Search Technologies' CEO Kamran Khan at the November 2013 Enterprise Search Summit / KMWorld in Washington DC. He discussed how modern search engines are currently being combined with powerful independent content processing pipelines and the distributed processing technologies from big data to form new and exciting enterprise search architecture, delivering results only available to the biggest companies with the deepest pockets in the past. For more information visit http://www.searchtechnologies.com/.

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
980
On SlideShare
0
From Embeds
0
Number of Embeds
9
Actions
Shares
0
Downloads
29
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide
  • First image)OK, you would think the Wikipedia definition would be enough, however it says “so much data it is awkward”, that is not very helpful and wasn’t large amounts of data awkward 20 years ago?So a look through a Google image search for the term “big data” should help, a picture is worth a thousand words.Second image) OK, a classic venn diagram with the terms Volume (a lot of data), Velocity (that comes fast) and variety (different types of data), we have all heard these three terms with big data, however what type of data and what do you do with it to be big data?Third image) OK, this definition graphic combines the wiki def with the classic three terms, but still not really sure what it is.Fourth image) Hold it I thought there were three V’s, now there are four?Fifth image) No Five!Sixth image) Now it looks like a parking zone for all kinds of big numbers and buzz termsSeventh image) And social network iconsEighth image) Maybe it is just a black hole of numbers and termsStill often leaving us confused.
  • OK, so if one of the aspects of this revolution involves “Big Data”. I am required to give some explanation. Modern big data came out of companies such as Google and Yahoo needing to process their expanding log files that contained the data on their users behavior and ever increasing numbers of web pages. We are talking about the ability to deal with massive amounts of data objects and the agitate to analyze them utilizing multiple servers to rapidly achieve a result.
  • Slides 4 and 5 should be combined to have a cool animation of the big single log file object breaking up into the multitude of jobs and then those jobs combining to provide a result.What Google, Yahoo and others developed, that is now called Hadoop, is the ability to take a large single processing job, break it into many small jobs all doing the same thing on only a part of the data set and then have these jobs all report their results to a function that reduces them to a desired result such as a report or analytic display. I don’t know why we took so long to get to this, bees and ants have been doing this since there were bees and ants!
  • As is typical with a next generation of any industry, a set of enabling technologies comes together at a point in time.We believe the three central enabling technologies are:- Hadoop for reliable and scalable job processing- Elastic or Cloud computing to provide Hadoop the cost effective computing and storage infrastructure- Modern statistics analysis that is based on working on large complete data sets rather then the old style analysis that was designed to work on minimal and incomplete data sets.
  • Slides 4 and 5 should be combined to have a cool animation of the big single log file object breaking up into the multitude of jobs and then those jobs combining to provide a result.What Google, Yahoo and others developed, that is now called Hadoop, is the ability to take a large single processing job, break it into many small jobs all doing the same thing on only a part of the data set and then have these jobs all report their results to a function that reduces them to a desired result such as a report or analytic display. I don’t know why we took so long to get to this, bees and ants have been doing this since there were bees and ants!
  • Here is a traditional search architecture. In a traditional enterprise search system, much of this iterative improvement work is done in the indexing pipeline. This is where metadata is prepared, awkward headers and footers are removed, and content is normalized to help the relevancy algorithm compare dissimilar documentsWith the traditional architecture, any meaningful change to the index pipeline will require a full re-index.As data sets grow, this becomes increasingly onerous. A typical enterprise search indexing rate is 3 documents per second.Getting the data from the repositories is almost always the bottleneck. The are not set up for mass bulk exports.So even if you have a modest amount of content – say 10 million documents, it takes weeks to re-index. I’m sure most of you already know how long reindexing times impedes agility and solution quality. You want to use new metadata to support additional features, you want to use cleaner and more consistent content for better precision, you want to add some entity extraction to increase relevance, however you delay because of the pain and expense until some disaster force you to do it.Commonly used workarounds, such as developing systems based on a small sample of data, has its own sever limitations – but no time to go into those here
  • By including Hadoop with its file system designed for massive amounts of content and automated and reliable distributed processing, running within either your existing servers or using elastic computing within a cloud; and utilizing a strong independent content processing framework that can also take advantage of Hadoop, we now have an architecture that supports modern analytics, advanced content processing and shorter re-indexing times. We believe this architecture can reduce reindexing 10 million document, that often takes several weeks to a day or two.All of this combines to provide an environment were you can effectively have a continuous improvement cycle and iterative development
  • If you now include hadoop in the architecture, there are some dramatic potential effects.1) You now have an integrated platform to utilize the traditional documents of search along with other data that has customarily outside of search solutions but are gaining in importance such as log files that define user behaviour and supplemental data such as dictionaries, taxonomies, wiki data.2) A high performance and elastic platform that supports cloud computing and functionality that was not feasible before, I will talk about those on the next slide3) A platform that can reduce integration, development and management cost.4) A platform that can enable your organization to be more agile and your search architecture more saleable5) And the ability to preform reindexing at a much faster rate then ever possible.
  • Let me use the top two examples on this list as an illustration?Let me be clear, we are not saying these things were not possible before, it just was not practical for most organizations due to past hardware, software and programing resource constraints.Search and Match – this is the ability to do a search across the index and then do a precise matching of other documents based on the concepts within those documents. We are currently developing this for one of the largest staffing and recruiting agencies in the world . Finding CVs to match job requirements and job requirements to match CVs.Forward and reverse citations – Many business and technical documents have links and other forms of citations to other documents. Your enterprise many not current be as link rich as the internet, however that is changing and if you are in certain business such as medical research, insurance, legal and financial the ability to utilize citations in your search solutions is critical. We have developed a solution for a major patent company that is the leader in intellectual property (IP) management, that utilizes the forward and reverse patent citations to accomplish its mission.
  • [This may be too over the top.Kam you need to say the conclusion you are comfortable with, this is just something to work from]We are very excited about the future of enterprise search.There may have been a bit of a stall in our industry in the last few years, however we see a revolution coming.The next generation of search:Powered by a Haddop based architectureSupported by elastic cloud computingExtended by new and exciting analysis techniquesThis will increase functionality and agility while lowering costWe will be able to do what we only dreamed about in the pastIt may still not be Captain Kirk talking to a cognitive computer, however we will effectively utilize the massive amount of content that we have at our disposaland provide a new level of support and experience our users crave.
  • So, I’m out of time. Thanks for your attentionIf you have any questions, please come and find our booth in the KMWorld hall.
  • Enterprise Search Summit Keynote: A Big Data Architecture for Search

    1. 1. A Big Data Architecture for Search Kamran Khan, CEO The expert in the search space
    2. 2. Search Technologies Overview Ascot, UK Karlsruhe, DE Cincinnati, OH Herndon, VA San Diego, CA San Jose, CR • The leading IT Services company dedicated to Enterprise Search & Search-based Applications • Implementation, Consulting, Managed Services • 120 employees and growing • Independent, working with all of the leading software vendors and open source alternatives The expert in the search space
    3. 3. 500+ Customers The expert in the search space
    4. 4. What Is Big Data? The expert in the search space
    5. 5. Where Did Modern Big Data Come From? Web Web Servers Servers Web Servers Content Content Content The expert in the search space
    6. 6. What is Big Data? LOG FILES LOG FILES LOG FILES LOG FILES LOG FILES LOG FILES LOG FILES The expert in the search space
    7. 7. What is Big Data? Too big for a single machine • Physically impossible for a single machine Data Aggregation & Analysis • Simply transforming data records is not enough • Must aggregate / “boil down” the data Batch Processing • Very long running jobs (not real-time) Message: Lots of Data  “Big Data” The expert in the search space
    8. 8. Enabling Technologies Big Data For Search Hadoop Elastic / Cloud Computing Modern Statistical Analysis The expert in the search space
    9. 9. What is Big Data? Content Content Content Content Content Content Content Content Content Content Content Content Content Content Content Hadoop The expert in the search space
    10. 10. A Traditional Integrated Architecture Does a lot of what we need for Enterprise Search Content Sources SharePoint Search Engine File System Aspire Connectors ETC. Search Index Connector RDBMS Employee Directory Index Pipeline Limitations • • • • Limited support for modern analytics Limited support for content processing Re-indexing takes too long Limits ability to do continuous improvement cycle The expert in the search space
    11. 11. Why Content Processing is Important Content Sources Employee Directory File System Search Engine Aspire Connectors Connector Content Processing Index Pipeline Search Index RDBMS Employee Directory ETC. • Powerful & Complete Content Processing Service • Clean and consistent data and metadata • Ability to supplement metadata • Support for Continuous Improvement Cycle • Develop and maintain processing IP • Ability to easily migrate to new search engines The expert in the search space
    12. 12. A New Enterprise Search Architecture Content Sources Employee Directory File System Aspire Connectors Connector Content Processing & Tokenization Search Engine Search Index Pipeline Index RDBMS Secure Cache Employee Directory Analytics ETC. • • • • • Docs, Log files, Supplemental Data Integrated Platform (Docs, Log Files and External data) Reduced Cost Better Agility and Scalability Fast Reindexing Expanded Functionality The expert in the search space
    13. 13. Advanced Features & Analytics Enabled Search and Match Forward and Reverse Citation Latent Semantic Analysis More Precise Term Weighting Beyond TF/IDF Near Duplicate Detection Document Topic Tagging Results ranking including popularity Recommendations based on user behavior Suggested queries based on user behavior The expert in the search space
    14. 14. In Summary New architectureBig Data Technology better: Structured for search providing Will Analytics and other functionality Search Revolutionize Enterprise Content processing Agility Economics and scalability Big Data architectures will significantly move search forward The expert in the search space
    15. 15. For further information www.searchtechnologies.com The expert in the search space

    ×