Software Industry in India  and  Keyword Search Over Dynamic Categorized Information Manish Bhide [email_address]
My Background <ul><li>BE in CSE, 2000 from VRCE (now VNIT   ) </li></ul><ul><li>MTech in CSE, 2002 from IITB </li></ul><u...
Types of Software Companies (type of work) <ul><li>Services Companies </li></ul><ul><li>Product development companies  </l...
Services Companies <ul><li>Services companies work for other companies using the “outsourcing model” </li></ul><ul><ul><li...
Services Companies <ul><li>Support work can be categorized as </li></ul><ul><ul><li>Business process outsourcing  </li></u...
Product Development <ul><li>Involves development of products  </li></ul><ul><ul><li>Also involves testing of products – no...
Research and Development <ul><li>R&D is a misused word </li></ul><ul><li>Some companies try to put product development int...
How it all fits together <ul><li>Consider an example of a bank </li></ul><ul><ul><li>It wants to focus on its core busines...
Take-away for you… <ul><li>As far as possible, try to find a job in product development companies </li></ul><ul><ul><li>Ge...
How to find a Job in good companies <ul><li>First and foremost: You need to be good academically! </li></ul><ul><li>Enhanc...
Keyword Search Over Dynamic Categorized Information Joint work with: Venkatesan Chakravarthy,  Krithi Ramamritham and Pras...
Motivating Example <ul><li>Prime ministerial candidate a political party “PP” wants to asses reaction of different voter c...
Motivating Example (contd..) <ul><li>Alternate Approach: Use traditional search, group results into categories </li></ul><...
Problem Statement <ul><li>CS* (Categorized Search) system supports top-K keyword search over categories </li></ul>Q(t 1 ,t...
Scoring Function <ul><li>We use standard tf-idf based scoring function to compute relevance of a category to a keyword que...
Computing Top-K Categories <ul><li>Scoring Function: </li></ul><ul><li>Use stored meta-data to compute Score(c,Q) values <...
Naïve Approach: Update-all Strategy <ul><li>Refresh all the categories when a new data-item is added </li></ul><ul><ul><li...
CS* Approach: Selective update of categories with selective data <ul><li>Identify a sub-set of categories (of size  ImpCat...
Overview <ul><li>Motivation, Problem Statement, Naïve Strategy </li></ul><ul><li>Statistics used by CS* </li></ul><ul><li>...
Statistics Maintained by CS* C i : p c <ul><li>Contiguous Refreshing:  </li></ul><ul><ul><li>CS* refreshes a category in a...
Estimating approximate tf <ul><li>Need to find  tf  at current time s *  -  tf s* (c,t) </li></ul><ul><ul><li>Use principl...
Overview <ul><li>Motivation, Problem Statement, Naïve Strategy </li></ul><ul><li>Statistics used by CS* </li></ul><ul><li>...
Determining Important Categories <ul><li>What categories will be important? </li></ul><ul><ul><li>Categories which will be...
Range Selection Problem <ul><li>Input: </li></ul><ul><ul><li>Sequence of categories  c 1 , c 2 ,….,c N </li></ul></ul><ul>...
Overview <ul><li>Motivation, Problem Statement, Naïve Strategy </li></ul><ul><li>Statistics used by CS* </li></ul><ul><li>...
Query Answering Module <ul><li>Given keyword query Q  = {t 1 , t 2 ,….,t l }  use  tf est  and  idf est  to find  top-K  c...
Query Answering Module <ul><li>Algorithm Overview </li></ul><ul><ul><li>Setup  l  ordered lists – one for each keyword </l...
Query Answering Module <ul><li>Recall formula for  tf est s* : </li></ul><ul><li>Maintaining sorted list as per  tf est s*...
Conclusion <ul><li>First to identify the problem of keyword search over categorized dynamic data </li></ul><ul><li>Develop...
Thank You & Questions!
Upcoming SlideShare
Loading in …5
×

Research Opportunities in India & Keyword Search Over Dynamic Categorized Information

796 views
728 views

Published on

Published in: Education, Technology, Business
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
796
On SlideShare
0
From Embeds
0
Number of Embeds
11
Actions
Shares
0
Downloads
15
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Research Opportunities in India & Keyword Search Over Dynamic Categorized Information

  1. 1. Software Industry in India and Keyword Search Over Dynamic Categorized Information Manish Bhide [email_address]
  2. 2. My Background <ul><li>BE in CSE, 2000 from VRCE (now VNIT  ) </li></ul><ul><li>MTech in CSE, 2002 from IITB </li></ul><ul><li>Working with IBM India Research Lab since 2002 </li></ul><ul><li>Started part time PhD in IITB in 2005 </li></ul>
  3. 3. Types of Software Companies (type of work) <ul><li>Services Companies </li></ul><ul><li>Product development companies </li></ul><ul><li>Research and Development Companies </li></ul><ul><li>Many companies do work that falls in all the above three categories </li></ul>
  4. 4. Services Companies <ul><li>Services companies work for other companies using the “outsourcing model” </li></ul><ul><ul><li>E.g., Bank wants to focus on its core business, and not worry about the software needed to run it </li></ul></ul><ul><ul><li>IT part is outsourced to a services company. </li></ul></ul><ul><li>Services companies can do following type of work </li></ul><ul><ul><li>Support </li></ul></ul><ul><ul><li>Product development </li></ul></ul><ul><li>Support </li></ul><ul><ul><li>L1: First level of contact (BPO companies) </li></ul></ul><ul><ul><li>L2: If problem not solved by L1, it is escalated to L2 </li></ul></ul><ul><ul><li>L3: If problem not solved by L2, it is escalated to L3 </li></ul></ul>
  5. 5. Services Companies <ul><li>Support work can be categorized as </li></ul><ul><ul><li>Business process outsourcing </li></ul></ul><ul><ul><li>Application Support & Maintenance (L2, L3) </li></ul></ul><ul><ul><li>L3 work involves bug fixing </li></ul></ul><ul><li>Support work might not be that great! </li></ul><ul><li>Product development in services companies </li></ul><ul><ul><li>The product companies outsource the development of the products to services companies </li></ul></ul><ul><ul><li>Conceptualization and design done by the product company </li></ul></ul><ul><ul><li>Development and testing done by services company </li></ul></ul>
  6. 6. Product Development <ul><li>Involves development of products </li></ul><ul><ul><li>Also involves testing of products – not great  </li></ul></ul><ul><li>Development part more exciting than services work </li></ul><ul><li>Pays better than services work (L1, L2, L3) </li></ul><ul><li>Quality of people hired by product companies is better than those hired by services companies </li></ul><ul><ul><li>People from CS@IIT do not apply to services companies </li></ul></ul>
  7. 7. Research and Development <ul><li>R&D is a misused word </li></ul><ul><li>Some companies try to put product development into R&D </li></ul><ul><li>True meaning: Involves conceptualization of new product ideas or enhancement to existing product ideas </li></ul><ul><ul><li>The concept of a “database” originated in IBM Research </li></ul></ul><ul><ul><li>Job role can be thought of as originating ideas and developing new products </li></ul></ul><ul><li>People hired are typically PhD or masters in computer Science from IIT’s or the best universities worldwide </li></ul><ul><li>Hires the best and pays the best amongst all types of software companies  </li></ul>
  8. 8. How it all fits together <ul><li>Consider an example of a bank </li></ul><ul><ul><li>It wants to focus on its core business – banking </li></ul></ul><ul><ul><li>IT part is outsourced to “Services Companies” </li></ul></ul><ul><ul><li>Software used by bank will consist of – database, web-server, etc. </li></ul></ul><ul><ul><li>Services company will use these software products to build a “solution” for the bank – banking software </li></ul></ul><ul><ul><li>Someone needs to build the product like database, web-server etc. for the services companies to use – This is done by the product companies </li></ul></ul><ul><ul><li>Someone needs to think of a need for a new product – This is where R&D companies play a role </li></ul></ul>
  9. 9. Take-away for you… <ul><li>As far as possible, try to find a job in product development companies </li></ul><ul><ul><li>Getting into R&D right after BTech is difficult </li></ul></ul><ul><li>If interested in doing quality work, try to do a MTech/MS. </li></ul><ul><li>If still interested in further studies, register for a PhD </li></ul><ul><li>Caveat Emptor: </li></ul><ul><ul><li>Not everyone will get a job in product development </li></ul></ul><ul><ul><li>This is not the right time to try out all the above </li></ul></ul>
  10. 10. How to find a Job in good companies <ul><li>First and foremost: You need to be good academically! </li></ul><ul><li>Enhance your coding skills </li></ul><ul><li>Try to participate in coding contests such as: </li></ul><ul><ul><li>Google India Code Jam </li></ul></ul><ul><ul><li>International Online Programming Contest (organized by IIT’s), </li></ul></ul><ul><li>Try to participate in open source software development </li></ul><ul><li>Try to do a summer internship in product companies </li></ul><ul><ul><li>Try to contact VNIT alumni in these companies to improve your chances </li></ul></ul><ul><li>Realize your potential! </li></ul><ul><ul><li>From my personal experience I believe that the top 10% of the folks in CS@VNIT are at par with those in IIT </li></ul></ul><ul><ul><li>The rest are better than most of the guys from other engineering colleges in India </li></ul></ul><ul><ul><li>The faculty in VNIT is amongst the best in India </li></ul></ul>
  11. 11. Keyword Search Over Dynamic Categorized Information Joint work with: Venkatesan Chakravarthy, Krithi Ramamritham and Prasan Roy
  12. 12. Motivating Example <ul><li>Prime ministerial candidate a political party “PP” wants to asses reaction of different voter categories to their manifesto </li></ul><ul><li>Current Approach: </li></ul><ul><ul><li>Keyword Query: “PP manifesto” </li></ul></ul><ul><ul><li>Results consist of large number of blog posts </li></ul></ul><ul><ul><li>Cannot form a consolidated opinion </li></ul></ul><ul><li>Desirable Result: Most relevant categories </li></ul><ul><ul><li>Blogs about education issues </li></ul></ul><ul><ul><li>Blogs about Tax rebates </li></ul></ul>
  13. 13. Motivating Example (contd..) <ul><li>Alternate Approach: Use traditional search, group results into categories </li></ul><ul><li>Problems: </li></ul><ul><ul><li>Difficult to assign labels to generated clusters </li></ul></ul><ul><ul><li>Unpredictability of generated results </li></ul></ul><ul><li>Solution: Categorized search (Faceted search) over pre-defined categories </li></ul>
  14. 14. Problem Statement <ul><li>CS* (Categorized Search) system supports top-K keyword search over categories </li></ul>Q(t 1 ,t 2 ..,t l ) Keyword Query Top-K Categories d i = Blog Posts A(d i ) = Attributes in user profile T(d i ) = Text of blog Blog posts about educational issues p c = Text classifier “ PP Manifesto” <ul><ul><li>Blogs about educational issues </li></ul></ul><ul><ul><li>Blogs about tax rebates </li></ul></ul>C 1 : p c Categories C 2 : p c C 5 : p c C 6 : p c C 3 : p c C 4 : p c C N : p c d 1 : A(d 1 ), T(d 1 ) d 2 : A(d 2 ), T(d 2 ) d 3 : A(d 3 ), T(d 3 ) . . Information Repository
  15. 15. Scoring Function <ul><li>We use standard tf-idf based scoring function to compute relevance of a category to a keyword query </li></ul><ul><li>Term Frequency: </li></ul><ul><li>Inverse Document Frequency: </li></ul><ul><li>Score: </li></ul>
  16. 16. Computing Top-K Categories <ul><li>Scoring Function: </li></ul><ul><li>Use stored meta-data to compute Score(c,Q) values </li></ul>Q(t 1 ,t 2 ..,t l ) Keyword Query Top-K Categories Meta-Data d N : A(d N ), T(d N ) C 1 : p c Categories C 2 : p c C 5 : p c C 6 : p c C 3 : p c C 4 : p c C N : p c d 1 : A(d 1 ), T(d 1 ) d 2 : A(d 2 ), T(d 2 ) d 3 : A(d 3 ), T(d 3 ) . . Information Repository
  17. 17. Naïve Approach: Update-all Strategy <ul><li>Refresh all the categories when a new data-item is added </li></ul><ul><ul><li>Evaluate p c of each category with respect to the data item </li></ul></ul><ul><ul><li>Update meta-data for those categories whose p c evaluates to true </li></ul></ul><ul><li>p c can be a text classifier or could involve expensive joins </li></ul><ul><ul><li>High value customer: Transactions more than 10K in last 15 days </li></ul></ul><ul><li>If one p c evaluation takes 25 milliseconds, for 1000 categories it will take 25 seconds! </li></ul><ul><li>While one data item is being processed more data-items could be added </li></ul><ul><ul><li>As per 2006 estimate 13 blog posts are created per second </li></ul></ul><ul><li>Meta-data will become stale, affecting quality of results </li></ul>Need for an intelligent selective update strategy!
  18. 18. CS* Approach: Selective update of categories with selective data <ul><li>Identify a sub-set of categories (of size ImpCat ) that are deemed important </li></ul><ul><li>Identify a sub-set of data-items (of size ImpData ) that can provide maximum impact in terms of update to meta-data </li></ul><ul><li>Refresh important categories using the sub-set of data-items </li></ul><ul><li>CS* consists of two components: </li></ul><ul><ul><li>Meta-data refresher </li></ul></ul><ul><ul><li>Query Answering Module </li></ul></ul>
  19. 19. Overview <ul><li>Motivation, Problem Statement, Naïve Strategy </li></ul><ul><li>Statistics used by CS* </li></ul><ul><li>Meta-Data Refresher </li></ul><ul><li>Query Answering Module </li></ul><ul><li>Experimental Evaluation </li></ul><ul><li>Conclusions </li></ul>
  20. 20. Statistics Maintained by CS* C i : p c <ul><li>Contiguous Refreshing: </li></ul><ul><ul><li>CS* refreshes a category in a contiguous manner </li></ul></ul><ul><ul><li>When the statistics of a category are refreshed using data item d i , it is also refreshed using all the data item added before d i </li></ul></ul>Refresh Refresh <ul><li>Last Refresh Time rt(c): </li></ul><ul><ul><li>Largest time step till which the statistics of c have been refreshed </li></ul></ul>rt(C i ) = s 6 tf s 6 (C i ,t) will be available Time-step d 1 d 2 d 3 d 4 d 5 d 6 d 7 d 8 d 9 s 1 s 2 s 3 s 4 s 5 s 6 s 7 s 8 s 9 Data-Items
  21. 21. Estimating approximate tf <ul><li>Need to find tf at current time s * - tf s* (c,t) </li></ul><ul><ul><li>Use principle of locality </li></ul></ul><ul><ul><li>Find rate of change of term frequency Δ (c,t) – estimate of change in term frequency per data item </li></ul></ul><ul><ul><li>Δ (c,t) updated whenever c is refreshed </li></ul></ul>C i : p c Refresh rt(C i ) = s 6 tf s 6 (C i ,t) will be available Current Time <ul><li>Estimated term frequency tf est s* calculated as </li></ul>Time-step d 1 d 2 d 3 d 4 d 5 d 6 d 7 d 8 d* s 1 s 2 s 3 s 4 s 5 s 6 s 7 s 8 s* Data-Items
  22. 22. Overview <ul><li>Motivation, Problem Statement, Naïve Strategy </li></ul><ul><li>Statistics used by CS* </li></ul><ul><li>Meta-Data Refresher </li></ul><ul><li>Query Answering Module </li></ul><ul><li>Experimental Evaluation </li></ul><ul><li>Conclusions </li></ul>
  23. 23. Determining Important Categories <ul><li>What categories will be important? </li></ul><ul><ul><li>Categories which will be useful for answering queries in the future </li></ul></ul><ul><li>What queries are likely to be asked in the future? </li></ul><ul><ul><li>Need to predict the queries </li></ul></ul><ul><li>What categories will be useful for answering those queries? </li></ul><ul><ul><li>Look at history and find categories used of answering queries in past </li></ul></ul><ul><li>How to compute the benefit of a set of data items? </li></ul><ul><ul><li>How many categories can be refreshed using the data items? </li></ul></ul><ul><ul><li>What is the importance of those categories? </li></ul></ul>Importance is a measure of the likelihood of the category being used to answer a query in the future
  24. 24. Range Selection Problem <ul><li>Input: </li></ul><ul><ul><li>Sequence of categories c 1 , c 2 ,….,c N </li></ul></ul><ul><ul><li>Width ImpData </li></ul></ul><ul><li>Output: Set of data items such that </li></ul><ul><ul><ul><li>Total number of data items selected is at most ImpData </li></ul></ul></ul><ul><ul><ul><li>Total benefit is maximized </li></ul></ul></ul><ul><li>We use a dynamic programming algorithm to solve this problem </li></ul><ul><li>Details are in paper </li></ul>
  25. 25. Overview <ul><li>Motivation, Problem Statement, Naïve Strategy </li></ul><ul><li>Statistics used by CS* </li></ul><ul><li>Meta-Data Refresher </li></ul><ul><li>Query Answering Module </li></ul><ul><li>Experimental Evaluation </li></ul><ul><li>Conclusions </li></ul>
  26. 26. Query Answering Module <ul><li>Given keyword query Q = {t 1 , t 2 ,….,t l } use tf est and idf est to find top-K categories using scoring function: </li></ul><ul><li>Naïve approach: Compute score for all categories containing any one of Q , and return top- K categories </li></ul><ul><li>We use threshold algorithm (TA) to do this efficiently </li></ul><ul><ul><li>TA solves the problem of finding the topmost object amongst a set of objects using scoring function consisting of multiple components </li></ul></ul><ul><ul><li>TA requires input objects to be sorted on each of the components </li></ul></ul><ul><li>In our setup score of C is combination of tf.idf score for each keyword t i </li></ul>
  27. 27. Query Answering Module <ul><li>Algorithm Overview </li></ul><ul><ul><li>Setup l ordered lists – one for each keyword </li></ul></ul><ul><ul><li>List for keyword t i provides ordering of categories based on tf est s* x idf est s* for t i </li></ul></ul><ul><ul><li>Lists are merged using TA algorithm to get top-K categories </li></ul></ul>TA Score est s* (*,Q) Categories sorted based on tf est s* (*,t 1 ) x idf est s* (t 1 ) Categories sorted based on tf est s* (*,t l ) x idf est s* (t 1 ) tf est s* (*,t 1 ) x idf est s* (t 1 ) C 3 C 1 C 9 C 2 tf est s* (*,t 2 ) x idf est s* (t 2 ) C 5 C 2 C 6 C 1 tf est s* (*,t l ) x idf est s* (t l ) C 6 C 3 C 1 C 8 C 4 C 6 C 1 C 8 C 7
  28. 28. Query Answering Module <ul><li>Recall formula for tf est s* : </li></ul><ul><li>Maintaining sorted list as per tf est s* is not easy </li></ul><ul><ul><li>Dependant on </li></ul></ul><ul><ul><li>Function of time s* – ordering changes with time </li></ul></ul><ul><ul><li>Problem solved by using another level of threshold algorithm </li></ul></ul>
  29. 29. Conclusion <ul><li>First to identify the problem of keyword search over categorized dynamic data </li></ul><ul><li>Developed the CS* system consisting of two components: </li></ul><ul><ul><li>Query Answering Module: Two level threshold algorithm </li></ul></ul><ul><ul><li>Meta-data Refresher: Formulated an interval selection problem and proposed a dynamic programming solution </li></ul></ul><ul><li>Provides accuracy in excess of 90% using 57% less resources than the Update-All Strategy </li></ul>
  30. 30. Thank You & Questions!

×