Published on

Presentation made at SemTech2010 detailing the Calit2 Research Intelligence system for faculty expertise profile and our experience with semantics in this space.

  • Be the first to comment

  • Be the first to like this

No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide


  1. 1. The Research Intelligence Project California Institute for Telecommunications and Information Technology (Calit2) Jerry Sheehan, Chief of Staff June 25th, 2010 SemTech 2010
  2. 2. The Research Intelligence Project Outline The Semantic Our Research Data Problem Intel Tools Evolution Future Concluding Directions Thoughts SemTech 2010
  3. 3. My Bias Prefer Found Elsewhere SemTech 2010 Image Courtesy of Matt Jones, Creative Commons License, Flickr (blackbeltjones)
  4. 4. Topic I Our Problem SemTech 2010
  5. 5. Who Are We? SemTech 2010
  6. 6. What Do We Do? SemTech 2010
  7. 7. The Standard “Completed” Faculty Profile Dr. H’s Dr H. drh@edu SemTech 2010
  8. 8. Different Way to Think About Our Problem SemTech 2010 Image Courtesy of Scott Granneman, Creative Commons License, Flickr (rsgranne)
  9. 9. Topic II Tools We Developed SemTech 2010
  10. 10. How Research Universities Look At Their Business Data SemTech 2010 Image Courtesy of HA! Designers, Creative Commons License, Flickr (artbyheather)
  11. 11. How We Could Look At Our Data SemTech 2010 Logo Design by Kyle Bowen,
  12. 12. Research Intelligence Platform Development 2005 2006 2007 2008 2009 2010 Idea Proof of Concept Alpha/Beta for Calit2 Beta for Others Production for Campus New Domains # of Users 250 300 480 900 Faculty 71 Companies 460 SemTech 2010
  13. 13. Research Intelligence Development in Web History Timeline SemTech 2010
  14. 14. 2005: Topic Modeling of Researchers SemTech 2010 Initial Site Developed by David Newman with Direction from Padhric Smyth, University of California, Irvine
  15. 15. 2005: The Topic Modeling Proof of Concept SemTech 2010
  16. 16. Conceptual Challenges with 2005 Model NLP Algorithm Human Intervention Discipline Bias SemTech 2010
  17. 17. The Folksonomy vs Taxonomy Debate Felis Bengalensis Taxonomy Folksonomy •Kingdom: Animalia •Cat •Phylum: Chordata •Bengal Cat •Class: Mammalia •F6 •Order: Carnivora •Leopard •Family: Felidea •Hybrid •Genus: Felis •Nikita •Species: Bengalensis Bengal Cat SemTech 2010
  18. 18. Manual Tagging Experiment • Three person team examined one university affiliated web page for affiliated faculty and associated a minimum of three keywords with each person. • No controlled vocabulary but rather a narrative question to focus manual tagging. • What type of research does this person primarily do? • Created SQL Database of all UCSD affiliated academic researchers. SemTech 2010
  19. 19. Unfiltered Tags: Automated Extraction 1. ucsd (157) 28. structural engineering (16) 2. email (117) 29. associate professor (16) 3. university of california san diego (112) 30. electrical engineering (16) 4. sdsc (55) 31. department of computer science (16) 5. contact (50) 32. cse (16) 6. california san diego (47) 33. responsphere (16) 7. professor (44) 34. computational biology (15) 8. university of california (44) 35. adjunct professor (15) 9. computer science (36) 36. algorithms (15) 10. mail (36) 37. nsf (14) 11. edu (34) 38. networking (14) 12. wireless (31) 39. digital signal processing (14) 13. telecommunications (31) 40. geophysics (14) 14. california institute (28) 41. (14) 15. photonics (27) 42. california institutes (14) 16. physics (26) 43. information technology staff (14) 17. signal processing (23) 44. cwc (13) 18. visualization (22) 45. san diego supercomputer center (13) 19. computer engineering (22) 46. biology (13) 20. bioinformatics (21) 47. cognitive science (13) 21. capsule bio (21) 48. information theory (13) 22. nanotechnology (19) 49. optical networking (13) 23. uc san diego (19) 50. mit (13) 24. sensors (18) 25. scripps institution of oceanography (18) 26. information technology (17) 27. ucsd faculty (17) SemTech 2010
  20. 20. Filtered Tags: Automated Extraction 1. wireless (31) 28. computer (13) 2. telecommunications (31) 29. san diego supercomputer (13) 3. photonics (27) 30. supercomputing (12) 4. physics (26) 31. communications (12) 5. signal processing (23) 32. embedded systems (12) 6. visualization (22) 33. semiconductors (11) 7. computer engineering (22) 34. networks (11) 8. bioinformatics (21) 35. biochemistry (11) 9. nanotechnology (19) 36. pharmacology (11) 10. sensors (18) 37. systems biology (11) 11. information technology (17) 38. chemistry (11) 12. structural engineering (16) 39. neural networks (11) 13. electrical engineering (16) 40. computer vision (11) 14. responsphere (16) 41. http (11) 15. computational biology (15) 42. journal of geophysical research (11) 16. algorithms (15) 43. music (10) 17. nsf (14) 44. integrated circuits (10) 18. networking (14) 45. vlsi (10) 19. digital signal processing (14) 46. information storage (10) 20. geophysics (14) 47. artificial intelligence (10) 21. (14) 48. engineering (10) 22. cwc (13) 49. engineering university (10) 23. san diego supercomputer center (13) 50. rescue (10) 24. biology (13) 25. cognitive science (13) 26. information theory (13) 27. optical networking (13) SemTech 2010
  21. 21. The Archimedes Project 2006 SemTech 2010
  22. 22. Importance of Value Propositions SemTech 2010 Image: Norman Rockwell for Tom Sawyer and Huck Finn, 1935
  23. 23. What Researchers are Interested In SemTech 2010 TreeMap, Federal Funding, May 2010, Data and Visualization Calit2
  24. 24. Really Good Government SemTech 2010
  25. 25. Federal Funding Opportunities 2006 SemTech 2010
  26. 26. Federal Funding Opportunities Production Workflow SemTech 2010
  27. 27. Research Intelligence 2007: Faculty and Funding Keywords 606 Grants, 5700 Tags SemTech 2010
  28. 28. Research Intelligence 2007 Workflow SemTech 2010
  29. 29. Research Intelligence Campus 2009 SemTech 2010
  30. 30. Campus RI: Integrated Researcher Metadata SemTech 2010
  31. 31. Research Intelligence The 2009 Semantic Engine Keywords Relevancy 900 Users Keywords Topics 5400 Keywords 70,000 Documents Semantics, Linked Data Tags Keywords Semantics Keywords Semantics SemTech 2010
  32. 32. Community Research Intelligence: New Application Thrust 2010 SemTech 2010
  33. 33. Topic III Semantic Data Evolution SemTech 2010
  34. 34. Research Intelligence View of Semantic Data Evolution Initial Open Linked Data Repositories Complexity Initial Open APIS Semantic Services Few Open APIs for NLP Closed NLP Text Mining 2005 2008 2009 2010 SemTech 2010 Time
  35. 35. Research Intelligence: The Data, Grant Abstract NSF Solicitation: Software Infrastructure for Sustained Innovation Computation is accepted as the third pillar supporting innovation and discovery in science and engineering and is central to NSF's future vision of Cyberinfrastructure Framework for 21st Century Science and Engineering (CF21)[1]. Software is an integral part of the computation paradigm and a primary modality for realizing the CF21 vision. Scientific discovery and innovation are advancing fundamentally new pathways opened by development of increasingly sophisticated software. Software is also directly responsible for increased scientific productivity and significant enhancement of researchers' capabilities. In order to nurture, accelerate and sustain this critical mode of scientific progress, NSF is establishing a new program, Software Infrastructure for Sustained Innovation (SI2), with the overarching goal of transforming innovations in research and education into sustained software resources that are an integral part of the cyberinfrastructure. SI2 is a long-term investment focused on catalyzing new thinking, paradigms, and practices in using software to understand natural, human, and engineered systems. SI2's intent is to foster a pervasive cyberinfrastructure to help researchers address problems of unprecedented scale, complexity, resolution, and accuracy by integrating computation, data, networking and experiments in novel ways. It is NSF's expectation that SI2 investment will result in robust, reliable, usable and sustainable software infrastructure that is critical to the CF21 vision and will transform science and engineering. It is expected that SI2 will generate and nurture the multidisciplinary processes required to support the entire software lifecycle and will result in the development of sustainable software communities. SI2 envisions vibrant partnerships among academia, government laboratories and industry for the development and stewardship of a sustainable software infrastructure that can enhance productivity and accelerate innovation in science and engineering. The goal of the SI2 program is to create a software ecosystem that includes all levels of the software stack and scales from individual or small groups of software innovators to large hubs of software excellence. The program addresses all aspects of CI, from embedded sensor systems and instruments, to desktops and high-end data and computing systems, to major instruments and facilities.The SI2 program envisions three classes of awards:1. Scientific Software Elements (SSE): SSE awards target small groups that will create and deploy robust software elements for which there is a demonstrated need, encapsulating innovation in science and engineering. The effort targeted by a SSE award is up to a level roughly comparable to: summer support for two investigators with complementary expertise; two graduate students; and their collective research needs (e.g. materials, supplies, travel) for three years.2. Scientific Software Integration (SSI): SSI awards target larger groups of PIs organized around common research problems as well as common software infrastructure, and will result in a sustainable community software framework. The effort targeted by a SSI award is up to a level roughly comparable to: summer support for three to four investigators with complementary expertise; three to four graduate students; one or two senior personnel (including post-doctoral researchers, software developers, and staff); and their collective research needs (e.g., materials, supplies, travel) for three to five years. The integrative contributions of the SSI team should clearly be greater than the sum of the contributions of each individual member of the team.3. Scientific Software Innovation Institutes (S2I2): S2I2 awards will focus on the establishment of long-term community-wide hubs of software excellence. These hubs will provide expertise, processes, resources and implementation mechanism to transform computational science and engineering innovations and community software into robust and sustained tools for enabling science and engineering. S2I2 proposals will bring together multidisciplinary teams of domains scientists and engineers, computer scientists and software engineers, technologists and educators.The FY 2010 SI2 competition will be limited to SSE and SSI awards. The solicitation in FY 2011, and in subsequent years, will outline funding opportunities for all three classes of awards (SSE, SSI and S2I2), subject to availability of funds.[1] SemTech 2010
  36. 36. Keyword Extraction Across Sources Term Human Yahoo KEA Calais Alchemy OAmplify Common Software Infrastructure Community Software Cyberinfrastructure Embedded Sensor Engineering Hubs of Scientific Innovation Innovation NSF Scientific Scientific Discovery Scientific Software Scientific Software Integration Scientific Software Innovation Institutes SI2 Software Software Developers Software Ecosystem Software Elements Software Engineers Software Infrastructure Software Innovators Software Lifecycle Software Stack SSI Sustainable Software Sustained Tool Vision SemTech 2010 12 3 9 15 20 10
  37. 37. Semantic Structure Returned by Open Calais Industry Terms Social Tags •Community Software •Cyberinfrastructure •Software Lifecycle •E-Science •Sustainable Software Communities •Computing •Usable and Sustainable Software Infrastructure •Computer Software •Software Infrastructure •Innovation •Software Stack •Software Engineer •Software Developers •Technology •Sustainable Community Software Framework •Science •Sustained Software Resources •Technology_Internet •Software Ecosystem •Software Excellence •Embedded Sensor Systems Organization •Software Elements •National Science Foundation •Sustainable Software Infrastructure URL • 2010/nsf10015/nsf10015.jsp SemTech 2010
  38. 38. Semantic Structure Returned by Alchemy API Tags Company •Scientific productivity •pillar supporting innovation •Scientific Software •overarching goal •primary modality •graduate students •program envisions Field Terminology •scientific discovery •researchers address problems •robust software elements •Software •21st century science •Software Stack •collective research •scientific progress •scientific software elements •Software Developers •common research problems •Software Ecosystems •common software infrastructure •scientific software innovation •scientific software integration •Software Engineers •community software •complementary expertise •si2's intent •computation paradigm •small groups •cyberinfrastructure framework •software elements Organization •entire software lifecycle •software excellence •NSF •envisions vibrant partnerships •software infrastructure •SSI •innovation computation •software innovators •innovations •software resources •long-term community-wide hubs •sophisticated software •nsf's expectation •sse award Category •nsf's future vision •ssi awards •Science and Technology •pervasive cyberinfrastructure •ssi team •summer support •sustainable community software •sustainable software communities •sustainable software infrastructure SemTech 2010
  39. 39. Semantic Modeling Challenge Even with XML/DTD SemTech 2010 HTTP://
  40. 40. Grants.Gov Technical Support Doesn’t Like Data Questions SemTech 2010 HTTP://
  41. 41. Open Calais Faculty Linked Data Results Tag Type Linked Data Relevancy National Science Organization 52% Foundation Software Excellence Industry Term 34% Sustained Software Industry Term h,p://­‐1/61a1eb6d-­‐196d-­‐3493-­‐ad6c-­‐8ea0b85ce421.html 32% Resources Usable and Sustainable Industry Term 31% Software Infrastructure Software Lifecycle Industry Term 30% Sustainable Software Industry Term 29% Communities Sustainable Software Industry Term 27% Infrastructure Software Stack Industry Term 24% Software Innovators Industry Term 21% SemTech 2010
  42. 42. Open Calais Linked Data Examples National Science Foundation Software Excellence SemTech 2010
  43. 43. Zemanta Linked Data Examples from Grant Abstract SemTech 2010
  44. 44. Linking to Freebase Via API from Grant Abstract SemTech 2010
  45. 45. Are Faculty Yet Data Objects? Depends on Their Popularity SemTech 2010
  46. 46. My Boss is 32 Triples SemTech 2010
  47. 47. Faculty Web Page SemTech 2010
  48. 48. Open Calais Faculty Linked Data Example Tag Type Linked Data Relevancy Lo Research Group Company 50% California Institute Facility 46% EmailAddress h,p://­‐1/babf08c8-­‐1f57-­‐3b99-­‐b020-­‐7e0dd8eaf1fc.html 31% California Institute for Organization 31% Telecommunications 858-xxx-xxxx PhoneNumber 31% PhoneNumber 858-xxx-xxxx 31% Information Technology Technology 31% optoelectronic devices Industry Term 29% International Business Company 6% Machines SemTech 2010
  49. 49. Open Calais Linked Data Examples Calit2 SemTech 2010
  50. 50. Open Calais Linked Data Examples IBM SemTech 2010
  51. 51. Zemanta Linked Data Results Tag Linked Data Confidence Integrated Circuits wikipedia: Integrated circuit 0.65 geolocation: University of California, Berkeley UC Berkeley homepage: University of California, Berkeley 0.64 wikipedia: University of California, Berkeley Information Technology wikipedia:  InformaHon  technology 0.63 geolocation: California Institute for Telecommunications and Information Technology Calit2 wikipedia: California Institute for Telecommunications and Information Technology 0.60 geolocation: IBM Almaden Research Center Almaden Research Center wikipedia: IBM Almaden Research Center 0.59 Age related Macular wikipedia: Macular degeneration 0.59 Degeneration Minimally Invasive Surgery wikipedia: Invasiveness of surgical procedures 0.58 Cancer 0.57 geolocation: Cornell University homepage: Cornell University Cornell wikipedia: Cornell University 0.57 youtube: Cornell University Fluorescence Activated Cell wikipedia: Flow cytometry 0.57 Sorter SemTech 2010
  52. 52. Zemanta Linked Data Examples Integrated Circuit Calit2 SemTech 2010
  53. 53. The Linked Data Cloud SemTech 2010
  54. 54. Linked Data and a Wikipedia Base Wikipedia: How Accurate? Source: Jeremy Hsu, “Wikipedia: How Accurate is it?” November 2009, Live Science, SemTech 2010
  55. 55. Is It A Problem? John S., Is a Possible Assassin of, John K SemTech 2010 SOURCE: USA Today, November 29, 2005
  56. 56. Maybe Not? How Important is Validity to Researchers? SemTech 2010 SOURCE: PHARMANEWS.EU, January 23, 2009
  57. 57. Topic IV Future Directions? SemTech 2010
  58. 58. Life Sciences Example SemTech 2010
  59. 59. SciVal From Elsevier SemTech 2010
  60. 60. SciVal Terms And Conditions SemTech 2010
  61. 61. The Value of Your Data SemTech 2010
  62. 62. Beginning to Have Data Portability Policies...for Sites SemTech 2010
  63. 63. Future Direction for Semantic Academic Communities? SemTech 2010 HTTP://
  64. 64. VIVO Ontology SemTech 2010 HTTP://
  65. 65. Emerging/Growing Semantic Catalog SemTech 2010
  66. 66. Example: DOE Awards Semantic Catalog SemTech 2010
  67. 67. Topic IV Conclusions SemTech 2010
  68. 68. Words of Wisdom John Wooley “One of the Most Important Things I Learned is What Not to Pay Attention To” SemTech 2010
  69. 69. Is the Semantic Web and Linked Data This? SemTech 2010 Image Courtesy of Alan Vernon, Creative Commons License, Flickr (alanvernon)
  70. 70. Is the Semantic Web and Linked Data or This? SemTech 2010 Image Courtesy of Vince Huang, Creative Commons License, Flickr (vincehuang)
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.