SemTech2010
Upcoming SlideShare
Loading in...5
×

Like this? Share it with your network

Share

SemTech2010

  • 983 views
Uploaded on

Presentation made at SemTech2010 detailing the Calit2 Research Intelligence system for faculty expertise profile and our experience with semantics in this space.

Presentation made at SemTech2010 detailing the Calit2 Research Intelligence system for faculty expertise profile and our experience with semantics in this space.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads

Views

Total Views
983
On Slideshare
983
From Embeds
0
Number of Embeds
0

Actions

Shares
Downloads
8
Comments
0
Likes
0

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. The Research Intelligence Project California Institute for Telecommunications and Information Technology (Calit2) Jerry Sheehan, Chief of Staff June 25th, 2010 SemTech 2010
  • 2. The Research Intelligence Project Outline The Semantic Our Research Data Problem Intel Tools Evolution Future Concluding Directions Thoughts SemTech 2010
  • 3. My Bias Prefer Found Elsewhere SemTech 2010 Image Courtesy of Matt Jones, Creative Commons License, Flickr (blackbeltjones)
  • 4. Topic I Our Problem SemTech 2010
  • 5. Who Are We? SemTech 2010
  • 6. What Do We Do? SemTech 2010
  • 7. The Standard “Completed” Faculty Profile Dr. H’s Dr H. drh@edu SemTech 2010
  • 8. Different Way to Think About Our Problem SemTech 2010 Image Courtesy of Scott Granneman, Creative Commons License, Flickr (rsgranne)
  • 9. Topic II Tools We Developed SemTech 2010
  • 10. How Research Universities Look At Their Business Data SemTech 2010 Image Courtesy of HA! Designers, Creative Commons License, Flickr (artbyheather)
  • 11. How We Could Look At Our Data SemTech 2010 Logo Design by Kyle Bowen, http://www.educause.edu/Community/MemDir/Profiles/KyleBowen/58744
  • 12. Research Intelligence Platform Development 2005 2006 2007 2008 2009 2010 Idea Proof of Concept Alpha/Beta for Calit2 Beta for Others Production for Campus New Domains # of Users 250 300 480 900 Faculty 71 Companies 460 SemTech 2010
  • 13. Research Intelligence Development in Web History Timeline SemTech 2010
  • 14. 2005: Topic Modeling of Researchers SemTech 2010 Initial Site Developed by David Newman with Direction from Padhric Smyth, University of California, Irvine
  • 15. 2005: The Topic Modeling Proof of Concept SemTech 2010 http://datalab-1.ics.uci.edu/calit2/
  • 16. Conceptual Challenges with 2005 Model NLP Algorithm Human Intervention Discipline Bias SemTech 2010
  • 17. The Folksonomy vs Taxonomy Debate Felis Bengalensis Taxonomy Folksonomy •Kingdom: Animalia •Cat •Phylum: Chordata •Bengal Cat •Class: Mammalia •F6 •Order: Carnivora •Leopard •Family: Felidea •Hybrid •Genus: Felis •Nikita •Species: Bengalensis Bengal Cat SemTech 2010
  • 18. Manual Tagging Experiment • Three person team examined one university affiliated web page for affiliated faculty and associated a minimum of three keywords with each person. • No controlled vocabulary but rather a narrative question to focus manual tagging. • What type of research does this person primarily do? • Created SQL Database of all UCSD affiliated academic researchers. SemTech 2010
  • 19. Unfiltered Tags: Automated Extraction 1. ucsd (157) 28. structural engineering (16) 2. email (117) 29. associate professor (16) 3. university of california san diego (112) 30. electrical engineering (16) 4. sdsc (55) 31. department of computer science (16) 5. contact (50) 32. cse (16) 6. california san diego (47) 33. responsphere (16) 7. professor (44) 34. computational biology (15) 8. university of california (44) 35. adjunct professor (15) 9. computer science (36) 36. algorithms (15) 10. mail (36) 37. nsf (14) 11. edu (34) 38. networking (14) 12. wireless (31) 39. digital signal processing (14) 13. telecommunications (31) 40. geophysics (14) 14. california institute (28) 41. (14) 15. photonics (27) 42. california institutes (14) 16. physics (26) 43. information technology staff (14) 17. signal processing (23) 44. cwc (13) 18. visualization (22) 45. san diego supercomputer center (13) 19. computer engineering (22) 46. biology (13) 20. bioinformatics (21) 47. cognitive science (13) 21. capsule bio (21) 48. information theory (13) 22. nanotechnology (19) 49. optical networking (13) 23. uc san diego (19) 50. mit (13) 24. sensors (18) 25. scripps institution of oceanography (18) 26. information technology (17) 27. ucsd faculty (17) SemTech 2010
  • 20. Filtered Tags: Automated Extraction 1. wireless (31) 28. computer (13) 2. telecommunications (31) 29. san diego supercomputer (13) 3. photonics (27) 30. supercomputing (12) 4. physics (26) 31. communications (12) 5. signal processing (23) 32. embedded systems (12) 6. visualization (22) 33. semiconductors (11) 7. computer engineering (22) 34. networks (11) 8. bioinformatics (21) 35. biochemistry (11) 9. nanotechnology (19) 36. pharmacology (11) 10. sensors (18) 37. systems biology (11) 11. information technology (17) 38. chemistry (11) 12. structural engineering (16) 39. neural networks (11) 13. electrical engineering (16) 40. computer vision (11) 14. responsphere (16) 41. http (11) 15. computational biology (15) 42. journal of geophysical research (11) 16. algorithms (15) 43. music (10) 17. nsf (14) 44. integrated circuits (10) 18. networking (14) 45. vlsi (10) 19. digital signal processing (14) 46. information storage (10) 20. geophysics (14) 47. artificial intelligence (10) 21. (14) 48. engineering (10) 22. cwc (13) 49. engineering university (10) 23. san diego supercomputer center (13) 50. rescue (10) 24. biology (13) 25. cognitive science (13) 26. information theory (13) 27. optical networking (13) SemTech 2010
  • 21. The Archimedes Project 2006 SemTech 2010
  • 22. Importance of Value Propositions SemTech 2010 Image: Norman Rockwell for Tom Sawyer and Huck Finn, 1935
  • 23. What Researchers are Interested In SemTech 2010 TreeMap, Federal Funding, May 2010, Data and Visualization Calit2
  • 24. Really Good Government SemTech 2010
  • 25. Federal Funding Opportunities 2006 SemTech 2010
  • 26. Federal Funding Opportunities Production Workflow SemTech 2010
  • 27. Research Intelligence 2007: Faculty and Funding Keywords 606 Grants, 5700 Tags SemTech 2010
  • 28. Research Intelligence 2007 Workflow SemTech 2010
  • 29. Research Intelligence Campus 2009 SemTech 2010 http://ric.ucsd.edu
  • 30. Campus RI: Integrated Researcher Metadata SemTech 2010 http://ric.ucsd.edu
  • 31. Research Intelligence The 2009 Semantic Engine Keywords Relevancy 900 Users Keywords Topics 5400 Keywords 70,000 Documents Semantics, Linked Data Tags Keywords Semantics Keywords Semantics SemTech 2010
  • 32. Community Research Intelligence: New Application Thrust 2010 SemTech 2010
  • 33. Topic III Semantic Data Evolution SemTech 2010
  • 34. Research Intelligence View of Semantic Data Evolution Initial Open Linked Data Repositories Complexity Initial Open APIS Semantic Services Few Open APIs for NLP Closed NLP Text Mining 2005 2008 2009 2010 SemTech 2010 Time
  • 35. Research Intelligence: The Data, Grant Abstract NSF Solicitation: Software Infrastructure for Sustained Innovation Computation is accepted as the third pillar supporting innovation and discovery in science and engineering and is central to NSF's future vision of Cyberinfrastructure Framework for 21st Century Science and Engineering (CF21)[1]. Software is an integral part of the computation paradigm and a primary modality for realizing the CF21 vision. Scientific discovery and innovation are advancing fundamentally new pathways opened by development of increasingly sophisticated software. Software is also directly responsible for increased scientific productivity and significant enhancement of researchers' capabilities. In order to nurture, accelerate and sustain this critical mode of scientific progress, NSF is establishing a new program, Software Infrastructure for Sustained Innovation (SI2), with the overarching goal of transforming innovations in research and education into sustained software resources that are an integral part of the cyberinfrastructure. SI2 is a long-term investment focused on catalyzing new thinking, paradigms, and practices in using software to understand natural, human, and engineered systems. SI2's intent is to foster a pervasive cyberinfrastructure to help researchers address problems of unprecedented scale, complexity, resolution, and accuracy by integrating computation, data, networking and experiments in novel ways. It is NSF's expectation that SI2 investment will result in robust, reliable, usable and sustainable software infrastructure that is critical to the CF21 vision and will transform science and engineering. It is expected that SI2 will generate and nurture the multidisciplinary processes required to support the entire software lifecycle and will result in the development of sustainable software communities. SI2 envisions vibrant partnerships among academia, government laboratories and industry for the development and stewardship of a sustainable software infrastructure that can enhance productivity and accelerate innovation in science and engineering. The goal of the SI2 program is to create a software ecosystem that includes all levels of the software stack and scales from individual or small groups of software innovators to large hubs of software excellence. The program addresses all aspects of CI, from embedded sensor systems and instruments, to desktops and high-end data and computing systems, to major instruments and facilities.The SI2 program envisions three classes of awards:1. Scientific Software Elements (SSE): SSE awards target small groups that will create and deploy robust software elements for which there is a demonstrated need, encapsulating innovation in science and engineering. The effort targeted by a SSE award is up to a level roughly comparable to: summer support for two investigators with complementary expertise; two graduate students; and their collective research needs (e.g. materials, supplies, travel) for three years.2. Scientific Software Integration (SSI): SSI awards target larger groups of PIs organized around common research problems as well as common software infrastructure, and will result in a sustainable community software framework. The effort targeted by a SSI award is up to a level roughly comparable to: summer support for three to four investigators with complementary expertise; three to four graduate students; one or two senior personnel (including post-doctoral researchers, software developers, and staff); and their collective research needs (e.g., materials, supplies, travel) for three to five years. The integrative contributions of the SSI team should clearly be greater than the sum of the contributions of each individual member of the team.3. Scientific Software Innovation Institutes (S2I2): S2I2 awards will focus on the establishment of long-term community-wide hubs of software excellence. These hubs will provide expertise, processes, resources and implementation mechanism to transform computational science and engineering innovations and community software into robust and sustained tools for enabling science and engineering. S2I2 proposals will bring together multidisciplinary teams of domains scientists and engineers, computer scientists and software engineers, technologists and educators.The FY 2010 SI2 competition will be limited to SSE and SSI awards. The solicitation in FY 2011, and in subsequent years, will outline funding opportunities for all three classes of awards (SSE, SSI and S2I2), subject to availability of funds.[1] http://www.nsf.gov/pubs/2010/nsf10015/nsf10015.jsp SemTech 2010
  • 36. Keyword Extraction Across Sources Term Human Yahoo KEA Calais Alchemy OAmplify Common Software Infrastructure Community Software Cyberinfrastructure Embedded Sensor Engineering Hubs of Scientific Innovation Innovation NSF Scientific Scientific Discovery Scientific Software Scientific Software Integration Scientific Software Innovation Institutes SI2 Software Software Developers Software Ecosystem Software Elements Software Engineers Software Infrastructure Software Innovators Software Lifecycle Software Stack SSI Sustainable Software Sustained Tool Vision SemTech 2010 12 3 9 15 20 10
  • 37. Semantic Structure Returned by Open Calais Industry Terms Social Tags •Community Software •Cyberinfrastructure •Software Lifecycle •E-Science •Sustainable Software Communities •Computing •Usable and Sustainable Software Infrastructure •Computer Software •Software Infrastructure •Innovation •Software Stack •Software Engineer •Software Developers •Technology •Sustainable Community Software Framework •Science •Sustained Software Resources •Technology_Internet •Software Ecosystem •Software Excellence •Embedded Sensor Systems Organization •Software Elements •National Science Foundation •Sustainable Software Infrastructure URL •http://www.nsf.gov/pubs/ 2010/nsf10015/nsf10015.jsp SemTech 2010 http://www.opencalais.com/
  • 38. Semantic Structure Returned by Alchemy API Tags Company •Scientific productivity •pillar supporting innovation •Scientific Software •overarching goal •primary modality •graduate students •program envisions Field Terminology •scientific discovery •researchers address problems •robust software elements •Software •21st century science •Software Stack •collective research •scientific progress •scientific software elements •Software Developers •common research problems •Software Ecosystems •common software infrastructure •scientific software innovation •scientific software integration •Software Engineers •community software •complementary expertise •si2's intent •computation paradigm •small groups •cyberinfrastructure framework •software elements Organization •entire software lifecycle •software excellence •NSF •envisions vibrant partnerships •software infrastructure •SSI •innovation computation •software innovators •innovations •software resources •long-term community-wide hubs •sophisticated software •nsf's expectation •sse award Category •nsf's future vision •ssi awards •Science and Technology •pervasive cyberinfrastructure •ssi team •summer support •sustainable community software •sustainable software communities •sustainable software infrastructure SemTech 2010 http://www.openamplify.com/
  • 39. Semantic Modeling Challenge Even with XML/DTD SemTech 2010 HTTP://vivoweb.org
  • 40. Grants.Gov Technical Support Doesn’t Like Data Questions SemTech 2010 HTTP://vivoweb.org
  • 41. Open Calais Faculty Linked Data Results Tag Type Linked Data Relevancy National Science Organization http://d.opencalais.com/genericHasher-1/f7d1451f-915f-31bc-8194-b9794401ea2d.html 52% Foundation Software Excellence Industry Term http://d.opencalais.com/genericHasher-1/3da6f84d-cff9-3eec-8fce-99ea792e370c.html 34% Sustained Software Industry Term h,p://d.opencalais.com/genericHasher-­‐1/61a1eb6d-­‐196d-­‐3493-­‐ad6c-­‐8ea0b85ce421.html 32% Resources Usable and Sustainable Industry Term http://d.opencalais.com/genericHasher-1/9e6fe116-e562-3753-9b93-8f938095a715.html 31% Software Infrastructure Software Lifecycle Industry Term http://d.opencalais.com/genericHasher-1/9c7876e1-a85f-307c-8b38-163c129f19f7.html 30% Sustainable Software Industry Term http://d.opencalais.com/genericHasher-1/5228ac30-2bf5-397e-bc1a-04275a3f5045.html 29% Communities Sustainable Software Industry Term http://d.opencalais.com/genericHasher-1/4be05ead-30cd-3c3a-bd88-5dbb8427acc9.html 27% Infrastructure Software Stack Industry Term http://d.opencalais.com/genericHasher-1/c22ad2e5-bd08-3083-9dc5-14945fb77010.html 24% Software Innovators Industry Term http://d.opencalais.com/genericHasher-1/eba4d676-5aa8-3b1e-83dc-c4bd91b4d0f4.html 21% SemTech 2010
  • 42. Open Calais Linked Data Examples National Science Foundation Software Excellence SemTech 2010
  • 43. Zemanta Linked Data Examples from Grant Abstract SemTech 2010
  • 44. Linking to Freebase Via API from Grant Abstract SemTech 2010
  • 45. Are Faculty Yet Data Objects? Depends on Their Popularity SemTech 2010
  • 46. My Boss is 32 Triples SemTech 2010
  • 47. Faculty Web Page SemTech 2010
  • 48. Open Calais Faculty Linked Data Example Tag Type Linked Data Relevancy Lo Research Group Company http://d.opencalais.com/comphash-1/2cf74602-005c-3d32-a184-4bc49ef2d5f2.html 50% California Institute Facility http://d.opencalais.com/genericHasher-1/37ab20cd-0681-3775-bf97-7583b4ec1434.html 46% X@ece.ucsd.edu EmailAddress h,p://d.opencalais.com/genericHasher-­‐1/babf08c8-­‐1f57-­‐3b99-­‐b020-­‐7e0dd8eaf1fc.html 31% California Institute for Organization http://d.opencalais.com/genericHasher-1/6a1fba6f-cf57-300b-94fc-f36d027c8ff0.html 31% Telecommunications 858-xxx-xxxx PhoneNumber http://d.opencalais.com/genericHasher-1/e8e3ad15-ace3-3616-be5a-ae9038bc0678.html 31% PhoneNumber 858-xxx-xxxx http://d.opencalais.com/genericHasher-1/5228ac30-2bf5-397e-bc1a-04275a3f5045.html 31% Information Technology Technology http://d.opencalais.com/genericHasher-1/a0f02cf0-dc13-3b0f-a139-5509b026bd96.html 31% optoelectronic devices Industry Term http://d.opencalais.com/genericHasher-1/7f81f0c9-b94f-3959-b35b-67be2f703ab4.html 29% International Business Company http://d.opencalais.com/er/company/ralg-tr1r/9e3f6c34-aa6b-3a3b-b221-a07aa7933633.html 6% Machines SemTech 2010
  • 49. Open Calais Linked Data Examples Calit2 SemTech 2010
  • 50. Open Calais Linked Data Examples IBM SemTech 2010
  • 51. Zemanta Linked Data Results Tag Linked Data Confidence Integrated Circuits wikipedia: Integrated circuit 0.65 geolocation: University of California, Berkeley UC Berkeley homepage: University of California, Berkeley 0.64 wikipedia: University of California, Berkeley Information Technology wikipedia:  InformaHon  technology 0.63 geolocation: California Institute for Telecommunications and Information Technology Calit2 wikipedia: California Institute for Telecommunications and Information Technology 0.60 geolocation: IBM Almaden Research Center Almaden Research Center wikipedia: IBM Almaden Research Center 0.59 Age related Macular wikipedia: Macular degeneration 0.59 Degeneration Minimally Invasive Surgery wikipedia: Invasiveness of surgical procedures 0.58 Cancer http://en.wikipedia.org/wiki/Cancer 0.57 geolocation: Cornell University homepage: Cornell University Cornell wikipedia: Cornell University 0.57 youtube: Cornell University Fluorescence Activated Cell wikipedia: Flow cytometry 0.57 Sorter SemTech 2010
  • 52. Zemanta Linked Data Examples Integrated Circuit Calit2 SemTech 2010
  • 53. The Linked Data Cloud SemTech 2010
  • 54. Linked Data and a Wikipedia Base Wikipedia: How Accurate? Source: Jeremy Hsu, “Wikipedia: How Accurate is it?” November 2009, Live Science, http://www.livescience.com/technology/091106-ttr-wikipedia.html#comments SemTech 2010
  • 55. Is It A Problem? John S., Is a Possible Assassin of, John K SemTech 2010 SOURCE: USA Today, November 29, 2005
  • 56. Maybe Not? How Important is Validity to Researchers? SemTech 2010 SOURCE: PHARMANEWS.EU, January 23, 2009
  • 57. Topic IV Future Directions? SemTech 2010
  • 58. Life Sciences Example SemTech 2010 http://www.collexis.com/
  • 59. SciVal From Elsevier SemTech 2010 http://www.scival.com/
  • 60. SciVal Terms And Conditions SemTech 2010 http://www.scival.com/terms-and-conditions
  • 61. The Value of Your Data SemTech 2010 http://www.turbulence.org/Works/swipe/calculator.html
  • 62. Beginning to Have Data Portability Policies...for Sites SemTech 2010 http://portabilitypolicy.org:80/sample-policies.html
  • 63. Future Direction for Semantic Academic Communities? SemTech 2010 HTTP://vivoweb.org
  • 64. VIVO Ontology SemTech 2010 HTTP://vivoweb.org
  • 65. Emerging/Growing Semantic Catalog SemTech 2010 http://www.data.gov/semantic/catalog
  • 66. Example: DOE Awards Semantic Catalog SemTech 2010 http://www.data.gov/semantic/catalog
  • 67. Topic IV Conclusions SemTech 2010
  • 68. Words of Wisdom John Wooley “One of the Most Important Things I Learned is What Not to Pay Attention To” SemTech 2010
  • 69. Is the Semantic Web and Linked Data This? SemTech 2010 Image Courtesy of Alan Vernon, Creative Commons License, Flickr (alanvernon)
  • 70. Is the Semantic Web and Linked Data or This? SemTech 2010 Image Courtesy of Vince Huang, Creative Commons License, Flickr (vincehuang)