SlideShare a Scribd company logo
Six Month Progress Report




       Farzaneh Sarafraz
        14 August 2008
In this report
    What I have learnt
●


    What are the gaps in my understanding
●


    Outputs so far
●


    Reflection on supervision mode
●


    Plan outline until December 2008
●
1. What I have learnt – general
    General
●

         Settled down in a new environment
     –
         Learnt some of the regulations and how things 
     –
         work in
              The country
          ●


              The city
          ●


              The university
          ●


              The faculty
          ●


              The school
          ●
What I have learnt – less 
                 general
    Less general
●

        Thesis and paper writing theory and practice
    –
             Specifically through the CS7100 seminar
         ●


        LaTeX
    –
        Coding infrastruction
    –
             Warmed up!
         ●


        Database handling
    –
        Administration / web applications
    –
    Specific
●

        Biological text mining theory
    –
Biological text mining
    Biological text mining theory
●

        Main problems
    –
        Main challenges
    –
        Main approaches
    –
        Communities
    –
        Events, papers, journals, competitions, etc.
    –
             40+ papers in my CiteULike account
         ●


    Biological text mining hands on
●

        Tools, techniques, and resources
    –
        i2b2
    –
        HIV
    –
Biological text mining theory
    Main problems
●

        Information retrieval
    –
        Information extraction
    –
             Relation extraction
         ●


        Shallow parsing / chunking
    –
        POS tagging
    –
        Word sense disambiguation
    –
        Term variation
    –
Biological text mining theory 
                (cont.)
    Main problems (cont.)
●

        Named entity recognition
    –
             Dictionary based
         ●


             Rule based
         ●


             Machine learning (HMM: Zhou et al.)
         ●


             Hybrid
         ●


        Evaluation
    –
             Precision, recall, F­Score
         ●


             Sensitivity and specificity
         ●


             Not always possible due to the lack of
         ●

                  Test Corpora
              –
                  Common domains, techniques, goals
              –
Biological text mining theory 
                (cont.)
    Main challenges
●

        Deal with sublanguage of biology
    –
        Build scalable and robust systems
    –
        Present the results in meaningful and informative 
    –
        ways to the biologist
        Deal with interdisciplinary aspects
    –
             Biology – chemistry – medicine
         ●

                  Different views / information needs
              –

             Specific field (biomedicine) – linguistics – computation 
         ●


             and data mining
Main Challenges (cont.)
    Specific field (biomedicine) – linguistics – 
●


    computation and data mining
        The text is not necessarily written to be 
    –
        comprehensible by automatic techniques
        The language is dramatically different from that 
    –
        of e.g. newswire.
        Terminology, new and coined terms, usage 
    –
        ambiguity
        Non­algorithmic, irrational patterns in NL
    –
Resources
    I am aware of / I am using existing resources
●

        Literature repositories/search engines
    –
             Pubmed, MEDLINE, BioMed
         ●


             Google
         ●


        Parsers
    –
             Stanford Parser
         ●


             GeniaTagger
         ●


        Terminological resorces
    –
             Gene Ontology
         ●


             EMBL­EBI
         ●


             MeSh thesaurus
         ●


             UMLS
         ●


             Gene Synonym Finder, SBO, ...
         ●
Resources (cont.)
    Existing resources (cont.)
●

        Lexical resources
    –
        Webservices
    –
             Entrez
         ●


             Taverna
         ●


             SBO
         ●
Resources (cont.)
    I am partially developing tools for
●

        Named entity recognition
    –
        Relation extraction
    –
    I am fully tackling
●

             PPI mining
         ●


             Word sense disambiguation
         ●


             Nominalization
         ●


        I may have to tackle in future
    –
             Contradiction, negation, contrasts
         ●


             Temporal text mining
         ●
2. What I still need to learn ­ 
               Specific
    There may be gaps I am unaware of
●


    Less of wheel reinvention
●

        Use other software
    –
             Lingpipe, NLTK, Weka, RASP, ABNER, PIE, 
         ●


             BIOINFER, MALLET, Julielab, SPECIALIST,  EMBL­
             EBI, GNN (Arizona Uni), 
        Use other methods/approaches
    –
             Machine Learning
         ●


             Dynamic programming
         ●


        CL / Bio text mining theory algorithms
    –
             Viterbi, HMM, NN, SVM, GA, CRF,
         ●


             ...
         ●
2. What I still need to learn ­ 
               Specific
    Make a resources list on our web page?
●

        Similar to the Stanford – outdated
    –
        repository
    –
What I still need to learn – Less 
              general
    News of the field
●


    Areas/opportunities for research
●

        Michael Phelps analogy
    –
    Developing skills for a CV
●

        Ways to proove I have the skills I already have
    –
    Presenting results
●

        Reasons, occasions, methods
    –
        Writing
    –
    Other workshops by the faculty
●
What I still need to learn ­ 
                 General
    Writing, writing, writing
●

        Binge writing vs. Snacking
    –
        Write as you go
    –
             Closer to the final output
         ●


             Paper­based dissertation? Something to consider.
         ●


        Review, get feedback, rewrite
    –
        A pedantic editor
    –
What I still need to learn – 
            General (Cont.)
    Stronger coding infrastructure
●

        More reusable libraries
    –
        Config files
    –
        One­click approach
    –
    Optimisation
●

        Code
    –
        Database
    –
             Query optimization
         ●


             Database optimization
         ●


        Server
    –
             Load balancing
         ●

                  Multi threading
              –
                  Multi processor
              –
3. Outputs so far
    Written
●

        Background work survey
    –
             Mid April 2008
         ●


             5 pages (approx. 1000 words)
         ●


             Feedback from supervisor
         ●


             Never was written up
         ●


        Writing sample for CS7100 seminar
    –
             June 2008
         ●


             Same document as above, revised and rewritten
         ●


             12 pages, 2215 words
         ●


             Feedback from Jim Miles and peer students
         ●
HIV
    Understanding of the problem and the goals
●


    Presenting the given/wanted as tables/code/
●


    query
    Building code infrastructure
●

        Database tables
    –
        Utility libraries
    –
        Version control system
    –
        1500+ lines of documented, reusable code
    –
HIV summary
    Goal: to reproduce a human­produced table
●


    Each row has the following main columns
●

        HIV GPN (protein name, acc, and gene ID)
    –
        Human GPN (protein name, acc, and gene ID)
    –
        A relation (interactoin) between the two
    –
        A description of the interaction
    –
        The PMIDs that the interaction has been 
    –
        reported in
    The raw input: the full abstracts
●
HIV results
    HIV and human GPN names
●

        Most where mapped to their entities
    –
        1237 out of 50416 currently unmapped (2%)
    –
    Interaction verbs
●

        Interesting verbs and stems identified
    –
        The stems where found in the text
    –
             Working on stems, so including nominals, etc.
         ●


    Terms extracted from the interaction 
●


    descriptions in the original data 
Example
    SELECT DISTINCT mention FROM 
●

    index_description_term i where 
    termID=28;

           18 variations
       ●




            CD4+ T         T4 (CD)    CD4+T
            CD4­, T        T4(CD)     T (CD4)
            T CD4          CD4 (T)    CD4+ (T)
            CD4(+) T       CD4(+)T    CD4(T)
            CD4 T          CD4+­T     CD(4+) T
            T4+ (CD)       CD4(+)­T   CD4­ T
Example
    SELECT DISTINCT mention FROM 
●

    index_description_term i where 
    termID=28 or termID = 17;

           28 variations
       ●




      CD4+ T          T4(CD)     CD4+ (T)       CD4(+) T cell
      CD4­, T         CD4 (T)    CD4(T)         CD4 T­cell
      T CD4           CD4(+)T    CD(4+) T       CD4(+) T­cell
      CD4(+) T        CD4+­T     CD4­ T         CD4(+)T cell
      CD4 T           CD4(+)­T   CD4+ T cell    CD4+­T­cell
      T4+ (CD)        CD4+T      CD4­, T cell   CD4(+)­T­cell
      T4 (CD)         T (CD4)    CD4+ T­cell    CD4 T cell
HIV results
    POS tagging with GeniaTagger
●


    Parsing with Stanford parser
●

        Haven't used this data yet
    –

    Working with sentences as units
●


    Normalising terms
●


    Tables of synonyms
●


    Tables of verb stems and terms
●


    Indexes with mention/offset pairs
●
HIV results

    Looking for sentences that share all these 
●


    properties with any of the goal table rows
        A human­HIV pair of GPN
    –
        A verb phrase containing a word with the same 
    –
        stem of the interaction verb
        Any description term(s)
    –
    Very high recall (few false negatives)
●


    Not­so­high precision (numerous false 
●


    positives)
    Optimisation for more complicated queries 
●
HIV next steps
    Compare with other PPI mining and GPN 
●


    recognition tools
    Find optimum parameters
●


    Presentable results
●


    Integrate with the interaction ontology
●


    Evaluate, compare, present, get feedback
●


    Apply to new papers
●


    Apply to new organisms
●


    Evaluate, compare, present, get feedback...
●
Supervision
    Good points
●

        Moving away from theory to tackling real 
    –
        problems very quickly
        Micromanagement while I am free to manage my 
    –
        own time and other preferences
        Planning ahead, causing commitment
    –
        Providing common sense, insight, and savvy
    –
Supervision – good points 
             (cont.)
    Providing good starting points while not ruling 
–
    out my own ideas
    Good meeting frequency
–
         Group meetings?
     ●


    General support
–
    Addressing my needs
–
         Financial
     ●


         Research interests and preferences
     ●
Supervision
    Could be improved
●

        Minutes were not always thorough
    –
        Same for tasklists
    –
        We could have agenda for the meetings
    –
             I write a list of the things that I want to discuss each 
         ●


             session
             Like the one I had for this report–could have been 
         ●


             there when I presented my 3­week plan
        Same for TEAM meetings and HIV meetings
    –
    I hope we keep tackling real problems in 
●


    future
Plan
    End of August
●

        Presenting HIV output to the group
    –
        Writing HIV results
    –
    Sep
●

        Moving to new accommodation (11­20 Sep.)
    –
        Moving on HIV
    –
             Applying the ontology
         ●


             Mining new corpora
         ●


             Generalising?
         ●
Plan
    Oct
●

        Writing up HIV
    –
        Possible publicatoin
    –
        Ideas for PhD research
    –
    Nov
●

        Finalise MPhil vs. PhD
    –
        Finalise PhD research area
    –
        Work on end of year report
    –
    Dec
●

        Write up EOY report
    –
        EOY Viva
    –
References
    Ananiadou, Sophia, and John McNaught. 2006. Text Mining for Biology 
●


    and Biomedicine. Norwood: Artech House, Inc.
    Spasić, Irena. Some Web Services relevant for biomedical applications. 
●


    (Presentation slides.)
    Zhou, GuoDong, Jie Zhang, Jian Su, Dan Shen, and ChewLim Tan, 
●


    2004. Recognizing names in biomedical texts: a machine learning 
    approach. Bioinformatics. Vol. 20 no. 7. Pp. 1178­1190

More Related Content

Viewers also liked

Health care special interest-i2b2
Health care  special interest-i2b2Health care  special interest-i2b2
Health care special interest-i2b2farzanehs
 
Nacsa úJ 4.1 Jav.
Nacsa úJ 4.1 Jav.Nacsa úJ 4.1 Jav.
Nacsa úJ 4.1 Jav.
tuddyke
 
BioNLP09 Winners
BioNLP09 WinnersBioNLP09 Winners
BioNLP09 Winnersfarzanehs
 
Tinsleys 7 Accomplishments
Tinsleys 7 AccomplishmentsTinsleys 7 Accomplishments
Tinsleys 7 Accomplishments
Tinsley10
 
Susan Gray
Susan GraySusan Gray
Susan Gray
smgray
 
the_life_cycle_of_a_wireframe
the_life_cycle_of_a_wireframethe_life_cycle_of_a_wireframe
the_life_cycle_of_a_wireframeguest7ae38dee
 
Olivia Contradictions
Olivia ContradictionsOlivia Contradictions
Olivia Contradictionsfarzanehs
 

Viewers also liked (12)

Health care special interest-i2b2
Health care  special interest-i2b2Health care  special interest-i2b2
Health care special interest-i2b2
 
Nacsa úJ 4.1 Jav.
Nacsa úJ 4.1 Jav.Nacsa úJ 4.1 Jav.
Nacsa úJ 4.1 Jav.
 
Edu
EduEdu
Edu
 
BioNLP09 Winners
BioNLP09 WinnersBioNLP09 Winners
BioNLP09 Winners
 
Crf
CrfCrf
Crf
 
Tinsleys 7 Accomplishments
Tinsleys 7 AccomplishmentsTinsleys 7 Accomplishments
Tinsleys 7 Accomplishments
 
Bionlp09
Bionlp09Bionlp09
Bionlp09
 
Susan Gray
Susan GraySusan Gray
Susan Gray
 
the_life_cycle_of_a_wireframe
the_life_cycle_of_a_wireframethe_life_cycle_of_a_wireframe
the_life_cycle_of_a_wireframe
 
Defense
DefenseDefense
Defense
 
Olivia Contradictions
Olivia ContradictionsOlivia Contradictions
Olivia Contradictions
 
Ambiguity
AmbiguityAmbiguity
Ambiguity
 

Similar to Six Month

Question Classifier
Question ClassifierQuestion Classifier
Question Classifier
Jennifer Lee
 
Trust in Recommender Systems a historical overview and recent developments
Trust in Recommender Systems
a historical overview and recent developmentsTrust in Recommender Systems
a historical overview and recent developments
Trust in Recommender Systems a historical overview and recent developments
Paolo Massa
 
!#$&()&#+,$)!#$$&())• +,-.$0$12,#-34-$#3.docx
!#$&()&#+,$)!#$$&())• +,-.$0$12,#-34-$#3.docx!#$&()&#+,$)!#$$&())• +,-.$0$12,#-34-$#3.docx
!#$&()&#+,$)!#$$&())• +,-.$0$12,#-34-$#3.docx
katherncarlyle
 
Exploring Data Visualization
Exploring Data VisualizationExploring Data Visualization
Exploring Data Visualization
Jim Jenkins
 
Paul Henning Krogh A New Dawn For E Collaboration In Science
Paul Henning Krogh   A New Dawn For E Collaboration In SciencePaul Henning Krogh   A New Dawn For E Collaboration In Science
Paul Henning Krogh A New Dawn For E Collaboration In Science
Vincenzo Barone
 
Semantic Web research anno 2006:main streams, popular falacies, current statu...
Semantic Web research anno 2006:main streams, popular falacies, current statu...Semantic Web research anno 2006:main streams, popular falacies, current statu...
Semantic Web research anno 2006:main streams, popular falacies, current statu...
Frank van Harmelen
 
Common Qualitative Research Designs and What They’re Good For
Common Qualitative Research Designs and What They’re Good ForCommon Qualitative Research Designs and What They’re Good For
Common Qualitative Research Designs and What They’re Good For
Statistics Solutions
 
Just the basics_strata_2013
Just the basics_strata_2013Just the basics_strata_2013
Just the basics_strata_2013Ken Mwai
 
ESWC SS 2013 - Wednesday Tutorial Marko Grobelnik: Introduction to Big Data A...
ESWC SS 2013 - Wednesday Tutorial Marko Grobelnik: Introduction to Big Data A...ESWC SS 2013 - Wednesday Tutorial Marko Grobelnik: Introduction to Big Data A...
ESWC SS 2013 - Wednesday Tutorial Marko Grobelnik: Introduction to Big Data A...eswcsummerschool
 
Research data management for medical data with pyradigm
Research data management for medical data with pyradigmResearch data management for medical data with pyradigm
Research data management for medical data with pyradigm
Pradeep Redddy Raamana
 
ALOE - Combining User Generated Content and Traditional Metadata
ALOE - Combining User Generated Content and Traditional MetadataALOE - Combining User Generated Content and Traditional Metadata
ALOE - Combining User Generated Content and Traditional Metadata
Martin Memmel
 
Getting to Know Your Data with R
Getting to Know Your Data with RGetting to Know Your Data with R
Getting to Know Your Data with R
Stephen Withington
 
Caspar Preservation Methodology Steve Renkin
Caspar Preservation Methodology Steve RenkinCaspar Preservation Methodology Steve Renkin
Caspar Preservation Methodology Steve Renkin
DigitalPreservationEurope
 
2014 aus-agta
2014 aus-agta2014 aus-agta
2014 aus-agta
c.titus.brown
 
Data Science Provenance: From Drug Discovery to Fake Fans
Data Science Provenance: From Drug Discovery to Fake FansData Science Provenance: From Drug Discovery to Fake Fans
Data Science Provenance: From Drug Discovery to Fake Fans
Jameel Syed
 
Knowledge base system appl. p 1,2-ver1
Knowledge base system appl.  p 1,2-ver1Knowledge base system appl.  p 1,2-ver1
Knowledge base system appl. p 1,2-ver1
Taymoor Nazmy
 
Oxford DTP - Sansone curation tools - Dec 2014
Oxford DTP - Sansone curation tools - Dec 2014Oxford DTP - Sansone curation tools - Dec 2014
Oxford DTP - Sansone curation tools - Dec 2014
Susanna-Assunta Sansone
 
Semi-automated Exploration and Extraction of Data in Scientific Tables
Semi-automated Exploration and Extraction of Data in Scientific TablesSemi-automated Exploration and Extraction of Data in Scientific Tables
Semi-automated Exploration and Extraction of Data in Scientific Tables
Elsevier
 
Text-mining and Automation
Text-mining and AutomationText-mining and Automation
Text-mining and Automation
benosteen
 
Information Resoruces For Chemists Nov08
Information Resoruces For Chemists Nov08Information Resoruces For Chemists Nov08
Information Resoruces For Chemists Nov08
Gaz Johnson
 

Similar to Six Month (20)

Question Classifier
Question ClassifierQuestion Classifier
Question Classifier
 
Trust in Recommender Systems a historical overview and recent developments
Trust in Recommender Systems
a historical overview and recent developmentsTrust in Recommender Systems
a historical overview and recent developments
Trust in Recommender Systems a historical overview and recent developments
 
!#$&()&#+,$)!#$$&())• +,-.$0$12,#-34-$#3.docx
!#$&()&#+,$)!#$$&())• +,-.$0$12,#-34-$#3.docx!#$&()&#+,$)!#$$&())• +,-.$0$12,#-34-$#3.docx
!#$&()&#+,$)!#$$&())• +,-.$0$12,#-34-$#3.docx
 
Exploring Data Visualization
Exploring Data VisualizationExploring Data Visualization
Exploring Data Visualization
 
Paul Henning Krogh A New Dawn For E Collaboration In Science
Paul Henning Krogh   A New Dawn For E Collaboration In SciencePaul Henning Krogh   A New Dawn For E Collaboration In Science
Paul Henning Krogh A New Dawn For E Collaboration In Science
 
Semantic Web research anno 2006:main streams, popular falacies, current statu...
Semantic Web research anno 2006:main streams, popular falacies, current statu...Semantic Web research anno 2006:main streams, popular falacies, current statu...
Semantic Web research anno 2006:main streams, popular falacies, current statu...
 
Common Qualitative Research Designs and What They’re Good For
Common Qualitative Research Designs and What They’re Good ForCommon Qualitative Research Designs and What They’re Good For
Common Qualitative Research Designs and What They’re Good For
 
Just the basics_strata_2013
Just the basics_strata_2013Just the basics_strata_2013
Just the basics_strata_2013
 
ESWC SS 2013 - Wednesday Tutorial Marko Grobelnik: Introduction to Big Data A...
ESWC SS 2013 - Wednesday Tutorial Marko Grobelnik: Introduction to Big Data A...ESWC SS 2013 - Wednesday Tutorial Marko Grobelnik: Introduction to Big Data A...
ESWC SS 2013 - Wednesday Tutorial Marko Grobelnik: Introduction to Big Data A...
 
Research data management for medical data with pyradigm
Research data management for medical data with pyradigmResearch data management for medical data with pyradigm
Research data management for medical data with pyradigm
 
ALOE - Combining User Generated Content and Traditional Metadata
ALOE - Combining User Generated Content and Traditional MetadataALOE - Combining User Generated Content and Traditional Metadata
ALOE - Combining User Generated Content and Traditional Metadata
 
Getting to Know Your Data with R
Getting to Know Your Data with RGetting to Know Your Data with R
Getting to Know Your Data with R
 
Caspar Preservation Methodology Steve Renkin
Caspar Preservation Methodology Steve RenkinCaspar Preservation Methodology Steve Renkin
Caspar Preservation Methodology Steve Renkin
 
2014 aus-agta
2014 aus-agta2014 aus-agta
2014 aus-agta
 
Data Science Provenance: From Drug Discovery to Fake Fans
Data Science Provenance: From Drug Discovery to Fake FansData Science Provenance: From Drug Discovery to Fake Fans
Data Science Provenance: From Drug Discovery to Fake Fans
 
Knowledge base system appl. p 1,2-ver1
Knowledge base system appl.  p 1,2-ver1Knowledge base system appl.  p 1,2-ver1
Knowledge base system appl. p 1,2-ver1
 
Oxford DTP - Sansone curation tools - Dec 2014
Oxford DTP - Sansone curation tools - Dec 2014Oxford DTP - Sansone curation tools - Dec 2014
Oxford DTP - Sansone curation tools - Dec 2014
 
Semi-automated Exploration and Extraction of Data in Scientific Tables
Semi-automated Exploration and Extraction of Data in Scientific TablesSemi-automated Exploration and Extraction of Data in Scientific Tables
Semi-automated Exploration and Extraction of Data in Scientific Tables
 
Text-mining and Automation
Text-mining and AutomationText-mining and Automation
Text-mining and Automation
 
Information Resoruces For Chemists Nov08
Information Resoruces For Chemists Nov08Information Resoruces For Chemists Nov08
Information Resoruces For Chemists Nov08
 

Recently uploaded

Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
DianaGray10
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
Kari Kakkonen
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
OnBoard
 
PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)
Ralf Eggert
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Thierry Lestable
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
Safe Software
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
Alan Dix
 
Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
Frank van Harmelen
 
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Inflectra
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
ThousandEyes
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Tobias Schneck
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
Paul Groth
 
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptxIOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
Abida Shariff
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
Prayukth K V
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
BookNet Canada
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
Product School
 
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Jeffrey Haguewood
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Product School
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
Thijs Feryn
 

Recently uploaded (20)

Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
 
PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
 
Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
 
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
 
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptxIOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
 
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
 

Six Month

  • 1. Six Month Progress Report Farzaneh Sarafraz 14 August 2008
  • 2. In this report What I have learnt ● What are the gaps in my understanding ● Outputs so far ● Reflection on supervision mode ● Plan outline until December 2008 ●
  • 3. 1. What I have learnt – general General ● Settled down in a new environment – Learnt some of the regulations and how things  – work in The country ● The city ● The university ● The faculty ● The school ●
  • 4. What I have learnt – less  general Less general ● Thesis and paper writing theory and practice – Specifically through the CS7100 seminar ● LaTeX – Coding infrastruction – Warmed up! ● Database handling – Administration / web applications – Specific ● Biological text mining theory –
  • 5. Biological text mining Biological text mining theory ● Main problems – Main challenges – Main approaches – Communities – Events, papers, journals, competitions, etc. – 40+ papers in my CiteULike account ● Biological text mining hands on ● Tools, techniques, and resources – i2b2 – HIV –
  • 6. Biological text mining theory Main problems ● Information retrieval – Information extraction – Relation extraction ● Shallow parsing / chunking – POS tagging – Word sense disambiguation – Term variation –
  • 7. Biological text mining theory  (cont.) Main problems (cont.) ● Named entity recognition – Dictionary based ● Rule based ● Machine learning (HMM: Zhou et al.) ● Hybrid ● Evaluation – Precision, recall, F­Score ● Sensitivity and specificity ● Not always possible due to the lack of ● Test Corpora – Common domains, techniques, goals –
  • 8. Biological text mining theory  (cont.) Main challenges ● Deal with sublanguage of biology – Build scalable and robust systems – Present the results in meaningful and informative  – ways to the biologist Deal with interdisciplinary aspects – Biology – chemistry – medicine ● Different views / information needs – Specific field (biomedicine) – linguistics – computation  ● and data mining
  • 9. Main Challenges (cont.) Specific field (biomedicine) – linguistics –  ● computation and data mining The text is not necessarily written to be  – comprehensible by automatic techniques The language is dramatically different from that  – of e.g. newswire. Terminology, new and coined terms, usage  – ambiguity Non­algorithmic, irrational patterns in NL –
  • 10. Resources I am aware of / I am using existing resources ● Literature repositories/search engines – Pubmed, MEDLINE, BioMed ● Google ● Parsers – Stanford Parser ● GeniaTagger ● Terminological resorces – Gene Ontology ● EMBL­EBI ● MeSh thesaurus ● UMLS ● Gene Synonym Finder, SBO, ... ●
  • 11. Resources (cont.) Existing resources (cont.) ● Lexical resources – Webservices – Entrez ● Taverna ● SBO ●
  • 12. Resources (cont.) I am partially developing tools for ● Named entity recognition – Relation extraction – I am fully tackling ● PPI mining ● Word sense disambiguation ● Nominalization ● I may have to tackle in future – Contradiction, negation, contrasts ● Temporal text mining ●
  • 13. 2. What I still need to learn ­  Specific There may be gaps I am unaware of ● Less of wheel reinvention ● Use other software – Lingpipe, NLTK, Weka, RASP, ABNER, PIE,  ● BIOINFER, MALLET, Julielab, SPECIALIST,  EMBL­ EBI, GNN (Arizona Uni),  Use other methods/approaches – Machine Learning ● Dynamic programming ● CL / Bio text mining theory algorithms – Viterbi, HMM, NN, SVM, GA, CRF, ● ... ●
  • 14. 2. What I still need to learn ­  Specific Make a resources list on our web page? ● Similar to the Stanford – outdated – repository –
  • 15. What I still need to learn – Less  general News of the field ● Areas/opportunities for research ● Michael Phelps analogy – Developing skills for a CV ● Ways to proove I have the skills I already have – Presenting results ● Reasons, occasions, methods – Writing – Other workshops by the faculty ●
  • 16. What I still need to learn ­  General Writing, writing, writing ● Binge writing vs. Snacking – Write as you go – Closer to the final output ● Paper­based dissertation? Something to consider. ● Review, get feedback, rewrite – A pedantic editor –
  • 17. What I still need to learn –  General (Cont.) Stronger coding infrastructure ● More reusable libraries – Config files – One­click approach – Optimisation ● Code – Database – Query optimization ● Database optimization ● Server – Load balancing ● Multi threading – Multi processor –
  • 18. 3. Outputs so far Written ● Background work survey – Mid April 2008 ● 5 pages (approx. 1000 words) ● Feedback from supervisor ● Never was written up ● Writing sample for CS7100 seminar – June 2008 ● Same document as above, revised and rewritten ● 12 pages, 2215 words ● Feedback from Jim Miles and peer students ●
  • 19. HIV Understanding of the problem and the goals ● Presenting the given/wanted as tables/code/ ● query Building code infrastructure ● Database tables – Utility libraries – Version control system – 1500+ lines of documented, reusable code –
  • 20. HIV summary Goal: to reproduce a human­produced table ● Each row has the following main columns ● HIV GPN (protein name, acc, and gene ID) – Human GPN (protein name, acc, and gene ID) – A relation (interactoin) between the two – A description of the interaction – The PMIDs that the interaction has been  – reported in The raw input: the full abstracts ●
  • 21. HIV results HIV and human GPN names ● Most where mapped to their entities – 1237 out of 50416 currently unmapped (2%) – Interaction verbs ● Interesting verbs and stems identified – The stems where found in the text – Working on stems, so including nominals, etc. ● Terms extracted from the interaction  ● descriptions in the original data 
  • 22. Example SELECT DISTINCT mention FROM  ● index_description_term i where  termID=28; 18 variations ● CD4+ T T4 (CD) CD4+T CD4­, T T4(CD) T (CD4) T CD4 CD4 (T) CD4+ (T) CD4(+) T CD4(+)T CD4(T) CD4 T CD4+­T CD(4+) T T4+ (CD) CD4(+)­T CD4­ T
  • 23. Example SELECT DISTINCT mention FROM  ● index_description_term i where  termID=28 or termID = 17; 28 variations ● CD4+ T T4(CD) CD4+ (T) CD4(+) T cell CD4­, T CD4 (T) CD4(T) CD4 T­cell T CD4 CD4(+)T CD(4+) T CD4(+) T­cell CD4(+) T CD4+­T CD4­ T CD4(+)T cell CD4 T CD4(+)­T CD4+ T cell CD4+­T­cell T4+ (CD) CD4+T CD4­, T cell CD4(+)­T­cell T4 (CD) T (CD4) CD4+ T­cell CD4 T cell
  • 24. HIV results POS tagging with GeniaTagger ● Parsing with Stanford parser ● Haven't used this data yet – Working with sentences as units ● Normalising terms ● Tables of synonyms ● Tables of verb stems and terms ● Indexes with mention/offset pairs ●
  • 25. HIV results Looking for sentences that share all these  ● properties with any of the goal table rows A human­HIV pair of GPN – A verb phrase containing a word with the same  – stem of the interaction verb Any description term(s) – Very high recall (few false negatives) ● Not­so­high precision (numerous false  ● positives) Optimisation for more complicated queries  ●
  • 26. HIV next steps Compare with other PPI mining and GPN  ● recognition tools Find optimum parameters ● Presentable results ● Integrate with the interaction ontology ● Evaluate, compare, present, get feedback ● Apply to new papers ● Apply to new organisms ● Evaluate, compare, present, get feedback... ●
  • 27. Supervision Good points ● Moving away from theory to tackling real  – problems very quickly Micromanagement while I am free to manage my  – own time and other preferences Planning ahead, causing commitment – Providing common sense, insight, and savvy –
  • 28. Supervision – good points  (cont.) Providing good starting points while not ruling  – out my own ideas Good meeting frequency – Group meetings? ● General support – Addressing my needs – Financial ● Research interests and preferences ●
  • 29. Supervision Could be improved ● Minutes were not always thorough – Same for tasklists – We could have agenda for the meetings – I write a list of the things that I want to discuss each  ● session Like the one I had for this report–could have been  ● there when I presented my 3­week plan Same for TEAM meetings and HIV meetings – I hope we keep tackling real problems in  ● future
  • 30. Plan End of August ● Presenting HIV output to the group – Writing HIV results – Sep ● Moving to new accommodation (11­20 Sep.) – Moving on HIV – Applying the ontology ● Mining new corpora ● Generalising? ●
  • 31. Plan Oct ● Writing up HIV – Possible publicatoin – Ideas for PhD research – Nov ● Finalise MPhil vs. PhD – Finalise PhD research area – Work on end of year report – Dec ● Write up EOY report – EOY Viva –
  • 32. References Ananiadou, Sophia, and John McNaught. 2006. Text Mining for Biology  ● and Biomedicine. Norwood: Artech House, Inc. Spasić, Irena. Some Web Services relevant for biomedical applications.  ● (Presentation slides.) Zhou, GuoDong, Jie Zhang, Jian Su, Dan Shen, and ChewLim Tan,  ● 2004. Recognizing names in biomedical texts: a machine learning  approach. Bioinformatics. Vol. 20 no. 7. Pp. 1178­1190