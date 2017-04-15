Research Methods in Natural Language Processing Pham Quang Nhat Minh FPT Technology Research Institute FPT University minh...
Objectives of the lecture Introduce some research know-how and practices in doing research Focus on NLP/Machine Learning/D...
Table of Contents 1 What are empirical research methods for computer science? 2 How to choose a good research topic? 3 How...
Acknowledgements Many contents in the lecture are from documents in the references (Alon, 2009) How To Choose a Good Scien...
Table of Contents
What does “empirical” mean? Relying on observations, data, experiments Empirical work should complement theoretical work T...
Why we need empirical methods Theory based science need not be all theorems We do not know how a theory works in diﬀerent ...
Empirical methods in CS/AI Data observation Construct hypotheses Test with empirical experiments Reﬁne hypotheses and mode...
Kinds of data analysis Exploratory (EDA) - looking for patterns in data Statistical inferences from sample data Testing hy...
Tools for data analysis R programming language Python: numpy scipy pandas matplotlib for data visualization My bias opinio...
Exercises Install R: https://www.r-project.org Download the data ﬁle ex1data1.txt from: http://tinyurl.com/m7bpp8d The dat...
R for data visualization Pham Quang Nhat Minh Research Methods in NLP 12/70
Table of Contents
Why do we need to choose a good research topic? “Garbage in, garbage out” principle You may work with a research topic for...
What is a good research topic? (Alon, 2009) Two Dimensions of Problem Choice Feasibility: whether a problem is hard or eas...
Two-dimensional space of Problem Choice (1) Figure: The Feasibility-Interest Diagram for Choosing a Project (Alon, 2009) P...
Two-dimensional space of Problem Choice (2) Figure: The Feasibility-Interest Diagram for Choosing a Project (Alon, 2009) P...
What is a good research topic? Are many people care about the topic? Research community, your supervisors, industry demand...
How to choose a good research topic: steps by steps Choose the broad (general) topic E.g, Machine Translation Draw a hiera...
Finding a research problem Take your time to choose a good research topic (Alon, 2009): Rule for new Ph.D. students and po...
Developing your research ideas Where do research ideas come from? Observations Data observations, data analysis, discover ...
Reading papers, attending conferences Choose good and relevant papers. Consider: Impact factors of the journal. In the NLP...
Techniques, methods from other ﬁelds Expand your view, problem solving methodologies by regularly reading articles in othe...
What happens after we choose a problem? (Alon, 2009) Pham Quang Nhat Minh Research Methods in NLP 24/70
Table of Contents
Two types of readings Fast readings Get and understand the basic ideas of the paper Know the problems the paper attacks an...
How to read a scientiﬁc paper (1) Michael J. Hanson. Eﬃcient Readings of Papers in Science and Technology: http://tinyurl....
How to read a scientiﬁc paper (2) Decide what to read Read title, abstract Read it, ﬁle it, or skip it Read for breath Wha...
How to read a scientiﬁc paper (3) Read in depth How did they do it? Challenge their arguments. Examine assumptions. Examin...
Homework Choose one scientiﬁc article that you want to read in depth, read, take notes and explain ideas, methods presente...
Table of Contents
Some basic rules Your advisor is supposed to be very busy, so you should follow up her/him Schedule the meeting in advance...
How to write a progress/status report Michael Ernst. Writing a progress/status report: http://tinyurl.com/zp7cdvt Quote th...
Communicate with your advisor Prepare some slides (3-4 slides) to make the discussion concrete Send the materials at least...
Table of Contents
What is Natural Language Processing? A ﬁeld of computer science, artiﬁcial intelligence, and computational linguistics To ...
Why is NLP interesting? Languages involve many human activities Reading, writing, speaking, listening Voice can be used as...
NLP problems Fundamental problems Word Segmentation Part-of-speech tagging Syntactic Analysis Semantic Analysis Applicatio...
What is it like doing research in NLP? Empirical methods are applied much in NLP Relying on observations, data, experiment...
What is it like doing research in NLP? Many ideas do not work Even though, we need to analyse the results to understand wh...
The typical working day of a NLP researcher Data observation and data/result analysis (a lot) Discuss ideas with colleague...
How to learn NLP? Research starts from learning Learn/review background about: Probabilistic and Statistics Basic math (li...
How to learn NLP: Get your hands dirty Practice with programming exercises: 100 NLP drill exercises: https://github.com/ m...
Finding a NLP research problem All the principles in the section “How to choose a good research topic” apply. Looking for ...
Basic rules to choose NLP papers READ: Papers in top conferences and journals in NLP and other related ﬁelds (ACL/EMNLP/NA...
Table of Contents
Why is coding important in NLP/ML research? Many (most) NLP/ML research work is empirical studies Need to do data analysis...
Why we care about coding practices in NLP research? Bad coding practices cause problems You ﬁnd errors in the experimental...
Why we care about coding practices in NLP research? Good coding practices speed up our research work Recall that: (No of s...
Best Practices for Scientiﬁc Computing (Wilson et al., 2012) 1- Write programs for people, not computers. Readers of the c...
Best Practices for Scientiﬁc Computing 3- Use the computer to record history Unique identiﬁers and version numbers for raw...
Best Practices for Scientiﬁc Computing 5- Use a version control system: git, mercural, subversion. Push code to github, bi...
Best Practices for Scientiﬁc Computing 7- Plan for mistakes Write and run tests Unit Test: Check the correctness of each s...
Best Practices for Scientiﬁc Computing 8- Optimize software only after it works correctly Use proﬁler to identify bottlene...
9- Document design, and purpose, not mechanics Document interface and reasons, not implementations Do not do that i = i + ...
Coding practices for NLP/ML research All general practices apply for NLP/ML research Separate a process into small process...
Tool for visualizing research results Tables (Microsoft Excel, HTML) Charts (gnuplot, matplotlib, R) Graphs (graphviz, Gep...
Optimize codes only after your ideas work “Make it work. Make it right. Make it fast.” (Kent Beck) “Premature optimization...
Table of Contents
My proﬁle 6/2006: B.Sc. in Information Technology from University of Engineering and Technology, Vietnam National Universi...
Master program at JAIST JAIST is a public graduate institute in Japan Homepage: https://www.jaist.ac.jp/english Three scho...
Master program at JAIST Two-year full-time master program First year: Students are temporarily assigned to a laboratory, a...
How did I ﬁnish my master? Six months before entering master program Take Japanese course Review background Read NLP Textb...
How I choose my master thesis I even did not know how to choose a research topic (crying) You should know how to choose I ...
Sentence insertion task Task: To automatically updating a wikipedia article by inserting new information into that. I prop...
Research projects at FPT Technology Research Institute NLP problems in chatbot development Intent classiﬁcation Named enti...
Summary Empirical research methods reply on observations, data, experiments Two dimensions of problem choice: Feasibility ...
Check-list for your master thesis 1 Is your work reproducible? Package your code so that it can automatically generate the...
Advices for your master thesis Take time to choose your master research topic Work on the research problem that you are in...
References Alon, U. (2009). How to choose a good scientiﬁc problem. Molecular cell, 35 6, 726-8. Aruliah, D.A., Brown, C.T...
Research Methods in Natural Language Processing

  1. 1. Research Methods in Natural Language Processing Pham Quang Nhat Minh FPT Technology Research Institute FPT University minhpqn2@fe.edu.vn April 16, 2017
  2. 2. Objectives of the lecture Introduce some research know-how and practices in doing research Focus on NLP/Machine Learning/Data Science ﬁelds Share my research experiences in the ﬁeld NLP Pham Quang Nhat Minh Research Methods in NLP 2/70
  3. 3. Table of Contents 1 What are empirical research methods for computer science? 2 How to choose a good research topic? 3 How to read a scientiﬁc paper? 4 How to work with your advisor 5 Doing research in NLP ﬁeld What is NLP? What is it like doing research in NLP? How to do research in NLP? How to choose NLP papers to read? 6 Coding practices for NLP/Machine Learning research work 7 My research stories Pham Quang Nhat Minh Research Methods in NLP 3/70
  4. 4. Acknowledgements Many contents in the lecture are from documents in the references (Alon, 2009) How To Choose a Good Scientiﬁc Problem (Wilson et al., 2012) Best Practices for Scientiﬁc Computing Paul Cohen: Empirical Methods for AI & CS Other documents, blogs Pham Quang Nhat Minh Research Methods in NLP 4/70
  5. 5. Table of Contents 1 What are empirical research methods for computer science? 2 How to choose a good research topic? 3 How to read a scientiﬁc paper? 4 How to work with your advisor 5 Doing research in NLP ﬁeld What is NLP? What is it like doing research in NLP? How to do research in NLP? How to choose NLP papers to read? 6 Coding practices for NLP/Machine Learning research work 7 My research stories Pham Quang Nhat Minh Research Methods in NLP 5/70
  6. 6. What does “empirical” mean? Relying on observations, data, experiments Empirical work should complement theoretical work Theories often have holes (e.g., How big is the constant term?) Theories are suggested by observations Theories are tested by observations Conversely, theories direct our empirical attention In addition, empirical means “wanting to understand behaviour of complex systems” In NLP, we may want to understand how features are correlated Pham Quang Nhat Minh Research Methods in NLP 6/70
  7. 7. Why we need empirical methods Theory based science need not be all theorems We do not know how a theory works in diﬀerent conditions Diﬀerent data sets, domains Pham Quang Nhat Minh Research Methods in NLP 7/70
  8. 8. Empirical methods in CS/AI Data observation Construct hypotheses Test with empirical experiments Reﬁne hypotheses and modelling assumptions Pham Quang Nhat Minh Research Methods in NLP 8/70
  9. 9. Kinds of data analysis Exploratory (EDA) - looking for patterns in data Statistical inferences from sample data Testing hypotheses Estimating parameters Building mathematical models of datasets Machine learning, data mining... Pham Quang Nhat Minh Research Methods in NLP 9/70
  10. 10. Tools for data analysis R programming language Python: numpy scipy pandas matplotlib for data visualization My bias opinions: statisticians like R, computer scientists often use Python Python is much easier to learn than R Pham Quang Nhat Minh Research Methods in NLP 10/70
  11. 11. Exercises Install R: https://www.r-project.org Download the data ﬁle ex1data1.txt from: http://tinyurl.com/m7bpp8d The data ﬁle has two columns: First column: the population of a city. Second column: the proﬁt of a food truck in that city. In R terminal, try the plot code df <- read.table("./ex1data1.txt", sep=",", header=FALSE) plot(df[,1], df[,2], xlab=‘‘Profit in $10,000s’’, ylab=‘‘Population of City in 10,000s’’) Pham Quang Nhat Minh Research Methods in NLP 11/70
  12. 12. R for data visualization Pham Quang Nhat Minh Research Methods in NLP 12/70
  13. 13. Table of Contents 1 What are empirical research methods for computer science? 2 How to choose a good research topic? 3 How to read a scientiﬁc paper? 4 How to work with your advisor 5 Doing research in NLP ﬁeld What is NLP? What is it like doing research in NLP? How to do research in NLP? How to choose NLP papers to read? 6 Coding practices for NLP/Machine Learning research work 7 My research stories Pham Quang Nhat Minh Research Methods in NLP 13/70
  14. 14. Why do we need to choose a good research topic? “Garbage in, garbage out” principle You may work with a research topic for years 1 year for a master thesis 3 years or more for a Ph.D. dissertation It is painful to do things that you feel uninteresting Lack passion, motivations, ideas Much frustration and bitterness Pham Quang Nhat Minh Research Methods in NLP 14/70
  15. 15. What is a good research topic? (Alon, 2009) Two Dimensions of Problem Choice Feasibility: whether a problem is hard or easy We can measure the feasibility as the expected time to complete the project Feasibility is a function of the skills of students/researchers and of the technology in the lab. Interest: the increase in knowledge expected from the project. Pham Quang Nhat Minh Research Methods in NLP 15/70
  16. 16. Two-dimensional space of Problem Choice (1) Figure: The Feasibility-Interest Diagram for Choosing a Project (Alon, 2009) Pham Quang Nhat Minh Research Methods in NLP 16/70
  17. 17. Two-dimensional space of Problem Choice (2) Figure: The Feasibility-Interest Diagram for Choosing a Project (Alon, 2009) Pham Quang Nhat Minh Research Methods in NLP 17/70
  18. 18. What is a good research topic? Are many people care about the topic? Research community, your supervisors, industry demands Are you really interested in the topic? The topic should be interesting to you rather than to others Good signs: “ideas and questions that come back again and again to your mind for months or years.” Pham Quang Nhat Minh Research Methods in NLP 18/70
  19. 19. How to choose a good research topic: steps by steps Choose the broad (general) topic E.g, Machine Translation Draw a hierarchy of research topics, starting from the broad topic Review literature to look for gaps in previous work Choose the focused topic E.g., Phrase-based Machine Translation Find gaps in previous work Form research questions in the focused topic From research questions, formulate the research problem Pham Quang Nhat Minh Research Methods in NLP 19/70
  20. 20. Finding a research problem Take your time to choose a good research topic (Alon, 2009): Rule for new Ph.D. students and postdocs: “Do not commit to a problem before 3 months have elapsed” For master students, take 1-2 months for choosing the research topic before your start the research project. Join projects in your laboratory Many research ideas for thesis are from projects you involved Pham Quang Nhat Minh Research Methods in NLP 20/70
  21. 21. Developing your research ideas Where do research ideas come from? Observations Data observations, data analysis, discover patterns in data Reading papers, attending conferences, listening talks Techniques, methods from other disciplines, ﬁelds Imagine Suggestions from your advisor Pham Quang Nhat Minh Research Methods in NLP 21/70
  22. 22. Reading papers, attending conferences Choose good and relevant papers. Consider: Impact factors of the journal. In the NLP ﬁeld, choose papers from top conferences, journals (ACL/NAACL/EMNLP/COLING) The Top 10 NLP Conferences: http://www.junglelightspeed.com/ the-top-10-nlp-conferences Reputations of authors and their organizations Not only readings, but criticizing papers and ﬁnding the gaps Pham Quang Nhat Minh Research Methods in NLP 22/70
  23. 23. Techniques, methods from other ﬁelds Expand your view, problem solving methodologies by regularly reading articles in other ﬁelds. An example is the task image captioning We need to use techniques from both computer vision and NLP. Pham Quang Nhat Minh Research Methods in NLP 23/70
  24. 24. What happens after we choose a problem? (Alon, 2009) Pham Quang Nhat Minh Research Methods in NLP 24/70
  25. 25. Table of Contents 1 What are empirical research methods for computer science? 2 How to choose a good research topic? 3 How to read a scientiﬁc paper? 4 How to work with your advisor 5 Doing research in NLP ﬁeld What is NLP? What is it like doing research in NLP? How to do research in NLP? How to choose NLP papers to read? 6 Coding practices for NLP/Machine Learning research work 7 My research stories Pham Quang Nhat Minh Research Methods in NLP 25/70
  26. 26. Two types of readings Fast readings Get and understand the basic ideas of the paper Know the problems the paper attacks and how it solves that Put the paper in the “big picture” of the ﬁeld Know what are diﬀerences between the paper and previous work We do “fast reading” much when we survey literature and choose a broad topic Deep readings Understand the details of presented methods Try to understand how the proposed method works Criticize the paper and ﬁnd its limitations If you were the authors, how would you solve the problem? Propose alternative methods? We do “deep reading” much we look for a focused topic Pham Quang Nhat Minh Research Methods in NLP 26/70
  27. 27. How to read a scientiﬁc paper (1) Michael J. Hanson. Eﬃcient Readings of Papers in Science and Technology: http://tinyurl.com/qdebynz Pham Quang Nhat Minh Research Methods in NLP 27/70
  28. 28. How to read a scientiﬁc paper (2) Decide what to read Read title, abstract Read it, ﬁle it, or skip it Read for breath What did they do Skim introduction, headings, graphics, deﬁnitions, conclusions and bibliography. Consider the credibility. How useful is it? Decide whether to go on. Pham Quang Nhat Minh Research Methods in NLP 28/70
  29. 29. How to read a scientiﬁc paper (3) Read in depth How did they do it? Challenge their arguments. Examine assumptions. Examine methods. Examine statistics. Examine reasoning and conclusions. How can I apply their approach to my work? Take notes Make notes as you read. Highlight major points. Note new terms and deﬁnitions. Summarize tables and graphs. Write a summary. Pham Quang Nhat Minh Research Methods in NLP 29/70
  30. 30. Homework Choose one scientiﬁc article that you want to read in depth, read, take notes and explain ideas, methods presented in the paper to other students in a simple way. Notes: You should be able to answer 3 questions as follows. What is the problem the paper attack? What are the diﬀerences between the paper and other existing papers? What are interesting points of the presented methods? Pham Quang Nhat Minh Research Methods in NLP 30/70
  31. 31. Table of Contents 1 What are empirical research methods for computer science? 2 How to choose a good research topic? 3 How to read a scientiﬁc paper? 4 How to work with your advisor 5 Doing research in NLP ﬁeld What is NLP? What is it like doing research in NLP? How to do research in NLP? How to choose NLP papers to read? 6 Coding practices for NLP/Machine Learning research work 7 My research stories Pham Quang Nhat Minh Research Methods in NLP 31/70
  32. 32. Some basic rules Your advisor is supposed to be very busy, so you should follow up her/him Schedule the meeting in advanced and ask for meeting Keep regular meeting with your advisor Usually weekly meeting Do not just do what your advisor tell you to do Rule of thumbnail: You should ﬁnish all your assigned tasks before doing your own ideas Pham Quang Nhat Minh Research Methods in NLP 32/70
  33. 33. How to write a progress/status report Michael Ernst. Writing a progress/status report: http://tinyurl.com/zp7cdvt Quote the previous week’s plan. This helps you determine whether you accomplished your goals. State this week’s progress. What you have accomplished, What you learned, what diﬃculties you overcame, what diﬃculties are still blocking you, Your new ideas for research directions or projects, etc Give the next week’s plan. A good format is a bulleted list Try to make each goal measurable: there should be no ambiguity as to whether you were able to ﬁnish it. It’s good to include longer-term goals as well. Pham Quang Nhat Minh Research Methods in NLP 33/70
  34. 34. Communicate with your advisor Prepare some slides (3-4 slides) to make the discussion concrete Send the materials at least 24 hours before the meeting day Arrange the meeting in advanced Your advisor is not always right Actually you know more about your work than her/him If you have data, evidences, proofs, do not hesitate to debate Do not say “I guest”, “I think” when you explain something. Use data, evidences, references instead Pham Quang Nhat Minh Research Methods in NLP 34/70
  35. 35. Table of Contents 1 What are empirical research methods for computer science? 2 How to choose a good research topic? 3 How to read a scientiﬁc paper? 4 How to work with your advisor 5 Doing research in NLP ﬁeld What is NLP? What is it like doing research in NLP? How to do research in NLP? How to choose NLP papers to read? 6 Coding practices for NLP/Machine Learning research work 7 My research stories Pham Quang Nhat Minh Research Methods in NLP 35/70
  36. 36. What is Natural Language Processing? A ﬁeld of computer science, artiﬁcial intelligence, and computational linguistics To get computers to perform useful tasks involving human languages Human-Machine communication Improving human-human communication E.g Machine Translation Extracting information from texts Pham Quang Nhat Minh Research Methods in NLP 36/70
  37. 37. Why is NLP interesting? Languages involve many human activities Reading, writing, speaking, listening Voice can be used as an user interface in many applications Remote controls, virtual assistants like siri,... NLP is used to acquire insights from massive amount of textual data E.g., hypotheses from medical, health reports NLP has many applications NLP is hard! Pham Quang Nhat Minh Research Methods in NLP 37/70
  38. 38. NLP problems Fundamental problems Word Segmentation Part-of-speech tagging Syntactic Analysis Semantic Analysis Application problems Information Retrieval Information Extraction Question Answering Text Summarization Machine Translation Pham Quang Nhat Minh Research Methods in NLP 38/70
  39. 39. What is it like doing research in NLP? Empirical methods are applied much in NLP Relying on observations, data, experiments Contains many loops of experiments Identify the problem → Create ideas → Test the best idea → Analyse results → Identify the problem → Create ideas → · · · Pham Quang Nhat Minh Research Methods in NLP 39/70
  40. 40. What is it like doing research in NLP? Many ideas do not work Even though, we need to analyse the results to understand why they do not work to come up with new ideas. Try the next idea Fails occur more often than successes Try to increase the number of experiments (No of successes) = (No of experiments) × (Success rate) Pham Quang Nhat Minh Research Methods in NLP 40/70
  41. 41. The typical working day of a NLP researcher Data observation and data/result analysis (a lot) Discuss ideas with colleagues Do experiments (run the program) to test ideas Reading papers to keep up-to-date on mainstream researches Investigate new NLP/Machine Learning tools, libraries (less regular) Pham Quang Nhat Minh Research Methods in NLP 41/70
  42. 42. How to learn NLP? Research starts from learning Learn/review background about: Probabilistic and Statistics Basic math (linear algebra, calculus) Machine Learning Programming Read NLP textbooks Jurafsky, D., & Martin, J.H. Speech and Language Processing: an Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition. Manning, C.D., & Schutze, H. Foundations of statistical natural language processing. Pham Quang Nhat Minh Research Methods in NLP 42/70
  43. 43. How to learn NLP: Get your hands dirty Practice with programming exercises: 100 NLP drill exercises: https://github.com/ minhpqn/nlp_100_drill_exercises NLP Programming Tutorial, by Graham Neubig: http://www.phontron.com/teaching.php Compete in Kaggle data science challenges (kaggle.com) Pham Quang Nhat Minh Research Methods in NLP 43/70
  44. 44. Finding a NLP research problem All the principles in the section “How to choose a good research topic” apply. Looking for ideas from related ﬁelds Linguistics Machine learning: mainstream in the NLP ﬁeld is applying machine learning methods in the NLP problems Computer vision Looking at data It is actually my daily task Pham Quang Nhat Minh Research Methods in NLP 44/70
  45. 45. Basic rules to choose NLP papers READ: Papers in top conferences and journals in NLP and other related ﬁelds (ACL/EMNLP/NAACL/EACL/COLING/CoNLL/...) Workshops that focus on an NLP sub-ﬁeld Short papers at top conferences PhD dissertations from top institutions/advisors Papers with many citations Textbooks from leading researchers For more information, see: The Top 10 NLP Conferences1 1 http://www.junglelightspeed.com/the-top-10-nlp-conferences/ Pham Quang Nhat Minh Research Methods in NLP 45/70
  46. 46. Table of Contents 1 What are empirical research methods for computer science? 2 How to choose a good research topic? 3 How to read a scientiﬁc paper? 4 How to work with your advisor 5 Doing research in NLP ﬁeld What is NLP? What is it like doing research in NLP? How to do research in NLP? How to choose NLP papers to read? 6 Coding practices for NLP/Machine Learning research work 7 My research stories Pham Quang Nhat Minh Research Methods in NLP 46/70
  47. 47. Why is coding important in NLP/ML research? Many (most) NLP/ML research work is empirical studies Need to do data analysis, run experiments to test our ideas So, we have to write programs Even theorists should program, too “Implementing your own algorithm is a good way of checking your work. If you aren’t implementing your algorithm, arguably you’re skipping a key step in checking your results.” —Michael Mitzenmacher http://mybiasedcoin.blogspot.com/2008/11/bugs.html Pham Quang Nhat Minh Research Methods in NLP 47/70
  48. 48. Why we care about coding practices in NLP research? Bad coding practices cause problems You ﬁnd errors in the experimental results right before the paper submission deadline You cannot understand your own code after some months You deleted intermediate results, so you cannot verify the code You do not know the technique to verify experimental results You did not test the code, and then use untested code for experiments You spend long time for refactoring the code You could not get back the version that generate the best results ... Pham Quang Nhat Minh Research Methods in NLP 48/70
  49. 49. Why we care about coding practices in NLP research? Good coding practices speed up our research work Recall that: (No of successes) = (No of experiments) × (Success rate) Pham Quang Nhat Minh Research Methods in NLP 49/70
  50. 50. Best Practices for Scientiﬁc Computing (Wilson et al., 2012) 1- Write programs for people, not computers. Readers of the code do not need to remember too much Easy to read: names should be consistent, distinctive, and meaningful Break down the coding work into one-hour-long tasks 2- Automate repetitive tasks Scientists should rely on the computer to repeat tasks. Should use a script to run program!! Use a build tool to automate their scientiﬁc workﬂows Pham Quang Nhat Minh Research Methods in NLP 50/70
  51. 51. Best Practices for Scientiﬁc Computing 3- Use the computer to record history Unique identiﬁers and version numbers for raw data records Unique identiﬁers and version number for programs and libraries The values of parameters used to generate any given output; The names and version number of programs used to generate those outputs. 4- Make incremental changes Scientists can not know what their programs should do next until the current version has produced some results. Should work in small steps with frequent feedback and correction! Pham Quang Nhat Minh Research Methods in NLP 51/70
  52. 52. Best Practices for Scientiﬁc Computing 5- Use a version control system: git, mercural, subversion. Push code to github, bitbucket Everything that has been created manually should be put in version control 6- Do not repeat yourself (or others) At small-scale, code should be modularized rather than copied and pasted. At large-scale, scientiﬁc programmers should re-use code instead of re-writing it. Pham Quang Nhat Minh Research Methods in NLP 52/70
  53. 53. Best Practices for Scientiﬁc Computing 7- Plan for mistakes Write and run tests Unit Test: Check the correctness of each single software unit Integration Test: Check that pieces of unit code work correctly when combined. Regression Test: Running pre-existing code tests after changes to the code in order to make sure that it hasn’t regressed. Should use oﬀ-the-self unit testing library Pham Quang Nhat Minh Research Methods in NLP 53/70
  54. 54. Best Practices for Scientiﬁc Computing 8- Optimize software only after it works correctly Use proﬁler to identify bottlenecks Write code in the highest-level language possible Python is recommended language for research Only use low-level programming language when they are sure that performance boost is needed. Use the highest-level programming language for rapid prototyping. Pham Quang Nhat Minh Research Methods in NLP 54/70
  55. 55. 9- Document design, and purpose, not mechanics Document interface and reasons, not implementations Do not do that i = i + 1 # Increment the variable ’i’ by one. Refactor the code instead of explaining how it works Embed the documentation for a piece of software in that software Use software to generate documentation. 10- Collaborate Use pre-merge code reviews Use an issue tracking tool. Pham Quang Nhat Minh Research Methods in NLP 55/70
  56. 56. Coding practices for NLP/ML research All general practices apply for NLP/ML research Separate a process into small processes Use pipelines in Unix/Linux Make use of tools in experiments Linux commands NLP/ML Tools Libraries (json, nltk, matplotlib, scikit-learn,...) Algorithms E.g., Show statistics about number of words in a text ﬁle source file name.txt | cut -f1 | sort | uniq -c | sort -nr Visualize experimental results, make demo for your research results Pham Quang Nhat Minh Research Methods in NLP 56/70
  57. 57. Tool for visualizing research results Tables (Microsoft Excel, HTML) Charts (gnuplot, matplotlib, R) Graphs (graphviz, Gephi, D3.js) Texts (Microsoft Excel, HTML, brat2) Codes (google-code-prettify3, Pygments4) Demo (HTML, JavaScript, CSS,...) 2 http://brat.nlplab.org/ 3 https://github.com/google/code-prettify 4 http://pygments.org/ Pham Quang Nhat Minh Research Methods in NLP 57/70
  58. 58. Optimize codes only after your ideas work “Make it work. Make it right. Make it fast.” (Kent Beck) “Premature optimization is the root of all evil (or at least most of it) in programming.” (Donald Knuth) In NLP, always start with a simple and dirty working version E.g, Bag-of-word features and Naive Bayes algorithm in text classiﬁcation tasks Pham Quang Nhat Minh Research Methods in NLP 58/70
  59. 59. Table of Contents 1 What are empirical research methods for computer science? 2 How to choose a good research topic? 3 How to read a scientiﬁc paper? 4 How to work with your advisor 5 Doing research in NLP ﬁeld What is NLP? What is it like doing research in NLP? How to do research in NLP? How to choose NLP papers to read? 6 Coding practices for NLP/Machine Learning research work 7 My research stories Pham Quang Nhat Minh Research Methods in NLP 59/70
  60. 60. My proﬁle 6/2006: B.Sc. in Information Technology from University of Engineering and Technology, Vietnam National University, Hanoi 3/2010: M.Sc. in Information Science from Japan Advanced Institute of Science and Technology 3/2013: Ph.D. in Information Science from Japan Advanced Institute of Science and Technology Pham Quang Nhat Minh Research Methods in NLP 60/70
  61. 61. Master program at JAIST JAIST is a public graduate institute in Japan Homepage: https://www.jaist.ac.jp/english Three schools Information Science Knowledge Science Material Science All courses have English version You can learn in English Pham Quang Nhat Minh Research Methods in NLP 61/70
  62. 62. Master program at JAIST Two-year full-time master program First year: Students are temporarily assigned to a laboratory, and select the oﬃcial lab after 3 months In the ﬁrst year, mainly taking courses and choosing the master research topic Write the research proposal for master thesis in the end of the ﬁrst year Second year: Finishing all remaining course work Working on master research project Looking for jobs (students who do not pursue Ph.D.) Pham Quang Nhat Minh Research Methods in NLP 62/70
  63. 63. How did I ﬁnish my master? Six months before entering master program Take Japanese course Review background Read NLP Textbooks First year: Finish all course work Join a research project in my laboratory Choose the research topic Second year: Do research Attend one international conference Thesis defense Pham Quang Nhat Minh Research Methods in NLP 63/70
  64. 64. How I choose my master thesis I even did not know how to choose a research topic (crying) You should know how to choose I was assigned the topic by my co-advisor The topic is about sentence insertion I proposed a method to improve the previous results Pham Quang Nhat Minh Research Methods in NLP 64/70
  65. 65. Sentence insertion task Task: To automatically updating a wikipedia article by inserting new information into that. I proposed to use Word Clusters to capture meaning of words Pham Quang Nhat Minh Research Methods in NLP 65/70
  66. 66. Research projects at FPT Technology Research Institute NLP problems in chatbot development Intent classiﬁcation Named entity recognition FAQ generation from chat history, manuals Figure: Source: stanfy.com: http://tinyurl.com/mdfsa6h) Pham Quang Nhat Minh Research Methods in NLP 66/70
  67. 67. Summary Empirical research methods reply on observations, data, experiments Two dimensions of problem choice: Feasibility and Interest Research starts from learning Reading is very important in research NLP research involves much data analysis Coding practices for NLP/ML research Pham Quang Nhat Minh Research Methods in NLP 67/70
  68. 68. Check-list for your master thesis 1 Is your work reproducible? Package your code so that it can automatically generate the results by a single script Freeze the ﬁnal version 2 Is your proposed method new 3 Did you revise your thesis many times? Ask your advisors, friends for proof reading 4 Did you understand previous work? 5 Do you think you can pass the master thesis defense? Pham Quang Nhat Minh Research Methods in NLP 68/70
  69. 69. Advices for your master thesis Take time to choose your master research topic Work on the research problem that you are interested in Start soon Follow up your advisor Spend time on regular literature review (reading papers) Commit at least 2-3 hours per day for your master research Look at your data before starting doing something Follow “best” coding practices for research Use version control For versioning everything that is manually created Backup your work on the cloud Pham Quang Nhat Minh Research Methods in NLP 69/70
  70. 70. References Alon, U. (2009). How to choose a good scientiﬁc problem. Molecular cell, 35 6, 726-8. Aruliah, D.A., Brown, C.T., Davis, M., Guy, R.T., Hong, N.P., Haddock, S.H., Huﬀ, K., Mitchell, I.M., Plumbley, M.D., Waugh, B., White, E.P., Wilson, G., & Wilson, P. (2014). Best Practices for Scientiﬁc Computing. PLoS biology. Ali Eslami. Patterns for Research in Machine Learning http://arkitus.com/patterns-for-research-in-machine-learning Pham Quang Nhat Minh Research Methods in NLP 70/70

