The document provides an introduction to natural language processing (NLP) with R. It outlines topics like foundational NLP frameworks, working with text in R, regular expressions, n-gram models, and morphological analysis. Regular expressions are discussed as a pattern matching device and their theoretical connection to finite state automata. N-gram models are introduced for recognizing and generating language based on the probabilities of word sequences. Morphological analysis is demonstrated through building a lexicon and applying regular expressions to extract agentive nouns.
Text analytics in Python and R with examples from Tobacco ControlBen Healey
Ben has been doing data sciencey work since 1999 for organisations in the banking, retailing, health and education industries. He is currently on contracts with Pharmac and Aspire2025 (a Tobacco Control research collaboration) where, happily, he gets to use his data-wrangling powers for good.
This presentation focuses on analysing text, with Tobacco Control as the context. Examples include monitoring mentions of NZ's smokefree goal by politicians and examining media uptake of BATNZ's Agree/Disagree PR campaign. It covers common obstacles during data extraction, cleaning and analysis, along with the key Python and R packages you can use to help clear them.
Natural Language Processing in R (rNLP)fridolin.wild
The introductory slides of a workshop given to the doctoral school at the Institute of Business Informatics of the Goethe University Frankfurt. The tutorials are available on http://crunch.kmi.open.ac.uk/w/index.php/Tutorials
Text analytics in Python and R with examples from Tobacco ControlBen Healey
Ben has been doing data sciencey work since 1999 for organisations in the banking, retailing, health and education industries. He is currently on contracts with Pharmac and Aspire2025 (a Tobacco Control research collaboration) where, happily, he gets to use his data-wrangling powers for good.
This presentation focuses on analysing text, with Tobacco Control as the context. Examples include monitoring mentions of NZ's smokefree goal by politicians and examining media uptake of BATNZ's Agree/Disagree PR campaign. It covers common obstacles during data extraction, cleaning and analysis, along with the key Python and R packages you can use to help clear them.
Natural Language Processing in R (rNLP)fridolin.wild
The introductory slides of a workshop given to the doctoral school at the Institute of Business Informatics of the Goethe University Frankfurt. The tutorials are available on http://crunch.kmi.open.ac.uk/w/index.php/Tutorials
Babar: Knowledge Recognition, Extraction and RepresentationPierre de Lacaze
Babar is a research project in the field of Artificial Intelligence. It aims to bridge together Neural AI and Symbolic AI. As such it is implemented in three different programming languages: Clojure, Python and CLOS.
The Clojure component (Clobar) implements the graphical user interface to Babar. Examples of the Clojure Hiccup library and interfacing Clojure to Javascript will be presented. The Python module (Pybar) implements the web crawling and scraping and the Neural Networks aspect of Babar. The Word Embedding and and LSTM (Long Short-Term Memory) components of Pybar will be described in detail. Finally the Common Lisp module (Lispbar) implements the Symbolic AI aspect of Babar. This latter includes an English Language Parser and Semantic Networks implemented as an in-memory Hypergraph.
We will present each of these components and target individual aspects with code examples. Specifically we will first present the web developments and Neural Networks components. Then the English Language parser will be examined in detail. We will also present the knowledge extraction aspect and bridge this with the Neural Network component.
Ultimately we will argue what can be termed "Neural AI" and "Symbolic AI" are at not at odds with each other but rather complement each other. In summary Artificial Intelligence is not a question of "brain" or "mind", but rather a question of "brain" and "mind".
This talk will cover various aspects of Logic Programming. We examine Logic Programming in the contexts of Programming Languages, Mathematical Logic and Machine Learning.
We will we start with an introduction to Prolog and metaprogramming in Prolog. We will also discuss how miniKanren and Core.Logic differ from Prolog while maintaining the paradigms of logic programming.
We will then cover the Unification Algorithm in depth and examine the mathematical motivations which are rooted in Skolem Normal Form. We will describe the process of converting a statement in first order logic to clausal form logic. We will also discuss the applications of the Unification Algorithm to automated theorem proving and type inferencing.
Finally we will look at the role of Prolog in the context of Machine Learning. This is known as Inductive Logic Programming. In that context we will briefly review Decision Tree Learning and it's relationship to ILP. We will then examine Sequential Covering Algorithms for learning clauses in Propositional Calculus and then the more general FOIL algorithm for learning sets of Horn clauses in First Order Predicate Calculus. Examples will be given in both Common Lisp and Clojure for these algorithms.
Pierre de Lacaze has over 20 years’ experience with Lisp and AI based technologies. He holds a Bachelor of Science in Applied Mathematics and Computer Science and a Master’s Degree in Computer Science. He is the president of LispNYC.org
Introduction to the basics of Python programming (part 3)Pedro Rodrigues
This is the 3rd part of a multi-part series that teaches the basics of Python programming. It covers list and dict comprehensions, functions, modules and packages.
Text mining and social network analysis of twitter data part 1Johan Blomme
Twitter is one of the most popular social networks through which millions of users share information and express views and opinions. The rapid growth of internet data is a driver for mining the huge amount of unstructured data that is generated to uncover insights from it.
In the first part of this paper we explore different text mining tools. We collect tweets containing the “#MachineLearning” hashtag, prepare the data and run a series of diagnostics to mine the text that is contained in tweets. We also examine the issue of topic modeling that allows to estimate the similarity between documents in a larger corpus.
How to Measure Document Similarity and Build Text Classifiers: A First Look at Term Frequency-Inverse Document Frequency (TF-IDF) Representations
Text data is potentially valuable for many data science projects but working with text is different from working with structured data. One representation of text that has worked well for many text mining and machine learning applications is the term frequency - inverse document frequency (TF-IDF) vector. In spite of the long winded name, this method is easy to understand, performs well in many applications, and has been implemented in commonly used data science tools. This presentation will introduce TF-IDF and show examples of how to use TF-IDF for document classification and measuring the similarity between documents.
This presentation does not assume any background in text mining or natural language processing. Examples will use Python.
Image Similarity Detection at Scale Using LSH and Tensorflow with Andrey GusevDatabricks
Learning over images and understanding the quality of content play an important role at Pinterest. This talk will present a Spark based system responsible for detecting near (and far) duplicate images. The system is used to improve the accuracy of recommendations and search results across a number of production surfaces at Pinterest.
At the core of the pipeline is a Spark implementation of batch LSH (locality sensitive hashing) search capable of comparing billions of items on a daily basis. This implementation replaced an older (MR/Solr/OpenCV) system, increasing throughput by 13x and decreasing runtime by 8x. A generalized Spark Batch LSH is now used outside of the image similarity context by a number of consumers. Inverted index compression using variable byte encoding, dictionary encoding, and primitives packing are some examples of what allows this implementation to scale. The second part of this talk will detail training and integration of a Tensorflow neural net with Spark, used in the candidate selection step of the system. By directly leveraging vectorization in a Spark context we can reduce the latency of the predictions and increase the throughput.
Overall, this talk will cover a scalable Spark image processing and prediction pipeline.
Functions, Exception, Modules and Files
Functions: Difference between a Function and a Method, Defining a Function, Calling a Function, Returning Results from a Function, Returning Multiple Values from a Function, Functions are First Class Objects, Pass by Object Reference, Formal and Actual Arguments, Positional Arguments, Keyword Arguments, Default Arguments, Variable Length Arguments, Local and Global Variables, The Global Keyword, Passing a Group of Elements to a Function, Recursive Functions, Anonymous Functions or Lambdas (Using Lambdas with filter() Function, Using Lambdas with map() Function, Using Lambdas with reduce() Function), Function Decorators, Generators, Structured Programming, Creating our Own Modules in Python, The Special Variable __name__
Exceptions: Errors in a Python Program (Compile-Time Errors, Runtime Errors, Logical Errors),Exceptions, Exception Handling, Types of Exceptions, The Except Block, The assert Statement, UserDefined Exceptions, Logging the Exceptions
20%
Files: Files, Types of Files in Python, Opening a File, Closing a File, Working with Text Files Containing Strings, Knowing Whether a File Exists or Not, Working with Binary Files, The with Statement, Pickle in Python, The seek() and tell() Methods, Random Accessing of Binary Files, Random Accessing of Binary Files using mmap, Zipping and Unzipping Files, Working with Directories, Running Other Programs from Python Program
Kaggle Top1% Solution: Predicting Housing Prices in Moscow Vivian S. Zhang
This project was completed by students graduated from NYC Data Science Academy 12-week Data Science Bootcamp. Learn more about the bootcamp: http://nycdatascience.com/data-science-bootcamp/
Watch the project presentation: https://youtu.be/W530d2ZdbJE
Ranked #15 out of 3,274 teams on Kaggle Team Members - Brandy Freitas, Chase Edge and Grant Webb
Given 4 years of housing price data in a foreign market, predicting the following year’s prices should be pretty straightforward, right? But what if in that last year of data, the country’s stock market, the value of its currency and the price of its number 1 export, all dropped by nearly 50%. And on top of all that, the country was slapped with economic sanctions by the EU and the US. This was Moscow in 2014 and as you can see, it was anything but straightforward.
We were able to overcome these challenges and in the two weeks of working together, were able to achieve a top 1% ranking on Kaggle. Our success is a product of our in depth data cleaning, feature engineering and our approach to modeling. With a focus on interpretability and simplicity, we begin modeling using linear regression and decision trees which gave us a better understanding of the data. We then utilized more complicated models such as random forests and XGBoost which ultimately resulted in our top submission.
Babar: Knowledge Recognition, Extraction and RepresentationPierre de Lacaze
Babar is a research project in the field of Artificial Intelligence. It aims to bridge together Neural AI and Symbolic AI. As such it is implemented in three different programming languages: Clojure, Python and CLOS.
The Clojure component (Clobar) implements the graphical user interface to Babar. Examples of the Clojure Hiccup library and interfacing Clojure to Javascript will be presented. The Python module (Pybar) implements the web crawling and scraping and the Neural Networks aspect of Babar. The Word Embedding and and LSTM (Long Short-Term Memory) components of Pybar will be described in detail. Finally the Common Lisp module (Lispbar) implements the Symbolic AI aspect of Babar. This latter includes an English Language Parser and Semantic Networks implemented as an in-memory Hypergraph.
We will present each of these components and target individual aspects with code examples. Specifically we will first present the web developments and Neural Networks components. Then the English Language parser will be examined in detail. We will also present the knowledge extraction aspect and bridge this with the Neural Network component.
Ultimately we will argue what can be termed "Neural AI" and "Symbolic AI" are at not at odds with each other but rather complement each other. In summary Artificial Intelligence is not a question of "brain" or "mind", but rather a question of "brain" and "mind".
This talk will cover various aspects of Logic Programming. We examine Logic Programming in the contexts of Programming Languages, Mathematical Logic and Machine Learning.
We will we start with an introduction to Prolog and metaprogramming in Prolog. We will also discuss how miniKanren and Core.Logic differ from Prolog while maintaining the paradigms of logic programming.
We will then cover the Unification Algorithm in depth and examine the mathematical motivations which are rooted in Skolem Normal Form. We will describe the process of converting a statement in first order logic to clausal form logic. We will also discuss the applications of the Unification Algorithm to automated theorem proving and type inferencing.
Finally we will look at the role of Prolog in the context of Machine Learning. This is known as Inductive Logic Programming. In that context we will briefly review Decision Tree Learning and it's relationship to ILP. We will then examine Sequential Covering Algorithms for learning clauses in Propositional Calculus and then the more general FOIL algorithm for learning sets of Horn clauses in First Order Predicate Calculus. Examples will be given in both Common Lisp and Clojure for these algorithms.
Pierre de Lacaze has over 20 years’ experience with Lisp and AI based technologies. He holds a Bachelor of Science in Applied Mathematics and Computer Science and a Master’s Degree in Computer Science. He is the president of LispNYC.org
Introduction to the basics of Python programming (part 3)Pedro Rodrigues
This is the 3rd part of a multi-part series that teaches the basics of Python programming. It covers list and dict comprehensions, functions, modules and packages.
Text mining and social network analysis of twitter data part 1Johan Blomme
Twitter is one of the most popular social networks through which millions of users share information and express views and opinions. The rapid growth of internet data is a driver for mining the huge amount of unstructured data that is generated to uncover insights from it.
In the first part of this paper we explore different text mining tools. We collect tweets containing the “#MachineLearning” hashtag, prepare the data and run a series of diagnostics to mine the text that is contained in tweets. We also examine the issue of topic modeling that allows to estimate the similarity between documents in a larger corpus.
How to Measure Document Similarity and Build Text Classifiers: A First Look at Term Frequency-Inverse Document Frequency (TF-IDF) Representations
Text data is potentially valuable for many data science projects but working with text is different from working with structured data. One representation of text that has worked well for many text mining and machine learning applications is the term frequency - inverse document frequency (TF-IDF) vector. In spite of the long winded name, this method is easy to understand, performs well in many applications, and has been implemented in commonly used data science tools. This presentation will introduce TF-IDF and show examples of how to use TF-IDF for document classification and measuring the similarity between documents.
This presentation does not assume any background in text mining or natural language processing. Examples will use Python.
Image Similarity Detection at Scale Using LSH and Tensorflow with Andrey GusevDatabricks
Learning over images and understanding the quality of content play an important role at Pinterest. This talk will present a Spark based system responsible for detecting near (and far) duplicate images. The system is used to improve the accuracy of recommendations and search results across a number of production surfaces at Pinterest.
At the core of the pipeline is a Spark implementation of batch LSH (locality sensitive hashing) search capable of comparing billions of items on a daily basis. This implementation replaced an older (MR/Solr/OpenCV) system, increasing throughput by 13x and decreasing runtime by 8x. A generalized Spark Batch LSH is now used outside of the image similarity context by a number of consumers. Inverted index compression using variable byte encoding, dictionary encoding, and primitives packing are some examples of what allows this implementation to scale. The second part of this talk will detail training and integration of a Tensorflow neural net with Spark, used in the candidate selection step of the system. By directly leveraging vectorization in a Spark context we can reduce the latency of the predictions and increase the throughput.
Overall, this talk will cover a scalable Spark image processing and prediction pipeline.
Functions, Exception, Modules and Files
Functions: Difference between a Function and a Method, Defining a Function, Calling a Function, Returning Results from a Function, Returning Multiple Values from a Function, Functions are First Class Objects, Pass by Object Reference, Formal and Actual Arguments, Positional Arguments, Keyword Arguments, Default Arguments, Variable Length Arguments, Local and Global Variables, The Global Keyword, Passing a Group of Elements to a Function, Recursive Functions, Anonymous Functions or Lambdas (Using Lambdas with filter() Function, Using Lambdas with map() Function, Using Lambdas with reduce() Function), Function Decorators, Generators, Structured Programming, Creating our Own Modules in Python, The Special Variable __name__
Exceptions: Errors in a Python Program (Compile-Time Errors, Runtime Errors, Logical Errors),Exceptions, Exception Handling, Types of Exceptions, The Except Block, The assert Statement, UserDefined Exceptions, Logging the Exceptions
20%
Files: Files, Types of Files in Python, Opening a File, Closing a File, Working with Text Files Containing Strings, Knowing Whether a File Exists or Not, Working with Binary Files, The with Statement, Pickle in Python, The seek() and tell() Methods, Random Accessing of Binary Files, Random Accessing of Binary Files using mmap, Zipping and Unzipping Files, Working with Directories, Running Other Programs from Python Program
Kaggle Top1% Solution: Predicting Housing Prices in Moscow Vivian S. Zhang
This project was completed by students graduated from NYC Data Science Academy 12-week Data Science Bootcamp. Learn more about the bootcamp: http://nycdatascience.com/data-science-bootcamp/
Watch the project presentation: https://youtu.be/W530d2ZdbJE
Ranked #15 out of 3,274 teams on Kaggle Team Members - Brandy Freitas, Chase Edge and Grant Webb
Given 4 years of housing price data in a foreign market, predicting the following year’s prices should be pretty straightforward, right? But what if in that last year of data, the country’s stock market, the value of its currency and the price of its number 1 export, all dropped by nearly 50%. And on top of all that, the country was slapped with economic sanctions by the EU and the US. This was Moscow in 2014 and as you can see, it was anything but straightforward.
We were able to overcome these challenges and in the two weeks of working together, were able to achieve a top 1% ranking on Kaggle. Our success is a product of our in depth data cleaning, feature engineering and our approach to modeling. With a focus on interpretability and simplicity, we begin modeling using linear regression and decision trees which gave us a better understanding of the data. We then utilized more complicated models such as random forests and XGBoost which ultimately resulted in our top submission.
Twitter: @NycDataSci
Learn with our NYC Data Science Program (weekend courses for working professionals and 12 week full time for whom are advancing their career into Data Science)
Our next 12-Week Data Science Bootcamp starts in Jun. (Deadline to apply is May 1st, all decisions will be made by May 15th.)
====================================
Max Kuhn, Director is Nonclinical Statistics of Pfizer and also the author of Applied Predictive Modeling.
He will join us and share his experience with Data Mining with R.
Max is a nonclinical statistician who has been applying predictive models in the diagnostic and pharmaceutical industries for over 15 years. He is the author and maintainer for a number of predictive modeling packages, including: caret, C50, Cubist and AppliedPredictiveModeling. He blogs about the practice of modeling on his website at ttp://appliedpredictivemodeling.com/blog
---------------------------------------------------------
His Feb 18th course can be RSVP at NYC Data Science Academy.
Syllabus
Predictive Modeling using R
Description
This class will get attendees up to speed in predictive modeling using the R programming language. The goal of the course is to understand the general predictive modeling process and how it can be implemented in R. A selection of important models (e.g. tree-based models, support vector machines) will be described in an intuitive manner to illustrate the process of training and evaluating models.
Prerequisites:
Attendees should have a working knowledge of basic R data structures (e.g. data frames, factors etc) and language fundamentals such as functions and subsetting data. Understanding of the content contained in Appendix B sections B1 though B8 of Applied Predictive Modeling (free PDF from publisher [1]) should suffice.
Outline:
- An introduction to predictive modeling
- R and predictive modeling: the good and bad
- Illustrative example
- Measuring performance
- Data splitting and resampling
- Data pre-processing
- Classification trees
- Boosted trees
- Support vector machines
If time allows, the following topics will also be covered
- Parallel processing
- Comparing models
- Feature selection
- Common pitfalls
Materials:
Attendees will be provided with a copy of Applied Predictive Modeling[2] as well as course notes, code and raw data. Participants will be able to reproduce the examples described in the workshop.
Attendees should have a computer with a relatively recent version of R installed.
About the Instructor:
More about Max's work:
[1] http://rd.springer.com/content/pdf/bbm%3A978-1-4614-6849-3%2F1.pdf
[2] http://appliedpredictivemodeling.com
Data Science is concerned with the analysis of large amounts of data. When the volume of data is really large, it requires the use of cooperating, distributed machines. The most popular method of doing this is Hadoop, a collection of programs to perform computations on connected machines in a cluster. Hadoop began life as an open-source implementation of MapReduce, an idea first developed and implemented by Google for its own clusters. Though Hadoop's MapReduce is Java-based, and quite complex, this talk focuses on the "streaming" facility, which allows Python programmers to use MapReduce in a clean and simple way. We will present the core ideas of MapReduce and show you how to implement a MapReduce computation using Python streaming. The presentation will also include an overview of the various components of the Hadoop "ecosystem."
NYC Data Science Academy is excited to welcome Sam Kamin who will be presenting an Introduction to Hadoop for Python Programmers a well as a discussion of MapReduce with Streaming Python.
Sam Kamin was a professor in the University of Illinois Computer Science Department. His research was in programming languages, high-performance computing, and educational technology. He taught a wide variety of courses, and served as the Director of Undergraduate Programs. He retired as Emeritus Associate Professor, and worked at Google until taking his current position as VP of Data Engineering in NYC Data Science Academy.
--------------------------------------
Our fall 12-Week Data Science bootcamp starts on Sept 21st,2015. Apply now to get a spot!
If you are hiring Data Scientists, call us at (1)888-752-7585 or reach info@nycdatascience.com to share your openings and set up interviews with our excellent students.
Hack session for NYTimes Dialect Map Visualization( developed by R Shiny)Vivian S. Zhang
Data Science Academy, Hack session, NY Times, Dialect Map, Data science by R, Vivian S. Zhang, see www.nycdatascience.com for more details. Joint work by Data Scientist team of SupStat Inc. a New York based data analytic and visualization consulting firm.
A Hybrid Recommender with Yelp Challenge Data Vivian S. Zhang
Developed by Chao Shi, Sam O'Mullane, Sean Kickham, Reza Rad and Andrew Rubino
Watch the project presentation: https://youtu.be/gkKGnnBenyk
This project was completed by students from NYC Data Science Academy's 12-Week Bootcamp. Learn more about the bootcamp: http://nycdatascience.com/data-science-bootcamp/
People make decisions on where to eat based on friends’ recommendations. Since they know you, their suggestions matter more than those of strangers.
For the capstone project, we built a hybrid Yelp recommendation system that can provide individualized recommendations based on your friend’s reviews on the social network. We built the machine learning models using Spark, and set up a Flask-Kafka-RDS-Databricks pipeline that allows a continuous stream of user requests.
During the presentation, we will talk about the development framework and technical implementation of the pipeline.
Read on their project posts and code:
https://blog.nycdatascience.com/student-works/capstone/yelp-recommender-part-1/
https://blog.nycdatascience.com/student-works/yelp-recommender-part-2/
This project was completed by Scott Dobbins and Rachel Kogan, who enrolled in the NYC Data Science Academy's 12-Week Data Science Bootcamp. Learn more about the program: http://nycdatascience.com/data-science-bootcamp/
Given that both Wikipedia and comments sections of most websites are freely open to anyone to edit at any time, how has Wikipedia managed to remain such a useful resource while most comments sections are ridden with vandalism, ads, and other counterproductive user behavior?
We believe the answer is two-fold: 1) Wikipedia has an army of bots that quickly identify and revert vandalism so that the worst edits are usually never seen by people and the site generally maintains itself in a well-kempt state, and 2) Wikipedia has a strong community of administrators and other contributors who routinely clean the site’s flagged contents.
Vandalism is relatively easy to flag, though a few clever edits manage to stay on the site for a long time. What about site content problems that are more subjective, like bias? Wikipedia users do routinely manually flag pages with point-of-view (POV) issues, though with millions of pages and no machine-based approaches, the site can only manage to confidently maintain neutrality on the more well-trafficked pages.
Here we propose a solution to solve some of the more intractable content issues for Wikipedia and other sites using Natural Language Processing (NLP) and machine learning approaches. The sheer quantity of data managed by Wikipedia and similar sites requires distributed computing approaches, so we show here how Apache Spark can upgrade common algorithms to run on massive data sets.
Using Machine Learning to aid Journalism at the New York TimesVivian S. Zhang
This talk was presented to NYC Open Data Meetup Group on Nov 11, 2014.
Speaker:
Daeil Kim is currently a data scientist at the Times and is finishing up his Ph.D at Brown University on work related to developing scalable inference algorithms for Bayesian Nonparametric models. His work at the Times spans a variety of problems related to the company's business interests, audience development, as well as developing tools to aid journalism.
Topic:
This talk will focus mostly on how machine learning can help problems that prop up in journalism. We'll begin first by talking about using popular supervised learning algorithms such as regularized Logistic Regression to help assist a journalist's work in uncovering insights into a story regarding the recall of Takata airbags in cars. Afterwards, we'll think about using topic modeling to deal with large document dumps generated from FOIA (Freedom of Information Act) requests and Refinery, a simple web based tool to ease the implementation of such tasks. Finally, if there is time, we will go over how topic models have been extended to assist in the problem of designing an efficient recommendation engine for text-based content.
Winning data science competitions, presented by Owen ZhangVivian S. Zhang
<featured> Meetup event hosted by NYC Open Data Meetup, NYC Data Science Academy. Speaker: Owen Zhang, Event Info: http://www.meetup.com/NYC-Open-Data/events/219370251/
Tong is a data scientist in Supstat Inc and also a master students of Data Mining. He has been an active R programmer and developer for 5 years. He is the author of the R package of XGBoost, one of the most popular and contest-winning tools on kaggle.com nowadays.
Agenda:
Introduction of Xgboost
Real World Application
Model Specification
Parameter Introduction
Advanced Features
Kaggle Winning Solution
- Basics: BEAM Ecosystem and Erlang Programming Language
- Functional Programming: How Make your code Beautiful
- Concurrency: You need to be concurrent to survive in the parallel world
- Fault Tolerance: Keep calm and let it crash
- Soft Real-time: Accept the reality, be real, be yourself
- Software Architecture: How to look nice in a bigger picture
NLTK: Natural Language Processing made easyoutsider2
Natural Language Toolkit(NLTK), an open source library which simplifies the implementation of Natural Language Processing(NLP) in Python is introduced. It is useful for getting started with NLP and also for research/teaching.
This document list the reasons why our past alumni chose NYC Data Science Academy over other programs.
Machine Learning Bootcamp is our flagship program and well received by our community.
Our fall 12-Week Data Science bootcamp starts on Sept 21st,2015. Apply now to get a spot!
If you are hiring Data Scientists, call us at (1)888-752-7585 or reach info@nycdatascience.com to share your openings and set up interviews with our excellent students.
---------------------------------------------------------------
Come join our meet-up and learn how easily you can use R for advanced Machine learning. In this meet-up, we will demonstrate how to understand and use Xgboost for Kaggle competition. Tong is in Canada and will do remote session with us through google hangout.
---------------------------------------------------------------
Speaker Bio:
Tong is a data scientist in Supstat Inc and also a master students of Data Mining. He has been an active R programmer and developer for 5 years. He is the author of the R package of XGBoost, one of the most popular and contest-winning tools on kaggle.com nowadays.
Pre-requisite(if any): R /Calculus
Preparation: A laptop with R installed. Windows users might need to have RTools installed as well.
Agenda:
Introduction of Xgboost
Real World Application
Model Specification
Parameter Introduction
Advanced Features
Kaggle Winning Solution
Event arrangement:
6:45pm Doors open. Come early to network, grab a beer and settle in.
7:00-9:00pm XgBoost Demo
Reference:
https://github.com/dmlc/xgboost
Nyc open-data-2015-andvanced-sklearn-expandedVivian S. Zhang
Scikit-learn is a machine learning library in Python, that has become a valuable tool for many data science practitioners.
This talk will cover some of the more advanced aspects of scikit-learn, such as building complex machine learning pipelines, model evaluation, parameter search, and out-of-core learning.
Apart from metrics for model evaluation, we will cover how to evaluate model complexity, and how to tune parameters with grid search, randomized parameter search, and what their trade-offs are. We will also cover out of core text feature processing via feature hashing.
---------------------------------------------------------
Andreas is an Assistant Research Scientist at the NYU Center for Data Science, building a group to work on open source software for data science. Previously he worked as a Machine Learning Scientist at Amazon, working on computer vision and forecasting problems. He is one of the core developers of the scikit-learn machine learning library, and maintained it for several years.
Material will be posted here:
https://github.com/amueller/pydata-nyc-advanced-sklearn
Blog:
peekaboo-vision.blogspot.com
Twitter:
https://twitter.com/t3kcit
R003 laila restaurant sanitation report(NYC Data Science Academy, Data Scienc...Vivian S. Zhang
NYC Data Science Academy, Data Science by R Intensive Beginner level, R003 student, Laila, presented on restaurant sanitation report using NYC Open Data Set, see her blog post at http://nycdatascience.com/2014/05/pizza-everyone-loves-pizza/
R003 jiten south park episode popularity analysis(NYC Data Science Academy, D...Vivian S. Zhang
NYC Data Science Academy, Data Science by R Intensive Beginner level, R003 student, Jiten presented how he scrapped dataset and did south park episode popularity analysis.
R003 jiten south park episode popularity analysis(NYC Data Science Academy, D...
Introducing natural language processing(NLP) with r
1. Introducing NLP with R 10/6/14, 19:37
Introducing NLP with R
Charlie Redmon | SupStat Analytics
Copyright Supstat Inc. All Rights Reserved
http://docs.supstat.com/NLPwithR/#1 Page 1 of 26
2. Introducing NLP with R 10/6/14, 19:37
Outline
Introduction to NLP
Foundational Frameworks
Working with text in R
Regular Expressions
As pattern matching device
Theoretical connection with finite state automaton
Application in morphological analysis
-
-
-
N-gram models
Recognizing language
Generating language
-
-
Further reading
·
·
·
·
·
·
2/26
http://docs.supstat.com/NLPwithR/#1 Page 2 of 26
3. Introducing NLP with R 10/6/14, 19:37
What+is+NLP?
Natural Language Processing
Briefly: Building models to facilitate human-computer interaction through language
We say natural language here to distinguish languages like English, Hungarian, and Bengali
from computer languages and other invented communication systems (e.g. Morse code)
-
-
Major sub-disciplines:
·
·
Speech Recognition/Synthesis
Computational Morphology (word structure)
Lexical Semantics (word meaning)
Computational Syntax (phrase/sentence structure)
Compositional Semantics (phrase/sentence meaning)
Information Retrieval
-
-
-
-
-
-
3/26
http://docs.supstat.com/NLPwithR/#1 Page 3 of 26
4. Introducing NLP with R 10/6/14, 19:37
Why+R?
R has powerful text processing capabilities
Many useful NLP-related packages
Many of the more sophisticated procedures in NLP generalize to statistical models, which is
where R really excels
·
·
·
4/26
http://docs.supstat.com/NLPwithR/#1 Page 4 of 26
5. Introducing NLP with R 10/6/14, 19:37
Founda6onal+NLP+Frameworks
Turing
- Turing Machine: Finite State Automaton, Finite State Transducer
Kleene
- Regular Expressions
Chomsky
- Regular Languages and their relation to natural languages
Markov:
N-gram models
HMMs
-
-
Shannon
·
·
·
·
·
Information Theory
Noisy Channel, Entropy models
-
-
5/26
http://docs.supstat.com/NLPwithR/#1 Page 5 of 26
6. Introducing NLP with R 10/6/14, 19:37
The+Workflow
1. Import and manipulate text in R
2. Create data structures facilitating NLP operations
3. Model implementation:
Morphological parsing
N-gram parsing
N-gram language generation
...
·
·
·
·
6/26
http://docs.supstat.com/NLPwithR/#1 Page 6 of 26
7. Introducing NLP with R 10/6/14, 19:37
Impor6ng+text+into+R
· Primary importing functions: scan(), readLines()
monty_text = scan('data/grail.txt', what="character", sep="", quote="")
monty_text[1:6]
[1] "SCENE" "1:" "[wind]" "[clop" "clop" "clop]"
malayalam_text = scan('data/mathrubhumi_2014-10_full.txt',
what="character", sep="", quote="")
malayalam_text[15:20]
[1] "#Date:" "01-10-2014"
[3] "#----------------------------------------" "അേമരിkയിെലtിയ"
[5] "+പധാനമ+nി" "നേര+nേമാദി"
· Why might this data structure be a problem for many natural language structures?
7/26
http://docs.supstat.com/NLPwithR/#1 Page 7 of 26
9. Introducing NLP with R 10/6/14, 19:37
Regular+Expressions
SYMBOL MEANING EXAMPLE
[] Disjunction (set) / [Gg]oogle / = Google, google
? 0 or 1 characters / savou?r / = savor, savour
* 0 or more characters / hey!* / = hey, hey!, hey!!, ...
Escape character / hey? / = hey?
+ 1 or more characters / a+h / = ah, aah, aaah, ...
{n, m} n to m repetitions / a{1-4}h{1-3} / = aahh, ahhh, ...
. Wildcard (any character) / #.* / = #rstats, #uofl, ...
() Conjunction / (ha)+ / = ha, haha, hahaha, ...
[^ ] NOT (negates bracketed chars) / [^ #.*] / = everything but #...
9/26
http://docs.supstat.com/NLPwithR/#1 Page 9 of 26
10. Introducing NLP with R 10/6/14, 19:37
Regular+Expressions
SYMBOL MEANING EXAMPLE
[x-y] Match characters from 'x' to 'y' / [A-Z][1-9] / = A1, Q8, X5, ...
w Word character (alphanumeric) / w's / = that's, Jerry's, ...
W Non-word character
d Digit character (0-9) / d{3} / = 137, 254, ...
D Non-digit character
s Whitespace / w+s+w+ / = I am, I am, ...
S Non-whitespace
b Word boundary / btheb / = the, not then
B Non-word boundary
^ Beginning of line / [a-z] / = non-capitalized beg.
$ End of line / #.*$ / = hashtags at end of line
10/26
http://docs.supstat.com/NLPwithR/#1 Page 10 of 26
11. Introducing NLP with R 10/6/14, 19:37
Manual+segmenta6on
The advantage of having all the text in a single element is we can now split the text into different-sized
segments for different kinds of natural language tasks.
#sentence level
pattern = "(?<=[.?!])s+"
monty_sentences = strsplit(monty_text, split=pattern, perl=T)
monty_sentences = unlist(monty_sentences)
monty_sentences[5:8]
[1] "King of the Britons, defeator of the Saxons, sovereign of all England!"
[2] "SOLDIER #1: Pull the other one!"
[3] "ARTHUR: I am, ..."
[4] "and this is my trusty servant Patsy."
11/26
http://docs.supstat.com/NLPwithR/#1 Page 11 of 26
12. Introducing NLP with R 10/6/14, 19:37
Manual+segmenta6on
Of course, depending on the language you're working with you might have different definitions of
sentence boundaries. For example, Hindi uses what's called a danda marker, । , in place of a period.
hindi_text = scan('data/hindustan_full.txt', what="character", sep="")
hindi_text = paste(hindi_text, collapse=" ")
pattern = "(?<=[।?!])s+"
hindi_sentences = strsplit(hindi_text, split=pattern, perl=T)
hindi_sentences = unlist(hindi_sentences)
hindi_sentences[5:8]
[1] "व"# मन# को लोकसभा चuनाव . करारी हार का सामना करना पड़ा था और उसका खाता भी नह9 खuल पाया था।"
[2] "लोकसभा चuनाव . भाजपा और िशव#ना > कuछ छोA दलo D साथ िमलकर 48 . # 42 सीAE जीत9।"
[3] "महाराFG . िशव#ना अब तक भाजपा D बड़e भाई की भLिमका iनभाती रही थी।"
[4] "इन दोनo D बीच उस वOत अलगाव Qआ S जब भाजपा TU . नVU मोदी D >तWXव . पLणZ बQमत D साथ स[ासीन S।"
12/26
http://docs.supstat.com/NLPwithR/#1 Page 12 of 26
13. Introducing NLP with R 10/6/14, 19:37
Manual+segmenta6on
We can also split the original text according to word boundaries.
#word level
pattern = "[()[]":;,.?!-]*s+[()[]":;,.?!-]*"
monty_words = strsplit(monty_text, split=pattern, perl=T)
monty_words = unlist(monty_words)
monty_words[5:30]
[1] "clop" "clop" "KING" "ARTHUR" "Whoa" "there" "clop" "clop"
[9] "clop" "SOLDIER" "#1" "Halt" "Who" "goes" "there" "ARTHUR"
[17] "It" "is" "I" "Arthur" "son" "of" "Uther" "Pendragon"
[25] "from" "the"
13/26
http://docs.supstat.com/NLPwithR/#1 Page 13 of 26
14. Introducing NLP with R 10/6/14, 19:37
Building+a+Lexicon
For many NLP tasks it is useful to have a dictionary, or lexicon, of the language you're working with.
Other researchers may have already built a text-formatted lexicon of the language you're using, but
nevertheless it's useful to see how we might build one.
#convert all words to lowercase
monty_words = tolower(monty_words)
monty_words[1:9]
[1] "scene" "1" "wind" "clop" "clop" "clop" "king" "arthur" "whoa"
#convert vector of tokens to set of unique words
monty_lexicon = unique(monty_words)
monty_lexicon[1:8]
[1] "scene" "1" "wind" "clop" "king" "arthur" "whoa" "there"
14/26
http://docs.supstat.com/NLPwithR/#1 Page 14 of 26
15. Introducing NLP with R 10/6/14, 19:37
Building+a+Lexicon
length(monty_words)
[1] 11213
length(monty_lexicon)
[1] 1889
15/26
http://docs.supstat.com/NLPwithR/#1 Page 15 of 26
16. Introducing NLP with R 10/6/14, 19:37
Morphological+Analysis
Now that we have our lexicon we can start to model the internal structure of the words in our corpus.
Formally, morphological rules can be modeled as an FSA. Here's a simple example from Jurafsky
and Martin (2000)
16/26
http://docs.supstat.com/NLPwithR/#1 Page 16 of 26
17. Introducing NLP with R 10/6/14, 19:37
Morphological+Analysis
But since it has already been proven that all regular expressions can be modeled as FSAs, and vice
versa, we can utilize the grep utilities in R to handle this process. First let's see if we can extract all
the agentive nouns (e.g. builder, worker, shopper, etc.).
monty_agents = grep('.+er$', monty_lexicon, perl=T, value=T)
monty_agents[1:30]
[1] "soldier" "uther" "other" "master" "together" "winter"
[7] "plover" "warmer" "matter" "order" "creeper" "under"
[13] "cart-master" "customer" "better" "over" "bother" "ever"
[19] "officer" "her" "water" "power" "mer" "villager"
[25] "whether" "cider" "e'er" "prisoner" "shelter" "wiper"
· This isn't exactly what we want. How can we improve our results?
17/26
http://docs.supstat.com/NLPwithR/#1 Page 17 of 26
18. Introducing NLP with R 10/6/14, 19:37
Morphological+Analysis
Take advantage of the lexicon.
monty_agents = grep('.+er$', monty_lexicon, perl=T, value=T)
new_monty_agents = character(0)
for (i in 1:length(monty_agents)) {
word = monty_agents[i]
stem_end = nchar(word) - 2
stem = substr(word, 1, stem_end)
if (is.element(stem, monty_lexicon)) {
new_monty_agents[i] = word
}
}
new_monty_agents = new_monty_agents[!is.na(new_monty_agents)]
new_monty_agents
[1] "warmer" "creeper" "longer" "nearer" "higher" "killer" "bleeder" "keeper"
18/26
http://docs.supstat.com/NLPwithR/#1 Page 18 of 26
19. Introducing NLP with R 10/6/14, 19:37
Malayalam+FSA
19/26
http://docs.supstat.com/NLPwithR/#1 Page 19 of 26
20. Introducing NLP with R 10/6/14, 19:37
NHgram+Models
Based on Markov model
At their heart, n-grams answer the question: "What is the likelihood of one word (or character,
phrase, sentence...) following another word or sequence of words?"
The kernel equation:
P(wn|wn−1 ) ≈ P( | )
1 wn wn−1
n−N+1
N N
where is the in N-gram (i.e. the number of words used to build the grammar)
For example, if we have the string, "We are the Knights who say, 'Ni!'", in the bigram model we're
moving along the string asking: P(Knights|are the), P(who|the Knights), ...
·
·
·
·
20/26
http://docs.supstat.com/NLPwithR/#1 Page 20 of 26
21. Introducing NLP with R 10/6/14, 19:37
NHgram+Models
library(ngram)
monty_bigram = ngram(monty_text, n=2)
get.ngrams(monty_bigram)[1:10]
[1] "cannot tell," "away. Just" "not 'is'." "bowels unplugged,"
[5] "well, Arthur," "[twang] Wayy!" "HERBERT: B--" "no. Until"
[9] "trade. I" "down, fell"
monty_trigram = ngram(monty_text, n=3)
get.ngrams(monty_trigram)[1:10]
[1] "a good spanking!" "Oooh! GALAHAD: My" "is the capital" "to you no"
[5] "Who's that then?" "you get back." "no arms left." "want... a shrubbery!"
[9] "Shut up! Um," "to a successful"
21/26
http://docs.supstat.com/NLPwithR/#1 Page 21 of 26
22. Introducing NLP with R 10/6/14, 19:37
NHgram+Models
print(monty_bigram, full=TRUE)
cannot tell,
suffice {1} |
away. Just
ignore {1} |
not 'is'.
HEAD {1} | You {2} | Not {1} |
bowels unplugged,
And {1} |
well, Arthur,
for {1} |
[twang] Wayy!
[twang] {1} |
22/26
http://docs.supstat.com/NLPwithR/#1 Page 22 of 26
23. Introducing NLP with R 10/6/14, 19:37
NHgram+Models
print(monty_trigram, full=TRUE)
a good spanking!
GIRLS: {1} |
Oooh! GALAHAD: My
God! {1} |
is the capital
of {1} |
to you no
more, {1} |
Who's that then?
CART-MASTER: {1} |
you get back.
GUARD {1} |
23/26
http://docs.supstat.com/NLPwithR/#1 Page 23 of 26
24. Introducing NLP with R 10/6/14, 19:37
NHgram+Models
babble(monty_bigram, 8)
[1] "must go too. OFFICER #1: Back. Right away. "
babble(monty_bigram, 8)
[1] "I'll do you up a treat mate! GALAHAD: "
babble(monty_bigram, 8)
[1] "from just stop him entering the room. GUARD "
24/26
http://docs.supstat.com/NLPwithR/#1 Page 24 of 26
25. Introducing NLP with R 10/6/14, 19:37
NHgram+Models
babble(monty_trigram, 8)
[1] "were still no nearer the Grail. Meanwhile, King "
babble(monty_trigram, 8)
[1] "the Britons. BEDEVERE: My liege! I would be "
babble(monty_trigram, 8)
[1] "Shh! VILLAGER #2: Wood! BEDEVERE: So, why do "
25/26
http://docs.supstat.com/NLPwithR/#1 Page 25 of 26
26. Introducing NLP with R 10/6/14, 19:37
Further+Reading
Jurafsky and Martin (2008), Speech and Language Processing
Manning (2008), An Introduction to Information Retrieval
Gries (2009), Quantitative Corpus Linguistics with R
·
·
·
26/26
http://docs.supstat.com/NLPwithR/#1 Page 26 of 26