NLP& Bigdata. Motivation and Action

1. NLP & Bigdata Motivation and Action Sarath P R sarath.amrita@gmail.com IIIT-MK Thiruvananthapuram November 09, 2013 Sarath P R sarath.amrita@gmail.com NLP & Bigdata Motivation and Action

2. About me Working as Technical Lead - Bigdata Like to develop software applications for good reasons Independent Data Journalist at DScribe.IN Holds Masters in Computer Science Like to travel and meet people Sarath P R sarath.amrita@gmail.com NLP & Bigdata Motivation and Action

3. Agenda Introduction Full text Search and Index Document Clustering Representing Data Stanford NLP R and Weka Social Media and Sentiment Analysis Introduction to Bigdata Current Trends Conclusion Sarath P R sarath.amrita@gmail.com NLP & Bigdata Motivation and Action

4. Introduction Sorry !!! No Deﬁnitions copied here for NLP ! In case you need a deﬁnition tell me. Otherwise we will ’see’ now what is NLP ! Sarath P R sarath.amrita@gmail.com NLP & Bigdata Motivation and Action

5. Introduction Sorry !!! No Deﬁnitions copied here for NLP ! In case you need a deﬁnition tell me. Otherwise we will ’see’ now what is NLP ! Sarath P R sarath.amrita@gmail.com NLP & Bigdata Motivation and Action

6. Introduction - 2 minutes Targit Video Watch Targit Video Here http://youtu.be/32KE0rbGZ9c Sarath P R sarath.amrita@gmail.com NLP & Bigdata Motivation and Action

7. So What is He (Targit CTO) Saying ? “Calling your system, and getting delivered an analysis is right around the corner” Go to Targit’s website http://targit.com. You will see a Lion standing in the front page They say “Targit is a courage Company” That was all about Motivation. No hidden agenda to promote Targit ! Sarath P R sarath.amrita@gmail.com NLP & Bigdata Motivation and Action

12. Introduction - Innovation What we just saw is one aspect of NLP What is it ? It is Speech Recognition and Analytics And what they did ? It is Innovation ! Sarath P R sarath.amrita@gmail.com NLP & Bigdata Motivation and Action

17. Introduction - Search Engines & Information Retrieval Tell me your opinion. Question follows IS Google an NLP Company ? Yes, they are. Biggest one ! So, how google works ? I mean the Search Engine ! From where they bring you the search results ? Answer is 3 things. Crawler, Index and Algorithms Now we will start with few NLP, Machine Learning and Analytics related topics in detail Sarath P R sarath.amrita@gmail.com NLP & Bigdata Motivation and Action

24. Full text Search and Inverted Index In information retrieval, full-text search refers to techniques for searching a single computer-stored document or a collection in a full text database When the number of documents to search is potentially large, or the quantity of search queries to perform is substantial, the problem of full-text search is often divided into two tasks Indexing and Searching The indexing stage will scan the text of all the documents and build a list of search terms, called an index In the search stage, when performing a speciﬁc query, only the index is referenced, rather than the text of the original documents Sarath P R sarath.amrita@gmail.com NLP & Bigdata Motivation and Action

27. Inverted index It is the most popular data structure used in document retrieval systems Similar to the index in the back of a book Used on a large scale for example in search engines Sarath P R sarath.amrita@gmail.com NLP & Bigdata Motivation and Action

28. Inverted index 1 1 Reference http://nlp.stanford.edu/IR-book/html/htmledition/ a-first-take-at-building-an-inverted-index-1.html Sarath P R sarath.amrita@gmail.com NLP & Bigdata Motivation and Action

29. Index vs Inverted Index Index A forward index (or just index) is the list of documents, and which words appear in them Inverted Index The inverted index is the list of words, and the documents in which they appear Sarath P R sarath.amrita@gmail.com NLP & Bigdata Motivation and Action

30. Index vs Inverted Index Index A forward index (or just index) is the list of documents, and which words appear in them Inverted Index The inverted index is the list of words, and the documents in which they appear Sarath P R sarath.amrita@gmail.com NLP & Bigdata Motivation and Action

31. Exercise Have a look at the table below Document Doc 1 Doc 2 Doc 3 Words talk, iiitmk, campus,nlp algorithm, bigdata, nlp researchers, talk What kind of an Index is it ? Create an inverted index from this forward index Sarath P R sarath.amrita@gmail.com NLP & Bigdata Motivation and Action

35. Answer Inverted Index Words talk iiitmk campus nlp algorithm bigdata researchers Document Doc 1, Doc 3 Doc 1 Doc 1 Doc 1, Doc 2 Doc 2 Doc 2 Doc 3 Search A search query like ’nlp talk’ would deliver what results ? Result Sarath P R sarath.amrita@gmail.com NLP & Bigdata Motivation and Action

38. Apache Lucene Demo Which Tool to try for indexing ans searching ? Apache Lucene is a full-featured text search engine library Written entirely in Java Open Source Scalable and High Performance Indexing Powerful, Accurate and Eﬃcient Search Algorithms Interesting Features of Lucene Core Allows Simultaneous update and searching Powerful query types like phrase queries, wildcard queries, range queries etc Fielded searching (e.g. title, author, contents) Sarath P R sarath.amrita@gmail.com NLP & Bigdata Motivation and Action

39. Apache Lucene Demo Which Tool to try for indexing ans searching ? Apache Lucene is a full-featured text search engine library Written entirely in Java Open Source Scalable and High Performance Indexing Powerful, Accurate and Eﬃcient Search Algorithms Interesting Features of Lucene Core Allows Simultaneous update and searching Powerful query types like phrase queries, wildcard queries, range queries etc Fielded searching (e.g. title, author, contents) Sarath P R sarath.amrita@gmail.com NLP & Bigdata Motivation and Action

40. Document Clustering Deﬁnition The process of grouping a set of physical or abstract objects into classes of similar objects is called clustering. A cluster is a collection of data objects that are similar to one another within the same cluster and are dissimilar to the objects in other clusters. Clustering is applicable in many ﬁelds, including machine learning, pattern recognition, image analysis, information retrieval, and bioinformatics. Clustering is an example for un supervised learning in Machine Learning Cluster Analysis can be achieved by various algorithms Sarath P R sarath.amrita@gmail.com NLP & Bigdata Motivation and Action

41. Document Clustering Deﬁnition The process of grouping a set of physical or abstract objects into classes of similar objects is called clustering. A cluster is a collection of data objects that are similar to one another within the same cluster and are dissimilar to the objects in other clusters. Clustering is applicable in many ﬁelds, including machine learning, pattern recognition, image analysis, information retrieval, and bioinformatics. Clustering is an example for un supervised learning in Machine Learning Cluster Analysis can be achieved by various algorithms Sarath P R sarath.amrita@gmail.com NLP & Bigdata Motivation and Action

42. The Library Example Reference I found this example in the book Mahout In Action by Sean Owen, Robin Anil, Ted Dunning, and Ellen Friedman Inside the Library A Library having thousands of books There is no particular order or anything how books are arranged in this Library Brainstorm ! Will you enjoy ﬁnding a book you want from there ? If not give me some solutions Sarath P R sarath.amrita@gmail.com NLP & Bigdata Motivation and Action

47. Solutions What about Sorting the books alphabetically by Title ? Yes, for readers seraching a book by title, that will help. What if some looking for books on some general subject ? For example Health Grouping books by topics will be more useful in this case But how would you even begin this grouping ? You will start reading books one by one and group them ! Good Work :-) Sarath P R sarath.amrita@gmail.com NLP & Bigdata Motivation and Action

53. Steps in Clustering Clustering involves the following An algorithm, the method used to group the books together. A notion of both similarity and dissimilarity. In the library example we relied on our assessment of which books belonged in an existing stack and which should start a new one. A stopping condition. In the library example, this might have been the point beyond books can’t be stacked anymore, or when the stacks are already quite dissimilar. Sarath P R sarath.amrita@gmail.com NLP & Bigdata Motivation and Action

57. K-Means Algorithm Let’s see an Algorithm ﬁrst and after that how to automate the grouping of books in the Library Example. K-Means k-Means clustering aims to partition n observations into k clusters. Takes the input parameter, k, and partitions a set of n objects into k clusters so that the resulting intracluster similarity is high but the intercluster similarity is low. Cluster similarity is measured in regard to the mean value of the objects in a cluster, which can be viewed as the cluster’s centroid Sarath P R sarath.amrita@gmail.com NLP & Bigdata Motivation and Action

58. K-Means Algorithm Let’s see an Algorithm ﬁrst and after that how to automate the grouping of books in the Library Example. K-Means k-Means clustering aims to partition n observations into k clusters. Takes the input parameter, k, and partitions a set of n objects into k clusters so that the resulting intracluster similarity is high but the intercluster similarity is low. Cluster similarity is measured in regard to the mean value of the objects in a cluster, which can be viewed as the cluster’s centroid Sarath P R sarath.amrita@gmail.com NLP & Bigdata Motivation and Action

59. K-Means Example 2 Reference Teknomo, Kardi. K-Means Clustering Tutorials. http://people.revoledu.com/kardi/tutorial/kMean Data Object Medicine A Medicine B medicine C Medicine D Attribute 1 (X) weight index 1 2 4 5 Attribute 2 (Y) pH 1 1 3 4 Problem we have 4 objects each having 2 attributes we also know before hand that these objects belong to two groups of medicine (cluster 1 and cluster 2) The problem now is to determine which medicines belong to cluster 1 and which medicines belong to the other cluster 2 Sarath P R sarath.amrita@gmail.com NLP & Bigdata Motivation and Action

60. K-Means Example 2 Reference Teknomo, Kardi. K-Means Clustering Tutorials. http://people.revoledu.com/kardi/tutorial/kMean Data Object Medicine A Medicine B medicine C Medicine D Attribute 1 (X) weight index 1 2 4 5 Attribute 2 (Y) pH 1 1 3 4 Problem we have 4 objects each having 2 attributes we also know before hand that these objects belong to two groups of medicine (cluster 1 and cluster 2) The problem now is to determine which medicines belong to cluster 1 and which medicines belong to the other cluster 2 Sarath P R sarath.amrita@gmail.com NLP & Bigdata Motivation and Action

61. Steps in K-means Iterate until stable (ie no object move group): 1 Determine the centroid coordinate 2 Determine the distance of each object to the centroids 3 Group the object based on minimum distance (ﬁnd the closest centroid) Each medicine represents one point with two features (X, Y). We can represent it as coordinate in a feature space Sarath P R sarath.amrita@gmail.com NLP & Bigdata Motivation and Action

62. Steps in K-means Iterate until stable (ie no object move group): 1 Determine the centroid coordinate 2 Determine the distance of each object to the centroids 3 Group the object based on minimum distance (ﬁnd the closest centroid) Each medicine represents one point with two features (X, Y). We can represent it as coordinate in a feature space Sarath P R sarath.amrita@gmail.com NLP & Bigdata Motivation and Action

63. Euclidean distance Each clustering problem is basically based on a distance between points Euclidean Distance is most commonly usd distance measure Mathematically, Euclidean distance between points with coordinates (x1, y1) and (x2, y2) is Sarath P R sarath.amrita@gmail.com NLP & Bigdata Motivation and Action

64. Iteration 0 Sarath P R sarath.amrita@gmail.com NLP & Bigdata Motivation and Action

65. Iteration 0 Initial Value of Centroids Take medicine A and medicine B as the ﬁrst centroids. Let c1 and c 2 denote the coordinate of the centroids, then c1 = (1,1) and c 2 = (2,1) Objects-Centroids Distance Calculate the distance between cluster centroid to each object. Distance matrix using Euclidean Distance at iteration 0 is Sarath P R sarath.amrita@gmail.com NLP & Bigdata Motivation and Action

66. Iteration 0 Initial Value of Centroids Take medicine A and medicine B as the ﬁrst centroids. Let c1 and c 2 denote the coordinate of the centroids, then c1 = (1,1) and c 2 = (2,1) Objects-Centroids Distance Calculate the distance between cluster centroid to each object. Distance matrix using Euclidean Distance at iteration 0 is Sarath P R sarath.amrita@gmail.com NLP & Bigdata Motivation and Action

67. Iteration 0 Each column in the distance matrix symbolizes the object The first row of the distance matrix corresponds to the distance of each object to the first centroid and the second row is the distance of each object to the second centroid For example, distance from medicine C = (4, 3) to the first centroid c1 = (1,1) is Similarly distance to the second centroid c 2 = (2,1) is Sarath P R sarath.amrita@gmail.com NLP & Bigdata Motivation and Action

68. Iteration 0 Objects clustering We assign each object based on the minimum distance Thus, medicine A is assigned to group 1, medicine B to group 2 and so on Group Matrix The element of Group matrix below is 1 if and only if the object is assigned to that group. Sarath P R sarath.amrita@gmail.com NLP & Bigdata Motivation and Action

69. Iteration 0 Objects clustering We assign each object based on the minimum distance Thus, medicine A is assigned to group 1, medicine B to group 2 and so on Group Matrix The element of Group matrix below is 1 if and only if the object is assigned to that group. Sarath P R sarath.amrita@gmail.com NLP & Bigdata Motivation and Action

70. Iteration 1 Determine new centroids Compute the new centroid of each group based on the new members Group 1 only has one member thus the centroid remains as c1 = (1,1) Group 2 now has three members, thus the centroid is the average coordinate among the three members Sarath P R sarath.amrita@gmail.com NLP & Bigdata Motivation and Action

72. Iteration 1 Objects-Centroids Distance Compute the distance of all objects to the new centroids Distance matrix at iteration 1 is Sarath P R sarath.amrita@gmail.com NLP & Bigdata Motivation and Action

73. Iteration 1 Objects clustering Again we assign each object based on the minimum distance Based on the new distance matrix, we move the medicine B to Group 1 while all the other objects remain. Group Matrix Group matrix at Iteration 1 Sarath P R sarath.amrita@gmail.com NLP & Bigdata Motivation and Action

74. Iteration 1 Objects clustering Again we assign each object based on the minimum distance Based on the new distance matrix, we move the medicine B to Group 1 while all the other objects remain. Group Matrix Group matrix at Iteration 1 Sarath P R sarath.amrita@gmail.com NLP & Bigdata Motivation and Action

75. Iteration 2 Determine new centroids Compute the new centroid of each group based on the new members Group1 and group 2 both has two members, thus the thus the new centroids are Sarath P R sarath.amrita@gmail.com NLP & Bigdata Motivation and Action

77. Iteration 2 Objects-Centroids Distance Distance matrix at iteration 2 is Sarath P R sarath.amrita@gmail.com NLP & Bigdata Motivation and Action

78. Iteration 2 Objects clustering Again we assign each object based on the minimum distance Group Matrix Group matrix at Iteration 2 Sarath P R sarath.amrita@gmail.com NLP & Bigdata Motivation and Action

79. Iteration 2 Objects clustering Again we assign each object based on the minimum distance Group Matrix Group matrix at Iteration 2 Sarath P R sarath.amrita@gmail.com NLP & Bigdata Motivation and Action

80. Results We obtain result that G2 = G1. Comparing the grouping of last iteration and this iteration reveals that the objects does not move group anymore. Thus, the computation of the k-mean clustering has reached its stability and no more iteration is needed. We get the ﬁnal grouping as the results. Sarath P R sarath.amrita@gmail.com NLP & Bigdata Motivation and Action

81. Document Representations X-Y Plane Example In previous example the measure of similarity (or similarity metric) for the points was the Euclidean distance between two points And that was in the X-Y plane Library Example The library example had no such clear, mathematical measure. And we relied entirely on our wisdom to judge book similarity Sarath P R sarath.amrita@gmail.com NLP & Bigdata Motivation and Action

82. Document Representations X-Y Plane Example In previous example the measure of similarity (or similarity metric) for the points was the Euclidean distance between two points And that was in the X-Y plane Library Example The library example had no such clear, mathematical measure. And we relied entirely on our wisdom to judge book similarity Sarath P R sarath.amrita@gmail.com NLP & Bigdata Motivation and Action

83. Document Representations Brainstorm ! We need a metric that can be implemented on a computer. One possible metric could be based on the number of words common to two books’ titles. So “Harry Potter: The Philosopher’s Stone” and “Harry Potter: The Prisoner of Azkaban” have three words in common: “Harry”, “Potter” and “The”. But, even though the book “The Lord of the Rings: The Two Towers” is similar to the Harry Potter series, this measure of similarity doesn’t capture that. Sarath P R sarath.amrita@gmail.com NLP & Bigdata Motivation and Action

88. Document Representations Another Solutions We could assemble word counts for each book, and when the counts are close for many words, judge the books similar. But the words like “a”, “an”, and “the” cannot contribute much to the similarity, because they occurs frequently in both books. We could use numeric weights in the computation, and apply low weights to these words to reduce their eﬀect on the similarity value. Once we give a weight value to each word in a book, we can easily ﬁnd out the similarity of two books. But the words like “a”, “an”, and “the” cannot contribute much to the similarity, because they occurs frequently in both books. Sarath P R sarath.amrita@gmail.com NLP & Bigdata Motivation and Action

94. Document Representations What if one book is 300 pages long and the other 1000 pages long? We have to ensure that the weight of words should be relative to the length of the text. We will see a method called TF-IDF shortly Sarath P R sarath.amrita@gmail.com NLP & Bigdata Motivation and Action

97. Document Representations Task ! Explore following distance measures 1 Squared Euclidean distance measure 2 Manhattan distance measure 3 Cosine distance measure Sarath P R sarath.amrita@gmail.com NLP & Bigdata Motivation and Action

98. Document Representations Representing Data as Vectors In mathematics, a vector is simply a point in space. We found how books can be clustered together based on their similarity in words. In reality, clustering could be applied to any kind of object provided we can distinguish similar and dissimilar items. Clustering of anything via algorithms starts with representing the object in a way that can be read by computers. It is quite practical to think of objects in terms of their measurable features or attributes. Sarath P R sarath.amrita@gmail.com NLP & Bigdata Motivation and Action

99. Document Representations Say we want to cluster bunch of Apples 3 3 Figure taken from Mahout in Action Sarath P R sarath.amrita@gmail.com NLP & Bigdata Motivation and Action

100. Document Representations A small, round, red apple is more similar to a small, round, green one than a large, ovoid green one. The process of vectorization starts with assigning features to a dimension Let’s say weight is feature (dimension) 0, color is 1, and size is 2 So the vector of a small round red apple looks like [0: 100 gram, 1: red, 2: small] Sarath P R sarath.amrita@gmail.com NLP & Bigdata Motivation and Action

104. Document Representations Set of apples of diﬀerent weight, sizes and colors converted to vectors 4 4 Figure taken from Mahout in Action Sarath P R sarath.amrita@gmail.com NLP & Bigdata Motivation and Action

105. Document Representations Improving weighting with TF-IDF Term frequency - Inverse Document Frequency (TF-IDF) weighting is a widely used improvement on simple term frequency weighting. We found how books can be clustered together based on their similarity in words. Instead of simply using term frequency as values in the vector, this value is multiplied by the inverse of the term’s document frequency IDF=log(N/n) N=total number of documents n = number of documents that contain a term TF-IDF = TF*IDF Sarath P R sarath.amrita@gmail.com NLP & Bigdata Motivation and Action

106. Stanford NLP NLP Toolkit Stanford NLP group provides NLP toolkits for various major computational linguistics problems. Written in Java. Open Source Sarath P R sarath.amrita@gmail.com NLP & Bigdata Motivation and Action

107. Stanford NLP Stanford Named Entity Recognizer Named-entity recognition (NER) techniques locate and classify atomic elements in text into predeﬁned categories such as the names of persons, organizations, locations etc Consider the following text Hello Jona, I am in Indian Institute at Trivandrum What are the entities in this ? NER Demo Stanford NER is also known as CRFClassiﬁer Conditional Random Field (CRF) sequence models are used for structured predictions Sarath P R sarath.amrita@gmail.com NLP & Bigdata Motivation and Action

112. Social Media and Sentiment Analysis Twitter Twitter Streaming Demo Sentiment Analysis Sentiment analysis is one of the hottest research areas in computer science today. A basic task in sentiment analysis is to classify the polarity of a given text at the document, sentence, or aspect level. Whether the expressed opinion in a document, a sentence or an entity feature oraspect is positive, negative, or neutral. Sarath P R sarath.amrita@gmail.com NLP & Bigdata Motivation and Action

113. Social Media and Sentiment Analysis Twitter Twitter Streaming Demo Sentiment Analysis Sentiment analysis is one of the hottest research areas in computer science today. A basic task in sentiment analysis is to classify the polarity of a given text at the document, sentence, or aspect level. Whether the expressed opinion in a document, a sentence or an entity feature oraspect is positive, negative, or neutral. Sarath P R sarath.amrita@gmail.com NLP & Bigdata Motivation and Action

114. Social Media and Sentiment Analysis Movie Review Let’s see a tweet on a recently released movie “Wow #Krish3 looks more exciting than Superman n Spider-Man for sure ! The Roshans have made a truly world class super hero ﬁlm, again!” These snippets of text are a gold mine for companies and individuals that want to monitor their reputation and get timely feedback about their products and actions Sarath P R sarath.amrita@gmail.com NLP & Bigdata Motivation and Action

117. Social Media and Sentiment Analysis Document-Level Sentiment Analysis Main approach for document level sentiment analysis is supervised learning. The system learns a classiﬁcation model from the training data common classiﬁcation algorithms such as SVM, Naive Bayes, Logistic Regression etc can be used Thus new documents are tagged into their various sentiment classes Sarath P R sarath.amrita@gmail.com NLP & Bigdata Motivation and Action

118. Bigdata Introduction to Bigdata Big data is the term for a collection of data sets so large and complex that it becomes diﬃcult to process using on-hand database management tools or traditional data processing applications. The challenges include capture, curation, storage, search, sharing, transfer, analysis, and visualization. Sarath P R sarath.amrita@gmail.com NLP & Bigdata Motivation and Action

119. Bigdata 3 Vs of Bigdata Volume: Ever-growing data of all types Velocity: For time-sensitive processes such as catching fraud, intrusion detection etc, the speed at which data arrives is a characteristic of bigdata Variety: Any type of data, structured and unstructured data such as text, sensor data, audio, video, click streams, log ﬁles and more Sarath P R sarath.amrita@gmail.com NLP & Bigdata Motivation and Action

120. Bigdata Tools and Technologies Hadoop NoSQL Spark D3 Sarath P R sarath.amrita@gmail.com NLP & Bigdata Motivation and Action

121. Bigdata Few Interesting Areas Internet of Things Data Journalism Sarath P R sarath.amrita@gmail.com NLP & Bigdata Motivation and Action

122. Conclusion Questions ? Sarath P R sarath.amrita@gmail.com NLP & Bigdata Motivation and Action

123. References Sean Owen, Robin Anil, Ted Dunning, Ellen Friedman, Mahout in Action, Manning Publications Jiawei Han, Micheline Kamber, Data Mining Concepts and Techniques Teknomo, Kardi K-Means Clustering Tutorials A ﬁrst take at building an inverted index, http://nlp.stanford.edu/IR-book/html/htmledition/ a-first-take-at-building-an-inverted-index-1.html Sarath P R sarath.amrita@gmail.com NLP & Bigdata Motivation and Action

124. Thanks Sarath P R sarath.amrita@gmail.com NLP & Bigdata Motivation and Action

NLP& Bigdata. Motivation and Action

Recommended

Recommended

More Related Content

What's hot

What's hot (13)

Similar to NLP& Bigdata. Motivation and Action

Similar to NLP& Bigdata. Motivation and Action (20)

Recently uploaded

Recently uploaded (20)

NLP& Bigdata. Motivation and Action