Bar camp2011 concept extraction


Published on

This is the presentation I gave at BarCamp Nashville, October 15th 2011. Missing are the notes and narration. Go to to get the full presentation. Search for bcn11.

Published in: Technology, News & Politics
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • Back around 2008 when I was still working in the Intelligence field I had an opportunity to speak to 2nd Intelligence Battalion, USMC, in North Carolina. The subject was applying a new system to their operations. This system was an advanced discovery system designed to reveal and connect dots in the way search was unable. I launched into a presentation similar to this one, only the beginning was more math heavy because this was cool stuff. Soon the Colonel held up his hand. It wasn’t that he had a question. His hand was more of a motion to just shut me up. He then put it like this:“Son, let me simplify things for you. We are the United States Marine Corps. We put Warheads on Foreheads. Will that system help us accomplish that goal?”Stunned, I hesitated a second but then said: “Yes… yes it will!”“Proceed with that part of the demonstration then.”And thus I have reduced this presentation with that philosophy in mind to help get you to the point where you can understand these methods and begin your own investigation into them.We will present the context of history and needs with some math and code to show how one particular text analytics task, extracting concept maps, is done at a per-document level.
  • In 1939, the emergence of an aggressive Germany and the recent discovery of nuclear fission led Albert Einstein, Leo Szilard and Enrico Fermi to write a letter to FDR about the threat.The result was a crash program to develop nuclear weapons, known as The Manhattan Project.The end result was new munitions that broke all the rules. One that was 10,000 times more powerful than any previous weapon.
  • In 1939, the emergence of an aggressive Germany and the recent discovery of nuclear fission led Albert Einstein, Leo Szilard and Enrico Fermi to write a letter to FDR about the threat.The result was a crash program to develop nuclear weapons, known as The Manhattan Project.The end result was new munitions that broke all the rules. One that was 10,000 times more powerful than any previous weapon.
  • In 1945, the first Science Advisor to the President, Vannavar Bush wrote an essay on information retrieval that was fulfilled entirely by 2007 with the emergence of the Smartphone. In it he postulated that all scientific achievement had been applied to the physical world. In the future it would be applied to information and understanding.
  • By late 2001, the Government was on a new crash program that dwarfed the original Manhattan Project. It has no name as it is a series of efforts along different lines of inquiry. Its end result was to improve data mining.The emergence of discovery tools ushered in a new age with a new vision that stands on the shoulders of Vannavar Bush’s work.
  • The King of Spain in 1231 was Ferdinand III of Castile.The difference between search and discovery is:With search you are looking for something specific that you know some attributes of. For example you have lost your keys and you wish to retrieve them. You know what they look like, where they are supposed to be, where they likely are and if not there where they might have fallen or who might have them.With discovery you have a mass of data and you don’t know what is in it but you know what is important to you. Discovery tools help you “surf the data” until you find what is important to you. Just like an explorer doesn’t have a map when landing on new land they must create and expedition and mark rivers, mountains and civilizations as they find them.A Google Search to determine if Iran had pre-knowledge of 9/11 would only work if someone had already drawn such a conclusion and published their findings. There are several reports of this related to a US court case. Prior to that revelation, was it possible from open sources to determine with search if Iran had such knowledge?
  • You aren’t reading English but you are decoding symbols back to what you believe are their true meaning. In fact you are skipping English and going straight to connecting them to concepts you already know. Computers don’t really understand English, they understand data structures of concepts. When you read the word “Car” you connect that to a concept structure in your mind. Your concept structure is huge for something like a car. You know about so many things connected to cars. Their parts, how to drive them, laws associated with them, things you’ve done in them and what you think of particular cars you love and ones you hate.The word is not the concept. The word is a label for the concept. Most analytics you will start with won’t deal with concepts but with labels and their discovered inherent relationships. This is enough. As a human your mind has a vast encyclopedia of concepts that respond to the stimulus of labels. Language is a compact way of processing ideas and modifying concepts, quickly.
  • Machine understanding of language had not made much success by the early 90’s. However as machines started to increase in power, following Moore’s Law, the hard problems of the 80’s gave way to the brute force of new hardware and new mathematics. In this past decade, Moore’s Law was broken with ever more powerful computing machines leading to ever more powerful machine understanding. This new decade is full of tremendous promise for what might be possible with machine understanding.
  • Here is our plaintext that we will use as input.
  • The code is simple and linear. The reason why the NLTK rocks is that all of the major steps have been put into the API. Your work is focused on what your particular data mining needs are.Here we are taking plaintext, detecting sentences and tokenizing it. Finally we are doing Part of Speech detection. This process you learned as a child in school. Words were given classes such as noun, adjective, predicates and so forth. You learned how to diagram a sentence at a young age. These classes continue to be very useful to data miners and for machine learning and message understanding.While not shown, noise reduction is extremely crucial to good text analytics. The effort put into it is directly visible in the final output. Every effort must be made to reduce noise.
  • Here is the first stage.
  • Here is the second stage.Note that each token is now paired with it’s part of speech. This is the beginning. Now that we can access directly both the term, the sentence, the part of speech we have what we need. Another pass, not shown here, can detect logical groupings of parts of speech and by a process called chunking can make noun phrases. If you have a document and you want to discover it’s inherent meaning, what exactly are you looking for? Meaning comes from relationships. It is the relationships between it’s core concepts that produce the meaning of the document. These concepts are in the document in the form of nouns. Now that we can find them, we can use them.Categorization requires feature level understanding. [explain this] Just starting with POS we have identifiable, meaningful features to work with. We are able to tell which features are more prominent by their frequency.
  • The corpus I worked with had the following size distribution. The reason it is not a smooth graph is because I ranked the files by their Coefficient of Variation, also known as the Relative Standard Distribution. I use RSD as the name for the results of the equation from here out. There is a 0.88 correlation between RSD and file size.
  • This concept map was derived from the SAMSUNG SCX-5935FN USER'S GUIDE.Concept maps are just the beginning.  I first used them to boost social tagging efforts. In a sense the AI was using the author’s words as input for the initial set of tags. This particular effort saved hundreds of man hours of work. It also ensured that when the system was initiated and loaded that files were already clustering. This happens every time a file is loaded in that system. Tags are created automatically meaning that even files that are casually loaded are analyzed. Every time the file is modified it is re-analyzed and the tags are updated with any new insertions.
  • Bar camp2011 concept extraction

    1. 1. AutomaticConcept MapExtractionA Presentation by Peter Mancini
    2. 2. In The Beginning… It was 1939 In the course of the last four months it has been made probable - through the work of Joliot in France as well as Fermi and Szilard in America - that it may become possible to set up a nuclear chain reaction in a large mass of uranium, by which vast amounts of power and large quantities of new radium-like elements would be generated.
    3. 3. The Manhattan Project• Ended an age based upon chemistry• Thrust the World into a more dangerous balance• Put to an end the ability to wage global war• Forced new thought on how to manage information
    4. 4. The Information Age • Vannavar Bush, 1945 • Essay: As We May Think • First Science Advisor to the President • Predicted devices similar to smart phones • Wanted the Knowledge of the Ages in the hands of everyone.
    5. 5. Information Age 2.0• The 9/11 Attacks showed that even with massive information retrieval capabilities we were still vulnerable.• We needed to both connect the dots – known unknowns• We needed to also find dots we didn’t know about – unknown unknowns.• In Decision Management Theory these are called Unanticipated Decision Variables.
    6. 6. The Age of Discovery• The difference between 1999 and 2011 is the emergence of discovery tools• Discovery is different than Search• Discovery can help along every leg of a problem that needs to be solved• Search can help you find who was the King of Spain in 1231 (try Googeling it from your phone of netbook)• Discovery can help you determine if Iran has pre- knowledge of 9/11 (try Googeling that)• Data Mining is the New Manhattan Project
    7. 7. If you can read this you are doing what acomputer has to do to read text
    8. 8. Extracting Meaning • Meaning is represented by the way concepts relate to each other • We know what the concepts are because we can detect the nouns in a document • The meaning of the document comes from the concept it speaks to • Here is how we do this using the Natural Language Toolkit
    9. 9. Extracting MeaningI believe that banking institutions are more dangerous toour liberties than standing armies. If the American peopleever allow private banks to control the issue of theircurrency, first by inflation, then by deflation, the banksand corporations that will grow up around the banks willdeprive the people of all property until their children wake-up homeless on the continent their fathers conquered. Theissuing power should be taken from the banks and restoredto the people, to whom it properly belongs.Thomas Jefferson
    10. 10. Extracting Meaningimport nltk, pprintfrom nltk.tokenize import *paragraph = nltk.sent_tokenize(plaintext)tokenizer = PunktWordTokenizer()tokenizedSentences = [tokenizer.tokenize(sentence) forsentence in paragraph]## Noise Reduction Goes Here#POSTaggedSentences = [nltk.pos_tag(sentence) for sentencein tokenizedSentences]
    11. 11. Plaintext Tokenized [I, believe, that, banking, institutions, are, more, dangerous, to, our, liberties, than, standing, armies.],
    12. 12. Plaintext Tokenized• [(I, PRP), (believe, VBP), (that, IN), (banking, NN), (institutions, NNS), (are, VBP), (more, RBR), (dangerous, JJ), (to, TO), (our, PRP$), (liberties, NNS), (than, IN), (standing, NN), (armies., NNP)],
    13. 13. The Coefficient of Variation• Not every file is useful – some don’t have enough information in them to determine what is most important.• The coefficient is derived by looking at the frequency of each noun, taking the standard deviation of the set and dividing it by the mean. The higher the number the more variation there is.• We can determine which files to tag based upon this analysis – the ones with the higher variation are better for determining what they are about.
    14. 14. Three zones represent an arbitrary decision on which files producedgood Concept Maps, OK ones and finally poor ones. The lower theRSD the worse the map. 1/3rd of the files were tossed but the other2/3rds contained 97% of the accumulated data.
    15. 15. What is this Concept Map About? • control panel • Machine Setup • scanner glass • document feeder • printer driver • Mac OS X • SyncThru Web Service • Macintosh • icon • media • Prints • Report
    16. 16. Questions and Answers All slides available at NectarineImp.comAdditional inquiry can be sent to: