Generating Dynamic Social
   Networks from Large Scale
       Unstructured Data
Enterprise Software to Make Sense of Really Junky Data



          Tim Estes - CEO, Digital Reasoning
 What We’ll Discuss

• What is a social network?
   • The web of relationships between entities that influences actions

• Why does it matter?
   • To reference Aesop: “You are known by the company you keep.”

• What’s required to build one algorithmically?
   • What’s similar, what’s the same, what’s connected
 What’s similar?

We use patented algorithms for deducing related terms from the data…

      Bush                White                                  Justin             Britney
                                               Nashville
                          House                                Timberlake           Spears


president bush        house                  tenn            miley cyrus        britney spears
president george w    gov                    the predators   pussycat dolls     the album
administration        white                  predators       bob dylan          x factor
bush administration   clinton                oakland         nine inch nails    my friends
george                the administration     milwaukee       rock star          mtv
george w              president-elect        st louis        the timberwolves   madonna
george bush           barack obama           carolina        sean preston       lady gaga
brown                 barack                 a season        lanarkshire        singer
american              president george w     baltimore       ticket prices      a student
clinton                                      kentucky        nme
 What’s the same?

Concept resolution:
  Roll up similar things into groups of the same (again, algorithmically)




                               Example: Tony Blair
 What’s connected?
Link analysis:
 Show who and what are connected (again, you guessed it, algorithmically)
                      Terrorist Leader Connections
Let’s Put an Idea to the Test...

 With powerful analytics can you remove some or
  most of the need for a priori structure in designing
  and understanding social networks or other quasi-
                                                         YES
  ontological schemas?                                   and

 Can you also do it with messy unstructured data?
                                                         YES
But first...
       Why do we (Digital Reasoning) care?
Because its what we do for a living.
                      We make sense of the senseless.
 Our customers have critical needs
 - Digital Reasoning works primarily in the Defense and Intelligence
   Community making sense of noisy, unstructured data and turning it
   into usable entity-centric systems supporting mission critical
   intelligence.
 The data is big and bad
 - Little structure in content, topics all over the place, and totally different
   ontologies/schemas across the community.
 The times we live in create urgencies
 - We care because the better and faster we are at making sense of this
   kind of data, the safer our country is.
Why did we take a data-centric, deployed software model?

 Unique Environments
 - Given who our customers are... we can’t host their data. No one can.
   The solution had to be a pure deployed software model.
 Meaning in Hard to Reach Places
 - The data is basically a bunch of pieces that don’t want to be connected.
   People that don’t want to be found.
 Result?
 - Imagine trying to turn that kind of data in that type of architecture from a
   bunch of loose communication into a social network that has patterns of
   life, weightings of influence, and projections of probable future actions...
Here’s what it looks like in an architecture…
Now let’s show what can be learned with a little application of
      Entity-Oriented Analytics to a bunch of web data.
Test Case

 Web Blog+Wikipedia data (collected by Fetch)
  -   6M Blog URLs collected over 1Yr +
  -   16M unique blog messages
  -   no unifying these, topic or author
  -   tricky to get “good” big data from the open web. ended up using .5% of that
      original source. 1TB became 4GB.
 No a priori structure, sparse metadata, nearly all meaning emerges
  from analysis
 Let’s see what we can find out...
Examining connections related to “Carl Icahn”



                              The data shows
                              connections to and from
                              Carl Icahn by:
                              • people
                              • periodicals
On closer examination         • topics
the data tells us:            • companies
Carl Icahn “is backing” a
startup company that
“would build” products
related to Barack Obama
Let’s examine what connections we find to “Egypt”



                                             Egypt is identified as a
                                          location, as an organization
                                              (country) and as an
                                            unassigned entity with all
                                              related connections
On closer examination we see
interesting connections in the
blogs for Egypt, Cairo, Issues
and the phrase “powder keg”.
If we drill down into the actual
blog entry we see the context of
the connections
How about connections to “Steve Jobs”?

One connection isconnections in
 The entities and interesting:         Topics
“Steve Jobs” to “Walt Mossberg”
 the blog data are vast – which
to “Kindle”
 is not surprising.                  Authors
Synthesys shows the of authors
The large amount reason for
connection as “pricing” popularity
and topics reflect the
of Steve Jobswordawe see the
Clicking on this as blog subject
context of the connection
Demo Platform

 Synthesys Platform Beta
  elastic
  user-driven
  entity-oriented-analytics on demand
Observations

 New innovations will be algorithmic and focused on turning hard-
  to-use data into dynamic, evolving knowledge that can automate
  machine execution
 Architectures/solutions will have to accommodate customers that
  don’t want to move their data to a Public Cloud
 It is a true statement... “If you can connect the dots, you can
  connect the people”
So why should You care?

 Because there is a lot of data that doesn‘t belong on a shared grid.
  Such as Top Secret data, Sensitive Corporate Data, and Personal
  Data.
 Because people may want to own (Personal Computing model)
  vs. rent (Mainframe model) analytics
 Because you may not want to convert your data to fit the model of
  the hosted solution or map to their ontology to get the answers
  you need.
To learn more…

 See us at:
 - Strata Science Fair (Wed evening 6:45PM)
 - Digital Reasoning Booth #305
 - www.digitalreasoning.com
Questions?



Automated Understanding, Trusted Decisions, True Intelligence

Tim Estes - Generating dynamic social networks from large scale unstructured data

  • 1.
    Generating Dynamic Social Networks from Large Scale Unstructured Data Enterprise Software to Make Sense of Really Junky Data Tim Estes - CEO, Digital Reasoning
  • 2.
     What We’llDiscuss • What is a social network? • The web of relationships between entities that influences actions • Why does it matter? • To reference Aesop: “You are known by the company you keep.” • What’s required to build one algorithmically? • What’s similar, what’s the same, what’s connected
  • 3.
     What’s similar? Weuse patented algorithms for deducing related terms from the data… Bush White Justin Britney Nashville House Timberlake Spears president bush house tenn miley cyrus britney spears president george w gov the predators pussycat dolls the album administration white predators bob dylan x factor bush administration clinton oakland nine inch nails my friends george the administration milwaukee rock star mtv george w president-elect st louis the timberwolves madonna george bush barack obama carolina sean preston lady gaga brown barack a season lanarkshire singer american president george w baltimore ticket prices a student clinton kentucky nme
  • 4.
     What’s thesame? Concept resolution: Roll up similar things into groups of the same (again, algorithmically) Example: Tony Blair
  • 5.
     What’s connected? Linkanalysis: Show who and what are connected (again, you guessed it, algorithmically) Terrorist Leader Connections
  • 6.
    Let’s Put anIdea to the Test...  With powerful analytics can you remove some or most of the need for a priori structure in designing and understanding social networks or other quasi- YES ontological schemas? and  Can you also do it with messy unstructured data? YES
  • 7.
    But first... Why do we (Digital Reasoning) care?
  • 8.
    Because its whatwe do for a living. We make sense of the senseless.  Our customers have critical needs - Digital Reasoning works primarily in the Defense and Intelligence Community making sense of noisy, unstructured data and turning it into usable entity-centric systems supporting mission critical intelligence.  The data is big and bad - Little structure in content, topics all over the place, and totally different ontologies/schemas across the community.  The times we live in create urgencies - We care because the better and faster we are at making sense of this kind of data, the safer our country is.
  • 9.
    Why did wetake a data-centric, deployed software model?  Unique Environments - Given who our customers are... we can’t host their data. No one can. The solution had to be a pure deployed software model.  Meaning in Hard to Reach Places - The data is basically a bunch of pieces that don’t want to be connected. People that don’t want to be found.  Result? - Imagine trying to turn that kind of data in that type of architecture from a bunch of loose communication into a social network that has patterns of life, weightings of influence, and projections of probable future actions...
  • 10.
    Here’s what itlooks like in an architecture…
  • 11.
    Now let’s showwhat can be learned with a little application of Entity-Oriented Analytics to a bunch of web data.
  • 12.
    Test Case  WebBlog+Wikipedia data (collected by Fetch) - 6M Blog URLs collected over 1Yr + - 16M unique blog messages - no unifying these, topic or author - tricky to get “good” big data from the open web. ended up using .5% of that original source. 1TB became 4GB.  No a priori structure, sparse metadata, nearly all meaning emerges from analysis  Let’s see what we can find out...
  • 13.
    Examining connections relatedto “Carl Icahn” The data shows connections to and from Carl Icahn by: • people • periodicals On closer examination • topics the data tells us: • companies Carl Icahn “is backing” a startup company that “would build” products related to Barack Obama
  • 14.
    Let’s examine whatconnections we find to “Egypt” Egypt is identified as a location, as an organization (country) and as an unassigned entity with all related connections On closer examination we see interesting connections in the blogs for Egypt, Cairo, Issues and the phrase “powder keg”. If we drill down into the actual blog entry we see the context of the connections
  • 15.
    How about connectionsto “Steve Jobs”? One connection isconnections in The entities and interesting: Topics “Steve Jobs” to “Walt Mossberg” the blog data are vast – which to “Kindle” is not surprising. Authors Synthesys shows the of authors The large amount reason for connection as “pricing” popularity and topics reflect the of Steve Jobswordawe see the Clicking on this as blog subject context of the connection
  • 16.
    Demo Platform  SynthesysPlatform Beta  elastic  user-driven  entity-oriented-analytics on demand
  • 17.
    Observations  New innovationswill be algorithmic and focused on turning hard- to-use data into dynamic, evolving knowledge that can automate machine execution  Architectures/solutions will have to accommodate customers that don’t want to move their data to a Public Cloud  It is a true statement... “If you can connect the dots, you can connect the people”
  • 18.
    So why shouldYou care?  Because there is a lot of data that doesn‘t belong on a shared grid. Such as Top Secret data, Sensitive Corporate Data, and Personal Data.  Because people may want to own (Personal Computing model) vs. rent (Mainframe model) analytics  Because you may not want to convert your data to fit the model of the hosted solution or map to their ontology to get the answers you need.
  • 19.
    To learn more… See us at: - Strata Science Fair (Wed evening 6:45PM) - Digital Reasoning Booth #305 - www.digitalreasoning.com
  • 20.
    Questions? Automated Understanding, TrustedDecisions, True Intelligence