Discovering Emerging Tech through Graph Analysis - Henry Hwangbo @ GraphConnect Chicago 2013


Published on

With the torrent of data available to us on the Internet, it's been increasingly difficult to separate the signal from the noise. We set out on a journey with a simple directive: Figure out a way to discover emerging technology trends. Through a series of experiments, trials, and pivots, we found our answer in the power of graph databases. We essentially built our "Emerging Tech Radar" on emerging technologies with graph databases being central to our discovery platform. Using a mix of NoSQL databases and open source libraries we built a scalable information digestion platform which touches upon multiple topics such as NLP, named entity extraction, data cleansing, cypher queries, multiple visualizations, and polymorphic persistence.

Published in: Technology
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Discovering Emerging Tech through Graph Analysis - Henry Hwangbo @ GraphConnect Chicago 2013

  1. 1. Discovering EmergingTechnology ThroughGraph AnalysisGraphConnect | ChicagoJune 2013
  2. 2. About || / Director of PwCs Emerging Tech Lab
  3. 3. What is the Emerging Tech Lab?We build stuff to help people get smart about applying technology tosolve problems● Founded 3 years ago to identify and experiment with newtechnologies relevant to but not widely adopted by the Enterprise● Focuses on rapid prototyping & MVP build-outs for bothtactical internal projects and more creative, exploratory ideas● Permanent core team, but operates a rotational program forstaff to provide them an opportunity for hands-on technicalexperience, learning agile & lean principles, and exposure to astartup-like environment
  4. 4. The Challenge
  5. 5. It usually starts with an idea…“Build a platform to help discover emerging technologies.”
  6. 6. …followed by some pretty mock-ups……to raise expectations.
  7. 7. Envisioning success● What are some emergingtechnologies?● How are they being used to solvereal problems?● Who is talking about them?● Who are the players?● Are there related technologies?● Get up to speed quickly● Discover related topics● Understand what is trending● Find interesting applications● See whats possible
  8. 8. What makes technology “emerging”?● Cannot already be mainstream technology● Needs to be more than a single event to be an emerging trend● Must be growing in popularity, but not yet popular● "Technology" could be a thing (e.g. nanotubes), but also anaggregation or application of technologies (e.g. cloudcomputing, quantified self)
  9. 9. The Journey
  10. 10. Initial designData Feeds(RSS)Pull &Store RawDataMongoDBAnalyze VisualizeSource?Postgres
  11. 11. Breaking ground● Natural Language Processing● Named Entity Recognition● ???● ???● ???● ???● ???Extract TextUnderstandTextDiscoverInsights
  12. 12. A bit more clarityData Feeds(RSS)Pull &Store RawDataMongoDBAnalyze VisualizeSource?3rd PartyAPIsTag &UpdatePostgres
  13. 13. Digging a little deeper● Natural Language Processing● Named Entity Recognition● Collocation?● K-means clustering?● Information Ontology?● ???● ???Extract TextUnderstandTextDiscoverInsights
  14. 14. The Eureka moment...…took a bit longer than it should haveGraphs are everywhere
  15. 15. Final designData Feeds(RSS)Pull &Store RawDataMongoDBAnalyze VisualizeSource3rd PartyAPITag &UpdateNeo4j Postgres
  16. 16. Lesson #1 - Graph data modeling is iterativeWhat should be a node, relationship, or a property? Depends on:● What will you search on?● How do you start your searches?● How much data do you expect to have? What data?Expect to change your graph based on:● Experimentation● Query syntax available to extract and aggregate graph data● Query performanceTIP: Plan to reload your graph many times - save the raw data, start small,use batch loading until you get it right…but more flexible than traditional data modeling
  17. 17. Modeling the dataDOCPPCKKCTCDOCPPCKKOTDocument are described by itsentities, concepts, and keywordsthrough relationshipsThis means:● Document are related to otherdocuments through sharedentities, concepts, and keywords● Concepts and entities are relatedto each other through shareddocuments● Incoming relationships measures# of referring documentsSimple, yet powerfulTAGGED_ASRELATES_TOREFERS_TOCONTAINSREFERS_TO
  18. 18. Lesson #2 - Connections are importantHighly connected data creates richergraphs and increases potential fordiscovering greater insightsBUT unnecessary connections cancreate noise & extra workDont create artificial connections, but clean up data before importing when itmakes sense (e.g. networking, networks, network)Prevent duplication which can impact your insights based on aggregation (e.g.# of relationships) or certain patterns
  19. 19. Keeping it cleanTechniques Graph BenefitsText extraction withreadability scoring● Better named entity extraction● Improve neighbor relevance● Minimize invalid nodes & relationshipsSimilarity Hashing● Improve validity of relationships● Increase graph connectednessPorter Stemming ● Improve graph connectedness
  20. 20. Lesson #3 - Understand Cypher● Cypher experimentation opens up the possible● SQL users will be at home - tabular results, similarsyntax● Start without parameters, check with Neo4j shell,move to parameterized queries for security &performance (caching)● Dont forget Lucene syntax● Continues to evolve for the better - check new releasechanges (● Let Cypher do the work
  21. 21. Useful Cypher SyntaxSTART with an indexMATCH defines your universeWHERE filters it downWITH combines multiple statementsHAS checks if a property existsAS lets you name your return valuesIN checks against an arrayCOLLECT aggregates into an arrayORDER just like SQLLIMIT for performance
  22. 22. Prototype highlights● 4 people & 4 months (first version)● Data Stores - Neo4J, MongoDB, Redis, Postgres● Visuals - D3.js, Vivagraph.js, Twitter Bootstrap● Key Languages/Libraries - Ruby, Rails, Cypher,Knockout.js, Amplify.js, HTML5, CSS3, jQuery,Neography gem, Resque gem● 3rd Party - Alchemy, OpenCalais, RSS feeds,Wikipedia● Concepts - natural language processing, namedentity extraction, text cleansing & de-duplication(map/reduce), similarity hashing, large-scaleinformation retrieval● 1M+ nodes, 3M+ relationships, 6M+ properties after6 months
  23. 23. Emerging Tech Radar Demo
  24. 24. Tag Cloud / SearchDOC CKKCDOCCKKDOCDOCDOCDOC● Index keywords and search across keywords (tip: use Lucene syntax)● Identify documents with strong relationships to keywords● Locate concepts with strongest relationships to relevant documents● Popularity based on number of incoming relationships
  25. 25. Emerging Index / Popularity / Doc ListDOCCDOC(E)OCDOC(NE)DOC(E)DOC(E)DOC(NE)DOC(E)DOC(NE)DOC(E)Cloud computing (Concept) and Google (Org)● Strong relationships with documents shared between concepts to filterand rank relevant content● Ratio and strength of relationships to quantify emerging index● Popularity based on number of incoming relationships of each type ofdocument (emerging versus non-emerging)
  26. 26. Node GraphDOCCK DOC OCDOCDOCDOCDOC DOCDOC● Existing relationships with documents shared between concepts tofilter relevant neighbors● Strength of relationships based on # and weight for ranking relevance(color)C
  27. 27. The Takeaway
  28. 28. Final Thoughts● Graphs makes it simple to generate complex insights - you dontneed to be a data scientist● Graphs are a natural fit for anything connected...which is mostthings (e.g. social media, internet of things, sensor data)● Experimentation is the best way to learn the power of graphs● Make graph databases a first class citizen in your technologytoolkit - many things can be solved better with a graphThe best way to discover emerging technologies is to trythem out
  29. 29. Thanks for Listening - Q & ASpecial thanks to Max De Marzi for his neography gem ( and ongoing advice, suggestions,troubleshooting