Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
© 2015 Continuum Analytics- Confidential & Proprietary
Memex: Mining the Deep Web
Katrina Riehl, PhD
Data Scientist
Contin...
© 2015 Continuum Analytics- Confidential & Proprietary
THE DEEP WEB
Explaining
When you ask the internet a question, who
is answering?
3
4
5
6
© 2015 Continuum Analytics- Confidential & Proprietary
DARPA MEMEX
An introduction to
What is MEMEX?
8
• Today's web searches use a centralized, one-size-fits-all approach that searches the Internet
with the ...
9
Memex Search Domains
• Human/Labor Trafficking
• Weapons
• Material Research Science
• Financial Fraud
• Counterfeit Elect...
http://opencatalog.darpa.mil
11
© 2015 Continuum Analytics- Confidential & Proprietary
LARGE SCALE DATA ANALYTICS
An Overview of the Ecosystem
13
BI - DB DM/Stats/ML
Scientific ComputingDistributed Systems
Numba
bcolz
RHadoop
© 2015 Continuum Analytics- Confidential & Proprietary
THE ANALYTICS PIPELINE
Analytics Pipeline
15
• Web Crawlers & Scrapers
• Entity Extractors
• Indexers
• Visual Analytics
Memex Explorer
16
• Pluggable Framework for Crawling & Data Discovery
• Django Web Application
• Elasticsearch Index
• Bok...
17
18
19
20
21
22
Data Storage
Abstract expressions
Computational backend
csv
HDF5
bcolz
DataFrame
HDFS
selection
filter
group by
join
co...
23
24
DATA ANALYSIS
Topic Modeling
26
Topic Modeling
27
Topic Modeling
28
29
QUESTIONS?
Thank you!!
Upcoming SlideShare
Loading in …5
×

Memex - PyData Seattle

3,011 views

Published on

Continuum Analytics Data Scientist Katrina Riehl slides on her Memex talk at PyData Seattle.

Published in: Data & Analytics
  • Be the first to comment

Memex - PyData Seattle

  1. 1. © 2015 Continuum Analytics- Confidential & Proprietary Memex: Mining the Deep Web Katrina Riehl, PhD Data Scientist Continuum Analytics July 25, 2015
  2. 2. © 2015 Continuum Analytics- Confidential & Proprietary THE DEEP WEB Explaining
  3. 3. When you ask the internet a question, who is answering? 3
  4. 4. 4
  5. 5. 5
  6. 6. 6
  7. 7. © 2015 Continuum Analytics- Confidential & Proprietary DARPA MEMEX An introduction to
  8. 8. What is MEMEX? 8 • Today's web searches use a centralized, one-size-fits-all approach that searches the Internet with the same set of tools for all queries. While that model has been wildly successful commercially, it does not work well for many government use cases. • DARPA launched the Memex program in September, 2014. • Memex seeks to develop software that advances online search capabilities • Creation of a new domain-specific indexing and search paradigm • content discovery • information extraction • information retrieval • user collaboration • Extension of current search capabilities • deep web • dark web • nontraditional (e.g. multimedia) content.
  9. 9. 9
  10. 10. Memex Search Domains • Human/Labor Trafficking • Weapons • Material Research Science • Financial Fraud • Counterfeit Electronics • Patent Trolling • Child Exploitation 10
  11. 11. http://opencatalog.darpa.mil 11
  12. 12. © 2015 Continuum Analytics- Confidential & Proprietary LARGE SCALE DATA ANALYTICS An Overview of the Ecosystem
  13. 13. 13 BI - DB DM/Stats/ML Scientific ComputingDistributed Systems Numba bcolz RHadoop
  14. 14. © 2015 Continuum Analytics- Confidential & Proprietary THE ANALYTICS PIPELINE
  15. 15. Analytics Pipeline 15 • Web Crawlers & Scrapers • Entity Extractors • Indexers • Visual Analytics
  16. 16. Memex Explorer 16 • Pluggable Framework for Crawling & Data Discovery • Django Web Application • Elasticsearch Index • Bokeh Visualizations for Crawling Stats • Kibana Dashboards for Initial Data Exploration • Apache Nutch Crawler • NYU ACHE Crawler • NYU Domain Discovery Tool
  17. 17. 17
  18. 18. 18
  19. 19. 19
  20. 20. 20
  21. 21. 21
  22. 22. 22 Data Storage Abstract expressions Computational backend csv HDF5 bcolz DataFrame HDFS selection filter group by join column wise Pandas Streaming Python Spark MongoDB SQLAlchemy json
  23. 23. 23
  24. 24. 24
  25. 25. DATA ANALYSIS
  26. 26. Topic Modeling 26
  27. 27. Topic Modeling 27
  28. 28. Topic Modeling 28
  29. 29. 29
  30. 30. QUESTIONS? Thank you!!

×