SlideShare a Scribd company logo
1 of 27
Scaling Document Clustering in the Cloud Robert Gillen Computer Science Research Cloud Futures 2011
Overview Introduction to Piranha Existing Limitations Current Solution Tracks Early Results & Future Work
Challenge – What to do with mounds of data? What is in there? Are there any threats? What am I missing? How do I connect the “dots”? How do I find the relevant information I need?
Trees Forest Can’t See the for the Traditionally, search methods are used to find information at high volume levels But, those methods won’t get you here easily
Piranha Ability to search AND analyze Organize documents based on content Identify similar & dissimilar documents Identify duplicate and near-duplicate data Incorporate new data as it becomes available 2007 R & D 100 Award winning Awards are based on each achievement's technical significance, uniqueness, and usefulness compared to competing projects and technologies.
Keyword Methods Document 1 Vector Space Model The Army needs sensor technology to help find improvised explosive devices Term List Weight Terms Army Sensor Technology Help Find Improvise Explosive  Device ORNL  develop  homeland  Defense Mitre  won  contract  Document 2 ORNL has developed sensor technology for homeland defense Document 3 Term Frequency – Inverse Document Frequency Mitre has won a contract to develop homeland defense sensors for explosive devices An index into the document list
Textual Clustering Vector Space Model Cluster Analysis Similarity Matrix D1 D2 D3 Documents to Documents Most similar documents Euclidean distance Time Complexity TFIDF O(n2Log n)
Example: Sign of the Crescent1 41 Short intelligence reports about a multi-prong terrorist attack Example: Report Date: 1 April, 2003. FBI: Abdul Ramazi is the owner of the Select Gourmet Foods shop in Springfield Mall, Springfield, VA. [Phone number 703-659-2317]. First Union National Bank lists Select Gourmet Foods as holding account number 1070173749003. Six checks totaling $35,000 have been deposited in this account in the past four months and are recorded as having been drawn on accounts at the Pyramid Bank of Cairo,  Egypt and the Central Bank of Dubai, United Arab Emirates. Both of these banks have just been listed as possible conduits in money laundering schemes 1Intelligence Analysis Case Study by F. J. Hughes, Joint Military Intelligence College
Piranha Cluster View Report Date: 1 April, 2003. FBI: Abdul Ramazi is the owner of the Select Gourmet Foods shop in Springfield Mall, Springfield, VA. [Phone number 703-659-2317]. First Union National Bank lists Select Gourmet Foods as holding account number 1070173749003. Six checks totaling $35,000 have been deposited in this account in the past four months and are recorded as having been drawn on accounts at the Pyramid Bank of Cairo,  Egypt and the Central Bank of Dubai, United Arab Emirates. Both of these banks have just been listed as possible conduits in money laundering schemes
Existing Issues Memory bound Prior distribution approaches were troublesome Extant need to process larger document sets
Current Solution Tracks Traditional HPC (Jaguar) ORNL has unique capabilities in this space Cloud New approaches may broaden the reach of the tool Less-specialized hardware requirements More-accessible programing/extensibility model Ability to utilize core features of cloud platforms to provide key functionality
Design Tenants Utilize cloud primitives wherever possible. Building “Environmentally Aware” algorithms… i.e. such that they are aware of the environment in which they are running. Dynamically fit the platform to the problem Design for use in disparate environments.
Cloud Scaling Approach R C1 C2 C4 C3 Patent Pending
Cloud Scaling Approach R C1 C2 C4 C3a QC1C2 C3b Patent Pending
Pending Issues How frequently to check for memory pressure Work Unit Size (how many documents at a time) Moving from a single machine to distributed model introduces I/O delay (by definition) ~60K docs  increase of 2:30 – bad case, 50min/million docs
Vector Creation/Serialization (local)
Vector Creation/Serialization (cloud)
Patent Pending Real-Time Environment Monitoring
Real-Time Environment Monitoring
Fault Tolerance C1 C1C2 C1C3 C3 C2 C3C4 C4 Patent Pending
Fault Tolerance C1 C1C2 C1C3 C3 C2 C3C4 C4 Patent Pending
Fault Tolerance C1 C1C2 C1C3 C2 C3C4 C4 Patent Pending
Fault Tolerance C1 C1C2 C1C3 C5 C2 C3C4 C4 Patent Pending
Fault Tolerance Queues provide isolation for fault tolerance Two-phase queues are key to success Regular serialization of node state is key Yet how often remains in question Not possible without programmable infrastructure provided by the cloud Patent Pending
Running in Different Environments Same core algorithm (C++ code) runs in Azure, Amazon, and on Jaguar (recompiled) “Scaffolding” code is cloud/jaguar specific Patterns used (Repository, etc) to abstract differences between various vendor storage repositories “Scaling” easier in Azure Raw control/access easier in Amazon
Early Results & Future Work File Packing? Scale vs. Stability vs. Speed Tuning the Work Unit Size Patent Pending
Questions? Rob Gillen gillenre@ornl.gov @argodev

More Related Content

Similar to Scaling Document Clustering in the Cloud

Digital Forensics best practices with the use of open source tools and admiss...
Digital Forensics best practices with the use of open source tools and admiss...Digital Forensics best practices with the use of open source tools and admiss...
Digital Forensics best practices with the use of open source tools and admiss...
Sagar Rahurkar
 
SplunkLive! Splunk for Insider Threats and Fraud Detection
SplunkLive! Splunk for Insider Threats and Fraud DetectionSplunkLive! Splunk for Insider Threats and Fraud Detection
SplunkLive! Splunk for Insider Threats and Fraud Detection
Splunk
 
Big data and the data quality imperative
Big data and the data quality imperativeBig data and the data quality imperative
Big data and the data quality imperative
Trillium Software
 
Project 3 – Hollywood and IT· Find 10 incidents of Hollywood p.docx
Project 3 – Hollywood and IT· Find 10 incidents of Hollywood p.docxProject 3 – Hollywood and IT· Find 10 incidents of Hollywood p.docx
Project 3 – Hollywood and IT· Find 10 incidents of Hollywood p.docx
stilliegeorgiana
 
Best Practices: Complex Discovery in Corporations and Law Firms | Ryan Baker ...
Best Practices: Complex Discovery in Corporations and Law Firms | Ryan Baker ...Best Practices: Complex Discovery in Corporations and Law Firms | Ryan Baker ...
Best Practices: Complex Discovery in Corporations and Law Firms | Ryan Baker ...
Rob Robinson
 

Similar to Scaling Document Clustering in the Cloud (20)

Digital forensics research: The next 10 years
Digital forensics research: The next 10 yearsDigital forensics research: The next 10 years
Digital forensics research: The next 10 years
 
Digital Forensics best practices with the use of open source tools and admiss...
Digital Forensics best practices with the use of open source tools and admiss...Digital Forensics best practices with the use of open source tools and admiss...
Digital Forensics best practices with the use of open source tools and admiss...
 
Big data in Private Banking
Big data in Private BankingBig data in Private Banking
Big data in Private Banking
 
SplunkLive! Splunk for Insider Threats and Fraud Detection
SplunkLive! Splunk for Insider Threats and Fraud DetectionSplunkLive! Splunk for Insider Threats and Fraud Detection
SplunkLive! Splunk for Insider Threats and Fraud Detection
 
Data mining with big data implementation
Data mining with big data implementationData mining with big data implementation
Data mining with big data implementation
 
李育杰/The Growth of a Data Scientist
李育杰/The Growth of a Data Scientist李育杰/The Growth of a Data Scientist
李育杰/The Growth of a Data Scientist
 
IRJET- A Study on Data Mining in Software
IRJET- A Study on Data Mining in SoftwareIRJET- A Study on Data Mining in Software
IRJET- A Study on Data Mining in Software
 
Using Data Riches A tale of two projects - Ajay Vinze
Using Data Riches A tale of two projects - Ajay VinzeUsing Data Riches A tale of two projects - Ajay Vinze
Using Data Riches A tale of two projects - Ajay Vinze
 
1. The Importance of Graphs in Government
1. The Importance of Graphs in Government1. The Importance of Graphs in Government
1. The Importance of Graphs in Government
 
A Novel Methodology for Offline Forensics Triage in Windows Systems
A Novel Methodology for Offline Forensics Triage in Windows SystemsA Novel Methodology for Offline Forensics Triage in Windows Systems
A Novel Methodology for Offline Forensics Triage in Windows Systems
 
DevOps for Highly Regulated Environments
DevOps for Highly Regulated EnvironmentsDevOps for Highly Regulated Environments
DevOps for Highly Regulated Environments
 
2008 Trends
2008 Trends2008 Trends
2008 Trends
 
Big data and the data quality imperative
Big data and the data quality imperativeBig data and the data quality imperative
Big data and the data quality imperative
 
Introduction to Data Mining
Introduction to Data MiningIntroduction to Data Mining
Introduction to Data Mining
 
Good Guys vs Bad Guys: Using Big Data to Counteract Advanced Threats
Good Guys vs Bad Guys: Using Big Data to Counteract Advanced ThreatsGood Guys vs Bad Guys: Using Big Data to Counteract Advanced Threats
Good Guys vs Bad Guys: Using Big Data to Counteract Advanced Threats
 
Meletis Belsis -CSIRTs
Meletis Belsis -CSIRTsMeletis Belsis -CSIRTs
Meletis Belsis -CSIRTs
 
David valovcin big data - big risk
David valovcin big data - big riskDavid valovcin big data - big risk
David valovcin big data - big risk
 
Project 3 – Hollywood and IT· Find 10 incidents of Hollywood p.docx
Project 3 – Hollywood and IT· Find 10 incidents of Hollywood p.docxProject 3 – Hollywood and IT· Find 10 incidents of Hollywood p.docx
Project 3 – Hollywood and IT· Find 10 incidents of Hollywood p.docx
 
Best Practices: Complex Discovery in Corporations and Law Firms | Ryan Baker ...
Best Practices: Complex Discovery in Corporations and Law Firms | Ryan Baker ...Best Practices: Complex Discovery in Corporations and Law Firms | Ryan Baker ...
Best Practices: Complex Discovery in Corporations and Law Firms | Ryan Baker ...
 
El contexto de la integración masiva de datos
El contexto de la integración masiva de datosEl contexto de la integración masiva de datos
El contexto de la integración masiva de datos
 

More from Rob Gillen

A Comparison of AWS and Azure - Part2
A Comparison of AWS and Azure - Part2A Comparison of AWS and Azure - Part2
A Comparison of AWS and Azure - Part2
Rob Gillen
 
A Comparison of AWS and Azure - Part 1
A Comparison of AWS and Azure - Part 1A Comparison of AWS and Azure - Part 1
A Comparison of AWS and Azure - Part 1
Rob Gillen
 
Intro to GPGPU Programming with Cuda
Intro to GPGPU Programming with CudaIntro to GPGPU Programming with Cuda
Intro to GPGPU Programming with Cuda
Rob Gillen
 

More from Rob Gillen (20)

CodeStock14: Hiding in Plain Sight
CodeStock14: Hiding in Plain SightCodeStock14: Hiding in Plain Sight
CodeStock14: Hiding in Plain Sight
 
What's in a password
What's in a password What's in a password
What's in a password
 
How well do you know your runtime
How well do you know your runtimeHow well do you know your runtime
How well do you know your runtime
 
Software defined radio and the hacker
Software defined radio and the hackerSoftware defined radio and the hacker
Software defined radio and the hacker
 
So whats in a password
So whats in a passwordSo whats in a password
So whats in a password
 
Hiding in plain sight
Hiding in plain sightHiding in plain sight
Hiding in plain sight
 
DevLink - WiFu: You think your wireless is secure?
DevLink - WiFu: You think your wireless is secure?DevLink - WiFu: You think your wireless is secure?
DevLink - WiFu: You think your wireless is secure?
 
You think your WiFi is safe?
You think your WiFi is safe?You think your WiFi is safe?
You think your WiFi is safe?
 
Anatomy of a Buffer Overflow Attack
Anatomy of a Buffer Overflow AttackAnatomy of a Buffer Overflow Attack
Anatomy of a Buffer Overflow Attack
 
Intro to GPGPU with CUDA (DevLink)
Intro to GPGPU with CUDA (DevLink)Intro to GPGPU with CUDA (DevLink)
Intro to GPGPU with CUDA (DevLink)
 
AWS vs. Azure
AWS vs. AzureAWS vs. Azure
AWS vs. Azure
 
A Comparison of AWS and Azure - Part2
A Comparison of AWS and Azure - Part2A Comparison of AWS and Azure - Part2
A Comparison of AWS and Azure - Part2
 
A Comparison of AWS and Azure - Part 1
A Comparison of AWS and Azure - Part 1A Comparison of AWS and Azure - Part 1
A Comparison of AWS and Azure - Part 1
 
Intro to GPGPU Programming with Cuda
Intro to GPGPU Programming with CudaIntro to GPGPU Programming with Cuda
Intro to GPGPU Programming with Cuda
 
Windows Azure: Lessons From The Field
Windows Azure: Lessons From The FieldWindows Azure: Lessons From The Field
Windows Azure: Lessons From The Field
 
Amazon Web Services for the .NET Developer
Amazon Web Services for the .NET DeveloperAmazon Web Services for the .NET Developer
Amazon Web Services for the .NET Developer
 
05561 Xfer Research 02
05561 Xfer Research 0205561 Xfer Research 02
05561 Xfer Research 02
 
05561 Xfer Research 01
05561 Xfer Research 0105561 Xfer Research 01
05561 Xfer Research 01
 
05561 Xfer Consumer 01
05561 Xfer Consumer 0105561 Xfer Consumer 01
05561 Xfer Consumer 01
 
Cloud Storage Upload Tests 02
Cloud Storage Upload Tests 02Cloud Storage Upload Tests 02
Cloud Storage Upload Tests 02
 

Recently uploaded

Structuring Teams and Portfolios for Success
Structuring Teams and Portfolios for SuccessStructuring Teams and Portfolios for Success
Structuring Teams and Portfolios for Success
UXDXConf
 

Recently uploaded (20)

WSO2CONMay2024OpenSourceConferenceDebrief.pptx
WSO2CONMay2024OpenSourceConferenceDebrief.pptxWSO2CONMay2024OpenSourceConferenceDebrief.pptx
WSO2CONMay2024OpenSourceConferenceDebrief.pptx
 
Introduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdf
Introduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdfIntroduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdf
Introduction to FDO and How It works Applications _ Richard at FIDO Alliance.pdf
 
FDO for Camera, Sensor and Networking Device – Commercial Solutions from VinC...
FDO for Camera, Sensor and Networking Device – Commercial Solutions from VinC...FDO for Camera, Sensor and Networking Device – Commercial Solutions from VinC...
FDO for Camera, Sensor and Networking Device – Commercial Solutions from VinC...
 
What's New in Teams Calling, Meetings and Devices April 2024
What's New in Teams Calling, Meetings and Devices April 2024What's New in Teams Calling, Meetings and Devices April 2024
What's New in Teams Calling, Meetings and Devices April 2024
 
Free and Effective: Making Flows Publicly Accessible, Yumi Ibrahimzade
Free and Effective: Making Flows Publicly Accessible, Yumi IbrahimzadeFree and Effective: Making Flows Publicly Accessible, Yumi Ibrahimzade
Free and Effective: Making Flows Publicly Accessible, Yumi Ibrahimzade
 
Top 10 Symfony Development Companies 2024
Top 10 Symfony Development Companies 2024Top 10 Symfony Development Companies 2024
Top 10 Symfony Development Companies 2024
 
How we scaled to 80K users by doing nothing!.pdf
How we scaled to 80K users by doing nothing!.pdfHow we scaled to 80K users by doing nothing!.pdf
How we scaled to 80K users by doing nothing!.pdf
 
How Red Hat Uses FDO in Device Lifecycle _ Costin and Vitaliy at Red Hat.pdf
How Red Hat Uses FDO in Device Lifecycle _ Costin and Vitaliy at Red Hat.pdfHow Red Hat Uses FDO in Device Lifecycle _ Costin and Vitaliy at Red Hat.pdf
How Red Hat Uses FDO in Device Lifecycle _ Costin and Vitaliy at Red Hat.pdf
 
Extensible Python: Robustness through Addition - PyCon 2024
Extensible Python: Robustness through Addition - PyCon 2024Extensible Python: Robustness through Addition - PyCon 2024
Extensible Python: Robustness through Addition - PyCon 2024
 
The Metaverse: Are We There Yet?
The  Metaverse:    Are   We  There  Yet?The  Metaverse:    Are   We  There  Yet?
The Metaverse: Are We There Yet?
 
TEST BANK For, Information Technology Project Management 9th Edition Kathy Sc...
TEST BANK For, Information Technology Project Management 9th Edition Kathy Sc...TEST BANK For, Information Technology Project Management 9th Edition Kathy Sc...
TEST BANK For, Information Technology Project Management 9th Edition Kathy Sc...
 
Structuring Teams and Portfolios for Success
Structuring Teams and Portfolios for SuccessStructuring Teams and Portfolios for Success
Structuring Teams and Portfolios for Success
 
Syngulon - Selection technology May 2024.pdf
Syngulon - Selection technology May 2024.pdfSyngulon - Selection technology May 2024.pdf
Syngulon - Selection technology May 2024.pdf
 
THE BEST IPTV in GERMANY for 2024: IPTVreel
THE BEST IPTV in  GERMANY for 2024: IPTVreelTHE BEST IPTV in  GERMANY for 2024: IPTVreel
THE BEST IPTV in GERMANY for 2024: IPTVreel
 
The Value of Certifying Products for FDO _ Paul at FIDO Alliance.pdf
The Value of Certifying Products for FDO _ Paul at FIDO Alliance.pdfThe Value of Certifying Products for FDO _ Paul at FIDO Alliance.pdf
The Value of Certifying Products for FDO _ Paul at FIDO Alliance.pdf
 
Enterprise Knowledge Graphs - Data Summit 2024
Enterprise Knowledge Graphs - Data Summit 2024Enterprise Knowledge Graphs - Data Summit 2024
Enterprise Knowledge Graphs - Data Summit 2024
 
Strategic AI Integration in Engineering Teams
Strategic AI Integration in Engineering TeamsStrategic AI Integration in Engineering Teams
Strategic AI Integration in Engineering Teams
 
Powerful Start- the Key to Project Success, Barbara Laskowska
Powerful Start- the Key to Project Success, Barbara LaskowskaPowerful Start- the Key to Project Success, Barbara Laskowska
Powerful Start- the Key to Project Success, Barbara Laskowska
 
Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...
Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...
Behind the Scenes From the Manager's Chair: Decoding the Secrets of Successfu...
 
AI presentation and introduction - Retrieval Augmented Generation RAG 101
AI presentation and introduction - Retrieval Augmented Generation RAG 101AI presentation and introduction - Retrieval Augmented Generation RAG 101
AI presentation and introduction - Retrieval Augmented Generation RAG 101
 

Scaling Document Clustering in the Cloud

  • 1. Scaling Document Clustering in the Cloud Robert Gillen Computer Science Research Cloud Futures 2011
  • 2. Overview Introduction to Piranha Existing Limitations Current Solution Tracks Early Results & Future Work
  • 3. Challenge – What to do with mounds of data? What is in there? Are there any threats? What am I missing? How do I connect the “dots”? How do I find the relevant information I need?
  • 4. Trees Forest Can’t See the for the Traditionally, search methods are used to find information at high volume levels But, those methods won’t get you here easily
  • 5. Piranha Ability to search AND analyze Organize documents based on content Identify similar & dissimilar documents Identify duplicate and near-duplicate data Incorporate new data as it becomes available 2007 R & D 100 Award winning Awards are based on each achievement's technical significance, uniqueness, and usefulness compared to competing projects and technologies.
  • 6. Keyword Methods Document 1 Vector Space Model The Army needs sensor technology to help find improvised explosive devices Term List Weight Terms Army Sensor Technology Help Find Improvise Explosive Device ORNL develop homeland Defense Mitre won contract Document 2 ORNL has developed sensor technology for homeland defense Document 3 Term Frequency – Inverse Document Frequency Mitre has won a contract to develop homeland defense sensors for explosive devices An index into the document list
  • 7. Textual Clustering Vector Space Model Cluster Analysis Similarity Matrix D1 D2 D3 Documents to Documents Most similar documents Euclidean distance Time Complexity TFIDF O(n2Log n)
  • 8. Example: Sign of the Crescent1 41 Short intelligence reports about a multi-prong terrorist attack Example: Report Date: 1 April, 2003. FBI: Abdul Ramazi is the owner of the Select Gourmet Foods shop in Springfield Mall, Springfield, VA. [Phone number 703-659-2317]. First Union National Bank lists Select Gourmet Foods as holding account number 1070173749003. Six checks totaling $35,000 have been deposited in this account in the past four months and are recorded as having been drawn on accounts at the Pyramid Bank of Cairo, Egypt and the Central Bank of Dubai, United Arab Emirates. Both of these banks have just been listed as possible conduits in money laundering schemes 1Intelligence Analysis Case Study by F. J. Hughes, Joint Military Intelligence College
  • 9. Piranha Cluster View Report Date: 1 April, 2003. FBI: Abdul Ramazi is the owner of the Select Gourmet Foods shop in Springfield Mall, Springfield, VA. [Phone number 703-659-2317]. First Union National Bank lists Select Gourmet Foods as holding account number 1070173749003. Six checks totaling $35,000 have been deposited in this account in the past four months and are recorded as having been drawn on accounts at the Pyramid Bank of Cairo, Egypt and the Central Bank of Dubai, United Arab Emirates. Both of these banks have just been listed as possible conduits in money laundering schemes
  • 10. Existing Issues Memory bound Prior distribution approaches were troublesome Extant need to process larger document sets
  • 11. Current Solution Tracks Traditional HPC (Jaguar) ORNL has unique capabilities in this space Cloud New approaches may broaden the reach of the tool Less-specialized hardware requirements More-accessible programing/extensibility model Ability to utilize core features of cloud platforms to provide key functionality
  • 12. Design Tenants Utilize cloud primitives wherever possible. Building “Environmentally Aware” algorithms… i.e. such that they are aware of the environment in which they are running. Dynamically fit the platform to the problem Design for use in disparate environments.
  • 13. Cloud Scaling Approach R C1 C2 C4 C3 Patent Pending
  • 14. Cloud Scaling Approach R C1 C2 C4 C3a QC1C2 C3b Patent Pending
  • 15. Pending Issues How frequently to check for memory pressure Work Unit Size (how many documents at a time) Moving from a single machine to distributed model introduces I/O delay (by definition) ~60K docs  increase of 2:30 – bad case, 50min/million docs
  • 18. Patent Pending Real-Time Environment Monitoring
  • 20. Fault Tolerance C1 C1C2 C1C3 C3 C2 C3C4 C4 Patent Pending
  • 21. Fault Tolerance C1 C1C2 C1C3 C3 C2 C3C4 C4 Patent Pending
  • 22. Fault Tolerance C1 C1C2 C1C3 C2 C3C4 C4 Patent Pending
  • 23. Fault Tolerance C1 C1C2 C1C3 C5 C2 C3C4 C4 Patent Pending
  • 24. Fault Tolerance Queues provide isolation for fault tolerance Two-phase queues are key to success Regular serialization of node state is key Yet how often remains in question Not possible without programmable infrastructure provided by the cloud Patent Pending
  • 25. Running in Different Environments Same core algorithm (C++ code) runs in Azure, Amazon, and on Jaguar (recompiled) “Scaffolding” code is cloud/jaguar specific Patterns used (Repository, etc) to abstract differences between various vendor storage repositories “Scaling” easier in Azure Raw control/access easier in Amazon
  • 26. Early Results & Future Work File Packing? Scale vs. Stability vs. Speed Tuning the Work Unit Size Patent Pending
  • 27. Questions? Rob Gillen gillenre@ornl.gov @argodev