SlideShare a Scribd company logo
An Exploration of 3 Very Different
ML Solutions Running on Accumulo
By Gadalia O’Bryan and Aaron Cordova, presented by Don Miner
• Introduction
• Koverse Accumulo Table Structures
• Supply Chain Risk
• Cyber Monitoring
• Forensic Document Search
• Questions
©Koverse 2
Talk Outline
• Our customers have an appetite for building
very diverse ML solutions on Accumulo
• These solutions require varying interaction
patterns with Accumulo
• We have found we are able to support these
use cases using the same set of Accumulo
table structures
©Koverse 3
Introduction
©Koverse 4
Koverse Accumulo Table Structures
Record Table: Objectives
• Store records under a unique ID
• Optimized for reading newly written records in
time order
• Bucket ID is prepended to distribute the newly
written records evenly across tablet servers
• Also supports fetching records that match
query criteria after consulting index table
©Koverse
Record Table: Key Components
Record ID
Bucket ID Dataset ID Timestamp
000 Dataset A 1539458936
000 Dataset A 1539458937
000 Dataset B 1539458931
000 Dataset B 1539458933
001 Dataset A 1539458932
001 Dataset A 1539458935
001 Dataset B 1539458930
001 Dataset B 1539458932
©Koverse
©Koverse
Record Table: Organization
Bucket 00
Bucket 01
Dataset A
Dataset B
Dataset A
Dataset B
Time ordered
©Koverse
Record Table: Ingest
Bucket 00
Bucket 01
Dataset A
Dataset B
Dataset A
Dataset B
Writes to the ends of many buckets
©Koverse
Record Table: Bulk Reading
Bucket 00
Bucket 01
Dataset A
Dataset B
Dataset A
Dataset B
Sequential reads from many buckets
Process newest batches incrementally
©Koverse
Index Table: Objectives
• Store a value-to-record ID pairs
• Support point queries and range queries on
values in specific fields, or ’any’ field
• Set intersection is done by query client,
comparing sorted record IDs for each criterion,
so matching record IDs do not need to fit in
memory
©Koverse
Index Table
Index Entry
Dataset ID Field Type Value Record ID
Dataset A $any N 45 000_Dataset_A_1539458936
Dataset A $any N 63 003_Dataset_A_1539458929
Dataset A $any S Bob 000_Dataset_A_1539458936
Dataset A Age N 45 000_Dataset_A_1539458936
Dataset A Age N 63 003_Dataset_A_1539458929
Dataset A Name S Bob 000_Dataset_A_1539458936
Dataset B $any S B0001 001_Dataset_B_1539458920
Dataset B Order ID S B0001 001_Dataset_B_1539458920
Index Table: Querying
Dataset A $any N 45 000_Dataset_A_1539458936
Dataset A $any N 63 003_Dataset_A_1539458929
Dataset A $any S Bob 000_Dataset_A_1539458936
Dataset A Age N 45 000_Dataset_A_1539458936
Dataset A Age N 63 003_Dataset_A_1539458929
Dataset A Name S Bob 000_Dataset_A_1539458936
Dataset B $any S B0001 001_Dataset_B_1539458920
Dataset B Order ID S B0001 001_Dataset_B_1539458920
SELECT * FROM TABLE WHERE
AGE > 40 AND AGE < 70 AND $any CONTAINS ‘Bob’
©Koverse
©Koverse
Index Table: Querying
Dataset A $any N 45 000_Dataset_A_1539458936
Dataset A $any N 63 003_Dataset_A_1539458929
Dataset A $any S Bob 000_Dataset_A_1539458936
Dataset A Age N 45 000_Dataset_A_1539458936
Dataset A Age N 63 003_Dataset_A_1539458929
Dataset A Name S Bob 000_Dataset_A_1539458936
Dataset B $any S B0001 001_Dataset_B_1539458920
Dataset B Order ID S B0001 001_Dataset_B_1539458920
SELECT * FROM TABLE WHERE
AGE > 40 AND AGE < 70 AND $any CONTAINS ‘Bob’
©Koverse
Index Table: Querying
SELECT * FROM TABLE WHERE
AGE > 40 AND AGE < 70 AND $any CONTAINS ‘Bob’
Batch Scan to fetch records
Bucket 00
Bucket 01
Dataset A
Dataset B
Dataset A
Dataset B
Matching Record IDs
Index Table Record Table
Composite Indexes
• Entries in the index table are for a
single value found in a field
• Record IDs that have a particular
value are stored together in
sorted order
• Spanning a range of values means
the Record IDs are no longer
totally ordered, so we can’t do
streaming set intersection
Value Record ID
Bat 002
Bat 004
Bat 005
Bask 001
Bask 009
For Bat
Record IDs
are sorted
For Ba*
Record IDs are
NOT sorted
©Koverse
Composite Indexes
• Composite Indexes allow us to query for
multiple ranges of values
• E.g. querying for points that fall in a specific
latitude longitude box, or query for a time
range and a range of port numbers
• Basically, enables apps to query a multi-
dimensional range
©Koverse
©Koverse
Composite Indexes
Data workers specify which fields to include in a composite index
Record ID Age Height
001 14 67
002 23 72
003 13 60
004 64 68
Composite Value
(interleaved bytes)
Record ID
1647 001
2732 002
1630 003
6648 004
Composite Index
Height
Age
©Koverse
©Koverse
Composite Index
Height
Age
Age > 30 AND Age < 50 AND height > 50 AND height < 70
©Koverse
Composite Index
Height
Age
Age > 30 AND Age < 50 AND height > 50 AND height < 70
True Positives
False Positives
©Koverse 21
Supply Chain Risk
• PricewaterhouseCoopers works with clients to evaluate risk in
their supply chain
• E.g., vendors with unethical business practices, based out of
sanctioned countries, history of tainted products, etc.
• Until now, analysis was very manual
• Could only evaluate each vendor every few years
• Not enough bandwidth to evaluate vendors’ vendors, or their
vendors’ vendors
©Koverse 22
Supply Chain Risk Use Case
• Automatically evaluate vendor risk on a daily basis
• Chain through arbitrary levels of vendors in the
supply chain
• Incorporate social media ML text analysis
©Koverse 23
Automated ‘Know Your Vendor’ Solution
!
• Storage of various record schemas from Excel
files, databases, webservices, and social
media
• Incremental batch processing to refresh
results on a daily basis
©Koverse 24
Accumulo Table Features Leveraged
©Koverse 25
Cyber Monitoring
• Fully managed service provided by a multinational
cybersecurity company
• Threat monitoring, detection and mitigation
• The use of Accumulo allowed the company to scale
their application, which had previously been built on
PostgreSQL
• Security features of Accumulo allow the managed
service to be multi-tenant
©Koverse 26
Managed Cyber Security Services Use Case
• Streaming writes of cyber logs using
Accumulo batchwriters
• Bulk threat detection analytics on time-
windowed event data
• Aggressive use of indexing, including
composite indexes, to enable scalable log
search on single terms and multiple ranges
©Koverse 27
Accumulo Table Features Leveraged
©Koverse 28
Forensic Document Search
• Investigation team at a large pharmaceutical
company
• Analysts need to search for and retrieve all
relevant documents related to a case
• Many users access the application on their
mobile devices
• Documents come from personal laptops,
databases, email attachments, shared
drives, Sharepoint, etc.
• OCR allows for search on evidence photos
that contain text
©Koverse 29
Forensic Document Search Use Case
• Storage of various record schemas resulting
from differing document formats and
metadata
• Indexing of all terms from document text to
enable term search
• Incremental batch NLP analytics on raw
document records
©Koverse 30
Accumulo Table Features Leveraged
©Koverse 31
Questions?

More Related Content

Similar to An Exploration of 3 Very Different ML Solutions Running on Accumulo

Data Modeling IoT and Time Series data in NoSQL
Data Modeling IoT and Time Series data in NoSQLData Modeling IoT and Time Series data in NoSQL
Data Modeling IoT and Time Series data in NoSQL
Basho Technologies
 
Enhanced Data Visualization provided for 200,000 Machines with OpenTSDB and ...
 Enhanced Data Visualization provided for 200,000 Machines with OpenTSDB and ... Enhanced Data Visualization provided for 200,000 Machines with OpenTSDB and ...
Enhanced Data Visualization provided for 200,000 Machines with OpenTSDB and ...
YASH Technologies
 
Microsoft Azure Big Data Analytics
Microsoft Azure Big Data AnalyticsMicrosoft Azure Big Data Analytics
Microsoft Azure Big Data Analytics
Mark Kromer
 
Scott Paddock's AWS Chicago Healthcare slides - 2016
Scott Paddock's AWS Chicago Healthcare slides - 2016Scott Paddock's AWS Chicago Healthcare slides - 2016
Scott Paddock's AWS Chicago Healthcare slides - 2016
AWS Chicago
 
Chicago AWS meetup
Chicago AWS meetupChicago AWS meetup
Chicago AWS meetup
Scott Paddock
 
Niagara Dashboard Application
Niagara Dashboard ApplicationNiagara Dashboard Application
Niagara Dashboard Application
controlconsultantsinc
 
Enhanced Data Visualization provided for 200,000 Machines with OpenTSDB and C...
Enhanced Data Visualization provided for 200,000 Machines with OpenTSDB and C...Enhanced Data Visualization provided for 200,000 Machines with OpenTSDB and C...
Enhanced Data Visualization provided for 200,000 Machines with OpenTSDB and C...
YASH Technologies
 
Riak TS
Riak TSRiak TS
Riak TS
clive boulton
 
19CS3052R-CO1-7-S7 ECE
19CS3052R-CO1-7-S7 ECE19CS3052R-CO1-7-S7 ECE
19CS3052R-CO1-7-S7 ECE
Bharath123Maddipati
 
Best Practices: How to Analyze IoT Sensor Data with InfluxDB
Best Practices: How to Analyze IoT Sensor Data with InfluxDBBest Practices: How to Analyze IoT Sensor Data with InfluxDB
Best Practices: How to Analyze IoT Sensor Data with InfluxDB
InfluxData
 
Accelerating YourBusiness with Security
Accelerating YourBusiness with SecurityAccelerating YourBusiness with Security
Accelerating YourBusiness with Security
Amazon Web Services
 
Intro to InfluxDB
Intro to InfluxDBIntro to InfluxDB
Intro to InfluxDB
InfluxData
 
IPC Data Analysis and Extraction
IPC Data Analysis and ExtractionIPC Data Analysis and Extraction
IPC Data Analysis and Extractionpzybrick
 
Intro to Time Series
Intro to Time Series Intro to Time Series
Intro to Time Series
InfluxData
 
Accelerating your Business with Security
Accelerating your Business with SecurityAccelerating your Business with Security
Accelerating your Business with Security
Amazon Web Services
 
William Berth Bi Portfolio
William Berth Bi PortfolioWilliam Berth Bi Portfolio
William Berth Bi Portfolionewberth
 
Introduction 6.1 01_architecture_overview
Introduction 6.1 01_architecture_overviewIntroduction 6.1 01_architecture_overview
Introduction 6.1 01_architecture_overview
Anvith S. Upadhyaya
 
#LOP workshop - bluecrux - Logipharma 2018
#LOP workshop - bluecrux - Logipharma 2018#LOP workshop - bluecrux - Logipharma 2018
#LOP workshop - bluecrux - Logipharma 2018
Bluecrux
 
Spring Cloud and Netflix OSS overview v1
Spring Cloud and Netflix OSS overview v1Spring Cloud and Netflix OSS overview v1
Spring Cloud and Netflix OSS overview v1
Dmitry Skaredov
 

Similar to An Exploration of 3 Very Different ML Solutions Running on Accumulo (20)

Data Modeling IoT and Time Series data in NoSQL
Data Modeling IoT and Time Series data in NoSQLData Modeling IoT and Time Series data in NoSQL
Data Modeling IoT and Time Series data in NoSQL
 
Nithin(1)
Nithin(1)Nithin(1)
Nithin(1)
 
Enhanced Data Visualization provided for 200,000 Machines with OpenTSDB and ...
 Enhanced Data Visualization provided for 200,000 Machines with OpenTSDB and ... Enhanced Data Visualization provided for 200,000 Machines with OpenTSDB and ...
Enhanced Data Visualization provided for 200,000 Machines with OpenTSDB and ...
 
Microsoft Azure Big Data Analytics
Microsoft Azure Big Data AnalyticsMicrosoft Azure Big Data Analytics
Microsoft Azure Big Data Analytics
 
Scott Paddock's AWS Chicago Healthcare slides - 2016
Scott Paddock's AWS Chicago Healthcare slides - 2016Scott Paddock's AWS Chicago Healthcare slides - 2016
Scott Paddock's AWS Chicago Healthcare slides - 2016
 
Chicago AWS meetup
Chicago AWS meetupChicago AWS meetup
Chicago AWS meetup
 
Niagara Dashboard Application
Niagara Dashboard ApplicationNiagara Dashboard Application
Niagara Dashboard Application
 
Enhanced Data Visualization provided for 200,000 Machines with OpenTSDB and C...
Enhanced Data Visualization provided for 200,000 Machines with OpenTSDB and C...Enhanced Data Visualization provided for 200,000 Machines with OpenTSDB and C...
Enhanced Data Visualization provided for 200,000 Machines with OpenTSDB and C...
 
Riak TS
Riak TSRiak TS
Riak TS
 
19CS3052R-CO1-7-S7 ECE
19CS3052R-CO1-7-S7 ECE19CS3052R-CO1-7-S7 ECE
19CS3052R-CO1-7-S7 ECE
 
Best Practices: How to Analyze IoT Sensor Data with InfluxDB
Best Practices: How to Analyze IoT Sensor Data with InfluxDBBest Practices: How to Analyze IoT Sensor Data with InfluxDB
Best Practices: How to Analyze IoT Sensor Data with InfluxDB
 
Accelerating YourBusiness with Security
Accelerating YourBusiness with SecurityAccelerating YourBusiness with Security
Accelerating YourBusiness with Security
 
Intro to InfluxDB
Intro to InfluxDBIntro to InfluxDB
Intro to InfluxDB
 
IPC Data Analysis and Extraction
IPC Data Analysis and ExtractionIPC Data Analysis and Extraction
IPC Data Analysis and Extraction
 
Intro to Time Series
Intro to Time Series Intro to Time Series
Intro to Time Series
 
Accelerating your Business with Security
Accelerating your Business with SecurityAccelerating your Business with Security
Accelerating your Business with Security
 
William Berth Bi Portfolio
William Berth Bi PortfolioWilliam Berth Bi Portfolio
William Berth Bi Portfolio
 
Introduction 6.1 01_architecture_overview
Introduction 6.1 01_architecture_overviewIntroduction 6.1 01_architecture_overview
Introduction 6.1 01_architecture_overview
 
#LOP workshop - bluecrux - Logipharma 2018
#LOP workshop - bluecrux - Logipharma 2018#LOP workshop - bluecrux - Logipharma 2018
#LOP workshop - bluecrux - Logipharma 2018
 
Spring Cloud and Netflix OSS overview v1
Spring Cloud and Netflix OSS overview v1Spring Cloud and Netflix OSS overview v1
Spring Cloud and Netflix OSS overview v1
 

Recently uploaded

Graspan: A Big Data System for Big Code Analysis
Graspan: A Big Data System for Big Code AnalysisGraspan: A Big Data System for Big Code Analysis
Graspan: A Big Data System for Big Code Analysis
Aftab Hussain
 
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
Juraj Vysvader
 
Providing Globus Services to Users of JASMIN for Environmental Data Analysis
Providing Globus Services to Users of JASMIN for Environmental Data AnalysisProviding Globus Services to Users of JASMIN for Environmental Data Analysis
Providing Globus Services to Users of JASMIN for Environmental Data Analysis
Globus
 
A Sighting of filterA in Typelevel Rite of Passage
A Sighting of filterA in Typelevel Rite of PassageA Sighting of filterA in Typelevel Rite of Passage
A Sighting of filterA in Typelevel Rite of Passage
Philip Schwarz
 
2024 RoOUG Security model for the cloud.pptx
2024 RoOUG Security model for the cloud.pptx2024 RoOUG Security model for the cloud.pptx
2024 RoOUG Security model for the cloud.pptx
Georgi Kodinov
 
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoamOpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
takuyayamamoto1800
 
GlobusWorld 2024 Opening Keynote session
GlobusWorld 2024 Opening Keynote sessionGlobusWorld 2024 Opening Keynote session
GlobusWorld 2024 Opening Keynote session
Globus
 
Large Language Models and the End of Programming
Large Language Models and the End of ProgrammingLarge Language Models and the End of Programming
Large Language Models and the End of Programming
Matt Welsh
 
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, BetterWebinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
XfilesPro
 
Top 7 Unique WhatsApp API Benefits | Saudi Arabia
Top 7 Unique WhatsApp API Benefits | Saudi ArabiaTop 7 Unique WhatsApp API Benefits | Saudi Arabia
Top 7 Unique WhatsApp API Benefits | Saudi Arabia
Yara Milbes
 
Understanding Globus Data Transfers with NetSage
Understanding Globus Data Transfers with NetSageUnderstanding Globus Data Transfers with NetSage
Understanding Globus Data Transfers with NetSage
Globus
 
GOING AOT WITH GRAALVM FOR SPRING BOOT (SPRING IO)
GOING AOT WITH GRAALVM FOR  SPRING BOOT (SPRING IO)GOING AOT WITH GRAALVM FOR  SPRING BOOT (SPRING IO)
GOING AOT WITH GRAALVM FOR SPRING BOOT (SPRING IO)
Alina Yurenko
 
Globus Compute wth IRI Workflows - GlobusWorld 2024
Globus Compute wth IRI Workflows - GlobusWorld 2024Globus Compute wth IRI Workflows - GlobusWorld 2024
Globus Compute wth IRI Workflows - GlobusWorld 2024
Globus
 
Enhancing Research Orchestration Capabilities at ORNL.pdf
Enhancing Research Orchestration Capabilities at ORNL.pdfEnhancing Research Orchestration Capabilities at ORNL.pdf
Enhancing Research Orchestration Capabilities at ORNL.pdf
Globus
 
How to Position Your Globus Data Portal for Success Ten Good Practices
How to Position Your Globus Data Portal for Success Ten Good PracticesHow to Position Your Globus Data Portal for Success Ten Good Practices
How to Position Your Globus Data Portal for Success Ten Good Practices
Globus
 
How Recreation Management Software Can Streamline Your Operations.pptx
How Recreation Management Software Can Streamline Your Operations.pptxHow Recreation Management Software Can Streamline Your Operations.pptx
How Recreation Management Software Can Streamline Your Operations.pptx
wottaspaceseo
 
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
Shahin Sheidaei
 
Atelier - Innover avec l’IA Générative et les graphes de connaissances
Atelier - Innover avec l’IA Générative et les graphes de connaissancesAtelier - Innover avec l’IA Générative et les graphes de connaissances
Atelier - Innover avec l’IA Générative et les graphes de connaissances
Neo4j
 
AI Pilot Review: The World’s First Virtual Assistant Marketing Suite
AI Pilot Review: The World’s First Virtual Assistant Marketing SuiteAI Pilot Review: The World’s First Virtual Assistant Marketing Suite
AI Pilot Review: The World’s First Virtual Assistant Marketing Suite
Google
 
Pro Unity Game Development with C-sharp Book
Pro Unity Game Development with C-sharp BookPro Unity Game Development with C-sharp Book
Pro Unity Game Development with C-sharp Book
abdulrafaychaudhry
 

Recently uploaded (20)

Graspan: A Big Data System for Big Code Analysis
Graspan: A Big Data System for Big Code AnalysisGraspan: A Big Data System for Big Code Analysis
Graspan: A Big Data System for Big Code Analysis
 
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
 
Providing Globus Services to Users of JASMIN for Environmental Data Analysis
Providing Globus Services to Users of JASMIN for Environmental Data AnalysisProviding Globus Services to Users of JASMIN for Environmental Data Analysis
Providing Globus Services to Users of JASMIN for Environmental Data Analysis
 
A Sighting of filterA in Typelevel Rite of Passage
A Sighting of filterA in Typelevel Rite of PassageA Sighting of filterA in Typelevel Rite of Passage
A Sighting of filterA in Typelevel Rite of Passage
 
2024 RoOUG Security model for the cloud.pptx
2024 RoOUG Security model for the cloud.pptx2024 RoOUG Security model for the cloud.pptx
2024 RoOUG Security model for the cloud.pptx
 
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoamOpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
 
GlobusWorld 2024 Opening Keynote session
GlobusWorld 2024 Opening Keynote sessionGlobusWorld 2024 Opening Keynote session
GlobusWorld 2024 Opening Keynote session
 
Large Language Models and the End of Programming
Large Language Models and the End of ProgrammingLarge Language Models and the End of Programming
Large Language Models and the End of Programming
 
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, BetterWebinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
 
Top 7 Unique WhatsApp API Benefits | Saudi Arabia
Top 7 Unique WhatsApp API Benefits | Saudi ArabiaTop 7 Unique WhatsApp API Benefits | Saudi Arabia
Top 7 Unique WhatsApp API Benefits | Saudi Arabia
 
Understanding Globus Data Transfers with NetSage
Understanding Globus Data Transfers with NetSageUnderstanding Globus Data Transfers with NetSage
Understanding Globus Data Transfers with NetSage
 
GOING AOT WITH GRAALVM FOR SPRING BOOT (SPRING IO)
GOING AOT WITH GRAALVM FOR  SPRING BOOT (SPRING IO)GOING AOT WITH GRAALVM FOR  SPRING BOOT (SPRING IO)
GOING AOT WITH GRAALVM FOR SPRING BOOT (SPRING IO)
 
Globus Compute wth IRI Workflows - GlobusWorld 2024
Globus Compute wth IRI Workflows - GlobusWorld 2024Globus Compute wth IRI Workflows - GlobusWorld 2024
Globus Compute wth IRI Workflows - GlobusWorld 2024
 
Enhancing Research Orchestration Capabilities at ORNL.pdf
Enhancing Research Orchestration Capabilities at ORNL.pdfEnhancing Research Orchestration Capabilities at ORNL.pdf
Enhancing Research Orchestration Capabilities at ORNL.pdf
 
How to Position Your Globus Data Portal for Success Ten Good Practices
How to Position Your Globus Data Portal for Success Ten Good PracticesHow to Position Your Globus Data Portal for Success Ten Good Practices
How to Position Your Globus Data Portal for Success Ten Good Practices
 
How Recreation Management Software Can Streamline Your Operations.pptx
How Recreation Management Software Can Streamline Your Operations.pptxHow Recreation Management Software Can Streamline Your Operations.pptx
How Recreation Management Software Can Streamline Your Operations.pptx
 
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
 
Atelier - Innover avec l’IA Générative et les graphes de connaissances
Atelier - Innover avec l’IA Générative et les graphes de connaissancesAtelier - Innover avec l’IA Générative et les graphes de connaissances
Atelier - Innover avec l’IA Générative et les graphes de connaissances
 
AI Pilot Review: The World’s First Virtual Assistant Marketing Suite
AI Pilot Review: The World’s First Virtual Assistant Marketing SuiteAI Pilot Review: The World’s First Virtual Assistant Marketing Suite
AI Pilot Review: The World’s First Virtual Assistant Marketing Suite
 
Pro Unity Game Development with C-sharp Book
Pro Unity Game Development with C-sharp BookPro Unity Game Development with C-sharp Book
Pro Unity Game Development with C-sharp Book
 

An Exploration of 3 Very Different ML Solutions Running on Accumulo

  • 1. An Exploration of 3 Very Different ML Solutions Running on Accumulo By Gadalia O’Bryan and Aaron Cordova, presented by Don Miner
  • 2. • Introduction • Koverse Accumulo Table Structures • Supply Chain Risk • Cyber Monitoring • Forensic Document Search • Questions ©Koverse 2 Talk Outline
  • 3. • Our customers have an appetite for building very diverse ML solutions on Accumulo • These solutions require varying interaction patterns with Accumulo • We have found we are able to support these use cases using the same set of Accumulo table structures ©Koverse 3 Introduction
  • 4. ©Koverse 4 Koverse Accumulo Table Structures
  • 5. Record Table: Objectives • Store records under a unique ID • Optimized for reading newly written records in time order • Bucket ID is prepended to distribute the newly written records evenly across tablet servers • Also supports fetching records that match query criteria after consulting index table ©Koverse
  • 6. Record Table: Key Components Record ID Bucket ID Dataset ID Timestamp 000 Dataset A 1539458936 000 Dataset A 1539458937 000 Dataset B 1539458931 000 Dataset B 1539458933 001 Dataset A 1539458932 001 Dataset A 1539458935 001 Dataset B 1539458930 001 Dataset B 1539458932 ©Koverse
  • 7. ©Koverse Record Table: Organization Bucket 00 Bucket 01 Dataset A Dataset B Dataset A Dataset B Time ordered
  • 8. ©Koverse Record Table: Ingest Bucket 00 Bucket 01 Dataset A Dataset B Dataset A Dataset B Writes to the ends of many buckets
  • 9. ©Koverse Record Table: Bulk Reading Bucket 00 Bucket 01 Dataset A Dataset B Dataset A Dataset B Sequential reads from many buckets Process newest batches incrementally
  • 10. ©Koverse Index Table: Objectives • Store a value-to-record ID pairs • Support point queries and range queries on values in specific fields, or ’any’ field • Set intersection is done by query client, comparing sorted record IDs for each criterion, so matching record IDs do not need to fit in memory
  • 11. ©Koverse Index Table Index Entry Dataset ID Field Type Value Record ID Dataset A $any N 45 000_Dataset_A_1539458936 Dataset A $any N 63 003_Dataset_A_1539458929 Dataset A $any S Bob 000_Dataset_A_1539458936 Dataset A Age N 45 000_Dataset_A_1539458936 Dataset A Age N 63 003_Dataset_A_1539458929 Dataset A Name S Bob 000_Dataset_A_1539458936 Dataset B $any S B0001 001_Dataset_B_1539458920 Dataset B Order ID S B0001 001_Dataset_B_1539458920
  • 12. Index Table: Querying Dataset A $any N 45 000_Dataset_A_1539458936 Dataset A $any N 63 003_Dataset_A_1539458929 Dataset A $any S Bob 000_Dataset_A_1539458936 Dataset A Age N 45 000_Dataset_A_1539458936 Dataset A Age N 63 003_Dataset_A_1539458929 Dataset A Name S Bob 000_Dataset_A_1539458936 Dataset B $any S B0001 001_Dataset_B_1539458920 Dataset B Order ID S B0001 001_Dataset_B_1539458920 SELECT * FROM TABLE WHERE AGE > 40 AND AGE < 70 AND $any CONTAINS ‘Bob’ ©Koverse
  • 13. ©Koverse Index Table: Querying Dataset A $any N 45 000_Dataset_A_1539458936 Dataset A $any N 63 003_Dataset_A_1539458929 Dataset A $any S Bob 000_Dataset_A_1539458936 Dataset A Age N 45 000_Dataset_A_1539458936 Dataset A Age N 63 003_Dataset_A_1539458929 Dataset A Name S Bob 000_Dataset_A_1539458936 Dataset B $any S B0001 001_Dataset_B_1539458920 Dataset B Order ID S B0001 001_Dataset_B_1539458920 SELECT * FROM TABLE WHERE AGE > 40 AND AGE < 70 AND $any CONTAINS ‘Bob’
  • 14. ©Koverse Index Table: Querying SELECT * FROM TABLE WHERE AGE > 40 AND AGE < 70 AND $any CONTAINS ‘Bob’ Batch Scan to fetch records Bucket 00 Bucket 01 Dataset A Dataset B Dataset A Dataset B Matching Record IDs Index Table Record Table
  • 15. Composite Indexes • Entries in the index table are for a single value found in a field • Record IDs that have a particular value are stored together in sorted order • Spanning a range of values means the Record IDs are no longer totally ordered, so we can’t do streaming set intersection Value Record ID Bat 002 Bat 004 Bat 005 Bask 001 Bask 009 For Bat Record IDs are sorted For Ba* Record IDs are NOT sorted ©Koverse
  • 16. Composite Indexes • Composite Indexes allow us to query for multiple ranges of values • E.g. querying for points that fall in a specific latitude longitude box, or query for a time range and a range of port numbers • Basically, enables apps to query a multi- dimensional range ©Koverse
  • 17. ©Koverse Composite Indexes Data workers specify which fields to include in a composite index Record ID Age Height 001 14 67 002 23 72 003 13 60 004 64 68 Composite Value (interleaved bytes) Record ID 1647 001 2732 002 1630 003 6648 004
  • 19. ©Koverse Composite Index Height Age Age > 30 AND Age < 50 AND height > 50 AND height < 70
  • 20. ©Koverse Composite Index Height Age Age > 30 AND Age < 50 AND height > 50 AND height < 70 True Positives False Positives
  • 22. • PricewaterhouseCoopers works with clients to evaluate risk in their supply chain • E.g., vendors with unethical business practices, based out of sanctioned countries, history of tainted products, etc. • Until now, analysis was very manual • Could only evaluate each vendor every few years • Not enough bandwidth to evaluate vendors’ vendors, or their vendors’ vendors ©Koverse 22 Supply Chain Risk Use Case
  • 23. • Automatically evaluate vendor risk on a daily basis • Chain through arbitrary levels of vendors in the supply chain • Incorporate social media ML text analysis ©Koverse 23 Automated ‘Know Your Vendor’ Solution !
  • 24. • Storage of various record schemas from Excel files, databases, webservices, and social media • Incremental batch processing to refresh results on a daily basis ©Koverse 24 Accumulo Table Features Leveraged
  • 26. • Fully managed service provided by a multinational cybersecurity company • Threat monitoring, detection and mitigation • The use of Accumulo allowed the company to scale their application, which had previously been built on PostgreSQL • Security features of Accumulo allow the managed service to be multi-tenant ©Koverse 26 Managed Cyber Security Services Use Case
  • 27. • Streaming writes of cyber logs using Accumulo batchwriters • Bulk threat detection analytics on time- windowed event data • Aggressive use of indexing, including composite indexes, to enable scalable log search on single terms and multiple ranges ©Koverse 27 Accumulo Table Features Leveraged
  • 29. • Investigation team at a large pharmaceutical company • Analysts need to search for and retrieve all relevant documents related to a case • Many users access the application on their mobile devices • Documents come from personal laptops, databases, email attachments, shared drives, Sharepoint, etc. • OCR allows for search on evidence photos that contain text ©Koverse 29 Forensic Document Search Use Case
  • 30. • Storage of various record schemas resulting from differing document formats and metadata • Indexing of all terms from document text to enable term search • Incremental batch NLP analytics on raw document records ©Koverse 30 Accumulo Table Features Leveraged