Much of today's big data processing deals with structured data, while there are reports that up to 80% of big data may be unstructured. For more than a decade, RavenPack has been focused on deriving value from unstructured data. Here we'll look at how we've migrated to a cloud platform, and what advantages this gives us. We'll also consider some options we have now that were almost entirely impractical before.
Using the cloud to process unstructured big data by Jason Cornez.
1. May 21, 2016
Using the Cloud to Process
Unstructured Big Data
J on the Beach, Malaga, Spain
RavenPack: Mapping the World’s
Big Data for Financial Applications
Jason Cornez ‒ CTO
jcornez@ravenpack.com
2. 2ravenpack.com | info@ravenpack.com | AMERICAS Tel: (646) 277-7339 | EMEA-APAC Tel: +34 952 90 73 90
• RavenPack delivers big data analytics to financial professionals
• Top hedge funds and investment banks use RavenPack
for trading and risk management
• Patented, proprietary technology and award-winning research
• Archive of more than 300 million documents, spanning past 20 years
RavenPack processes hundreds of thousands of documents each day.
We produce machine readable analytics for each document in real time.
Expected processing time for a typical document is less than 250ms.
RavenPack at a Glance
3. 3ravenpack.com | info@ravenpack.com | AMERICAS Tel: (646) 277-7339 | EMEA-APAC Tel: +34 952 90 73 90
• Classification Overview
• Realtime Classification: Classic vs Cloud
• Historical Classification: Classic vs Cloud
• New Challenges: Spot Instances and The Weather
• New Opportunities
Contents
4. 4ravenpack.com | info@ravenpack.com | AMERICAS Tel: (646) 277-7339 | EMEA-APAC Tel: +34 952 90 73 90
Extract meaning from Unstructured Text
• Tokenization
• Entity Detection
• Attribute Tagging
• Event Detection
• Consolidation
A stream-based Classification Framework allow us to add new classifiers into a stream of
documents. As much as possible, classifiers use separate threads to run in parallel.
Classification Overview
5. 5ravenpack.com | info@ravenpack.com | AMERICAS Tel: (646) 277-7339 | EMEA-APAC Tel: +34 952 90 73 90
• Dictionary of nearly 400,000 entities
• Point-in-time aware
• Rules per entity type
• Extensive entity relationship modeling
• Supports metadata and other hints
• Equivalent terms and stop words
We support: company (Oracle Corp.), organization (European Union), geo-political place
(Spain), currency (US Dollar), nationality (Spanish), people (Barack Obama), commodity
(Crude Oil), position (CEO, President), team (Real Madrid), product (iPhone 6S), and more.
Entity Detection
6. 6ravenpack.com | info@ravenpack.com | AMERICAS Tel: (646) 277-7339 | EMEA-APAC Tel: +34 952 90 73 90
Example: People Detection
• Many people share the same or similar names
• Many people hold various positions at employers across time
• People have one or more nationalities
• People are related to other people
Melanie Griffith files for divorce from Banderas
Mai And Banderas Star In The New The King Of Fighters XIV Trailer
After year out, Tim Cook joins competitive Oregon State running back battle
Apple CEO Tim Cook Attends iPad Pro 9.7 inch Launch at Palo Alto Store
8. 8ravenpack.com | info@ravenpack.com | AMERICAS Tel: (646) 277-7339 | EMEA-APAC Tel: +34 952 90 73 90
Cloud Model using AWS
• CloudFormation to model the Stack
• Unlimited, Distributed Storage
• Easy redundancy, failover and backup
Use Case: Realtime Classification
Amazon
EC2
AWS
CloudFormation
Amazon
DynamoDB
Amazon
S3
Amazon
RDS
Amazon
CloudSearch
Amazon
Redshift
Amazon
Kinesis
RT Feed
Snapshots
ClassifiersCollectors
9. 9ravenpack.com | info@ravenpack.com | AMERICAS Tel: (646) 277-7339 | EMEA-APAC Tel: +34 952 90 73 90
• Lose central RDBMS → Lose transactions
• S3 great for documents, but no index
• DynamoDB great for index, but...
Must manage throughput
No foreign keys or integrity constraints
Eventual consistency
• RedShift amazing for OLAP, but not OLTP
So use Kinesis to stream and then batch
• Schema-free is a myth
Applications are more flexible and scalable, but also more complex.
Cloud Migration Challenges
10. 10ravenpack.com | info@ravenpack.com | AMERICAS Tel: (646) 277-7339 | EMEA-APAC Tel: +34 952 90 73 90
Classic Model
• Same Limited Set of Servers, Same RDBMS
• Can affect Realtime System, Backups
• Full archive, 4-6 Classifiers → 6 weeks!
Use Case: History Classification
RDBMS Files
Classifiers
Classifiers
11. 11ravenpack.com | info@ravenpack.com | AMERICAS Tel: (646) 277-7339 | EMEA-APAC Tel: +34 952 90 73 90
Cloud Model using AWS
• Servers on Demand, Distributed Storage
• Independent of Realtime System
• Full archive, 100 Classifiers → 3 days!
Use Case: History Classification
Amazon
EC2
AWS
CloudFormation
Amazon
DynamoDB
Amazon
S3
Amazon
RDS
Amazon
Redshift
Availability Zone
Availability Zone
...
Classifiers
Coordinator
12. 12ravenpack.com | info@ravenpack.com | AMERICAS Tel: (646) 277-7339 | EMEA-APAC Tel: +34 952 90 73 90
Classic Model - Clear skies!
• Well-known resources
• Predictable workload
• Predictable behavior
• Stable Behavior
We have full control over the resources.
We expect a service to be started seldom
and to run for a long time without interruption.
The Weather
13. 13ravenpack.com | info@ravenpack.com | AMERICAS Tel: (646) 277-7339 | EMEA-APAC Tel: +34 952 90 73 90
Cloud Model - Spot Instances
• Bid for unused capacity
• Save money, control costs
• Great for jobs with no specific deadline
• Possible to bid above on-demand rates
Typically pay 1/2 to 1/10 the “on-demand” rates.
We use spot instances for our historical
classification runs.
The Weather
14. 14ravenpack.com | info@ravenpack.com | AMERICAS Tel: (646) 277-7339 | EMEA-APAC Tel: +34 952 90 73 90
Cloud Model - Warning! Uncertain Conditions
• Someone else’s resources
• Unpredictable behavior
• Easy to move the spot market
We have no control over the resources or who
else might be using them. We expect a server
can be killed with little notice.
The Weather
15. 15ravenpack.com | info@ravenpack.com | AMERICAS Tel: (646) 277-7339 | EMEA-APAC Tel: +34 952 90 73 90
Cloud Model - Warning! Uncertain Conditions
• Do work in multiple zones
• Optimize image startup
• Group work into well-defined chunks
• Use on-demand instances for co-ordination
Expect inclement weather and be prepared for it!
Dealing with Bad Weather
Availability Zone
16. 16ravenpack.com | info@ravenpack.com | AMERICAS Tel: (646) 277-7339 | EMEA-APAC Tel: +34 952 90 73 90
Download a Custom “Slice” of Analytics Data
• Provide a Web-API and Web Service
• Let client specify parameters
Data Set and Time Range
Entities and Events
Filters
• Leverage Amazon RedShift and S3
• Compression and Multiple Output Formats
Opportunity: Self-Service Data
Amazon
S3
Amazon
Redshift
Amazon
EC2
Amazon API
Gateway
17. 17ravenpack.com | info@ravenpack.com | AMERICAS Tel: (646) 277-7339 | EMEA-APAC Tel: +34 952 90 73 90
• Let Clients upload Proprietary Content
to a Private and Secure VPC
• Provision Computing and Storage Resources
on a Per Project Basis
• View Private Analytics in Isolation or Alongside
Standard RavenPack Analytic DataSets
• Everything Goes Away when Project Completes
Opportunity: The RavenPack Cloud
Amazon
DynamoDB
Amazon
RDS
Amazon
S3
Amazon
Redshift
Amazon
EC2
AWS
CloudFormation
Amazon
CloudSearch
18. May 21, 2016
Using the Cloud to Process
Unstructured Big Data
J on the Beach, Malaga, Spain
Thank you! Gracias!
Jason Cornez ‒ CTO
jcornez@ravenpack.com