SlideShare a Scribd company logo
The Internet in a
Database
A Cassandra Use Case
Data on the Web
DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET
● 48 billion pages on the Internet
● 56 million GB of data
● Incredibly powerful connections
● 70% of useful data is unstructured
● User generated data + facts
DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET
Too Much Data…
DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET
● Modern search engines
○ Unstructured data
○ Unconnected data
○ Unnormalized data
Search
DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET
● Goals
○ Collect vast amounts of data through web crawling
○ Normalize and deduplicate data
○ Make it searchable and meaningful
DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET
● Speed
● Scale
● Adaptable
Needs
● Very fast
○ Log-structured storage
● Easily scalable
○ Decentralized rings
● Completely adaptable
○ Schema-less key/value store
DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET
The Solution
DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET
…Almost
● Useful searching was missing
○ Secondary indexes not flexible
○ No free text searches
○ No (reasonable) range queries
DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET
● Pros: Full control over indexing
● Cons: Not scalable
What We Needed
DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET
● Reasons to go with DSE
○ Combines Cassandra and Solr
○ Constant refinements and integrations
○ Support
Putting It All Together
DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET
Normalization
Cassandra
Solr
Cassandra
Solr
Cassandra
Solr
Load
Balancing
Our Stack
Web Crawling
DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET
Cassandra / Solr Setup
● 3 column families / 3 cores
○ Locations
○ Products
○ People
● 73,114,909 records
DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET
● 29,818,644 records
● Interesting data
○ Reviews
○ Revenue
○ Contact information
● Businesses vs. Locations
○ Unique key
○ Location specific user data
Data: Locations
DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET
Data: Products
● 18,470,005 records
● Interesting data
○ Categories
○ Price
○ Reviews
● Challenges
○ Too many unique keys
DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET
Data: People
● 24,826,260 records
● Interesting data
○ Work History
○ Education History
○ Location
● Challenges
○ Normalization
○ Identification
DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET
Challenges
● Memory
● Speed
● Space
● Representation
DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET
Challenges: Memory
● Multi-minute garbage collection
● Exponential increase in frequency
● Virtual memory confusion
● Solr + Cassandra
● Heap Size vs Buffer Cache
● Bash Scripts
DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET
Challenges: Speed
● Upgrade
○ Better memory management
○ Smaller index size
● Reduce index size
● Future: Solaris
DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET
Challenges: Speed
● Providing a real-time service
● Issues
○ Solr not inherently real time
○ Search speeds
○ I/O
DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET
Challenges: Speed
● Solr Solution: DSE integration leverages
○ Cassandra's speed
○ Cassandra's caches
○ Cassandra's distribution
○ Solr caches less useful
DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET
Challenges: Speed
● Search complexity solution
○ Text vs String indexing
○ Uniqueness vs Flexibility
○ Leveraging Cassandra
DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET
Challenges: Speed
● I/O Solution
○ Cassandra's built in mapping
○ Increase disk access speeds (SSDs)
■ Not cost effective
○ Future: Solaris
DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET
Challenges: Space
● Field corruption
○ Caused by improper encoding
○ Exponential growth
○ Fills up Solr index
● Locate, inspect & remove corrupt records
DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET
Challenges: Space
● Solr index issue
○ No compression (vs Cassandra)
○ Must adjust indexing
● Key things to keep in mind
○ Size of fields
○ Scale vs Flexibility
○ Index as little as possible
DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET
Challenges: Representation
● Cassandra is flat
● Actual data is not flat
○ Reviews
○ Price information
● Many different output formats
○ CSV, JSON, XML, etc.
DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET
● Solution: Flatten when possible
○ E.g. Address object -> Separate fields
● Internal subgroup representation
○ Composite keys (Occasionally)
■ Known subgroups
■ Non multiple subgroups
○ Dynamic fields
■ Composite field + Dynamic tag
■ E.g. review.text_<tag>
Challenges: Representation
DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET
Challenges: Representation
● Robust and adaptable conversion package
● JSON -> Internal
○ Solr returns JSON
● Internal -> CSV, JSON, XML
○ User defined views
○ Specify field groupings
○ Specify partitioning
DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET
DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET
● Memory Usage
● Speed
● Space
● Containers
Future Work
DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET
Future Work: Memory
● Java 7 G1 (Garbage First) Collector
○ Ideal for large heaps
○ Big Data Sets
○ Bursty Workloads
DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET
Future Work: Speed
● Solaris Kernel Scheduler > Linux Kernel Scheduler
○ (At large number of cores)
● Drastically increase iops
○ Cache reads (L2ARC) on PCIe SSD (~800 MB/s)
○ Cache writes (ZIL) on PCIe SSD (~800 MB/s)
○ Reduce needed size of SSD
■ More smaller SSDs in ZFS pool
○ Fewer moving parts
DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET
Future Work: Space
● Caching at PCIe, Storing on SATA III
○ Cheaper larger storage via ZFS pools
○ Easier to grow
● ZFS Compression (LZ4)
○ Replaces Cassandra's Snappy compression
○ Very fast lossless compression (400 Mb/s per core)
○ Scales to multiple CPUs
○ Hits the ram speed limit
DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET
Future Work: Containers
● OS Level virtualization
○ Resource control
○ Boundary separation
● More control over cassandra resources
● Better snapshots (whole machine)
● Hardware abstracted out
○ Many disks represented as single space
○ Easily add or remove hardware
Questions?
https://www.datafiniti.net
http://blog.datafiniti.net
@datafiniti
DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET
Addendum 1
ZFS Comparison
Name Ratio (MB/s) Compression
(MB/s)
Decompression
(MB/s)
LZ4 (r97) 2.084 410 1810
LZO 2.06 2.106 409 600
QuickLZ 1.5.1b6 2.237 373 420
Snappy 1.1.0 2.091 323 1070
LZF 2.077 270 570
zlib 1.2.8 -1 2.730 65 280
LZ4 HC (r97) 2.720 25 2040
zlib 1.2.8 -6 3.099 21 300

More Related Content

What's hot

Steam Learn: An introduction to Redis
Steam Learn: An introduction to RedisSteam Learn: An introduction to Redis
Steam Learn: An introduction to Redis
inovia
 
Big Data in 200 km/h | AWS Big Data Demystified #1.3
Big Data in 200 km/h | AWS Big Data Demystified #1.3  Big Data in 200 km/h | AWS Big Data Demystified #1.3
Big Data in 200 km/h | AWS Big Data Demystified #1.3
Omid Vahdaty
 
MongoDB .local Chicago 2019: MongoDB Atlas Data Lake Technical Deep Dive
MongoDB .local Chicago 2019: MongoDB Atlas Data Lake Technical Deep DiveMongoDB .local Chicago 2019: MongoDB Atlas Data Lake Technical Deep Dive
MongoDB .local Chicago 2019: MongoDB Atlas Data Lake Technical Deep Dive
MongoDB
 
An Introduction to Big Data, NoSQL and MongoDB
An Introduction to Big Data, NoSQL and MongoDBAn Introduction to Big Data, NoSQL and MongoDB
An Introduction to Big Data, NoSQL and MongoDB
William LaForest
 
NoSQL Databases
NoSQL DatabasesNoSQL Databases
NoSQL Databases
Ashish Karki
 
Using MongoDB For BigData in 20 Minutes
Using MongoDB For BigData in 20 MinutesUsing MongoDB For BigData in 20 Minutes
Using MongoDB For BigData in 20 Minutes
András Fehér
 
NoSQL Databases
NoSQL DatabasesNoSQL Databases
NoSQL Databases
Eduard Tudenhoefner
 
NoSQL document oriented data access for .net systems with postgresql and marten
NoSQL document oriented data access for .net systems with postgresql and martenNoSQL document oriented data access for .net systems with postgresql and marten
NoSQL document oriented data access for .net systems with postgresql and marten
Bojan Veljanovski
 
«NoSQL Databases and Polyglot Persistence»
«NoSQL Databases and Polyglot Persistence»«NoSQL Databases and Polyglot Persistence»
«NoSQL Databases and Polyglot Persistence»
Olga Lavrentieva
 
Austin bdug 2011_01_27_small_and_big_data
Austin bdug 2011_01_27_small_and_big_dataAustin bdug 2011_01_27_small_and_big_data
Austin bdug 2011_01_27_small_and_big_data
Alex Pinkin
 
MongoDB .local Houston 2019: MongoDB Atlas Data Lake Technical Deep Dive
MongoDB .local Houston 2019: MongoDB Atlas Data Lake Technical Deep DiveMongoDB .local Houston 2019: MongoDB Atlas Data Lake Technical Deep Dive
MongoDB .local Houston 2019: MongoDB Atlas Data Lake Technical Deep Dive
MongoDB
 
Sql no sql comparision
Sql no sql comparisionSql no sql comparision
Sql no sql comparision
zwak1234
 
Pandas
PandasPandas
RubiX
RubiXRubiX
NoSQL in the context of Social Web
NoSQL in the context of Social WebNoSQL in the context of Social Web
NoSQL in the context of Social Web
Bogdan Gaza
 
Alluxio Data Orchestration Platform for the Cloud
Alluxio Data Orchestration Platform for the CloudAlluxio Data Orchestration Platform for the Cloud
Alluxio Data Orchestration Platform for the Cloud
Shubham Tagra
 
Why MongoDB over other Databases - Habilelabs
Why MongoDB over other Databases - HabilelabsWhy MongoDB over other Databases - Habilelabs
Why MongoDB over other Databases - Habilelabs
Habilelabs
 
Nosql databases for the .net developer
Nosql databases for the .net developerNosql databases for the .net developer
Nosql databases for the .net developer
Jesus Rodriguez
 
Introducing Datawave
Introducing DatawaveIntroducing Datawave
Introducing Datawave
Accumulo Summit
 
The DBpedia databus
The DBpedia databusThe DBpedia databus
The DBpedia databus
Leipziger Semantic Web Tag
 

What's hot (20)

Steam Learn: An introduction to Redis
Steam Learn: An introduction to RedisSteam Learn: An introduction to Redis
Steam Learn: An introduction to Redis
 
Big Data in 200 km/h | AWS Big Data Demystified #1.3
Big Data in 200 km/h | AWS Big Data Demystified #1.3  Big Data in 200 km/h | AWS Big Data Demystified #1.3
Big Data in 200 km/h | AWS Big Data Demystified #1.3
 
MongoDB .local Chicago 2019: MongoDB Atlas Data Lake Technical Deep Dive
MongoDB .local Chicago 2019: MongoDB Atlas Data Lake Technical Deep DiveMongoDB .local Chicago 2019: MongoDB Atlas Data Lake Technical Deep Dive
MongoDB .local Chicago 2019: MongoDB Atlas Data Lake Technical Deep Dive
 
An Introduction to Big Data, NoSQL and MongoDB
An Introduction to Big Data, NoSQL and MongoDBAn Introduction to Big Data, NoSQL and MongoDB
An Introduction to Big Data, NoSQL and MongoDB
 
NoSQL Databases
NoSQL DatabasesNoSQL Databases
NoSQL Databases
 
Using MongoDB For BigData in 20 Minutes
Using MongoDB For BigData in 20 MinutesUsing MongoDB For BigData in 20 Minutes
Using MongoDB For BigData in 20 Minutes
 
NoSQL Databases
NoSQL DatabasesNoSQL Databases
NoSQL Databases
 
NoSQL document oriented data access for .net systems with postgresql and marten
NoSQL document oriented data access for .net systems with postgresql and martenNoSQL document oriented data access for .net systems with postgresql and marten
NoSQL document oriented data access for .net systems with postgresql and marten
 
«NoSQL Databases and Polyglot Persistence»
«NoSQL Databases and Polyglot Persistence»«NoSQL Databases and Polyglot Persistence»
«NoSQL Databases and Polyglot Persistence»
 
Austin bdug 2011_01_27_small_and_big_data
Austin bdug 2011_01_27_small_and_big_dataAustin bdug 2011_01_27_small_and_big_data
Austin bdug 2011_01_27_small_and_big_data
 
MongoDB .local Houston 2019: MongoDB Atlas Data Lake Technical Deep Dive
MongoDB .local Houston 2019: MongoDB Atlas Data Lake Technical Deep DiveMongoDB .local Houston 2019: MongoDB Atlas Data Lake Technical Deep Dive
MongoDB .local Houston 2019: MongoDB Atlas Data Lake Technical Deep Dive
 
Sql no sql comparision
Sql no sql comparisionSql no sql comparision
Sql no sql comparision
 
Pandas
PandasPandas
Pandas
 
RubiX
RubiXRubiX
RubiX
 
NoSQL in the context of Social Web
NoSQL in the context of Social WebNoSQL in the context of Social Web
NoSQL in the context of Social Web
 
Alluxio Data Orchestration Platform for the Cloud
Alluxio Data Orchestration Platform for the CloudAlluxio Data Orchestration Platform for the Cloud
Alluxio Data Orchestration Platform for the Cloud
 
Why MongoDB over other Databases - Habilelabs
Why MongoDB over other Databases - HabilelabsWhy MongoDB over other Databases - Habilelabs
Why MongoDB over other Databases - Habilelabs
 
Nosql databases for the .net developer
Nosql databases for the .net developerNosql databases for the .net developer
Nosql databases for the .net developer
 
Introducing Datawave
Introducing DatawaveIntroducing Datawave
Introducing Datawave
 
The DBpedia databus
The DBpedia databusThe DBpedia databus
The DBpedia databus
 

Similar to The Internet in Database: A Cassandra Use Case

TRHUG 2015 - Veloxity Big Data Migration Use Case
TRHUG 2015 - Veloxity Big Data Migration Use CaseTRHUG 2015 - Veloxity Big Data Migration Use Case
TRHUG 2015 - Veloxity Big Data Migration Use Case
Hakan Ilter
 
week1slides1704202828322.pdf
week1slides1704202828322.pdfweek1slides1704202828322.pdf
week1slides1704202828322.pdf
TusharAgarwal49094
 
Elasticsearch as a time series database
Elasticsearch as a time series databaseElasticsearch as a time series database
Elasticsearch as a time series database
felixbarny
 
Piano Media - approach to data gathering and processing
Piano Media - approach to data gathering and processingPiano Media - approach to data gathering and processing
Piano Media - approach to data gathering and processing
MartinStrycek
 
Real-time analytics with Druid at Appsflyer
Real-time analytics with Druid at AppsflyerReal-time analytics with Druid at Appsflyer
Real-time analytics with Druid at Appsflyer
Michael Spector
 
21st Athens Big Data Meetup - 1st Talk - Fast and simple data exploration wit...
21st Athens Big Data Meetup - 1st Talk - Fast and simple data exploration wit...21st Athens Big Data Meetup - 1st Talk - Fast and simple data exploration wit...
21st Athens Big Data Meetup - 1st Talk - Fast and simple data exploration wit...
Athens Big Data
 
Introduction to Apache Tajo: Data Warehouse for Big Data
Introduction to Apache Tajo: Data Warehouse for Big DataIntroduction to Apache Tajo: Data Warehouse for Big Data
Introduction to Apache Tajo: Data Warehouse for Big Data
Jihoon Son
 
Incremental Processing on Large Analytical Datasets with Prasanna Rajaperumal...
Incremental Processing on Large Analytical Datasets with Prasanna Rajaperumal...Incremental Processing on Large Analytical Datasets with Prasanna Rajaperumal...
Incremental Processing on Large Analytical Datasets with Prasanna Rajaperumal...
Databricks
 
Hoodie: How (And Why) We built an analytical datastore on Spark
Hoodie: How (And Why) We built an analytical datastore on SparkHoodie: How (And Why) We built an analytical datastore on Spark
Hoodie: How (And Why) We built an analytical datastore on Spark
Vinoth Chandar
 
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB Atlas
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB AtlasMongoDB World 2019: Packing Up Your Data and Moving to MongoDB Atlas
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB Atlas
MongoDB
 
Cloud arch patterns
Cloud arch patternsCloud arch patterns
Cloud arch patterns
Corey Huinker
 
Introduction to Apache Tajo: Future of Data Warehouse
Introduction to Apache Tajo: Future of Data WarehouseIntroduction to Apache Tajo: Future of Data Warehouse
Introduction to Apache Tajo: Future of Data Warehouse
Gruter
 
Introduction to Apache Tajo: Future of Data Warehouse
Introduction to Apache Tajo: Future of Data WarehouseIntroduction to Apache Tajo: Future of Data Warehouse
Introduction to Apache Tajo: Future of Data Warehouse
Jihoon Son
 
A Day in the Life of a Druid Implementor and Druid's Roadmap
A Day in the Life of a Druid Implementor and Druid's RoadmapA Day in the Life of a Druid Implementor and Druid's Roadmap
A Day in the Life of a Druid Implementor and Druid's Roadmap
Itai Yaffe
 
Spark Meetup at Uber
Spark Meetup at UberSpark Meetup at Uber
Spark Meetup at Uber
Databricks
 
Amazon Redshift - Bay Area CloudSearch Meetup June 19, 2013
Amazon Redshift - Bay Area CloudSearch Meetup June 19, 2013Amazon Redshift - Bay Area CloudSearch Meetup June 19, 2013
Amazon Redshift - Bay Area CloudSearch Meetup June 19, 2013
Michael Bohlig
 
Data Lessons Learned at Scale - Big Data DC
Data Lessons Learned at Scale - Big Data DCData Lessons Learned at Scale - Big Data DC
Data Lessons Learned at Scale - Big Data DC
Charlie Reverte
 
AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned
Omid Vahdaty
 
AWS big-data-demystified #1.1 | Big Data Architecture Lessons Learned | English
AWS big-data-demystified #1.1  | Big Data Architecture Lessons Learned | EnglishAWS big-data-demystified #1.1  | Big Data Architecture Lessons Learned | English
AWS big-data-demystified #1.1 | Big Data Architecture Lessons Learned | English
Omid Vahdaty
 
Data Platform in the Cloud
Data Platform in the CloudData Platform in the Cloud
Data Platform in the Cloud
Amihay Zer-Kavod
 

Similar to The Internet in Database: A Cassandra Use Case (20)

TRHUG 2015 - Veloxity Big Data Migration Use Case
TRHUG 2015 - Veloxity Big Data Migration Use CaseTRHUG 2015 - Veloxity Big Data Migration Use Case
TRHUG 2015 - Veloxity Big Data Migration Use Case
 
week1slides1704202828322.pdf
week1slides1704202828322.pdfweek1slides1704202828322.pdf
week1slides1704202828322.pdf
 
Elasticsearch as a time series database
Elasticsearch as a time series databaseElasticsearch as a time series database
Elasticsearch as a time series database
 
Piano Media - approach to data gathering and processing
Piano Media - approach to data gathering and processingPiano Media - approach to data gathering and processing
Piano Media - approach to data gathering and processing
 
Real-time analytics with Druid at Appsflyer
Real-time analytics with Druid at AppsflyerReal-time analytics with Druid at Appsflyer
Real-time analytics with Druid at Appsflyer
 
21st Athens Big Data Meetup - 1st Talk - Fast and simple data exploration wit...
21st Athens Big Data Meetup - 1st Talk - Fast and simple data exploration wit...21st Athens Big Data Meetup - 1st Talk - Fast and simple data exploration wit...
21st Athens Big Data Meetup - 1st Talk - Fast and simple data exploration wit...
 
Introduction to Apache Tajo: Data Warehouse for Big Data
Introduction to Apache Tajo: Data Warehouse for Big DataIntroduction to Apache Tajo: Data Warehouse for Big Data
Introduction to Apache Tajo: Data Warehouse for Big Data
 
Incremental Processing on Large Analytical Datasets with Prasanna Rajaperumal...
Incremental Processing on Large Analytical Datasets with Prasanna Rajaperumal...Incremental Processing on Large Analytical Datasets with Prasanna Rajaperumal...
Incremental Processing on Large Analytical Datasets with Prasanna Rajaperumal...
 
Hoodie: How (And Why) We built an analytical datastore on Spark
Hoodie: How (And Why) We built an analytical datastore on SparkHoodie: How (And Why) We built an analytical datastore on Spark
Hoodie: How (And Why) We built an analytical datastore on Spark
 
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB Atlas
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB AtlasMongoDB World 2019: Packing Up Your Data and Moving to MongoDB Atlas
MongoDB World 2019: Packing Up Your Data and Moving to MongoDB Atlas
 
Cloud arch patterns
Cloud arch patternsCloud arch patterns
Cloud arch patterns
 
Introduction to Apache Tajo: Future of Data Warehouse
Introduction to Apache Tajo: Future of Data WarehouseIntroduction to Apache Tajo: Future of Data Warehouse
Introduction to Apache Tajo: Future of Data Warehouse
 
Introduction to Apache Tajo: Future of Data Warehouse
Introduction to Apache Tajo: Future of Data WarehouseIntroduction to Apache Tajo: Future of Data Warehouse
Introduction to Apache Tajo: Future of Data Warehouse
 
A Day in the Life of a Druid Implementor and Druid's Roadmap
A Day in the Life of a Druid Implementor and Druid's RoadmapA Day in the Life of a Druid Implementor and Druid's Roadmap
A Day in the Life of a Druid Implementor and Druid's Roadmap
 
Spark Meetup at Uber
Spark Meetup at UberSpark Meetup at Uber
Spark Meetup at Uber
 
Amazon Redshift - Bay Area CloudSearch Meetup June 19, 2013
Amazon Redshift - Bay Area CloudSearch Meetup June 19, 2013Amazon Redshift - Bay Area CloudSearch Meetup June 19, 2013
Amazon Redshift - Bay Area CloudSearch Meetup June 19, 2013
 
Data Lessons Learned at Scale - Big Data DC
Data Lessons Learned at Scale - Big Data DCData Lessons Learned at Scale - Big Data DC
Data Lessons Learned at Scale - Big Data DC
 
AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned
 
AWS big-data-demystified #1.1 | Big Data Architecture Lessons Learned | English
AWS big-data-demystified #1.1  | Big Data Architecture Lessons Learned | EnglishAWS big-data-demystified #1.1  | Big Data Architecture Lessons Learned | English
AWS big-data-demystified #1.1 | Big Data Architecture Lessons Learned | English
 
Data Platform in the Cloud
Data Platform in the CloudData Platform in the Cloud
Data Platform in the Cloud
 

Recently uploaded

leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...
leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...
leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...
alexjohnson7307
 
Monitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdfMonitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdf
Tosin Akinosho
 
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
saastr
 
GraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracyGraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracy
Tomaz Bratanic
 
Digital Marketing Trends in 2024 | Guide for Staying Ahead
Digital Marketing Trends in 2024 | Guide for Staying AheadDigital Marketing Trends in 2024 | Guide for Staying Ahead
Digital Marketing Trends in 2024 | Guide for Staying Ahead
Wask
 
5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides
DanBrown980551
 
Presentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of GermanyPresentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of Germany
innovationoecd
 
Nordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptxNordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptx
MichaelKnudsen27
 
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
saastr
 
Azure API Management to expose backend services securely
Azure API Management to expose backend services securelyAzure API Management to expose backend services securely
Azure API Management to expose backend services securely
Dinusha Kumarasiri
 
Best 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERPBest 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERP
Pixlogix Infotech
 
HCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAUHCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAU
panagenda
 
WeTestAthens: Postman's AI & Automation Techniques
WeTestAthens: Postman's AI & Automation TechniquesWeTestAthens: Postman's AI & Automation Techniques
WeTestAthens: Postman's AI & Automation Techniques
Postman
 
Serial Arm Control in Real Time Presentation
Serial Arm Control in Real Time PresentationSerial Arm Control in Real Time Presentation
Serial Arm Control in Real Time Presentation
tolgahangng
 
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development ProvidersYour One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
akankshawande
 
Deep Dive: Getting Funded with Jason Jason Lemkin Founder & CEO @ SaaStr
Deep Dive: Getting Funded with Jason Jason Lemkin Founder & CEO @ SaaStrDeep Dive: Getting Funded with Jason Jason Lemkin Founder & CEO @ SaaStr
Deep Dive: Getting Funded with Jason Jason Lemkin Founder & CEO @ SaaStr
saastr
 
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc
 
Introduction of Cybersecurity with OSS at Code Europe 2024
Introduction of Cybersecurity with OSS  at Code Europe 2024Introduction of Cybersecurity with OSS  at Code Europe 2024
Introduction of Cybersecurity with OSS at Code Europe 2024
Hiroshi SHIBATA
 
UI5 Controls simplified - UI5con2024 presentation
UI5 Controls simplified - UI5con2024 presentationUI5 Controls simplified - UI5con2024 presentation
UI5 Controls simplified - UI5con2024 presentation
Wouter Lemaire
 
Trusted Execution Environment for Decentralized Process Mining
Trusted Execution Environment for Decentralized Process MiningTrusted Execution Environment for Decentralized Process Mining
Trusted Execution Environment for Decentralized Process Mining
LucaBarbaro3
 

Recently uploaded (20)

leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...
leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...
leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...
 
Monitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdfMonitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdf
 
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
 
GraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracyGraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracy
 
Digital Marketing Trends in 2024 | Guide for Staying Ahead
Digital Marketing Trends in 2024 | Guide for Staying AheadDigital Marketing Trends in 2024 | Guide for Staying Ahead
Digital Marketing Trends in 2024 | Guide for Staying Ahead
 
5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides
 
Presentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of GermanyPresentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of Germany
 
Nordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptxNordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptx
 
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
 
Azure API Management to expose backend services securely
Azure API Management to expose backend services securelyAzure API Management to expose backend services securely
Azure API Management to expose backend services securely
 
Best 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERPBest 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERP
 
HCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAUHCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAU
 
WeTestAthens: Postman's AI & Automation Techniques
WeTestAthens: Postman's AI & Automation TechniquesWeTestAthens: Postman's AI & Automation Techniques
WeTestAthens: Postman's AI & Automation Techniques
 
Serial Arm Control in Real Time Presentation
Serial Arm Control in Real Time PresentationSerial Arm Control in Real Time Presentation
Serial Arm Control in Real Time Presentation
 
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development ProvidersYour One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
 
Deep Dive: Getting Funded with Jason Jason Lemkin Founder & CEO @ SaaStr
Deep Dive: Getting Funded with Jason Jason Lemkin Founder & CEO @ SaaStrDeep Dive: Getting Funded with Jason Jason Lemkin Founder & CEO @ SaaStr
Deep Dive: Getting Funded with Jason Jason Lemkin Founder & CEO @ SaaStr
 
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy Survey
 
Introduction of Cybersecurity with OSS at Code Europe 2024
Introduction of Cybersecurity with OSS  at Code Europe 2024Introduction of Cybersecurity with OSS  at Code Europe 2024
Introduction of Cybersecurity with OSS at Code Europe 2024
 
UI5 Controls simplified - UI5con2024 presentation
UI5 Controls simplified - UI5con2024 presentationUI5 Controls simplified - UI5con2024 presentation
UI5 Controls simplified - UI5con2024 presentation
 
Trusted Execution Environment for Decentralized Process Mining
Trusted Execution Environment for Decentralized Process MiningTrusted Execution Environment for Decentralized Process Mining
Trusted Execution Environment for Decentralized Process Mining
 

The Internet in Database: A Cassandra Use Case

  • 1. The Internet in a Database A Cassandra Use Case
  • 2. Data on the Web DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET ● 48 billion pages on the Internet ● 56 million GB of data ● Incredibly powerful connections ● 70% of useful data is unstructured ● User generated data + facts
  • 3. DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET Too Much Data…
  • 4. DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET ● Modern search engines ○ Unstructured data ○ Unconnected data ○ Unnormalized data Search
  • 5. DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET ● Goals ○ Collect vast amounts of data through web crawling ○ Normalize and deduplicate data ○ Make it searchable and meaningful
  • 6. DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET ● Speed ● Scale ● Adaptable Needs
  • 7. ● Very fast ○ Log-structured storage ● Easily scalable ○ Decentralized rings ● Completely adaptable ○ Schema-less key/value store DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET The Solution
  • 8. DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET …Almost ● Useful searching was missing ○ Secondary indexes not flexible ○ No free text searches ○ No (reasonable) range queries
  • 9. DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET ● Pros: Full control over indexing ● Cons: Not scalable What We Needed
  • 10. DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET ● Reasons to go with DSE ○ Combines Cassandra and Solr ○ Constant refinements and integrations ○ Support Putting It All Together
  • 11. DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET Normalization Cassandra Solr Cassandra Solr Cassandra Solr Load Balancing Our Stack Web Crawling
  • 12. DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET Cassandra / Solr Setup ● 3 column families / 3 cores ○ Locations ○ Products ○ People ● 73,114,909 records
  • 13. DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET ● 29,818,644 records ● Interesting data ○ Reviews ○ Revenue ○ Contact information ● Businesses vs. Locations ○ Unique key ○ Location specific user data Data: Locations
  • 14. DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET Data: Products ● 18,470,005 records ● Interesting data ○ Categories ○ Price ○ Reviews ● Challenges ○ Too many unique keys
  • 15. DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET Data: People ● 24,826,260 records ● Interesting data ○ Work History ○ Education History ○ Location ● Challenges ○ Normalization ○ Identification
  • 16. DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET Challenges ● Memory ● Speed ● Space ● Representation
  • 17. DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET Challenges: Memory ● Multi-minute garbage collection ● Exponential increase in frequency ● Virtual memory confusion ● Solr + Cassandra ● Heap Size vs Buffer Cache ● Bash Scripts
  • 18. DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET Challenges: Speed ● Upgrade ○ Better memory management ○ Smaller index size ● Reduce index size ● Future: Solaris
  • 19. DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET Challenges: Speed ● Providing a real-time service ● Issues ○ Solr not inherently real time ○ Search speeds ○ I/O
  • 20. DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET Challenges: Speed ● Solr Solution: DSE integration leverages ○ Cassandra's speed ○ Cassandra's caches ○ Cassandra's distribution ○ Solr caches less useful
  • 21. DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET Challenges: Speed ● Search complexity solution ○ Text vs String indexing ○ Uniqueness vs Flexibility ○ Leveraging Cassandra
  • 22. DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET Challenges: Speed ● I/O Solution ○ Cassandra's built in mapping ○ Increase disk access speeds (SSDs) ■ Not cost effective ○ Future: Solaris
  • 23. DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET Challenges: Space ● Field corruption ○ Caused by improper encoding ○ Exponential growth ○ Fills up Solr index ● Locate, inspect & remove corrupt records
  • 24. DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET Challenges: Space ● Solr index issue ○ No compression (vs Cassandra) ○ Must adjust indexing ● Key things to keep in mind ○ Size of fields ○ Scale vs Flexibility ○ Index as little as possible
  • 25. DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET Challenges: Representation ● Cassandra is flat ● Actual data is not flat ○ Reviews ○ Price information ● Many different output formats ○ CSV, JSON, XML, etc.
  • 26. DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET ● Solution: Flatten when possible ○ E.g. Address object -> Separate fields ● Internal subgroup representation ○ Composite keys (Occasionally) ■ Known subgroups ■ Non multiple subgroups ○ Dynamic fields ■ Composite field + Dynamic tag ■ E.g. review.text_<tag> Challenges: Representation
  • 27. DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET Challenges: Representation ● Robust and adaptable conversion package ● JSON -> Internal ○ Solr returns JSON ● Internal -> CSV, JSON, XML ○ User defined views ○ Specify field groupings ○ Specify partitioning
  • 28. DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET
  • 29. DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET ● Memory Usage ● Speed ● Space ● Containers Future Work
  • 30. DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET Future Work: Memory ● Java 7 G1 (Garbage First) Collector ○ Ideal for large heaps ○ Big Data Sets ○ Bursty Workloads
  • 31. DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET Future Work: Speed ● Solaris Kernel Scheduler > Linux Kernel Scheduler ○ (At large number of cores) ● Drastically increase iops ○ Cache reads (L2ARC) on PCIe SSD (~800 MB/s) ○ Cache writes (ZIL) on PCIe SSD (~800 MB/s) ○ Reduce needed size of SSD ■ More smaller SSDs in ZFS pool ○ Fewer moving parts
  • 32. DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET Future Work: Space ● Caching at PCIe, Storing on SATA III ○ Cheaper larger storage via ZFS pools ○ Easier to grow ● ZFS Compression (LZ4) ○ Replaces Cassandra's Snappy compression ○ Very fast lossless compression (400 Mb/s per core) ○ Scales to multiple CPUs ○ Hits the ram speed limit
  • 33. DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET Future Work: Containers ● OS Level virtualization ○ Resource control ○ Boundary separation ● More control over cassandra resources ● Better snapshots (whole machine) ● Hardware abstracted out ○ Many disks represented as single space ○ Easily add or remove hardware
  • 35. DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET Addendum 1 ZFS Comparison Name Ratio (MB/s) Compression (MB/s) Decompression (MB/s) LZ4 (r97) 2.084 410 1810 LZO 2.06 2.106 409 600 QuickLZ 1.5.1b6 2.237 373 420 Snappy 1.1.0 2.091 323 1070 LZF 2.077 270 570 zlib 1.2.8 -1 2.730 65 280 LZ4 HC (r97) 2.720 25 2040 zlib 1.2.8 -6 3.099 21 300