Use Cases for NoSQL in Media

About me
 Manager Core Services at Sanoma
 Responsible for all common services, including the
Big Data platform
 Work:
– Centralized services
– Data platform
– Search
 Like:
– Work
– Water(sports)
– Whiskey
– Tinkering: Arduino, Raspberry PI, soldering stuff
24 April 20152

Sanoma, B2C Publishing and Learning company
2+100
2 Finnish newspapers
Over 100 magazines
24 April 2015 Presentation name3
5
TV channels in Finland
and The Netherlands
200+
Websites
100
Mobile applications on
various mobile platforms

Not Only
SQL

Generic vs specialized solutions

 Data models
 Speed
 Scalability
 Partition tolerance
 Availability / Redundancy
 Cost per GB
Specialized focus

 CAP (or Brewster) Theorem says:
“it is impossible for a distributed computer system
to simultaneously provide all three of the following
guarantees:
– Consistency
– Availability
– Partition tolerance”
CAP Theorem
A
C P

CAP Theorem
A
C P
Availability
Each client can always
read and write
Partition Tolerance
The system works well
despite physical
network partitions
Consistency
All clients always have
the same view of the
data
RDBMS
MySQL
Postgres
MS SQL
Oracle
NOSQL
NOSQL

Eventual consistency
-- Werner Vogels, CTO Amazon

 key-value
 column
 document stores
 map/reduce
 graph
 search
 blob storage
Various data models

Key/value stores
Photo credits: John Chulick - https://www.flickr.com/photos/chulickphotos/8234894686/

Key/value stores
 Storing object on key
 Based on the Dynamo paper (Werner Vogels)
 Products:
– Riak
– Memcache/Membase
– Tokyo Cabinet
– Redis
– Voldemort
 Use cases:
– Counting
– Top lists
– Caches
– Pre-calculated optimizations

Bucket A B C
Key/Value buckets
User XXXX YYYY ZZZZ
Article 100 200 300
Article_<5 min. TIME> 50 100 150

Real time stats

Document stores
 Stores ”records” as documents
 Versioning
 Easy sharding (document self contained)
 Products:
– MongoDB
– CouchDB
– SimpleDB
 Use case:
– CMS
– Meta data
– Product catalog

From relational data model to document
Product
Properties
Application
Property
Property

MyJour
Item Based Framework
….
CMS
Architecture Content Platform
Content Platform Core
Search
Solr
Blob
Storage
(S3 & MT)
Article
storage
MongoDB
Analyse
CMS
CMS
Editorial
reuse-interface
ePub
Digital
Template
system
WoodWing
Content
Portal
Feeds
Noma
Viva
PDF Based Framework
….
HomeDeco
Sources Services Solutions Products
??
??
??
??
eLinea
Blendle
Google Currents
LINDA. nieuws
NU.nl search

Column stores
 Lineage: Google's BigTable paper
 Records with many, many columns
 Distinguish between hot and cold data
 Versioning
 Records and columns can be sharded
 Products:
– Hbase
– Cassandra
– Hypertable
 Use cases:
– Analytics
– Messages

Big Data
 Linage: Google GFS & Map/Reduce
 Distributed data storage and processing
 Advanced analytics capabilities on raw data
 Schema on read
Products:
 Hadoop
 MPP databases
 Use cases:
– Adhoc querying terabytes of data
– Data science
 Predictive analytics
 Model training
– Calculate recommendations

Big Data at Sanoma
 Main use case for reporting and analytics, moving to
data science
 A/B MVT testing evaluations
 Using Qlikview as a front-end
 Supply data to other environments (SAS,
Advertising, Behavioral Targeting)
 Agile process for adding sources, from raw to
intermediate to modeled datawarehouse
 Sanoma standard data platform, used in all Sanoma
countries
 > 250 Users: dashboard users
 40 daily users: analysts & developers
 43 source systems, with 125 different sources
 400 tables in hive
 Platform:
– Cloudera Hadoop
– 40-60 nodes
– > 400TB storage
– ~2000 jobs/day
 Typical data node / task tracker:
– 1-2 CPU 4-12 cores
– 2 system disks (RAID 1)
– 4 data disks (2TB, 3TB or 4TB)
– 24-32GB RAM

Sanoma Data lake
Traditional BI vs Big Data approach
28 24.4.2015 © Sanoma Media

Search
Photo credits: http://www.flickr.com/photos/emyanmei/8223998414/

Search
 Keyword search can be combined with
advanced forms of ranking the results
 Most of the fields go to an index
 Facets can be used for analytics
 Ranker can be replaced with custom logic
 Products:
– Solr
– ElasticSearch
– Marklogic
 Use cases:
– Content Search
– Analytics / Faceted
– Percolation

Search
Content
Q Σ Result ranking

Search too
Content
t
Σ Result ranking
User

Search too
Content
Page
Σ Result ranking
User

 Traditional queries: against index with existing data
 What if the data does not exist at time of query?
 Percolation allows registration of queries and then returning the query IDs, e.g. for notification when
new matches are available
 Use case:
– Search for a tweet, but after the initial results continuously
get newly tweeted items when they come in
Search - Percolation

Graph databases
 Lineage: Euler and graph theory.
 Data model: Nodes & edges, both which can
hold key-value pairs
 Products:
– AllegroGraph
– InfoGrid
– Neo4j
 Use cases:
– Social relationships
– Content Linking (Entity linking)
Jan Smit
3js
Nick en Simon
Volendam
Article
1
Article
2
Article
3

Blob storage
 Endless storage of binary data
 Storing larger objects then a single machine
 “Lower” price/GB compared to SAN storage
 Products
– Amazon S3
– CAStor
– (Hadoop)
 Use case:
– Media storage
– Archiving

 RDBMS systems are a good enough for many problems
 For specific problems NOSQL solutions provide a specific solution
 There’s a variety of NOSQL solutions with different characteristics
 NOSQL solutions will require a higher engineering effort
Summary

Dream NO SQL Architecture – Content Delivery
24 April 201541
CMS
Document storage
(MongoDB/
CouchDB)
Blob storage
(S3/
CAStor)
Search
(ElasticSearch/
Solr)
Website / Mobile
Application

Dream NO SQL Architecture - Analytics
24 April 201542
Event collection
Message Queue
(Kafka / Flume )
Event processing
(Storm)
Key-value
store
(Redis)
Real time
recommendations
/ targeting
Column
storage
(Cassandra/
Hbase)
Real time
Dashboarding
Big Data
(Hadoop)
Adhoc reporting &
Data science

CAP Theorem
A
C P
Availability
Each client can always
read and write
Partition Tolerance
The system works well
despite physical
network partitions
Consistency
All clients always have
the same view of the
data
MySQL Asterdata
Postgres Greenplum
MS SQL Vertica
Oracle
Dynamo Cassandra
Voldemort SimpleDB
Tokyo Cabinet CouchDB
KAI Riak
Big Table MongoDB Berkeley DB
Hypertable Terrastore MemcachDB
Hbase Scalaris Redis
Data models
Relational databases
Key-value
Column-oriented
Document-oriented

Use Cases for NoSQL in Media

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Use Cases for NoSQL in Media

Similar to Use Cases for NoSQL in Media (20)

Recently uploaded

Recently uploaded (20)

Use Cases for NoSQL in Media