Delivering a 'Big Data Ready' minimum viable product

DELIVERING A
'BIG DATA READY'
MVP

Gregory Chomatas
Dublin Google Developers Group - 2013 July 30th

GREGORY
CHOMATAS

Entrepreneur
SW Engineer
Betaconcept / Astroboa: Founder
Aquinetix: Co-founder / CTO

http://linkedin.com/in/gchomatas
http://www.astroboa.org
t: @gchomatas

LOTS OF OBJECT-RELATIONAL MISMATCH

DB IS NOT THE CENTER OF MY APPLICATION

Domain Driven Design / Behaviour Driven Design
vs
Database Driven Design

AT THAT TIME NOT MANY ALTERNATIVES EXISTED
so we decided to roll our own data store
solution...

ASTROBOA TO THE RESCUE
Hybrid Document-Graph Store focused on data semantics
Similar to Google Datastore & OrientDB
External 'app independent' Semantic Data Model
Model as you go
Security per Entity instance / property
Versioned Entities
Automated REST APIs encapsulating the data layer
Hyperlinked Resources
Polyglot Persistence (Experimental) *
*Not available in the public version

THE "BIGNESS' IN BIG DATA
Two main paths to the realization of 'BIGNESS'
Luckily both paths converge to common principles & tools that can
manage BIG Complexity & BIG Volume

BIG 'DATA PROBLEMS' (COMPLEXITY)
single point of failure / resilience
cross data center
human fault tolerance
store / search unstructured or semi-structured data
flexible data modeling (e.g. traverse relationships)
data versioning
polyglot programming
multitenancy
share / data as a service
semantic web / multiple formats - endpoints

Flexible Options / Ease of operations

'BIG DATA' PROBLEMS (VOLUME)
high volume
high velocity
real-time APIs / act in real time
data as others service / dirty data from open sources
log collection / aggregation

LINEAR / HORIZONTAL SCALING

I AM NOT A BIG DATA START-UP!
Start-up = Growth (5% - 10%) / week
1000 writes per aquaculture farm per day
120 farms on public beta = 120000 writes / day
1st month: 176 farms = 176000 writes / day
6th month: 1181 farms = 1.2M writes / day
1st year: 17045 farms = 17M writes / day (200/sec)
2nd year: 2421143 = 2.4B writes / day (27777 / sec)
ARE YOU SURE ?
a sucessful SaaS is a big data service

IT'S JUST AN MVP - WE WILL ADD ALL THESE
BIG DATA STUFF LATER
A Big Data architecture can be simpler than a traditional one
The right data store can increase productivity
Keep it simple but not compromise the architectural
concepts
Balance between technical debt & technical equity
An enterprise business system will usually win on
underlying technological innovation, robustness and
enterprise readiness
"In business there is nothing more valuable than a technical
advantage your competitors don't understand" - Paul
Graham

KEY BIG DATA ARCHITECTURE FEATURES
Distributed Storage
APPLICATION database vs INTEGRATION database
Mix several data models / polyglot persistence
External Data Schema / Common Data Structures
Data Store encapsulated by an API (Data Services)
Append only / save changes vs state (event sourcing)

KEY BIG DATA ARCHITECTURE FEATURES
Distributed Computing
Asynchronous processing
Real Time Event Processing / Streaming
Simple decoupled services exposed through REST or RPC
APIs (business services)
Thick web clients / mob. apps using the REST or Streaming
APIs
Client-level multivariate data analysis & complex
visualization

THE LAMBDA ARCHITECTURE
by Nathan Marz and James Warren

store raw, immutable, perpetual data
query = function(all data)
combine batch & real time stream processing to compute
arbitrary functions on arbitrary data

ULTIMATE DESIGN RULE
KEEP it SIMPLE

THE CONVENTIONAL ARCHITECTURE
new data store criteria
Distributed
Easy to change schema & queries
Minimize impedance mismatch
Boost productivity
Simple to install, configure, operate
one component
auto-shard
peer-to-peer

DIRECTLY STORE MY AGGREGATES
{
"date": "2013-02-28",
"allocated_worker": "swp4jhi4Tm6VxY1nueX2yw",
"cage": "1GuuHWTaQc-kpPcRV5uBGA",
"feed": "7IWmy2FATcS9Vh0RB1onXQ",
"quantity_approved": 12.5,
"farm": "__uBZUr3RWOqOSkszfbRLw",
"species": "KDU-2LCjRRynby9HLifc3g",
"batch": "i6MgxixnSCGwGWb0037wlQ",
"execution": {
"feeder": "swp4jhi4Tm6VxY1nueX2yw",
"quantity_fed": 12.5,
"species_position_start": "top",
"species_position_end": "middle",
"start": "2013-02-28T07:59:57.668Z",
"end": "2013-02-28T08:00:03.216Z",
"feeder_position_end": {
"lat_lon": {"lat": 37.7066959, "lon": 23.16831896},
"altitude": 40,
"accuracy": 12
}
}
}

THE CANDIDATES
Key-Value
Riak
Redis
Pr. Voldemort
MemcacheDB
DynamoDB

Document
MongoDB
CouchBase
OrientDB
ElasticSearch
Google Datastore

Column
Cassandra
HBase
Hypertable
Accumulo
SimpleDB

Graph
Neo4J
Infinite Graph
OrientDB
Titan
Virtuoso

MY COOL DATA STORE TIP
elasticsearch document store
No other NoSQL store comes close to the out of the box utility
and usability of Elastic Search
schema less, multi tenant, replicating & sharding document store that implements extensible
& advanced search features (geo spatial, faceting, filtering, etc.)
REST API to CREATE / UPDATE (partially) / DELETE / READ aggregates / entities
REST Search API with full text search out of the box
MULTI-TENANT friendly with REST API for creating / updating DBs & entity types
Dynamic / Semi-Dynamic / Fixed schema

ELASTIC SEARCH POWER
index over 95GB/h/node
8-node cluster: sub-200ms response for complex searches on 10B+ records
(oracle OR mysql) AND replication
apple AND ip*d
john AND city:Dublin
species:"Sea Bream" AND execution.date:[20130701 TO 20130730]
taxicub AND ("Dublin"^2 OR "Cork")
"facets" : {
"locations" : { "terms" : {"field" : "city"} }
}
"terms" : [ {
"term" : "Dublin",
"count" : 130
}, {
"term" : "Cork",
"count" : 20
}, {
"term" : "Galway",
"count" : 1
} ]

HISTOGRAMS / GEO DISTANCE
"facets" : {
"Feed_Histogram" : {
"date_histogram" : {
"key_field" : "date",
"value_field" : "execution.quantity_fed",
"interval" : "month"
}
}
}
"filter" : {
"geo_distance_range" : {
"from" : "200km",
"to" : "400km"
"pin.location" : {
"lat" : 40,
"lon" : -70
}
}
}

"filter" : {
"geo_distance" : {
"distance" : "200km",
"pin.location" : {
"lat" : 40,
"lon" : -70
}
}
}

"filter" : {
"geo_polygon" : {
"person.location" : {
"points" : [
{"lat" : 40, "lon" : -70},
{"lat" : 30, "lon" : -80},
{"lat" : 20, "lon" : -90}
]
}
}
}

THE TITAN GRAPH DB
Distributed
Pluggable storage (Cassandra, HBase, Berkeley DB)
Indexing with Elastic Search & Lucene
Blueprints Interface
Gremlin Query Language
Rexter Server adds JSON-based REST interface

EASY GRAPH TRAVERSAL WITH GREMLIN
// calculate basic collaborative filtering for user 'Gregory'
m = [:]
g.v('name','Gregory').out('likes').in('likes').out('likes').groupCount(m)
m.sort{-it.value}

DATA STORE SELECTION TIPS (1)
Use polyglot persistence with multiple data models
Start with a Document Store as your system of record
Mix it with a key-value Store for keeping sessions, shopping
cart, user prefs, counters, caching
Mix it with a Graph store to keep and traverse entity
relationships
Use a Column Store as your system of record if you need
performance rather than flexibility and you know well your
data model & queries
Keep a relational db for queries on transient data (reporting
on inter-aggregate relationships)

Prefer one-component stores rather than many moving
parts
Choose a store that makes it easy to experiment with
schema and query changes & supports easy data migrations
Prefer stores that can work with both dynamic & fixed
schemas (there is always an implicit schema)
In early prototypes avoid Column stores as they have a high
cost on schema and query changes

Choose stores that support auto-sharding
Prefer peer-to-peer replication rather than master-slave
Replication factor N = 3 is a good standard choice
Consistency Adjustment Quorum: W > N/2 , W+R > N

ALL THAT SAID...
APP CONTEXT is always the determining factor for selecting
your store
as well as...
Safety / Stability
Productivity
Community
Performance
Tooling / Operation easeness

DATA MODELING TIPS
Remember that you fit your model to the data store and not
Vice Versa (APPLICATION vs INTEGRATION DB)
Use a Schema
Build your aggregates or column families according to your
use cases, i.e. DENORMALIZE per your query requirements
Aggregates form the boundaries for ACID operations
(transactions)
Pre-compute Question Focused Datasets
(materialized views) to provide data organized differently
from their primary aggregates

ARE WE FINISHED YET?
NOT QUITE!
Do something with our monolithic app

SPLIT THE MONOLITHIC APPLICATION
Wrap data stores into DATA SERVICES
Create BUSINESS SERVICES on top of Data Services
Prefer RESTful APIs for services (ROA)
Use a Binary Serialization Framework to create RPC APIs if
performance is a concern (ROA / SOA)
Move MVC* to fat mobile / web client apps that consume
the APIs

JavaScript in the browser is one of the world's most widely
distributed execution environments & Deployment is trivial !

DECOUPLED
SERVICES
FAT CLIENT
SINGLE PAGE
APP

API FRAMEWORK / DSL
class API < Grape::API
version 'v1', :using => :header, :vendor => 'aquinetix.com'
default_format :json
content_type :json, "application/json"
content_type :tsv, "text/tab-separated-values"
formatter :tsv, Aquinetix::TsvFormatter
content_type :kml, "text/xml"
formatter :kml, Aquinetix::KmlFormatter

mount CageAPI
mount CageEventsAPI
mount DeviceAPI
mount FeedAPI
mount FeedingAPI
mount LossCountEventAPI
mount OxygenSamplingEventAPI
mount SigninAPI
mount TemperatureSamplingEventAPI
mount UserAPI

add_swagger_documentation markdown: true, base_path: "http://..."
end

API FRAMEWORK / DSL
class FeedingAPI < Grape::API
resource :feedings do
desc 'Create a new feeding'
post do
execute_farm_obj_create_request 'Feeding'
end

desc 'Perform a FULL or PARTIAL update of an existing feeding'
params do
requires :id, type: String, desc: "The id (UUID) of ..."
optional :fields, type: String, desc: "Which fields ..."
end
put '/:id' do
execute_farm_obj_update_request 'Feeding'
end

desc 'Get a feeding by its id (UUID)'
params do
requires :id, :type => String, :desc => "Feeding id."
end
get '/:id' do
execute_farm_obj_instance_get_request 'Feeding'
end
end
end

MVC*AT THE CLIENT
Mobile app with backbone.js & phonegap
Management / BI Console with AngularJS
Visualization with D3.js
Multivariate Dataset Analysis at the browser with
crossfilter.js
App workflow & build with yeoman, grunt, bower

* MVP, MVVM, MVC, MVW

ASYNCHRONOUS / REAL TIME PROCESSING &
STREAMING API
RabbitMQ + RabbitMQ Web-Stomp Plugin at the server
SockJS, Stomp js libs at the client
Real-time event stream processing with ESPER

Alternative message brokers:
node.js + zeromq
kestrel
pusher
kafka (> 100k msg/sec)

Alternative Real-time stream processing: Storm

USE CASES
count ratings, votes, click-throughs
block abusive crawlers
rate-limit apis
detect spamming attempts
track performance and trigger alerts
batch process logs

SUBSCRIBE TO STOMP TOPICS FROM JS
ws = new SockJS('http://node1.aquinetix.com:15674/stomp')
@client = Stomp.over(ws)
@client.connect('aquinetix', 'password', (x) =>
@on_connect(x)
@on_error, "/")

on_connect: (x) ->
console.log "Connected to message broker"
@feeding_subscr_id = @client.subscribe '/topic/feeding', (message) =>
feeding = JSON.parse(message.body)
Aq_Manager.events.trigger 'feeding_execution:arrived', feeding
@position_subscr_id = @client.subscribe '/topic/position', (message) =>
position = JSON.parse(message.body);
Aq_Manager.events.trigger 'worker_position:arrived', position
@client.send('/topic/feeding', {}, JSON.stringify(feeding_obj))

REAL TIME EVENT PROCESSING WITH ESPER

select count(*) as tps, max(retweetCount) as maxRetweets from TwitterEvent.win:time_batch(1 sec

select fraud.accountNumber as accntNum, fraud.warning as warn, withdraw.amount as amount,
MAX(fraud.timestamp, withdraw.timestamp) as timestamp, 'withdrawlFraud' as desc
from FraudWarningEvent.win:time(30 min) as fraud,
WithdrawalEvent.win:time(30 sec) as withdraw
where fraud.accountNumber = withdraw.accountNumber

LOG ACTIVITY AND OPERATIONAL DATA
Today a critical part of
the production features
of websites

Logstash + ElasticSearch
+ Kibana 3

WRAPUP
Should availability, robustness & scalability be added to your hypotheses & value proposition
?
if YES then :
Adopt an architecture with decoupled and distributed components at early stages. Build your
team around it & balance technical debt / equity to get:
Increased team productivity, Increased readiness and agility, Sustainability

Build your data models around your use cases rather than around your database
and experiment with a polyglot persistence strategy

Start with the most easy to install, configure & operate technologies.

Keep it SIMPLE & SUSTAINABLE

LINKS / REFERENCES
Introduction to NoSQL - Martin Fowler goto; conference
Martin Fowler at NoSQL Matters conference
Book on the Lambda Architecture
Talk on Lambda Architecture
William Pietri - Going the Distance: Building a Sustainable Startup
Don't Let the Minimum Win Over the Viable - Harvard Business Review
Elastic Search Document DB & Search Engine
Cassandra Column DB
Titan Graph DB
Astroboa Semantic Document Store
http://www.rabbitmq.com/web-stomp.html
https://github.com/jmesnil/stomp-websocket/

LINKS / REFERENCES
https://github.com/sockjs/sockjs-client
https://github.com/robey/kestrel
https://github.com/JustinTulloss/zeromq.node
http://kafka.apache.org/index.html
https://github.com/nathanmarz/storm
https://developers.helloreverb.com/swagger/
https://github.com/wordnik/swagger-ui

Delivering a 'Big Data Ready' minimum viable product

Recommended

Recommended

More Related Content

Similar to Delivering a 'Big Data Ready' minimum viable product

Similar to Delivering a 'Big Data Ready' minimum viable product (20)

Recently uploaded

Recently uploaded (20)

Delivering a 'Big Data Ready' minimum viable product