Big Data! Great! Now What? #SymfonyCon 2014

BIG DATA!
Great! Now what?
Ricard Clau
SymfonyCon 2014

HELLO WORLD!
• Ricard Clau, born and grown up in Barcelona
• Server engineer at Another Place Productions
• Symfony2 lover and PHP believer (sometimes…)
• Open-source contributor, sometimes I give talks
• Twitter (@ricardclau) / Gmail ricard.clau@gmail.com

WE WILL TALK ABOUT…
• Where / How to store / query our “BIG” DATA
• SQL vs NoSQL, why we ended up here?
• Strengths and weaknesses of both approaches
• PHP / Symfony Status with these technologies
• Some war stories and recommendations

QUICK DISCLAIMERS
• Not your average PHP talk, not sure if you will
be able to use this next week at work
• Continuous learner about all these technologies
• 100M records is NOT BIG DATA

“Big data is like teenage sex;
everyone talks about it,
nobody really knows how to do it,
everyone thinks everyone else is doing it,
so everyone claims they are doing it”.
Dan Ariely, Duke University

A BIT OF HISTORY
Maybe we have not learnt so much…

A (NOT SO) LONG TIME AGO
• Programmers processed files directly
• Lots of people doing the same, first
databases appeared, different APIs,
strengths and weaknesses
• In the early 70s IBM came with the
SEQUEL (Structured English Query
Language) idea, and the rest is story

WHY NOSQL EXISTS?
• RDBMS are not brilliant to scale horizontally
• Google, Amazon, Facebook, etc… started building
their own solutions to meet their unique needs
• When your data does not fit in one box, you need to
give up consistency or availability
• Some problems need a different approach

RDBMS SYSTEMS
Old rockers never die

SQL
• A “common” query language
• We can normalise data and query it
• Easy to do joins, filters, aggregations
• We don’t need to know in advance how we access data
• We rely on each database server’s query optimiser (and
sometimes we need a DBA)

ACID PROPERTIES
A C I D
Atomicity
Transactions
are all or
nothing
Consistency
A transaction
is subject to a
set of rules
Isolation
Transactions
do not affect
each other
Durability
Written data
will not get
lost

WE NEED ACID
• Banking, logistics, finance, e-commerce,…
• Systems we started building 30 years ago… and we
still work on them generating millions of $ daily!
• There are many applications that still fit the relational
model and have structured data

USUAL PROBLEMS
• You can painfully achieve sharding, but
you need to give up some ACID goods
• Tricky for unstructured data
• Not great for small read / write ratio
• Some data structures

TRICKY SCENARIOS
• Geospatial queries for augmented reality
• Leaderboards for social activity, Sets operations
• Columnar aggregations on big tables
• Graph data traversing to analyse your customers
• Search engines over big chunks of text

NOSQL SYSTEMS
Different problems, different solutions

BASE PROPERTIES
• Basically Available: appears
to work most of the time
• Soft state: state of the
system may change even
without a query
• Eventual consistency

CAP THEOREM
• A shared-data system cannot guarantee
simultaneously:
• Consistency: All clients have the same view of the data
• Availability: Each client can always read and write
• Partition tolerance: The system works well even
when there are network partitions

“During a network partition, a
distributed system must choose
between either Consistency or
Availability”

Availability
Consistency
Partition
Tolerance
Single Node,
mostly RDBMS
(MySQL, PostgreSQL,
DB2, SQLite…)
All nodes same role
(Cassandra, Riak,
DynamoDB…)
Special nodes (Zookeeper, HBase,
MongoDB, Redis…)

I TOTALLY NEED ACID!
Are you sure about that?

EVENTUAL CONSISTENCY
If you are using master-slave replication,
you already have eventual consistency in your reads

ANALYTICS / STATS
We can possibly afford losing a small % of the data

TRANSACTIONS
Bank transfers happen asynchronously as well!

WHAT ABOUT PHP & SYMFONY?
Is there any hope for us?

PHP: BEST WEB PLATFORM?
• PHP is still heavily used, despite its many quirks
• Mature, actively maintained libraries for everything
• Composer makes things much easier these days
• Symfony bundles for almost everything
• Some databases consider PHP a second class citizen

Key-value Graph
Column Document

KEY-VALUE STORES
• Simple APIs, easy to install and use. You are
already using them for caching, sessions, etc…
• PHP Extensions: memcached, phpredis
• Libraries: nrk/predis, basho/riak, aws/aws-sdk-php
• Bundles: snc/redis-bundle, leaseweb/memcache-bundle,
kbrw/riak-bundle

GRAPH DATABASES
• Very verbose queries, access via REST APIs
• Maybe not mature enough for source of truth
• Libraries: everyman/neo4jphp
• Bundles: klaussilveira/neo4j-ogm-bundle
• IMHO, one of the next big things

CYPHER QUERY EXAMPLES
Top 5 Sushi restaurants
in New York for
Philip’s friends
2nd degree co-actors
who have never acted
with Tom Hanks

COLUMN-BASED STORAGES
• Possibly the most suitable for Big Data
• Redshift supports SQL in a petabyte scale
database
• Libraries: thobbs/phpcassa, pop/pop_hbase,
PDO for Redshift (with some quirks)
• IMHO, Cassandra will become THE database

DOCUMENT DATABASES
• MongoDB and Couchbase look very shiny… but the
Internet is FULL of horror scaling stories
• PHP Extensions: mongodb, couchbase
• Libraries: doctrine/mongodb
• Bundles: doctrine/mongodb-odm-bundle

SEARCH ENGINES
• Mostly Lucene based
• PHP Extensions: solr, sphinx
• Libraries: solarium/solarium, elasticsearch/
elasticsearch
• Bundles: nelmio/solarium-bundle,
friendsofsymfony/elastica-bundle

DATA ANALYSIS
All businesses need this!

QUERY VS PROCESSING
• SQL is great because we can query by any field
• There is no standard in NoSQL databases
• NoSQL systems are more limited, only keys (some
allow secondary indexes) or complex graph syntax
• We sometimes need processing for complex queries

HADOOP VS SPARK
• Techniques to extract subsets of the data (MAP) and
operate them in parallel before aggregating (REDUCE)
• Not real time, Hadoop the most popular
• Apache Spark opens a new paradigm for near real-time
• You need other languages for these techniques

ENGINEERING CHALLENGES
• The Internet of things will generate real BIG DATA
• SQL / ACID technologies are not going anywhere
• Be very careful when using NoSQL in production
• Databases… and life… are full of tradeoffs
• The next decade will be fascinating for the industry

QUESTIONS?
• Twitter: @ricardclau
• E-mail: ricard.clau@gmail.com
• Github: https://github.com/ricardclau
• Please rate the talk at https://joind.in/talk/view/12958

Big Data! Great! Now What? #SymfonyCon 2014

More Related Content

What's hot

Similar to Big Data! Great! Now What? #SymfonyCon 2014

More from Ricard Clau

Recently uploaded

Big Data! Great! Now What? #SymfonyCon 2014