Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
BIG DATA! 
Great! Now what? 
Ricard Clau 
SymfonyCon 2014
HELLO WORLD! 
• Ricard Clau, born and grown up in Barcelona 
• Server engineer at Another Place Productions 
• Symfony2 lo...
WE WILL TALK ABOUT… 
• Where / How to store / query our “BIG” DATA 
• SQL vs NoSQL, why we ended up here? 
• Strengths and...
QUICK DISCLAIMERS 
• Not your average PHP talk, not sure if you will 
be able to use this next week at work 
• Continuous ...
“Big data is like teenage sex; 
everyone talks about it, 
nobody really knows how to do it, 
everyone thinks everyone else...
2 BIG PROBLEMS
PROBLEM 1: STORAGE
PROBLEM 2: QUERYING
A BIT OF HISTORY 
Maybe we have not learnt so much…
A (NOT SO) LONG TIME AGO 
• Programmers processed files directly 
• Lots of people doing the same, first 
databases appear...
WHY NOSQL EXISTS? 
• RDBMS are not brilliant to scale horizontally 
• Google, Amazon, Facebook, etc… started building 
the...
THE CURRENT CHAOS
RDBMS SYSTEMS 
Old rockers never die
SQL 
• A “common” query language 
• We can normalise data and query it 
• Easy to do joins, filters, aggregations 
• We do...
ACID PROPERTIES 
A C I D 
Atomicity 
Transactions 
are all or 
nothing 
Consistency 
A transaction 
is subject to a 
set o...
WE NEED ACID 
• Banking, logistics, finance, e-commerce,… 
• Systems we started building 30 years ago… and we 
still work ...
USUAL PROBLEMS 
• You can painfully achieve sharding, but 
you need to give up some ACID goods 
• Tricky for unstructured ...
TRICKY SCENARIOS 
• Geospatial queries for augmented reality 
• Leaderboards for social activity, Sets operations 
• Colum...
NOSQL SYSTEMS 
Different problems, different solutions
BASE PROPERTIES 
• Basically Available: appears 
to work most of the time 
• Soft state: state of the 
system may change e...
CAP THEOREM 
• A shared-data system cannot guarantee 
simultaneously: 
• Consistency: All clients have the same view of th...
“During a network partition, a 
distributed system must choose 
between either Consistency or 
Availability”
Availability 
Consistency 
Partition 
Tolerance 
Single Node, 
mostly RDBMS 
(MySQL, PostgreSQL, 
DB2, SQLite…) 
All nodes...
CONSISTENT HASHING
I TOTALLY NEED ACID! 
Are you sure about that?
EVENTUAL CONSISTENCY 
If you are using master-slave replication, 
you already have eventual consistency in your reads
ANALYTICS / STATS 
We can possibly afford losing a small % of the data
TRANSACTIONS 
Bank transfers happen asynchronously as well!
WHAT ABOUT PHP & SYMFONY? 
Is there any hope for us?
PHP: BEST WEB PLATFORM? 
• PHP is still heavily used, despite its many quirks 
• Mature, actively maintained libraries for...
Key-value Graph 
Column Document
KEY-VALUE STORES 
• Simple APIs, easy to install and use. You are 
already using them for caching, sessions, etc… 
• PHP E...
GRAPH DATABASES 
• Very verbose queries, access via REST APIs 
• Maybe not mature enough for source of truth 
• Libraries:...
CYPHER QUERY EXAMPLES 
Top 5 Sushi restaurants 
in New York for 
Philip’s friends 
2nd degree co-actors 
who have never ac...
COLUMN-BASED STORAGES 
• Possibly the most suitable for Big Data 
• Redshift supports SQL in a petabyte scale 
database 
•...
DOCUMENT DATABASES 
• MongoDB and Couchbase look very shiny… but the 
Internet is FULL of horror scaling stories 
• PHP Ex...
SEARCH ENGINES 
• Mostly Lucene based 
• PHP Extensions: solr, sphinx 
• Libraries: solarium/solarium, elasticsearch/ 
ela...
DATA ANALYSIS 
All businesses need this!
QUERY VS PROCESSING 
• SQL is great because we can query by any field 
• There is no standard in NoSQL databases 
• NoSQL ...
MAP-REDUCE
HADOOP VS SPARK 
• Techniques to extract subsets of the data (MAP) and 
operate them in parallel before aggregating (REDUC...
FINAL THOUGHTS 
Now what?
ENGINEERING CHALLENGES 
• The Internet of things will generate real BIG DATA 
• SQL / ACID technologies are not going anyw...
READ CAREFULLY THE DOCS
CHOOSE THE RIGHT TOOL
QUESTIONS? 
• Twitter: @ricardclau 
• E-mail: ricard.clau@gmail.com 
• Github: https://github.com/ricardclau 
• Please rat...
Big Data! Great! Now What? #SymfonyCon 2014
Upcoming SlideShare
Loading in …5
×

Big Data! Great! Now What? #SymfonyCon 2014

20,273 views

Published on

Big Data is one of the new buzzwords in the industry. Everyone is using NoSQL databases. MySQL is not cool anymore. But... do we really have big data? Where should we store it? Are the traditional RDBMS databases dead? Is NoSQL the solution to our problems? And most importantly, how can PHP and Symfony2 help with it?

Published in: Data & Analytics, Engineering
  • Hi there! Essay Help For Students | Discount 10% for your first order! - Check our website! https://vk.cc/80SakO
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • nice presentation
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

Big Data! Great! Now What? #SymfonyCon 2014

  1. BIG DATA! Great! Now what? Ricard Clau SymfonyCon 2014
  2. HELLO WORLD! • Ricard Clau, born and grown up in Barcelona • Server engineer at Another Place Productions • Symfony2 lover and PHP believer (sometimes…) • Open-source contributor, sometimes I give talks • Twitter (@ricardclau) / Gmail ricard.clau@gmail.com
  3. WE WILL TALK ABOUT… • Where / How to store / query our “BIG” DATA • SQL vs NoSQL, why we ended up here? • Strengths and weaknesses of both approaches • PHP / Symfony Status with these technologies • Some war stories and recommendations
  4. QUICK DISCLAIMERS • Not your average PHP talk, not sure if you will be able to use this next week at work • Continuous learner about all these technologies • 100M records is NOT BIG DATA
  5. “Big data is like teenage sex; everyone talks about it, nobody really knows how to do it, everyone thinks everyone else is doing it, so everyone claims they are doing it”. Dan Ariely, Duke University
  6. 2 BIG PROBLEMS
  7. PROBLEM 1: STORAGE
  8. PROBLEM 2: QUERYING
  9. A BIT OF HISTORY Maybe we have not learnt so much…
  10. A (NOT SO) LONG TIME AGO • Programmers processed files directly • Lots of people doing the same, first databases appeared, different APIs, strengths and weaknesses • In the early 70s IBM came with the SEQUEL (Structured English Query Language) idea, and the rest is story
  11. WHY NOSQL EXISTS? • RDBMS are not brilliant to scale horizontally • Google, Amazon, Facebook, etc… started building their own solutions to meet their unique needs • When your data does not fit in one box, you need to give up consistency or availability • Some problems need a different approach
  12. THE CURRENT CHAOS
  13. RDBMS SYSTEMS Old rockers never die
  14. SQL • A “common” query language • We can normalise data and query it • Easy to do joins, filters, aggregations • We don’t need to know in advance how we access data • We rely on each database server’s query optimiser (and sometimes we need a DBA)
  15. ACID PROPERTIES A C I D Atomicity Transactions are all or nothing Consistency A transaction is subject to a set of rules Isolation Transactions do not affect each other Durability Written data will not get lost
  16. WE NEED ACID • Banking, logistics, finance, e-commerce,… • Systems we started building 30 years ago… and we still work on them generating millions of $ daily! • There are many applications that still fit the relational model and have structured data
  17. USUAL PROBLEMS • You can painfully achieve sharding, but you need to give up some ACID goods • Tricky for unstructured data • Not great for small read / write ratio • Some data structures
  18. TRICKY SCENARIOS • Geospatial queries for augmented reality • Leaderboards for social activity, Sets operations • Columnar aggregations on big tables • Graph data traversing to analyse your customers • Search engines over big chunks of text
  19. NOSQL SYSTEMS Different problems, different solutions
  20. BASE PROPERTIES • Basically Available: appears to work most of the time • Soft state: state of the system may change even without a query • Eventual consistency
  21. CAP THEOREM • A shared-data system cannot guarantee simultaneously: • Consistency: All clients have the same view of the data • Availability: Each client can always read and write • Partition tolerance: The system works well even when there are network partitions
  22. “During a network partition, a distributed system must choose between either Consistency or Availability”
  23. Availability Consistency Partition Tolerance Single Node, mostly RDBMS (MySQL, PostgreSQL, DB2, SQLite…) All nodes same role (Cassandra, Riak, DynamoDB…) Special nodes (Zookeeper, HBase, MongoDB, Redis…)
  24. CONSISTENT HASHING
  25. I TOTALLY NEED ACID! Are you sure about that?
  26. EVENTUAL CONSISTENCY If you are using master-slave replication, you already have eventual consistency in your reads
  27. ANALYTICS / STATS We can possibly afford losing a small % of the data
  28. TRANSACTIONS Bank transfers happen asynchronously as well!
  29. WHAT ABOUT PHP & SYMFONY? Is there any hope for us?
  30. PHP: BEST WEB PLATFORM? • PHP is still heavily used, despite its many quirks • Mature, actively maintained libraries for everything • Composer makes things much easier these days • Symfony bundles for almost everything • Some databases consider PHP a second class citizen
  31. Key-value Graph Column Document
  32. KEY-VALUE STORES • Simple APIs, easy to install and use. You are already using them for caching, sessions, etc… • PHP Extensions: memcached, phpredis • Libraries: nrk/predis, basho/riak, aws/aws-sdk-php • Bundles: snc/redis-bundle, leaseweb/memcache-bundle, kbrw/riak-bundle
  33. GRAPH DATABASES • Very verbose queries, access via REST APIs • Maybe not mature enough for source of truth • Libraries: everyman/neo4jphp • Bundles: klaussilveira/neo4j-ogm-bundle • IMHO, one of the next big things
  34. CYPHER QUERY EXAMPLES Top 5 Sushi restaurants in New York for Philip’s friends 2nd degree co-actors who have never acted with Tom Hanks
  35. COLUMN-BASED STORAGES • Possibly the most suitable for Big Data • Redshift supports SQL in a petabyte scale database • Libraries: thobbs/phpcassa, pop/pop_hbase, PDO for Redshift (with some quirks) • IMHO, Cassandra will become THE database
  36. DOCUMENT DATABASES • MongoDB and Couchbase look very shiny… but the Internet is FULL of horror scaling stories • PHP Extensions: mongodb, couchbase • Libraries: doctrine/mongodb • Bundles: doctrine/mongodb-odm-bundle
  37. SEARCH ENGINES • Mostly Lucene based • PHP Extensions: solr, sphinx • Libraries: solarium/solarium, elasticsearch/ elasticsearch • Bundles: nelmio/solarium-bundle, friendsofsymfony/elastica-bundle
  38. DATA ANALYSIS All businesses need this!
  39. QUERY VS PROCESSING • SQL is great because we can query by any field • There is no standard in NoSQL databases • NoSQL systems are more limited, only keys (some allow secondary indexes) or complex graph syntax • We sometimes need processing for complex queries
  40. MAP-REDUCE
  41. HADOOP VS SPARK • Techniques to extract subsets of the data (MAP) and operate them in parallel before aggregating (REDUCE) • Not real time, Hadoop the most popular • Apache Spark opens a new paradigm for near real-time • You need other languages for these techniques
  42. FINAL THOUGHTS Now what?
  43. ENGINEERING CHALLENGES • The Internet of things will generate real BIG DATA • SQL / ACID technologies are not going anywhere • Be very careful when using NoSQL in production • Databases… and life… are full of tradeoffs • The next decade will be fascinating for the industry
  44. READ CAREFULLY THE DOCS
  45. CHOOSE THE RIGHT TOOL
  46. QUESTIONS? • Twitter: @ricardclau • E-mail: ricard.clau@gmail.com • Github: https://github.com/ricardclau • Please rate the talk at https://joind.in/talk/view/12958

×