I CAN HAS BIG DATA?Small and Big Data at Bazaarvoice          Alex Pinkin           @apinkin
whois apinkin● Alex Pinkin  Software Engineering Lead,  Data Infrastructure team,  Bazaarvoice● Loves both SQL and NoSQL. ...
Big Data?
A few facts about Bazaarvoice● Bazaarvoice is a SaaS company  powering user generated content  such as ratings and reviews...
How Do We Do It?● Client-side integration● Code and Servers :)
What Do We Run in Prod? ● SQL    ○ MySQL    ○ Infobright ● NoSQL    ○ SOLR    ○ ElasticSearch    ○ MongoDB    ○ CouchDB   ...
Four Pillars
MySQL and Big Data?!! ● Yes, MySQL is our Master. Mostly used as K/V store. ● Scaling Reads: Replication ● Scaling Writes:...
Search: SOLR/Lucene● Document Store● Inverted Index     Term               Document IDs     rating:5           1,2     rat...
Analytics
Analytics - Infobright● Columnar storage   ○ Compression (10x+)   ○ Reduced disk I/O● Partitioning   ○ Horizontal: Data Pa...
Infobright - Pros and Cons● Pros   ○ 30x faster than MySQL on analytics queries   ○ Open Source● Cons   ○ No DML in OSS ve...
Hadoop Use Case
Bazaarvoice EMR - Phase 1
Bazaarvoice EMR - Phase 2
Summary ● We use the best tool for the job ● NoSQL is maturing quickly.   Query languages are still in flux though. ● Hado...
@apinkin
Upcoming SlideShare
Loading in...5
×

Austin bdug 2011_01_27_small_and_big_data

4,515

Published on

Short overview of data infrastructure at Bazaarvoice. We use a combination of many different data stores such as MySQL, SOLR, Infobright, MongoDB and Hadoop.

Published in: Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
4,515
On Slideshare
0
From Embeds
0
Number of Embeds
36
Actions
Shares
0
Downloads
10
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Austin bdug 2011_01_27_small_and_big_data

  1. 1. I CAN HAS BIG DATA?Small and Big Data at Bazaarvoice Alex Pinkin @apinkin
  2. 2. whois apinkin● Alex Pinkin Software Engineering Lead, Data Infrastructure team, Bazaarvoice● Loves both SQL and NoSQL. Cant commit to one! :-) @apinkin
  3. 3. Big Data?
  4. 4. A few facts about Bazaarvoice● Bazaarvoice is a SaaS company powering user generated content such as ratings and reviews on thousands of web sites● Over 75 Million reviews● 280 Billion impressions● 5 Billion Page Views per month
  5. 5. How Do We Do It?● Client-side integration● Code and Servers :)
  6. 6. What Do We Run in Prod? ● SQL ○ MySQL ○ Infobright ● NoSQL ○ SOLR ○ ElasticSearch ○ MongoDB ○ CouchDB ○ Hadoop
  7. 7. Four Pillars
  8. 8. MySQL and Big Data?!! ● Yes, MySQL is our Master. Mostly used as K/V store. ● Scaling Reads: Replication ● Scaling Writes: Sharding ● HA: Hot Back-up, Multiple DC ● Pros ○ Rock solid ○ SQL ● Cons ○ Inflexible schema ○ Replication lag ○ Sharding not built-in ○ HA
  9. 9. Search: SOLR/Lucene● Document Store● Inverted Index Term Document IDs rating:5 1,2 rating:4 3 productId: 12345 1,2,3
  10. 10. Analytics
  11. 11. Analytics - Infobright● Columnar storage ○ Compression (10x+) ○ Reduced disk I/O● Partitioning ○ Horizontal: Data Packs ○ Vertical: Columns● Knowledge grid ○ MIN(C), MAX(C), SUM(C), AVG(C), COUNT(DISTINCT(C))
  12. 12. Infobright - Pros and Cons● Pros ○ 30x faster than MySQL on analytics queries ○ Open Source● Cons ○ No DML in OSS version ○ No MPP (good for up to 5 TB)
  13. 13. Hadoop Use Case
  14. 14. Bazaarvoice EMR - Phase 1
  15. 15. Bazaarvoice EMR - Phase 2
  16. 16. Summary ● We use the best tool for the job ● NoSQL is maturing quickly. Query languages are still in flux though. ● Hadoop is here to stay ● We are (slowly) moving away from MySQL
  17. 17. @apinkin
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×