Big Data          architectural concerns for the                     new ageSunday, 2 December 12
Debasish Ghosh                            CTO                        (a Nomura Research Institute group company)Sunday, 2 ...
@debasishg on Twitter                                           code @                        http://github.com/debasishg ...
some numbers ..Sunday, 2 December 12
Facebook reaches 1 billion active usersSunday, 2 December 12
Sunday, 2 December 12
Sunday, 2 December 12
some more numbers ..Sunday, 2 December 12
• Walmart handles 1M transactions per hour                   • Google processes 24PB of data per day                   • A...
Big Data - the positive                        feedback cycle            1             new technologies            make us...
new technologies                        .. new architectural concernsSunday, 2 December 12
new ways to store dataSunday, 2 December 12
new techniques to retrieve dataSunday, 2 December 12
new ways to scale reads & writesSunday, 2 December 12
transparent to the                            applicationSunday, 2 December 12
new ways to consume dataSunday, 2 December 12
new techniques to analyze dataSunday, 2 December 12
new ways to visualize dataSunday, 2 December 12
at Web scaleSunday, 2 December 12
The Database                         Landscape so far ..                   • relational database - the bedrock of         ...
blogger, big geek and                        architectural consultant                                      “Object Relatio...
RDBMS & Big Data                   • once the data volume crosses the limit of a                        single server, you...
RDBMS & Big Data                   • Cost of distributed transactions                    • synchronization overhead       ...
RDBMS & Big Data                   • Master/Slave replication                    • synchronous replication => slow        ...
Need Distributed                           Databases                   • data is automatically partitioned                ...
2 famous papers ..                   • Bigtable: A distributed storage system for                        structured data, ...
Addressing 2                               Approaches                   • Bigtable: “how can we build a distributed       ...
Big Data                         recommendations                   • reduce accidental complexity in processing           ...
Polyglot Storage                   • unfortunately came to be known as NoSQL                        databases             ...
reduced impedance                                mismatch                richer modeling           closer to              ...
Asynchronous Replication to RDBMS using Message Oriented                                          MiddlewareSunday, 2 Dece...
Hybrid Oracle MongoDB storage over Messaging backboneSunday, 2 December 12
Relational Database is just another option, not   the only option when data set is BIG and               semantically rich...
10 things never to do with a                            Relational Database                   •    Search                 ...
Scalability, Availability ..                   •    ACID => BASE             •   Anti-entropy                   •    CAP T...
CAP Theorem                   • Consistency, Availability & Partition                        Tolerance                   •...
ACID => BASE                   • Basic Availability Soft-state Eventual                        consistency                ...
Consistent HashingSunday, 2 December 12
Big Data in the wild                   • Hadoop                    • started as a batch processing engine                 ...
a data warehouse system for Hadoop for easy data  summarization, ad-hoc queries & analysis of large  datasets stored in Ha...
Real time queries in                              Hadoop                   • currently people use Hadoop connectors       ...
.. and the Hadoop ecosystem continues to grow    with lots of real time tools being developed   actively that are complian...
Shark from UC                               Berkeley                   • a large scale data warehouse system for          ...
BI and Analytics                   • making Big Data available to developers                   • API / scripting abilities...
Machine Learning                   • personalization                   • social network analysis                   • patte...
Summary                   • Big Data will grow bigger - we need to                        embrace the changes in architect...
Summary                   • Go for decentralized architectures, avoid                        SPOFs                   • Wit...
Thank You!Sunday, 2 December 12
http://www.greenbookblog.org/2012/03/21/big-data-opportunity-or-threat-for-market-research/http://thailand.ipm-info.org/pe...
Upcoming SlideShare
Loading in...5
×

Big Data - architectural concerns for the new age

1,924

Published on

A brief introduction to Big Data and why should care about polyglot storage

Published in: Technology
0 Comments
4 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
1,924
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
80
Comments
0
Likes
4
Embeds 0
No embeds

No notes for slide

Big Data - architectural concerns for the new age

  1. 1. Big Data architectural concerns for the new ageSunday, 2 December 12
  2. 2. Debasish Ghosh CTO (a Nomura Research Institute group company)Sunday, 2 December 12
  3. 3. @debasishg on Twitter code @ http://github.com/debasishg blog @ Ruminations of a Programmer http://debasishg.blogspot.comSunday, 2 December 12
  4. 4. some numbers ..Sunday, 2 December 12
  5. 5. Facebook reaches 1 billion active usersSunday, 2 December 12
  6. 6. Sunday, 2 December 12
  7. 7. Sunday, 2 December 12
  8. 8. some more numbers ..Sunday, 2 December 12
  9. 9. • Walmart handles 1M transactions per hour • Google processes 24PB of data per day • AT&T transfers 30PB of data per day • 90 trillion emails are sent every year • World of Warcraft uses 1.3PB of storageSunday, 2 December 12
  10. 10. Big Data - the positive feedback cycle 1 new technologies make using big data 2 efficient more adoption of big data 3 generation of more big dataSunday, 2 December 12
  11. 11. new technologies .. new architectural concernsSunday, 2 December 12
  12. 12. new ways to store dataSunday, 2 December 12
  13. 13. new techniques to retrieve dataSunday, 2 December 12
  14. 14. new ways to scale reads & writesSunday, 2 December 12
  15. 15. transparent to the applicationSunday, 2 December 12
  16. 16. new ways to consume dataSunday, 2 December 12
  17. 17. new techniques to analyze dataSunday, 2 December 12
  18. 18. new ways to visualize dataSunday, 2 December 12
  19. 19. at Web scaleSunday, 2 December 12
  20. 20. The Database Landscape so far .. • relational database - the bedrock of enterprise data • irrespective of application development paradigm • object-relational-mapping considered to be the panacea for impedance mismatchSunday, 2 December 12
  21. 21. blogger, big geek and architectural consultant “Object Relational Mapping is the Vietnam of Computer Science” - Ted Neward (2006)Sunday, 2 December 12
  22. 22. RDBMS & Big Data • once the data volume crosses the limit of a single server, you shard / partition • sharding implies a lookup node for the hash code => SPOF • cross shard joins, transactions don’t scaleSunday, 2 December 12
  23. 23. RDBMS & Big Data • Cost of distributed transactions • synchronization overhead • 2 phase commit is a blocking protocol (can block indefinitely) • as slow as the slowest DB node + network latencySunday, 2 December 12
  24. 24. RDBMS & Big Data • Master/Slave replication • synchronous replication => slow • asynchronous replication => can lose data • writing to master is a bottleneck and SPOFSunday, 2 December 12
  25. 25. Need Distributed Databases • data is automatically partitioned • transparent to the application • add capacity without downtime • failure tolerantSunday, 2 December 12
  26. 26. 2 famous papers .. • Bigtable: A distributed storage system for structured data, 2006 • Dynamo: Amazon’s highly scalable key/value store, 2007Sunday, 2 December 12
  27. 27. Addressing 2 Approaches • Bigtable: “how can we build a distributed database on top of GFS ?” • Dynamo: “how can we build a distributed hash table appropriate for data center ?”Sunday, 2 December 12
  28. 28. Big Data recommendations • reduce accidental complexity in processing data • be less rigid (no rigid schema) • store data in a format closer to the domain model • hence no universal data model ..Sunday, 2 December 12
  29. 29. Polyglot Storage • unfortunately came to be known as NoSQL databases • document oriented (MongoDB, CouchDB) • key/value (Dynamo, Bigtable, Riak, Cassandra,Voldemort) • data structure based (redis) • graph based (Neo4J)Sunday, 2 December 12
  30. 30. reduced impedance mismatch richer modeling closer to capabilities domain modelSunday, 2 December 12
  31. 31. Asynchronous Replication to RDBMS using Message Oriented MiddlewareSunday, 2 December 12
  32. 32. Hybrid Oracle MongoDB storage over Messaging backboneSunday, 2 December 12
  33. 33. Relational Database is just another option, not the only option when data set is BIG and semantically richSunday, 2 December 12
  34. 34. 10 things never to do with a Relational Database • Search • Media Repository • Recommendation • Email • High Frequency Trading • Classification ad • Product Cataloging • Time Series / Forecasting • User group / ACLs • Log Analysis Source: http://www.infoworld.com/d/application-development/10-things-never-do-relational- database-206944?page=0,0Sunday, 2 December 12
  35. 35. Scalability, Availability .. • ACID => BASE • Anti-entropy • CAP Theorem & • Gossip Protocol Eventual Consistency • Consistent Hashing • Vector Clocks • Hinted Hand-off & Read repairSunday, 2 December 12
  36. 36. CAP Theorem • Consistency, Availability & Partition Tolerance • You can have only 2 of these in a distributed system • Eric Brewer postulated this quite some time backSunday, 2 December 12
  37. 37. ACID => BASE • Basic Availability Soft-state Eventual consistency • Rather than requiring consistency after every transaction, it’s enough for the database to eventually be in a consistent state. • It’s ok to use stale data and it’s ok to give approximate answersSunday, 2 December 12
  38. 38. Consistent HashingSunday, 2 December 12
  39. 39. Big Data in the wild • Hadoop • started as a batch processing engine (HDFS & Map/Reduce) • with bigger and bigger data, you need to make them available to users at near real time • stream processing, CEP ..Sunday, 2 December 12
  40. 40. a data warehouse system for Hadoop for easy data summarization, ad-hoc queries & analysis of large datasets stored in Hadoop compatible file systems complementing Map/Reduce Pig, a platform for analyzing large data sets that consists of a high-level language for expressing data in Hadoop analysis programs, coupled with infrastructure for evaluating these programs. Cloudera Impala real time ad hoc query capability to Hadoop, complementing traditional MapReduce batch processingSunday, 2 December 12
  41. 41. Real time queries in Hadoop • currently people use Hadoop connectors to massively parallel databases to do real time queries in Hadoop • expensive and may need lots of data movement between the database & the Hadoop clustersSunday, 2 December 12
  42. 42. .. and the Hadoop ecosystem continues to grow with lots of real time tools being developed actively that are compliant with the current base ..Sunday, 2 December 12
  43. 43. Shark from UC Berkeley • a large scale data warehouse system for Spark, compatible with Hive • supports HiveQL, Hive data formats and user defined functions. In addition, Shark can be used to query data in HDFS, HBase and Amazon S3Sunday, 2 December 12
  44. 44. BI and Analytics • making Big Data available to developers • API / scripting abilities for writing rich analytic applications (Precog, Continuity, Infochimps) • analyzing user behaviors, network monitoring, log processing, recommenders, AI ..Sunday, 2 December 12
  45. 45. Machine Learning • personalization • social network analysis • pattern discovery - click patterns, recommendations, ratings • apps that rely on machine learning - Prismatic, Trifacta, Google, Twitter ..Sunday, 2 December 12
  46. 46. Summary • Big Data will grow bigger - we need to embrace the changes in architecture • An RDBMS is NOT the panacea - pick your data model that’s closest to your domain • It’s economical to limit data movement - process data in place and utilize the multiple cores of your hardwareSunday, 2 December 12
  47. 47. Summary • Go for decentralized architectures, avoid SPOFs • With the big volumes of data, streaming is your friendSunday, 2 December 12
  48. 48. Thank You!Sunday, 2 December 12
  49. 49. http://www.greenbookblog.org/2012/03/21/big-data-opportunity-or-threat-for-market-research/http://thailand.ipm-info.org/pesticides/survey_phitsanulok.htmhttp://www.emich.edu/chhs/about-researchMETHODS.htmlhttp://docs.basho.com/riak/latest/references/appendices/concepts/Sunday, 2 December 12
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×