• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Big Data - architectural concerns for the new age
 

Big Data - architectural concerns for the new age

on

  • 1,558 views

A brief introduction to Big Data and why should care about polyglot storage

A brief introduction to Big Data and why should care about polyglot storage

Statistics

Views

Total Views
1,558
Views on SlideShare
1,558
Embed Views
0

Actions

Likes
4
Downloads
59
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Big Data - architectural concerns for the new age Big Data - architectural concerns for the new age Presentation Transcript

    • Big Data architectural concerns for the new ageSunday, 2 December 12
    • Debasish Ghosh CTO (a Nomura Research Institute group company)Sunday, 2 December 12
    • @debasishg on Twitter code @ http://github.com/debasishg blog @ Ruminations of a Programmer http://debasishg.blogspot.comSunday, 2 December 12
    • some numbers ..Sunday, 2 December 12
    • Facebook reaches 1 billion active usersSunday, 2 December 12
    • Sunday, 2 December 12
    • Sunday, 2 December 12
    • some more numbers ..Sunday, 2 December 12
    • • Walmart handles 1M transactions per hour • Google processes 24PB of data per day • AT&T transfers 30PB of data per day • 90 trillion emails are sent every year • World of Warcraft uses 1.3PB of storageSunday, 2 December 12
    • Big Data - the positive feedback cycle 1 new technologies make using big data 2 efficient more adoption of big data 3 generation of more big dataSunday, 2 December 12
    • new technologies .. new architectural concernsSunday, 2 December 12
    • new ways to store dataSunday, 2 December 12
    • new techniques to retrieve dataSunday, 2 December 12
    • new ways to scale reads & writesSunday, 2 December 12
    • transparent to the applicationSunday, 2 December 12
    • new ways to consume dataSunday, 2 December 12
    • new techniques to analyze dataSunday, 2 December 12
    • new ways to visualize dataSunday, 2 December 12
    • at Web scaleSunday, 2 December 12
    • The Database Landscape so far .. • relational database - the bedrock of enterprise data • irrespective of application development paradigm • object-relational-mapping considered to be the panacea for impedance mismatchSunday, 2 December 12
    • blogger, big geek and architectural consultant “Object Relational Mapping is the Vietnam of Computer Science” - Ted Neward (2006)Sunday, 2 December 12
    • RDBMS & Big Data • once the data volume crosses the limit of a single server, you shard / partition • sharding implies a lookup node for the hash code => SPOF • cross shard joins, transactions don’t scaleSunday, 2 December 12
    • RDBMS & Big Data • Cost of distributed transactions • synchronization overhead • 2 phase commit is a blocking protocol (can block indefinitely) • as slow as the slowest DB node + network latencySunday, 2 December 12
    • RDBMS & Big Data • Master/Slave replication • synchronous replication => slow • asynchronous replication => can lose data • writing to master is a bottleneck and SPOFSunday, 2 December 12
    • Need Distributed Databases • data is automatically partitioned • transparent to the application • add capacity without downtime • failure tolerantSunday, 2 December 12
    • 2 famous papers .. • Bigtable: A distributed storage system for structured data, 2006 • Dynamo: Amazon’s highly scalable key/value store, 2007Sunday, 2 December 12
    • Addressing 2 Approaches • Bigtable: “how can we build a distributed database on top of GFS ?” • Dynamo: “how can we build a distributed hash table appropriate for data center ?”Sunday, 2 December 12
    • Big Data recommendations • reduce accidental complexity in processing data • be less rigid (no rigid schema) • store data in a format closer to the domain model • hence no universal data model ..Sunday, 2 December 12
    • Polyglot Storage • unfortunately came to be known as NoSQL databases • document oriented (MongoDB, CouchDB) • key/value (Dynamo, Bigtable, Riak, Cassandra,Voldemort) • data structure based (redis) • graph based (Neo4J)Sunday, 2 December 12
    • reduced impedance mismatch richer modeling closer to capabilities domain modelSunday, 2 December 12
    • Asynchronous Replication to RDBMS using Message Oriented MiddlewareSunday, 2 December 12
    • Hybrid Oracle MongoDB storage over Messaging backboneSunday, 2 December 12
    • Relational Database is just another option, not the only option when data set is BIG and semantically richSunday, 2 December 12
    • 10 things never to do with a Relational Database • Search • Media Repository • Recommendation • Email • High Frequency Trading • Classification ad • Product Cataloging • Time Series / Forecasting • User group / ACLs • Log Analysis Source: http://www.infoworld.com/d/application-development/10-things-never-do-relational- database-206944?page=0,0Sunday, 2 December 12
    • Scalability, Availability .. • ACID => BASE • Anti-entropy • CAP Theorem & • Gossip Protocol Eventual Consistency • Consistent Hashing • Vector Clocks • Hinted Hand-off & Read repairSunday, 2 December 12
    • CAP Theorem • Consistency, Availability & Partition Tolerance • You can have only 2 of these in a distributed system • Eric Brewer postulated this quite some time backSunday, 2 December 12
    • ACID => BASE • Basic Availability Soft-state Eventual consistency • Rather than requiring consistency after every transaction, it’s enough for the database to eventually be in a consistent state. • It’s ok to use stale data and it’s ok to give approximate answersSunday, 2 December 12
    • Consistent HashingSunday, 2 December 12
    • Big Data in the wild • Hadoop • started as a batch processing engine (HDFS & Map/Reduce) • with bigger and bigger data, you need to make them available to users at near real time • stream processing, CEP ..Sunday, 2 December 12
    • a data warehouse system for Hadoop for easy data summarization, ad-hoc queries & analysis of large datasets stored in Hadoop compatible file systems complementing Map/Reduce Pig, a platform for analyzing large data sets that consists of a high-level language for expressing data in Hadoop analysis programs, coupled with infrastructure for evaluating these programs. Cloudera Impala real time ad hoc query capability to Hadoop, complementing traditional MapReduce batch processingSunday, 2 December 12
    • Real time queries in Hadoop • currently people use Hadoop connectors to massively parallel databases to do real time queries in Hadoop • expensive and may need lots of data movement between the database & the Hadoop clustersSunday, 2 December 12
    • .. and the Hadoop ecosystem continues to grow with lots of real time tools being developed actively that are compliant with the current base ..Sunday, 2 December 12
    • Shark from UC Berkeley • a large scale data warehouse system for Spark, compatible with Hive • supports HiveQL, Hive data formats and user defined functions. In addition, Shark can be used to query data in HDFS, HBase and Amazon S3Sunday, 2 December 12
    • BI and Analytics • making Big Data available to developers • API / scripting abilities for writing rich analytic applications (Precog, Continuity, Infochimps) • analyzing user behaviors, network monitoring, log processing, recommenders, AI ..Sunday, 2 December 12
    • Machine Learning • personalization • social network analysis • pattern discovery - click patterns, recommendations, ratings • apps that rely on machine learning - Prismatic, Trifacta, Google, Twitter ..Sunday, 2 December 12
    • Summary • Big Data will grow bigger - we need to embrace the changes in architecture • An RDBMS is NOT the panacea - pick your data model that’s closest to your domain • It’s economical to limit data movement - process data in place and utilize the multiple cores of your hardwareSunday, 2 December 12
    • Summary • Go for decentralized architectures, avoid SPOFs • With the big volumes of data, streaming is your friendSunday, 2 December 12
    • Thank You!Sunday, 2 December 12
    • http://www.greenbookblog.org/2012/03/21/big-data-opportunity-or-threat-for-market-research/http://thailand.ipm-info.org/pesticides/survey_phitsanulok.htmhttp://www.emich.edu/chhs/about-researchMETHODS.htmlhttp://docs.basho.com/riak/latest/references/appendices/concepts/Sunday, 2 December 12