Paradigm Shifts:                 Big Data                                         Pini Cohen                              ...
The “Magic” of internet companies                                                                                Source: h...
Pinterest            Pini Cohen’s work Copyright STKI@2012            Do not remove source or attribution from any slide o...
Pinterest Architecture Update - 18 Million Visitors, 10x Growth,12 Employees, 410TB of Data       • 80 million objects sto...
Instagram     • The Instagram philosophy:        • Simplicity        • Optimized for minimal operational burden        • I...
Scaling Instagram     • Instagram went to 30+ million users in less than two years       and then rocketed to 40 million u...
Tumblr – Microbloging  social networking platform    •   500 million page views a day    •   15B+ page views month    •   ...
Technology listing     • Hadoop  Mapreduce     • NoSQL dbms (Cassandra, Mongo, HBASE)     • Shrading     • In Memory DBMS ...
Paradigm shifts agenda     • Big Data:        • Big Data definition and background        • Big Data value        • Big Da...
Big Data Definition – 4 V’s (or more…)     • Volume – tens of TBs and more (15-20TB+)     • Velocity – the speed in which ...
The origins of the 3V’s:      • 2002 research by Doug Laney from META Group (now        Gartner):                Pini Cohe...
“Big Data” theme main current usage:     • “Big Data" is just marketing jargon. -Doug Laney,       Gartner source: http://...
Big Data at work:     • Orbitz Worldwide has collected 750 terabytes of       unstructured data on their consumers’ behavi...
DW appliances will be discussed later               Teradata                                                              ...
What is the business value of big data analytics?     • Big data is now a technology looking for a business need     • It ...
Decision making – old school vs. new school (big data)     • Old School:        • Phase 1 : Analyze existing data and prep...
Big data use cases     • Recommendation engines – match users to one another       and provide recommendation based on sim...
Technology: Elements  Concepts      • Storing data for analytics (mainly):         • HDFS – Hadoop File System         • M...
Who Uses Hadoop?     •   Amazon/A9                                                              Quantcast     •   AOL    ...
Who Uses Cassandra?     •   Facebook                                                            SimpleGeo     •   Digg   ...
Big Data technologies (Hadoop etc.) vs. traditional IT  Traditional IT                                              Big Da...
New type of scale:     • Hadoop:        • Up to 4,000 machines in a cluster        • Up to 20 PB in a cluster     • Curren...
Brewers (CAP) Theorem     • It is impossible for a distributed computer system to       simultaneously provide all three o...
Dealing With CAP     • Drop Consistency        • Welcome to the “Eventually Consistent” term.            • At the end – ev...
Hadoop    • Apache Hadoop is a software framework that supports      data-intensive distributed applications    • It enabl...
HDFS – Hadoop File System        • Parallel        • Distributed on commodity elements        • Throughput over latency   ...
HDFS motivation     • What if you needed to write a program that distributes       data on commodity HW (PC’s or Servers)....
HDFS: Hadoop Distributed File Systems              • Data nodes and Name node              • Client requests meta data abo...
Datanode BlockreportsFile “part-0” will bereplicated twice and willpopulatesaved in blocks 1and 3 (file is big so it has t...
HDFS basic limitations     • Namenode is single point of failure     • Write-once model     • Plan to support appending-wr...
Map Reduce programming model    • In very basic – Brings the program to the data    • Contains two elements:        • Map:...
MapReduce motivation    • What if you needed to write a program that processes data      that’s on distributed computers? ...
MapReduce example:    map(String key, String value):    // key: document name    // value: document contents    for each w...
Dataflow in Hadoop                                                 Master                         Job: Word Count         ...
Dataflow in Hadoop                   Hello World Bye World Read                                                     Hello ...
Dataflow in Hadoop                              Finished                                      Finished + Location         ...
Dataflow in Hadoop                     map                     Local                                              FS      ...
Dataflow in Hadoop                                                                                           Write        ...
Components of Cluster Node            Flow File Input               Processor                                             ...
Hive: MapReduce helper:     • Code Example:        • hive> INSERT OVERWRITE TABLE events SELECT a.* FROM profiles a;      ...
NoSQL DBMS: storing and retrieving data     • Key/Value         • A big hash table         • Examples: Voldemort, Amazon’s...
Pros/Cons     • Pros:         • Performance         • BigData         • Most solutions are open source         • Data is r...
Apache Cassandra     • Cassandra is a highly scalable, eventually       consistent, distributed, structured key-value     ...
Consistent Hashing• Partition using consistent hashing (for the  first node data is placed) based on MD5  Distributed hash...
Cassandra’s tunable consistency (write)Level          Behavior               Ensure that the write has been written to at ...
Cassandra’s data model structure                 Think of cassandra as row-oriented      keyspace                         ...
Data Model – “flexible” scheme! ColumnFamily: RocketsKey                      Value 1                        Name         ...
Cassandra’s CQL – Cassandra SQL Language     • SQL like. Example:        • CREATE KEYSPACE test with strategy_class = Simp...
NoSQL benchmark – for scale!            Source: r esearch.yahoo.com/files/ycsb-v4.pdf                        Pini Cohen’s ...
Can we live with NoSQL limitations?     • Facebook has dropped Cassandra     • “..we found Cassandras eventual consistency...
What about other NoSQL DBMS?    • MongoDB    • Hbase    • CouchDB    • Maybe next session….               Pini Cohen’s wor...
Big Data potential implications on IT     • Will traditional RDBMS be obsolete? Surely no!     • Several areas are Big Dat...
Big data challenges     • NLP in Hebrew (entity recognition is more difficult)     • Adapting analytical algorithms to mat...
Example of big data technology: SPLUNK     • Splunk is a traditional IT vendor based on MapReduce       (from 2009)       ...
Thanks for your patience and hope you enjoyed     Here you can find the latest version of this presentation http://www.sli...
Upcoming SlideShare
Loading in...5
×

Big data 2012 v1

693

Published on

Big Data overview

Published in: Technology
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
693
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
0
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide

Big data 2012 v1

  1. 1. Paradigm Shifts: Big Data Pini Cohen VP and Senior AnalystTell me and I’ll forgetShow me and I may remember STKI Summit 2012Involve me and I’ll understand
  2. 2. The “Magic” of internet companies Source: http://venturebeat.com/2011/10/24/next-hot-internet-companies-not-in-us/internet-company-growth/ Pini Cohen’s work Copyright STKI@2012 2 Do not remove source or attribution from any slide or graph
  3. 3. Pinterest Pini Cohen’s work Copyright STKI@2012 Do not remove source or attribution from any slide or graph 3
  4. 4. Pinterest Architecture Update - 18 Million Visitors, 10x Growth,12 Employees, 410TB of Data • 80 million objects stored in S3 with 410 terabytes of user data, 10x what they had in August. EC2 instances have grown by 3x. Around $39K fo S3 and $30K for EC2 a month. • Pay for what you use saves money. Most traffic happens in the afternoons and evenings, so they reduce the number of instances at night by 40%. • 12 employees as of last December. Using the cloud a site can grow dramatically while maintaining a very small team. Looks like 31 employees as of now. Source: http://highscalability.com/blog/2012/5/21/pinterest-architecture-update-18-million-visitors-10x-growth.html Pini Cohen’s work Copyright STKI@2012 Do not remove source or attribution from any slide or graph 4
  5. 5. Instagram • The Instagram philosophy: • Simplicity • Optimized for minimal operational burden • Instrument everything Pini Cohen’s work Copyright STKI@2012 Do not remove source or attribution from any slide or graph 5
  6. 6. Scaling Instagram • Instagram went to 30+ million users in less than two years and then rocketed to 40 million users 10 days after the launch of its Android application. • After the release of the Android they had 1 million new users in 12 hours. • 2 engineers in 2010. • 3 engineers in 2011 • 5 engineers 2012, 2.5 on the backend. This includes iPhone and Android development. Source: http://highscalability.com/blog/2012/4/16/instagram-architecture-update-whats-new-with-instagram.html Pini Cohen’s work Copyright STKI@2012 Do not remove source or attribution from any slide or graph 6
  7. 7. Tumblr – Microbloging social networking platform • 500 million page views a day • 15B+ page views month • Peak rate of ~40k requests per second • 1+ TB/day into Hadoop cluster • Many TB/day into MySQL/HBase/Redis/Memcache • Growing at 30% a month • ~1000 hardware nodes in production (not cloud) • ~20 engineers (total 106 employees) Source: http://highscalability.com/blog/2012/2/13/tumblr-architecture-15-billion-page-views-a-month-and-harder.html STKI modifications Pini Cohen’s work Copyright STKI@2012 Do not remove source or attribution from any slide or graph 7
  8. 8. Technology listing • Hadoop Mapreduce • NoSQL dbms (Cassandra, Mongo, HBASE) • Shrading • In Memory DBMS • Memcashed • MemSQL • Solr • Redis • DJANGO • Python • ELB - Elastic load balancing amazon Pini Cohen’s work Copyright STKI@2012 Do not remove source or attribution from any slide or graph
  9. 9. Paradigm shifts agenda • Big Data: • Big Data definition and background • Big Data value • Big Data technology Source: http://www.b2binbound.com/blog/?Tag=paradigm%20shift Pini Cohen’s work Copyright STKI@2012 Do not remove source or attribution from any slide or graph 9
  10. 10. Big Data Definition – 4 V’s (or more…) • Volume – tens of TBs and more (15-20TB+) • Velocity – the speed in which data is added – 10M items per hour and more. And the speed in which the data needs to be processed • Variety – different types of data – structured & unstructured. In many cases deals with internet of things, social media, but also with voice, video, etc. • Variability - able to cope with new attributes and changing data types – without interrupting the analytical process (without “import-export”) • Other optional V’s - validity, volatility, viscosity (resistance to flow), etc. source: http://www.computerweekly.com/blogs/cwdn/2011/11/datas-main-drivers-volume-velocity-variety-and-variability.html Pini Cohen’s work Copyright STKI@2012 Do not remove source or attribution from any slide or graph 10
  11. 11. The origins of the 3V’s: • 2002 research by Doug Laney from META Group (now Gartner): Pini Cohen’s work Copyright STKI@2012 Do not remove source or attribution from any slide or graph 11
  12. 12. “Big Data” theme main current usage: • “Big Data" is just marketing jargon. -Doug Laney, Gartner source: http://www.computerweekly.com/blogs/cwdn/2011/11/datas-main-drivers-volume-velocity-variety-and-variability.html Source: http://winnbadisa.com/wp-content/uploads/2011/12/marketing-career-cloud.jpg • STKI : doing something significantly different from what you’ve done until now Pini Cohen’s work Copyright STKI@2012 Do not remove source or attribution from any slide or graph 12
  13. 13. Big Data at work: • Orbitz Worldwide has collected 750 terabytes of unstructured data on their consumers’ behavior – detailed information from customer online visits and browsing sessions. Using Hadoop, models have been developed intended to improve search results and tailor the user experience based on everything from location, interest in family travel versus solo travel, and even the kind of device being used to explore travel options. • The result? To date, a 7% increase in interaction rate, 37% growth in stickiness of sessions and a net 2.6% in booking path engagement. Source: http://www.deloitte.com/assets/Dcom-UnitedStates/Local%20Assets/Documents/us_cons_techtrends2012_013112.pdf Pini Cohen’s work Copyright STKI@2012 Do not remove source or attribution from any slide or graph 13
  14. 14. DW appliances will be discussed later Teradata EMC Greenplun Oracle Exadata Source: http://www.asugnews.com/2011/09/06/inside-saps-product-naming-strategies/ Pini Cohen’s work Copyright STKI@2012 14 Microsoft Parallel Data Warehouse Do not remove source or attribution from any slide or graph
  15. 15. What is the business value of big data analytics? • Big data is now a technology looking for a business need • It can mean doing the same thing but better / faster (better segmentation, more accurate analysis model) • Or it can mean doing completely new things (telematics, sentiment analysis, recommendation engine, matching competition’s pricing in real time, being able to analyze data we haven’t been able to analyze in the past) Pini Cohen’s work Copyright STKI@2012 Do not remove source or attribution from any slide or graph
  16. 16. Decision making – old school vs. new school (big data) • Old School: • Phase 1 : Analyze existing data and prepare general model • Phase 2: Apply the general model to specific client • This means applying the same model for many clients when they arrive • Issues with Old School decision making: • Time gap between preparing and applying the model • # of combinations might be too big for general model (example: recommendation based in interest) • The general model generated is biased towards “main stream” population • New School (Big Data): • Phase 1: Prepare specific model for the client and apply the model – instantly Pini Cohen’s work Copyright STKI@2012 Do not remove source or attribution from any slide or graph 16
  17. 17. Big data use cases • Recommendation engines – match users to one another and provide recommendation based on similar users (Examples: Linkedin – people you may know; Amazon) • Sentiment Analysis (Macro or individual user) • Fraud Detection - customer behavior, historical and transactional data combined. Same but more affordable • Customer Churn • Social graph analysis – influencers • Customer experience analysis – combine data from call center, web, social media etc. • Improved segmentation – more data (clickstream, call records) for more accurate analysis • Improved customer retention Pini Cohen’s work Copyright STKI@2012 Do not remove source or attribution from any slide or graph
  18. 18. Technology: Elements Concepts • Storing data for analytics (mainly): • HDFS – Hadoop File System • Map Reduce- Programming method mainly for analytics • Other “Add-on”: Pig, , Hive, JAQL (IBM) • Storing and retrieving data - DBMS: • NoSQL – DBMS (not only SQL): • Cassandra • MongoDB • CouchDB • Hbase Pini Cohen’s work Copyright STKI@2012 Do not remove source or attribution from any slide or graph 18
  19. 19. Who Uses Hadoop? • Amazon/A9  Quantcast • AOL  Rackspace/Mailtrust • Facebook • Fox interactive media  Veoh • Netflix  Yahoo! • New York Times  PowerSet (now Microsoft) More at http://wiki.apache.org/hadoop/PoweredBy Pini Cohen’s work Copyright STKI@2012 Do not remove source or attribution from any slide or graph 19
  20. 20. Who Uses Cassandra? • Facebook  SimpleGeo • Digg  Rackspace • Despegar  Shazam • Ooyala  SoftwareProjects • Imagini Pini Cohen’s work Copyright STKI@2012 Do not remove source or attribution from any slide or graph 20
  21. 21. Big Data technologies (Hadoop etc.) vs. traditional IT Traditional IT Big Data Centralized Storage Local storage Brand redundant Servers Cheap HW White Boxes Standard Infrastructure and virtual Is standardization needed?! (in the HW servers. level). No server virtualization. Well established backup and DRP Why do I need backup? How do I tackle procedures DRP (compute clusters that are stretched over locations) Traditional vendors Open Source solutions Mature products and procedures In a new patch for specific issues sometimes it is written “not implemented yet” Traditional programming, SQL Different kind of programming (map- reduce) , no Joins Will Big Data infrastructure be part of existing infrastructure or will be developed as new domain? Pini Cohen’s work Copyright STKI@2012 Do not remove source or attribution from any slide or graph 21
  22. 22. New type of scale: • Hadoop: • Up to 4,000 machines in a cluster • Up to 20 PB in a cluster • Currently traditional IT technologies can not handle this kind of scale. • This scale comes with a cost! Source: http://www.techsangam.com/wp-content/uploads/2012/01/i_love_scalability_mug.jpg Pini Cohen’s work Copyright STKI@2012 Do not remove source or attribution from any slide or graph 22
  23. 23. Brewers (CAP) Theorem • It is impossible for a distributed computer system to simultaneously provide all three of the following guarantees: • Consistency (all nodes see the same data at the same time) • Availability (node failures do not prevent survivors from continuing to operate) • Partition Tolerance (the system continues to operate in many partitions and despite arbitrary message loss) Source: Scalebase STKI modifications Professor Eric A. Brewer Pini Cohen’s work Copyright STKI@2012 Do not remove source or attribution from any slide or graph 23
  24. 24. Dealing With CAP • Drop Consistency • Welcome to the “Eventually Consistent” term. • At the end – everything will work out just fine - And hey, sometimes this is a good enough solution • When no updates occur for a long period of time, eventually all updates will propagate through the system and all the nodes will be consistent • For a given accepted update and a given node, eventually either the update reaches the node or the node is removed from service • Known as BASE (Basically Available, Soft state, Eventual consistency), as opposed to ACID Source: Scalebase Pini Cohen’s work Copyright STKI@2012 Do not remove source or attribution from any slide or graph 24
  25. 25. Hadoop • Apache Hadoop is a software framework that supports data-intensive distributed applications • It enables applications to work with thousands of nodes and petabytes of data. • Hadoop was inspired by Googles MapReduce and Google File System (GFS) papers • Contains (basically): • HDFS – Hadoop file System • MapReduce programming model Pini Cohen’s work Copyright STKI@2012 Do not remove source or attribution from any slide or graph 25
  26. 26. HDFS – Hadoop File System • Parallel • Distributed on commodity elements • Throughput over latency • Reliable and self healing • For large scale – typical file is gigabytes to terabytes (for one file!) • Applications need a write-once-read-many access model (mainly analytics) Pini Cohen’s work Copyright STKI@2012 Do not remove source or attribution from any slide or graph 26
  27. 27. HDFS motivation • What if you needed to write a program that distributes data on commodity HW (PC’s or Servers). You would need to take care of: • Where is the data located • How to distribute data between the nodes • How many times you want to replicate the data • How to insert, select and update data • What to do if one node or more fails • How to add node or to take out a node • Manage and monitor the environment • Hadoop File System did it for you! Pini Cohen’s work Copyright STKI@2012 Do not remove source or attribution from any slide or graph 27
  28. 28. HDFS: Hadoop Distributed File Systems • Data nodes and Name node • Client requests meta data about a file from namenode • Data is served directly from datanode HDFS namenode Application (file name, block id) HDFS Client File namespace /user/css534/input (block id, block location) block 3df2 instructions state (block id, byte range) HDFS datanode HDFS datanode block data Linux local file system Linux local file system … … source: http://www.google.co.il/url?sa=t&rct=j&q=Rob+Jordan++Chris+Livdahl+hadoop+filetype%3Apptx&source=web&cd=1&ved=0CCIQFjAA Pini Cohen’s work Copyright STKI@2012 Do not remove source or attribution from any slide or graph 28
  29. 29. Datanode BlockreportsFile “part-0” will bereplicated twice and willpopulatesaved in blocks 1and 3 (file is big so it has tobe divided to 2 blocks) Block 1 is on data nodes A and C source: http://www.google.co.il/url?sa=t&rct=j&q=Rob+Jordan++Chris+Livdahl+hadoop+filetype%3Apptx&source=web&cd=1&ved=0CCIQFjAA Pini Cohen’s work Copyright STKI@2012 Do not remove source or attribution from any slide or graph 29
  30. 30. HDFS basic limitations • Namenode is single point of failure • Write-once model • Plan to support appending-writes • A namespace with an extremely large number of files exceeds Namenode’s capacity to maintain • Cannot be mounted by exisiting OS • Getting data in and out is tedious • HDFS does not implement / support user quotas / access permissions • Data balancing schemes • No periodic checkpoints Pini Cohen’s work Copyright STKI@2012 Do not remove source or attribution from any slide or graph 30
  31. 31. Map Reduce programming model • In very basic – Brings the program to the data • Contains two elements: • Map: this part of the job is performed in parallel asynchronous by each node • Reduce: gather the result from the relevant nodes • In more detail : • Map : return (write on temp file) a list containing zero or more ( k, v ) pairs • Output can be a different key from the input • Output can have same key • Reduce : return a new list of reduced output from input Pini Cohen’s work Copyright STKI@2012 Do not remove source or attribution from any slide or graph 31
  32. 32. MapReduce motivation • What if you needed to write a program that processes data that’s on distributed computers? • You would need to write distributed program that: • Finds where the data located • Work on each node and then combine the result from each node together. • Where (on the local node) and how (format) to write the intermediate results • Find when the jobs of all participating nodes have concluded and then start the “aggregation” part • What to do if a job is stuck (restart the job or turn to another node to perform the same job) • Hadopp MapReduce is the framework for you! Pini Cohen’s work Copyright STKI@2012 Do not remove source or attribution from any slide or graph 32
  33. 33. MapReduce example: map(String key, String value): // key: document name // value: document contents for each word w in value: EmitIntermediate(w, "1"); reduce(String key, Iterator values): // key: a word // values: a list of counts int result = 0; for each v in values: result += ParseInt(v); Emit(AsString(result)); Pini Cohen’s work Copyright STKI@2012 Do not remove source or attribution from any slide or graph 33
  34. 34. Dataflow in Hadoop Master Job: Word Count Submit job All elements – standard HW map schedule reduce map reduce Source: Haifa Labs IBM Pini Cohen’s work Copyright STKI@2012 Do not remove source or attribution from any slide or graph 34
  35. 35. Dataflow in Hadoop Hello World Bye World Read Hello 1 Input File World 2 map reduce Block 1 Bye1 Hello Hadoop Goodbye Hadoop HDFS Block 2 Hello 1 map Hadoop 2 reduce Goodbye 1 Source: Haifa Labs IBM Pini Cohen’s work Copyright STKI@2012 Do not remove source or attribution from any slide or graph 35
  36. 36. Dataflow in Hadoop Finished Finished + Location map Local FS reduce Local map FS reduce Source: Haifa Labs IBM Pini Cohen’s work Copyright STKI@2012 Do not remove source or attribution from any slide or graph 36
  37. 37. Dataflow in Hadoop map Local FS reduce HTTP GET Local map FS reduce Source: Haifa Labs IBM Pini Cohen’s work Copyright STKI@2012 Do not remove source or attribution from any slide or graph 37
  38. 38. Dataflow in Hadoop Write Final reduce Answer HDFS reduce Bye 1 Goodbye 1 Hadoop 2 Hello 2 World 2 Source: Haifa Labs IBM Pini Cohen’s work Copyright STKI@2012 Do not remove source or attribution from any slide or graph 38
  39. 39. Components of Cluster Node Flow File Input Processor Flow Analysis Flow Analysis • Flow file Cluster File Map Reduce Cluster File Map Reduce input processor System (System) HDFS • Flow analysis flow- ( HDFS ) MapReduce Library map/reduce tools • Flow-tools Hadoop • Hadoop • HDFS Java Virtual Machine • MapReduce Operating System ( Linux ) • Java VM • OS : Linux Hardware ( CPU, HDD, Memory, NIC ) Source: www.caida.org/workshops/.../wide-casfi1004_wkang.ppt Pini Cohen’s work Copyright STKI@2012 Do not remove source or attribution from any slide or graph 39
  40. 40. Hive: MapReduce helper: • Code Example: • hive> INSERT OVERWRITE TABLE events SELECT a.* FROM profiles a; • hive> INSERT OVERWRITE TABLE events SELECT a.* FROM profiles a WHERE a.key < 100; • hive> INSERT OVERWRITE LOCAL DIRECTORY /tmp/reg_3 SELECT a.* FROM events a; • hive> INSERT OVERWRITE DIRECTORY /tmp/reg_4 select a.invites, a.pokes FROM profiles a; • hive> INSERT OVERWRITE DIRECTORY /tmp/reg_5 SELECT COUNT(*) FROM invites a WHERE a.ds=2008-08-15; • hive> INSERT OVERWRITE DIRECTORY /tmp/reg_5 SELECT a.foo, a.bar FROM invites a; • hive> INSERT OVERWRITE LOCAL DIRECTORY /tmp/sum SELECT SUM(a.pc) FROM pc1 a; Pini Cohen’s work Copyright STKI@2012 Do not remove source or attribution from any slide or graph 40
  41. 41. NoSQL DBMS: storing and retrieving data • Key/Value • A big hash table • Examples: Voldemort, Amazon’s Dynamo • Big Table • Big table, column families • Examples: Hbase, Cassandra • Document based • Collections of collections • Examples: CouchDB, MongoDB • Graph databases • Based on graph theory • Examples: Neo4J • Each solves a different problem Source: Scalebase Pini Cohen’s work Copyright STKI@2012 Do not remove source or attribution from any slide or graph 41
  42. 42. Pros/Cons • Pros: • Performance • BigData • Most solutions are open source • Data is replicated to nodes and is therefore fault-tolerant (partitioning) • Dont require a schema • Can scale up and down • Cons: • Code change • No framework support • Not ACID • Eco system (BI, Backup) • There is always a database at the backend • Some API is just too simple Source: Scalebase Pini Cohen’s work Copyright STKI@2012 Do not remove source or attribution from any slide or graph 42
  43. 43. Apache Cassandra • Cassandra is a highly scalable, eventually consistent, distributed, structured key-value store • Child of Google’s BigTable and Amazon’s Dynamo • Peer to peer architecture. All nodes are equal Source: ids.snu.ac.kr/w/images/1/18/2011SS-03.ppt • Cassandra’s replication factor (RF) is the total number of nodes onto which the data will be placed. RF of at least 2 is highly recommended, keeping in mind that your effective number of nodes is (N total nodes / RF). • CQL (Cassandra Query Language) command line • Time stamp for each value written Pini Cohen’s work Copyright STKI@2012 Do not remove source or attribution from any slide or graph 43
  44. 44. Consistent Hashing• Partition using consistent hashing (for the first node data is placed) based on MD5 Distributed hash table algorithm A• Keys hash to a point on a fixed circular C space V B• Ring is partitioned into a set of ordered slots and servers and keys hashed over these slots• Nodes take positions on the circle. S D• A, B, and D exists.• B responsible for AB range ( for replication factor=2 – default).• D responsible for BD range.• A responsible for DA range. R H• C joins.• B, D split ranges. M• C gets BC from D. Source: http://www.intertech.com/resource/usergroup/NoSQL.ppt Pini Cohen’s work Copyright STKI@2012 Do not remove source or attribution from any slide or graph 44
  45. 45. Cassandra’s tunable consistency (write)Level Behavior Ensure that the write has been written to at least 1 node, including HintedHandoffANY recipients. Ensure that the write has been written to at least 1 replicas commit log andONE memory table before responding to the client. Ensure that the write has been written to at least 2 replicas before responding toTWO the client. Ensure that the write has been written to at least 3 replicas before responding toTHREE the client. Ensure that the write has been written to N / 2 + 1 replicas before responding to theQUORUM client. Ensure that the write has been written to <ReplicationFactor> / 2 + 1 nodes, withinLOCAL_QUORUM the local datacenter (requires NetworkTopologyStrategy) Ensure that the write has been written to <ReplicationFactor> / 2 + 1 nodes in eachEACH_QUORUM datacenter (requires NetworkTopologyStrategy) Ensure that the write is written to all N replicas before responding to the client. AnyALL unresponsive replicas will fail the operation. Pini Cohen’s work Copyright STKI@2012 Do not remove source or attribution from any slide or graph Source: wiki 45
  46. 46. Cassandra’s data model structure Think of cassandra as row-oriented keyspace column family settings (eg, partitioner) settings column (eg, comparator, type [Std]) name value clock Source: http://assets.en.oreilly.com/1/event/51/Scaling%20Web%20Applications%20with%20Cassandra%20Presentation.ppt Pini Cohen’s work Copyright STKI@2012 Do not remove source or attribution from any slide or graph 46
  47. 47. Data Model – “flexible” scheme! ColumnFamily: RocketsKey Value 1 Name Value name Rocket-Powered Roller Skates toon Ready, Set, Zoom inventoryQty 5 brakes false 2 Name Value name Little Giant Do-It-Yourself Rocket-Sled Kit toon Beep Prepared inventoryQty 4 brakes false 3 Name Value name Acme Jet Propelled Unicycle toon Hot Rod and Reel inventoryQty 1 wheels 1 Source: http://wenku.baidu.com/view/6e254321482fb4daa58d4b87.html Pini Cohen’s work Copyright STKI@2012 Do not remove source or attribution from any slide or graph 47
  48. 48. Cassandra’s CQL – Cassandra SQL Language • SQL like. Example: • CREATE KEYSPACE test with strategy_class = SimpleStrategy and strategy_options:replication_factor=1; • CREATE INDEX ON users (birth_date); • SELECT * FROM users WHERE state=UT AND birth_date > 1970; • However: • No Joins • No UPDATES/DELETES Pini Cohen’s work Copyright STKI@2012 Do not remove source or attribution from any slide or graph 48
  49. 49. NoSQL benchmark – for scale! Source: r esearch.yahoo.com/files/ycsb-v4.pdf Pini Cohen’s work Copyright STKI@2012 Do not remove source or attribution from any slide or graph 49
  50. 50. Can we live with NoSQL limitations? • Facebook has dropped Cassandra • “..we found Cassandras eventual consistency model to be a difficult pattern to reconcile for our new Messages infrastructure” • Facebook has selected HBase (Columnar DBMS) . http://www.facebook.com/notes/facebook-engineering/the-underlying-technology-of- messages/454991608919 Pini Cohen’s work Copyright STKI@2012 Do not remove source or attribution from any slide or graph 50
  51. 51. What about other NoSQL DBMS? • MongoDB • Hbase • CouchDB • Maybe next session…. Pini Cohen’s work Copyright STKI@2012 Do not remove source or attribution from any slide or graph 51
  52. 52. Big Data potential implications on IT • Will traditional RDBMS be obsolete? Surely no! • Several areas are Big Data zone by definition – Internet marketing, Cyber, DW, etc. • How well can we live with “Eventually Consistent” which in most cases means 1-2 minutes delay?! • Can we define that all batch data can live well on Big Data technologies? • Will we see at the end (10 years form now) that only small portion of data still resides on RDBMS and most of the data resides on Big Data technologies?! Pini Cohen’s work Copyright STKI@2012 Do not remove source or attribution from any slide or graph 52
  53. 53. Big data challenges • NLP in Hebrew (entity recognition is more difficult) • Adapting analytical algorithms to match big data world (Anomaly detection needs to be redefined) • Some problem with consistency • Skiils problem – BI needs to program in Java, Hadoop, NoSQL knowledge Pini Cohen’s work Copyright STKI@2012 Do not remove source or attribution from any slide or graph
  54. 54. Example of big data technology: SPLUNK • Splunk is a traditional IT vendor based on MapReduce (from 2009) Pini Cohen’s work Copyright STKI@2012 Do not remove source or attribution from any slide or graph 54
  55. 55. Thanks for your patience and hope you enjoyed Here you can find the latest version of this presentation http://www.slideshare.net/pini Pini Cohen’s work Copyright STKI@2012 55 Do not remove source or attribution from any slide or graph

×