A brief exploration of causes and consequences of wealth of data.Dipesh Lalldipeshlall@gmail.com© 2011
Before we start our journey a bit about a bit, a byte and lots of bytes.   •   A bit (b) is short for binary digit, after ...
A perfect storm of forces is conspiring to generate a lot of data.          Data storage costs are falling…               ...
Almost everything is instrumented which means data is being generated invarious formats at various speeds and in various v...
Now all this data is pure cost unless it is transformed into information fromwhich insights can be drawn and right action ...
How do RDBMs really work (for the most part).•   Multiple interfaces•   Slow…disk drives need time to read-and-write•   Se...
RDBMs cannot scale because their intrinsic constraints run up against ahumbling rule that you cannot have everything in li...
Therefore, RDBMS are not good at performing all types of analysis. •       We need scalable database models that are not d...
The rich variety of data intruded to make data management a painnus posteriorus*.  •      While the volume and velocity of...
NOSQL solves the complexity, volume and speed constraints of an SQL designby using four different data models.•   Key valu...
BDA is actually very effective.•   Yahoo tested BDA by calculating Pi to 2,000,000,000,000,000th digit•   It used 1,000 co...
BDA works by breaking a problem into pieces, analyze each piece separatelyand then aggregating the results into a single r...
What does BDA landscape look like?•   It depends on what the need is but here is a simple graphic that shows the various e...
BDA architecture does not mean you need to throw away your investments intraditional data analytics infrastructure.•   BDA...
Even NOSQL is getting challenged, but for now we got-to-dance-with-them-what-brung-you.•   Zynga needs additional 1,000 se...
And what is next.  Big Data   + Context + Interactivity =                                           16
Smart Data…              17
…which will make Minority Report scenarios   look like…                                                          18
…Pong.         19
New skills you should consider in the world of Big Data – Cultivate expertise but be a strong generalist – Develop and gro...
Upcoming SlideShare
Loading in …5
×

The causes and consequences of too many bits

684 views

Published on

This is the first of a series of presentations on the challenges and opportunities of Big Data.

Published in: Business, Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
684
On SlideShare
0
From Embeds
0
Number of Embeds
44
Actions
Shares
0
Downloads
6
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

The causes and consequences of too many bits

  1. 1. A brief exploration of causes and consequences of wealth of data.Dipesh Lalldipeshlall@gmail.com© 2011
  2. 2. Before we start our journey a bit about a bit, a byte and lots of bytes. • A bit (b) is short for binary digit, after binary code (1 or 0) computers use to store and process data. • Binary means base of 2 just like decimal means the base of 10. • Byte (B) is the basic unit of computing used to create an English letter or number in computer code. One Byte is equal to 8 bits Kilobyte Megabyte Gigabyte Terabyte Petabyte Exabyte Zettabyte YottabyteUnit Bit (b) Byte (B) (KB) (MB) (GB) (TB) (PB) (EB) (ZB) (YB) 1,000 bytes 1,000 KB 1,000 MB 1,000 GB 1,000 TB 1,000 PB 1,000 EB 1,000 ZBSize 1 or 0 8 bits 210 bytes 220 bytes 230 bytes 240 bytes 250 bytes 260 bytes 270 bytes 280 bytes • One page of typed text is roughly 2KB. • All books catalogued in the US Library of Congress total around 15 TBs. • Google processes about 1PB every hour. • Monthly internet data flows at around 21 EBs. • Total amount of information in existence is around 1.2 ZB. • YB is currently too big to imagine (as per The Economist). • International Bureau of Weights and Measures sets the name of the prefixes. 2
  3. 3. A perfect storm of forces is conspiring to generate a lot of data. Data storage costs are falling… …data creating devices are growing… # of hosts $/TB Time Time …data processing costs are falling… …connectivity is growing… Large volume of data of rich Degree variety at “Big Data”$/GFLOPS of connectivity various speeds Time Time …data moving costs are falling…while… …along with performance expectations. Speed of response $/Mbps Time Time 3 Please note that the slope of the various lines are different but they are directionally correct.
  4. 4. Almost everything is instrumented which means data is being generated invarious formats at various speeds and in various volumes.• Structured data (tables, records)• Semi-structured data (XML and similar standards)• Complex data (hierarchical or legacy sources)• Event data (messages)• Unstructured data (human Volume language, audio, video)• Social media data (blogs, tweets, social networks)• Web logs and click streams• Spatial data (long/lat, GPS)• Machine generated data Velocity Variety (sensors, RFID, devices, server logs)• Scientific data (genomes, proteinomics, astronom y) 4
  5. 5. Now all this data is pure cost unless it is transformed into information fromwhich insights can be drawn and right action taken to create or protect value.• The information value chain depicts the various stages in the journey of data from its creation to use: Data Information Insights Decisions Action Value• At each stage of the value chain the right mix of business processes, human skills and technology capabilities are needed.• Relational database management systems (RDBMS) date back to the early 70s. RDBMS have worked well to handle transactional and structured data because this type of data can be stored in table format with relationships between and amongst the tables. The technology to manage RDBMS was developed at IBM (in San Jose) and was initially called SEQUEL (Structured English Query Language). Now called SQL• As more of the data generated shifts from structured to other formats the traditional methods of managing data are not practical.• So here is what has happened in the management of data over time. – Vertical scaling…bigger RDBMS machines…more disk space, more horse power, big data centers. – New methods, called Horizontal scaling, arrived as vertical scaling reached its limit from a data volume standpoint…so came Massively Parallel Processing (MPP) machines – But then came unstructured data (variety) and streaming data (velocity) so what was needed was a whole new way to manage data…Big Data (BD) 5
  6. 6. How do RDBMs really work (for the most part).• Multiple interfaces• Slow…disk drives need time to read-and-write• Sequential• Indexing a big challenge• Schema is not flexible Data is generated Data is Data is analyzed Data is stored in Information is in multiple aggregated in in analytical databases reported channels data warehouses applications• So the solution is to remove all these boxes (no pun intended) and get analytics as close as possible to the data. Hence, you hear terms like in-database analytics (analytics moving into d/b) or in-memory analytics (d/b moving into memory) Data is generated Data is stored, aggregated and analyzed on a single Information is via multiple platform reported channels 6
  7. 7. RDBMs cannot scale because their intrinsic constraints run up against ahumbling rule that you cannot have everything in life and you have to chose.• RDBMS rely on the ACID principle – Atomicity: All or nothing – Consistent: All transactions take d/b from one state to another without impairing referential integrity – Isolation: Other operations cannot access data while transaction is midstream – Durability: Ability to recover from system failure• Vertically scaled RDBMs do honor the ACID principle but horizontally scaled RDBMs (MPP machines) do not. This is called the CAP Theorem. It says that you can have any two of the following three when you have a distributed RDBM system – Consistency which means you operate fully or not at all. – Availability which means a node failure does not prevent surviving nodes from completing the task. – Partition tolerance (the distributed part) which means that system continues to operate despite arbitrary message loss.• The two bullets above mean that as you scale RDBM system you run into a wall…actually a cap! 7
  8. 8. Therefore, RDBMS are not good at performing all types of analysis. • We need scalable database models that are not dependent on a fixed data schemas. App App App App App Need for a new data architecture App Db Db Db Db Db Db DbApp DbDb Vertical scaling Horizontal scaling Schema agnostic scaling Volume growth Velocity growth Variety growth 8
  9. 9. The rich variety of data intruded to make data management a painnus posteriorus*. • While the volume and velocity of the data is Volume vector…..bad growing rapidly it is the growing variety of data Velocity vector…badder that is a complexity multiplier in the management of all these bits. • RDBMS and MPP approaches exhausted the ability of current architectures to process the torrent of bits flowing. • Hence arrived what I call Big Data Architecture (BDA) • BDA does not replace existing investments in data management; BDA complements them so no need to rip-and-replace; it is more Variety vector…baddest insert-and-augment. • BDA started in companies that had BD, essentially internet companies like Yahoo, Google, Facebook, Amazon, Twitter, LinkedIn that needed web-scale solutions to their data problems. They built this from scratch because there was nothing commercially available. • This revolution was called NOSQL (Not Only SQL) • The “NO” means that it is a technology that works in addition to SQL not instead of it. • NOSQL databases were organically developed…these are essentially schema agnostic…meaning that some of the constraints of SQL databases are negotiated well.*: painnus posteriorus is a contemporary acute discomfort of lower thoracic induced by unrelenting bit storms 9
  10. 10. NOSQL solves the complexity, volume and speed constraints of an SQL designby using four different data models.• Key value stores is a schema less model of storing data• Big table clones is a compressed high performance database system based on Google File System.• Document databases is a method to store semi-structured data• Graph databases uses graph structures (nodes, edges etc.) that provides index free lookups. NOSQL model Document Key value stores Big table clones Graph databases databases Based on Based on Based on Based on Amazon Dynamo Google BigTable Amazon Dynamo Graph Theory Memcached Hbase Lotus Domino AllegroGraph Dynamo Cassandra CouchDB VertexDB Voldemort HyperTable MongoDB Neo4J Tokyo Cabinet AzureTS Riak Active RDF 10
  11. 11. BDA is actually very effective.• Yahoo tested BDA by calculating Pi to 2,000,000,000,000,000th digit• It used 1,000 computers and the calculation took 23 days. This means 23,000 computing days.• Using RDBMs, it would have taken on PC about 500 years which is essentially ~182,621 computing days. Now that is ~87% improvement in speed (using a very rough back of the envelope calculation)• So yes, BDA works. 11
  12. 12. BDA works by breaking a problem into pieces, analyze each piece separatelyand then aggregating the results into a single response.• HADOOP is an instance of NOSQL that has two main parts: MapReduce and HDFS • MapReduce means mapping a problem to worker nodes and then aggregating (reducing) the results • HDFS is the file management systems that makes MapReduce work Map phase Reduce phase • Google searches • Amazon recommendations Piece 1 • Paypal real time fraud detection • Credit card unauthorized charges Piece 2 • Loopt Worker nodes Master node Master node • Directions from office to bar/pub…nearest Piece 3 vs. cheapest Problem Result • Genomics searching (needle-in-a-haystack) Piece 4 • Zynga gaming … • Facebook Friends • LinkedIn People-you-may-know (PYMK) Piece n • GPS directions (as you drive) • … 12
  13. 13. What does BDA landscape look like?• It depends on what the need is but here is a simple graphic that shows the various elements. This is only illustrative. Data Visualization/Mobile/R presentation Displaying and monitoring logs: Chukwa Job tracker Data processing Hadoop (batch); S4, Storm (streaming) Coordination: Zookeeper Data query Pig, Hive Processing Azkaban, Oozie scheduler Task tracker Database Voldemort, Cassandra, HBase Data collection Kafka, Flume, Scribe 13
  14. 14. BDA architecture does not mean you need to throw away your investments intraditional data analytics infrastructure.• BDA works alongside existing investments made by companies…not rip-and-replace! Traditional BI infrastructure Reporting & Distribution BDA 14
  15. 15. Even NOSQL is getting challenged, but for now we got-to-dance-with-them-what-brung-you.• Zynga needs additional 1,000 servers every week for their data needs.• Every search string you send to Google is divided and sent to 700-1000 servers so that you can get your response back in micro-seconds and thus not waste a few seconds in which you could have destroyed civilization.• Youtube serves 1 billion videos every day.• 2.5 billion photos uploaded each month to Facebook.• ~150,000 zombie computers created every day (used in botnets for sending spam)• At beginning of 2009 there were 187 million web sites. At the end of 2009 there were 234 million web sites. 25% growth. 15
  16. 16. And what is next. Big Data + Context + Interactivity = 16
  17. 17. Smart Data… 17
  18. 18. …which will make Minority Report scenarios look like… 18
  19. 19. …Pong. 19
  20. 20. New skills you should consider in the world of Big Data – Cultivate expertise but be a strong generalist – Develop and grow relationships and networks – Develop communication skills – Refine presentation skills – Read up, a lot – Monitor competition – Understand business, I mean really understand it Embrace* – Love the edge – Step outside your comfort zone, frequently ambiguity – If you have the appetite, read up a book or two on statistics – Think laterally, this just means do not be afraid to connect the dots* At a minimum, learn to accept ambiguity 20

×